Paying for the sins of slow performance
Posted by Andy Singleton on Wed, May 19, 2010 @ 09:36 AM
It's time to confess something. Forgive me father, for I have sinned. Assembla was getting slow. It wasn't fun and we wasted user's time. Users were waiting more than 400ms on average for pages, and 27% of the time, they hit the “frustrated” mark over half a second. 10% of the time they went over two seconds, which caused deep spiritual distress. Emperor Google tells us that
we will get an earthly reward if repent and improve.
Here is our penance. During the last four weeks, we have reduced the average time to serve a request to 190ms, with only about 4% going longer than half a second. It's not perfect yet, but it is much better.
First, we deployed application servers with faster processors. Our application servers run Ruby and Ruby on Rails. This is a nice environment for development, but Ruby is a very slow interpreter. It didn't help to get more processors with more cores, because Ruby is single-threaded. We had to actually get a faster processor to run that single thread.
We also looked at core libraries to see if we could get faster versions. Sometimes, this was only 2% faster, but every little bit helps.
Most of the gains after that are classic optimization work. We used Newrelic to log our request times, and looked at the slowest requests, and went to them individually, and figured how to change them to make them faster. In some cases we fixed the data retrieval strategy, or changed the data structure. We found more places where we can use memcache caching. We found some places where we could render a page faster by loading data later with an ajax call - for example, on ticket Edit. We are also working on repository server problems with threading that sometimes block the code browser. Etc.
We will continue our work, struggling as always against the forces of evil in the search for a fast and fun experience.
Our users experienced other disruptions in the last few weeks. The optimization work has made our release times longer usual, with servers down for an hour during the last two releases while we change databases and server topology. We do releases at 10:00 am Moldova time, which gives us an alert team, and is in the middle of the night in the US. However, it is a bad time for people in Europe. We will work on finding a better time. Also, about two weeks ago, the Amazon datacenter in Virginia had a power outage that took out a few of our servers. We were running on slower failover servers for a few days, and we had some incidents where the system slowed to a crawl or crashed. Hopefully, we won't be struck by more such acts of God, but if it does happen, we'll be quicker to start new machines.