Faster Dedicated Servers
Posted by Andy Singleton on Tue, Sep 20, 2011 @ 08:38 AM
Assembla.com is now running on shiny new dedicated servers in Atlanta. We made the change on Saturday night. The new servers give us advantages in reliability, failover, and speed (2.5 times faster) compared with our previous configuration in a cloud datacenter.
We apologize for some problems that users experienced during the move. There may be some remaining configuration issues, but today we see that error rates in our logs have dropped to historic lows.
Reliability: We had downtime in the last year caused by problems with cloud storage. The new system has directly connected disks and it will not be vulnerable to problems with cloud networking or storage. I will analyze this aspect of cloud computing in a future article.
Failover: The new servers are “triple redundant”. (1) They have internally redundant power, disks, and networking. (2) Data is replicated in real time to a local twin failover server. (3) Data is also replicated to a disaster recovery system in a remote Amazon datacenter. We will continue our tradition of never losing data, and we will be able to make your data accessible under (dare I say it) almost any circumstance.
Speed: This turned out to be a big win. I am amazed at the difference in performance between today and last Friday. The new servers are 2.5 times faster, with response times about 40% of the old response times. For users that are not in North America, we will make further improvements in the next month by deploying a CDN.
THERE WERE PROBLEMS. I apologize for glitches that affected our users. This move was supposed to be transparent for users, and we were able to do it without shutting down the site at any time. However, as the configurations and locations changed over the past three weeks, we had problems that affected users in various ways. At one point, the system was not correctly creating new repositories or importing repositories. Some repository operations saw network errors. Some git users could not log in or saw “repository not found”. Some FTP deploys failed. We had problems immediately after announcing that the new Subversion servers were "more reliable". It's true that the servers are more reliable, but the services running on them needed improved configurations. In most cases, we fixed the problems within 12 hours.
For those of you who are curious, here is what we did:
Data migration: We copied the data over the network. This took many weeks. In general, SaaS vendors have to cope with the trend (Landry’s law) of data storage growing faster than the network’s capacity to move that data.
Data replication: Then, we set our various systems to replicate data to the new servers, so that it would stay current in real time.
Proxying: We were able to move things around without affecting users (mostly) through the magic of HAProxy. This is a proxy for HTTP and database traffic (and even other types of network requests) that that can be switched to send requests to different locations.
Subversion repository move: We moved Subversion repositories first, because they were the most vulnerable to cloud storage problems. I announced this move a few weeks ago. We actually switched them one at a time with a lookup table in our front-end proxies.
Git and Mercurial repository move: Next, we moved the git and mercurial repositories. The move was smooth, but they went into a new, more scalable configuration with front-end proxies, and this had a few glitches.
Switch the other systems (Db, app, queue, etc.): With replication running, and all application servers configured, we used the proxy servers to send Web traffic to the new servers. That part was smooth.