Some Assembla Subversion repositories are currently inaccessible. This is because of problems in the EC2 storage architecture that we use. The same problem is affecting services like Reddit and Quora. Those services are completely down. At Assembla, the majority of our services are working normally. However, the svn outage is very difficult for the people who can't code [must code, must code].
We made the decision to leave Amazon EC2 a few weeks ago because of this storage problem. We are currently setting up dedicated servers with hardwired storage.
I discussed cloud storage architecture in a blog post last year. Now we have an update. The Amazon version doesn't work well enough to deliver reliable service. I think this is because it is network connected, and it uses a big and complicated network. This network has failed at least four times in the last year. They have had at least 10 months to fix it, but the problems have recurred.
We use Amazon EC2, and we recommend it because their truly on-demand server resources make it possible to rapidly try things, fix things, and innovate. Innovation speed is important. We recommend Amazon because they have done the most to deliver "on demand".
Assembla has a further responsibility to deliver at least 99.9% uptime - down no more than 2 hours per quarter. We beat this over the last year for all services except some svn repositories. We use some of the time budget on releases where we take down the database to make significant changes to the system. Then we run into the limits of the Internet and cloud infrastructure. We can easily exceed our reliability budget if EC2 storage network gets slow. During the past year, we have had a total of about 24 hours of svn downtime scattered among our various repositories. About 90 minutes of that was due to scheduled builds. The rest was because of problems with EBS storage.
If you use EC2 (or you run any high-availability system), you can usually design your system to withstand the disappearance of at least one server. You use a cluster of servers where one can fail, or you use pairs with replication and failover between a live and a hot swap server.
However, storage failures are more difficult to manage. You might try failing over to a different server, but if that server is using the same network attached storage system, it will also be slow or stopped. You can replicate to other storage systems, but that gives you a tradeoff. The replication uses network IO, so you get the storage bandwidth problems even more frequently.
In fact, this is exactly what happens at the Amazon EC2 datacenters that we use. Sometimes, the network that connects to the storage becomes slow. This crashes any system that uses a lot of storage IO. Our Subversion systems are particularly sensitive to this.
The problem was particularly bad last summer. Amazon datacenters must have been overloaded. Since then, the disk IO speed has, on average, gotten much faster. However, it is still variable, and during the past two weeks, it crashed twice.
The obvious solution for us is to move into dedicated racks with attached SAN, and solid state disks, which is where we are going.