Current Articles | RSS Feed RSS Feed

Problems with Amazon EC2 is storage architecture

Posted by Andy Singleton on Thu, Apr 21, 2011
  
  

Some Assembla Subversion repositories are currently inaccessible.  This is because of problems in the EC2 storage architecture that we use.  The same problem is affecting services like Reddit and Quora.   Those services are completely down.  At Assembla, the majority of our services are working normally.  However,  the svn outage is very difficult for the people who can't code  [must  code, must  code].  

We made the decision to leave Amazon EC2 a few weeks ago because of this storage problem.  We are currently setting up dedicated servers with hardwired storage.

I discussed cloud storage architecture in a blog post last year. Now we have an update.   The Amazon version doesn't work well enough to deliver reliable service.  I think this is because it is network connected, and it uses a big and complicated network.  This network has failed at least four times in the last year.  They have had at least 10 months to fix it, but the problems have recurred.

We use Amazon EC2, and we recommend it because their truly on-demand server resources make it possible to rapidly try things, fix things, and innovate. Innovation speed is important. We recommend Amazon because they have done the most to deliver "on demand".

Assembla has a further responsibility to deliver at least 99.9% uptime - down no more than 2 hours per quarter. We beat this over the last year for all services except some svn repositories. We use some of the time budget on releases where we take down the database to make significant changes to the system. Then we run into the limits of the Internet and cloud infrastructure. We can easily exceed our reliability budget if EC2 storage network gets slow.  During the past year, we have had a total of about 24 hours of svn downtime scattered among our various repositories.  About 90 minutes of that was due to scheduled builds.  The rest was because of problems with EBS storage.

If you use EC2 (or you run any high-availability system), you can usually design your system to withstand the disappearance of at least one server. You use a cluster of servers where one can fail, or you use pairs with replication and failover between a live and a hot swap server.

However, storage failures are more difficult to manage.   You might try failing over to a different server, but if that server is using the same network attached storage system, it will also be slow or stopped.  You can  replicate to other storage  systems, but that gives you a tradeoff.  The replication uses network IO, so you get the storage bandwidth problems even more frequently.

In fact, this is exactly what happens at the Amazon EC2 datacenters that we use. Sometimes, the network that connects to the storage becomes slow. This crashes any system that uses a lot of storage IO.  Our Subversion systems are particularly sensitive to this.

The problem was particularly bad last summer. Amazon datacenters must have been overloaded. Since then, the disk IO speed has, on average, gotten much faster. However, it is still variable, and during the past two weeks, it crashed twice.

The obvious solution for us is to move into dedicated racks with attached SAN, and solid state disks, which is where we are going.

Tags: ,

COMMENTS

Have you guys looked into EC2 alternatives. This space is hot, and a dedicated setup also means head aches of a different kind.

posted @ Thursday, April 21, 2011 11:08 AM by Sami Hoda


Hey, When is this likely to be working again?

posted @ Thursday, April 21, 2011 11:49 AM by Jonny Shaw


Any ETA on when the repositories will be back up?

posted @ Thursday, April 21, 2011 11:56 AM by Melvin


Thanks for this blog post explaining the problem. It's appreciated that users can at least know what's going on. Good luck getting things up and running.

posted @ Thursday, April 21, 2011 12:39 PM by Sven


Hi I was just comparing which svn service to use and have decided to go with assembla because of this fiasco. 
 
I like the fast response on the forums, the website which clearly states what is going on and where to find more info, and this blog which states you are attempting to solve the problem. 
 
It shows you are on top of things and working hard to make a great service!

posted @ Thursday, April 21, 2011 1:16 PM by shwick


Please let us know the ETA. Maybe it is time for GIT :) - stashes help a lot in such situation.

posted @ Thursday, April 21, 2011 1:18 PM by Andrzej Liśkiewicz


yes, indeed thank you for the update. Just a small suggestion/request - it'll be nice to get an e-mail notification as well since this is a rather major outage. While I do check the blog on the regular basis, email is still my primary work space. I am guessing I am not alone. Maybe a banner message on top of the assembla pages as well. Just a suggestion.  
 
also, I might be missing a point a bit, but why not replicate to AWS zones in EU and Asia or backup to S3? I understand that the bandwidth costs $, but if that gives better reliability then why not? I know, I would be willing to pay for added reliability, since today is a lost day for coding/deploying/testing for us.

posted @ Thursday, April 21, 2011 1:33 PM by Eugene


I'd also like to mirror the previous comments. An e-mail notification of this would have been much appreciated as I didn't know this blog even existed. I've been having a few computer problems lately and initially didn't even think that there was a problem with Assembla and assumed it was my PC. 
 
Any news on when the repositories will be back up. There's getting to be quite a build up of changes that need to be checked in and revisions that I need to reference. Normally I'd sacrifice the day and work on something else under such circumstances, but I'm approaching a deadline so time is precious..... 
 
Any updates would be greatly appreciated.

posted @ Thursday, April 21, 2011 1:56 PM by Chris


Thank you for a very detailed description of a problem. However, we all would like to know when this problem will be fixed. Having had a conversation with someone from Amazon, can you estimate when this is going to be fixed ?

posted @ Thursday, April 21, 2011 2:05 PM by Grzegorz Błaszczyk


I would also appreciate a much more pro active communication from Assembla. Let us know when you plan maintenance, let us also know by email that problems are solved and services are available. 
Let us also know (by email) that some services are temporarily not available. 
Currently the only communication we receive is a time out on the SVN repositories. Come on guys, you are monitoring the storage and the SVN usage, right? Or is this assumption to optimistic? I don’t think so.  
The reason I push on this simple: business continuity. We, and also other users of Assembla, have deadlines in their projects. Be transparent to us! 
 
Thanks. 

posted @ Thursday, April 21, 2011 2:57 PM by Martin


Another lost day of work for me and my team.. I don't mind small outages, but this is the 2nd time our SVN has been down for an ENTIRE day in the past 2 weeks!!! 
 
I'm glad to hear that you are planning on moving away from Amazon, and hopefully that happens before their EC2 datacenter implodes.

posted @ Thursday, April 21, 2011 4:11 PM by Derek Knapp


A couple of simple notifications could make most us patient/understanding when there are issues: 
 
1. Place a status indicator at the top of the page when signed into assembla online and ... if there is a major issue, maybe a link beside the status. 
 
and/or ... 
 
2. Send an email to the registered svn(or other service) members, letting them know when there is an issue/resolution. (I understand this would use up more bandwidth, which is why I listed it second) 
 
Even so, this blog post is welcome update. 
 
Thanks 
 

posted @ Thursday, April 21, 2011 5:00 PM by Randy


whoever relies solely on the cloud for their svn is an idiot 
 
it takes 2 seconds to make a local svn server to use for normal daily operations 
 
do a daily or hourly sync with a cloud service for additional backup, but definitely don't rely on it as your sole source it's just too risky with any provider

posted @ Thursday, April 21, 2011 7:25 PM by shwick


Almost 24 hours have passed and I can't access my SVN. Hope you to get back SVN service ASAP. I and my team are under panic.

posted @ Friday, April 22, 2011 3:41 AM by uuook


Any news when all repositories will be back? I have all but one working.

posted @ Friday, April 22, 2011 4:11 AM by Andreas Reuterberg


Hey man, first thank you for informing us of the situation, when I try to access my repository I am getting the following message: 
"The repository is empty." 
 
And when I go directly to the link of my repository I get a 404 page: 
https://subversion.assembla.com/svn/wuytonl/ 
 
That means I lost the content that was in my repository, or the content will be recovered?

posted @ Friday, April 22, 2011 7:36 AM by Jonathan


Thanks guys. I for one appreciate your efforts to recover and restore the system and be so open and frank about it. Contrary to bieng turned off by your service, my faith in it has improved.

posted @ Friday, April 22, 2011 10:18 AM by Muhammad OMer


>>> Right, and I just happen to be in the 30%. I know this game because I feed the same BS to my customers who are not technical when something goes wrong. Just admit it is probably still messed up for everyone. 
 
- well, I am glad I am not your customer. 7 of 8 of our repos on Assembla are back in business. aint no lie, I'll tell you that. Bad mouthing someone without having facts is immature.

posted @ Friday, April 22, 2011 10:29 AM by Eugene


I wanted to follow up to say that I'm still hopeful of having our main repo back this afternoon. 
 
I have a couple repo spaces, and all but one are online ... of course it is the one that is currently under deadline. But I understand that this was caused by Amazon (everyone else would too, if they turned on or read any current news). 
 
Anyway, I just wanted to say thanks for the effort the assembla team is putting into getting everyone back online and even giving the advance option of rolling back (if possible) for some repo before the full restore is done. 
 
Much appreciated!

posted @ Friday, April 22, 2011 12:49 PM by Randy


Repositories still down? Early weekend!

posted @ Friday, April 22, 2011 2:25 PM by Sam


People, 
 
you need to understand that lack of information is making lots of people very nervous here... Please provide REGULAR status updates.. I understand you are busy with restoring all this but I dont mind waiting if I know what is going on...

posted @ Saturday, April 23, 2011 7:41 PM by Perica


The problem is not solved? Because for me already is...

posted @ Saturday, April 23, 2011 8:08 PM by Jonathan


Any news on this? Is it safe to resume normal operation? 
 
I do have my backups and workaround and so on, but we'd very much like to get started again with full confidence.

posted @ Monday, April 25, 2011 5:50 AM by Bart


Every service experiences issues occasionally. However, Assembla completely fell down on keep its customers informed. I'm more concerned what you'll do in the future to COMMUNICATE than your technical excuses for the mess.

posted @ Monday, April 25, 2011 4:36 PM by Raymond Plante


@Bart:  
Bart, everything is fully operational (http://blog.assembla.com/assemblablog/tabid/12618/bid/48362/All-repositories-are-back-to-normal.aspx), if there is any problem with your space, you can file a ticket at https://www.assembla.com/spaces/AssemblaSupport/support/tickets and our admins will check into it specifically. 
 
@Raymond: 
Raymond, here is what we did to keep everyone informed: 1) we posted an alert at the top of all Assembla.com pages to keep everyone informed; 
2) we answered all emails, support tickets, forum posts, blog posts etc; 
3) we sent personal emails to users on the svn server that was down the longest; 
4) we posted on Twitter and answered twitter posts at http://twitter.com/#!/assembla; 
5) we posted an announcement on the forum at http://forum.assembla.com/forums and replied every single request regarding outage 
6) we posted public support tickets at https://www.assembla.com/spaces/AssemblaSupport/support/tickets and answered those with "Public" view so that everyone could read about the outage.  
 
Please let us know if you have additional recommendations. 
 
Thank you. 
 
Assembla Customer Service.

posted @ Tuesday, April 26, 2011 12:48 PM by Assembla Customer Service


Comments have been closed for this article.

Follow Assembla

twitter facebook youtube linkedin googleplus

Get Started

blog CTA button

Subscribe by Email

Your email: