An explosion ate our servers - how cool is that?
Posted by Andy Singleton on Mon, Jun 02, 2008 @ 12:18 PM
Our site started misbehaving on Saturday when we got this message from our datacenter: "...electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room ... This is a significant outage, impacting approximately 9,000 servers"
An explosion. How cool is that? It's way cooler than having a guy accidentally dig up your wiring with a backhoe, or run it over with a truck. It's an outright privilege compared with the squirrel that ate the fiber cable. Imagine spending nights and weekends rallying your troops around a disaster recovery plan, all to do battle with a squirrel. That would be humiliating. My wall of flame incinerates your squirrel.
The flame walled us off from three servers out of the approximately 8 that we need to run Assembla.com at full power. In theory, the rest of the system, including the www.assembla.com site itself, should have continued to work properly. In practice, we found that processes on the remaining servers would hang while waiting for responses from the missing servers, and users would get “Unavailable” errors on www.assembla.com or find themselves unable to access trac.
It's better now. The missing servers are back online or replaced. We'll be moving to improved virtual server topologies that will hold up under explosive attack.