Reddit, HootSuite, Foursquare and the great AWS crash of 4/21/2011
As I type this, the lead story on http://money.cnn.com/ reads "AMAZON CRASH ZAPS WEBSITES" (their choice of all caps). Ouch. The outage has kept Reddit, HootSuite, Foursquare and lots of other sites offline for the better part of the day. Double ouch.
But... but... isn't "The Cloud" about keeping your site up at all times? Doesn't it make that easy? Isn't that one of the primary reasons to use Cloud hosting? Absolutely.
One of the few things that an online market leader needs to do to maintain that lead is keep their services up. At all times. Regardless of what goes down below. The Cloud makes that easier than ever. Surprising to find companies of this scale not leveraging this benefit.
The exceptionally sad thing about outages like this is that they are both entirely forseeable, and only take maybe 4-6 weeks (at most) of concerted effort to avoid. Technologies like Heartbeat, DRBD, and Master/Slave Replication, when used in combination, ensure that even if your whole stack goes down on one provider or availability zone, that an exact replica comes up in another provider or availability zone -- usually within seconds.
As someone who has set up these systems can tell you, they are fairly easy to install and configure, work like magic, and have been battle tested by lots of other companies before yours. Heartbeat setup? Maybe 1-2 days. DRBD? 1 to 1.5 weeks. Master / Slave setups? 1 to 1.5 weeks again. Testing the whole shebang, and accounting for the complexities of a large Code and Database like these companies have would comprise the rest of the time. Seriously. 4 to 6 weeks, at a cost of maybe $20 to 40K.
So why would any company the size of these 3 giants NOT invest this small amount of time and money in order to avoid such a fiasco as what happened today? I'm sure the rationale was one or more of the following: 1) We're too busy developing and maintaining the product; 2) It will be too difficult / cost too much money 3) Full scale outages at AWS almost never happen; 4) We just never really thought about it.
Projects like Disaster Recovery, High Availability, and Security seem to never get any love, until neglect causes them to bring the rest of the business to it's knees. It doesn't matter how busy you are... If you don't install (and use) seat belts in your car, you will die or be seriously injured when a crash happens. It took lots of deaths / injuries in cars before the "seat belt" law was passed, and there were similar arguments of "it'll too expensive" right up to the end. And the odds of a Hosting provider crash are much higher than a car crash.
"We just never thought about it" is bordering on gross negligence, and rarely happens in our experience. It's highly likely that the tech teams at these companies made management well aware of the possibility of a full scale outage, but the powers that be either didn't grant the funds, or the tech teams made the effort sound too difficult / expensive to justify.
In any case, outages like this at a major online company are inexcusable, especially considering how easy they are to avoid. This one day outage alone probably cost each company 4-5x the $40K amount that a Highly Available, redundant architecture would've.
Which begs the question... Is your organization set up to continue in the event of an AWS or Rackspace Cloud outage? What would it cost your company to be offline for a day? If your team are busy on other efforts, doesn't it make sense to bring in Consultants who have experience setting up High Availability systems?












Post new comment