Aaron Cordova's Blog

Cloud Failures

cloud failures

Recently there has been a spate of high profile failures related to cloud technology. GMail was down for 100 minutes, and Microsoft's Danger lost a bunch of (mobile device) Sidekick users' information. While some people are taking these as a sign that either cloud technology is immature, or that cloud will never work, a closer inspection reveals that this may not be the case.

One of the core tenets of cloud technology is that applications should be distributed and single points of failure should be eliminated. Increasing independence of the individual components of a system reduces the chance of catastrophic failure - if the system is also designed to handle a constant fraction of machines going down. Google's MapReduce framework is designed this way - a number of machines can fail and the application automatically reassigns work portions to other machines.

What happened in the GMail and Microsoft failures is revealing. In Google's case, engineers were performing upgrades to routers, which meant taking a fraction of them offline, leaving the remaining routers to handle all the traffic. They underestimated the load and individual routers began to have a lot of traffic. All this would have been fine if it weren't for a flaw in the routers' programming that stipulated that a router under heavy load notify others to send less traffic its way. When one router did this, the others become more loaded, sent out messages to reduce traffic, and so on, leading to a cascading failure.

In Microsoft's case, the prevailing story is that "Microsoft's engineers attempted to perform a SAN transition that failed without any contingency plans in place." Okay. Number one - SANS and cloud don't make a whole lot of sense in the first place. The I/O isn't distributed, and SANS are expensive. One can get the same storage with more independence for much less - but you have to run software that is designed for distributed architectures. Number two - engineers had no contingency plan and when their primary system failed that was it.

So in these examples, the failures involved engineers taking the initiative to change the architecture, which they have to do from time to time to grow, and another tenet of cloud technology is being able to grow the architecture while it is online. But this is not the same as the architecture failing one day because of a traffic spike or hardware failure, which these architectures handle well. The second feature of these failures is that contingency plans were insufficient or the engineers failed to plan. In Google's case, it was specifically the failure of the router software to adhere to cloud tech principles that caused the cascading failure.

In sum, not all clouds are created equal. If you're going to go with a cloud provider, investigate whether they're really using cloud technology made up of highly distributed and independent components, or whether they've simply slapped the cloud label on the same old stuff. Unfortunately, however, it will probably be a while before someone develops an architecture that is resistant to system engineers' mistakes, or in some cases, gross neglect.