Friday, April 22, 2011

Episode 4: The Day The Cloud Crashed & People Lost Their Minds

February 20th 2011 will be a date that cloud commentators, cloud zealots & the opportunists in the cloud will make sure is not forgotten. Amazon AWS had a colossal outage. This article from the BBC exemplifies the kind of coverage that went along with the event. Needless to say, alot of people directly affected as customers of AWS were miffed, as were users of those services hosted there in the affected area. And no, SkyNet did not begin its take-over starting with AWS for those who were concerned.

First off, one thing really needs clarifying about this event, as the reaction in social media circles, especially amongst twitterati was grossly out of of proportion. The reality of this is that a SINGLE region in Amazon's network was down. The rest of their services in the USA were fine, as were their European & their Asian services. The fact that the affected region services so many companies made the issue seem far greater than it was. Amazon AWS customers who engaged in deploying their cloud strategy across multiple regions in Amazon's EC2 system were completely unaffected.

The fact it went on for over ten hours yes is a concern. And rightfully so. But, did it violate Amazon AWS's 99.95% SLA which allows for '4 hours per year of downtime'? Nope. Not even in the slightest, even with their 10 hours of being unavailable to people who were screaming over lack of access to key services. But, screaming doesn't get around SLA's you agree to for services you take, or use. Always check the warranty.

And this is the real thing to remember; the fine print of your SLA's or terms & conditions of service are the last word in any comeback you have. Cloud Providers trying to win business from AWS to their own services around the world, especially in Ireland cried foul. What they neglected to tell those same Irish companies they were trying to win business from as a result of the outage was that their own SLA's & guarantees are in fact absolutely no better than Amazon's ones. In fact, some of them have in their terms & conditions that you have absolutely no comeback whatsoever in the event of an outage, & there are no guarantees on up-time at all, even at centre power/connectivity level, which some at least provide.

The companies who promote their uptime & their 'solid SLAs' if you dig into them are actually nothing more than guarantees against power & network connectivity to an actual hosting center itself, & unless both those fail for more than four hours in a year, you could lose access to your VPS or cloud for days on end due to a hardware, or virtualisation or internal networking issue & they would still not have violated their SLA with you.

Beware of service providers who are eager to bash the performance of their competitors openly. They'll mouth off quite happily about others lack of 'service', while at the same time not being so mouthy about what happens when (not a case of 'if' with technology, but 'when') their services fail on you. And believe me they will. If multi-billion dollar global companies like Amazon, Google, Microsoft, Apple & others have outages, your local provider who is less equipped staff-wise, financially & technically to be as able to deal with outages as efficiently as those corporations who have vast resources in all areas. It is also important to remember a very old adage when it comes to this, empty vessels make the most noise.

So, you're a company looking to engage a cloud strategy because you can see the benefits, but are scared by what happened with Amazon AWS from what you read on blogs & Twitter. You don't know what to do next. Firstly, the most important thing to do is ignore Twitter & the blogs decrying AWS. These are but a noisy few out of millions. Many of them are vested interests & vested interests should be ignored like the plague.

A good cloud service provider will be upfront with you when you engage them. They should be knowledgable enough to work with you in understanding your requirements, explain what risks there are to what you want to achieve, & provide advice on how to mitigate against the risks to what you want to do. Sure they're there to sell you services & gain your custom, but a good consultant will tell you that they are & should only be part of a solution to you. That as good as the company they represent may be, risk should always be spread.

Every company involved in risk management as a business will tell you that the absolute fundamental to risk management is spreading that risk around in a controlled manner to shore up your mitigation. Mitigating risk is not cheap. So don't fall for companies promising you to be the 'cheapest solution for your business' - they're not. They are if anything given their pricing, a small part of a solution to you. You also need to ensure that you have a communications plan in place in the event of any outages, as well as documented & tested internal procedures on how your teams & staff need to act, & what events need to be triggered if any to mitigate the circumstances or ease them as much as possible.

But this issue goes outside your cloud provider. It comes down to your choice in developer also. Your developer if they are worth their salt should have an application that allows for spread, that allows for redundancy. They should also be advising you to spread your system across at least two providers or two centers at the very least if your single provider can actually do this. Your cloud provider really should even do this. Single cloud services are single points of failure.

And the issue of disaster recovery or planning doesn't even stop at the developer or the service provider. You, as the business owner/operator leading your organisation are the absolute linchpin of it all. Fundamentally, being a good leader means being a good planner. As a leader of yours, it is incumbent upon you to plan, & plan well & properly.

'The Cloud' is not a solution to redundancy, or disaster recovery. It is a tool to help mitigate some aspects of risk at best in a cost effective manner for its part. It should never be the case of "Oh, it's in the cloud, no need to worry or care. It's taken care of already by my cloud provider." Just because it's easy to set up a business in the internet space, doesn't mean normal conventions for business disaster recovery, or 'battle-stations' planning doesn't apply. The fundamentals of good business planning apply to the Internet as much as the high-street. Most of the time, it's just cheaper. Shortcuts on these areas are just that, except to one day being caught proverbially with your pants around your ankles.

Remember; a blip in the operation of your business from an outage won't kill your business, but how you manage that blip, communicate & work towards the point of restoration will determine whether your business will recover when it happens. Another couple of adages worth closing this blogpost with is 'plan for the worst, hope for the best', 'expect the unexpected' & 'if you want peace, prepare for war'.

No comments:

Post a Comment