Is Amazon REALLY responsible for Instagram, Netflicks and other outages?


Let's review the recent 30 June Amazon Cloud failure:

Amazon never said their EC2 is bullet proof of local regional outages – checkout their EC2 highlights here: http://aws.amazon.com/ec2/#highlights

It says "The Amazon EC2 Service Level Agreement commitment is 99.95% availability for each Amazon EC2 Region". Nothing here mentions cross region high availability.

The Amazon S3 storage service does have a redundancy across several facilities in the same region, but you can choose to disable this redundancy if you wish to reduce your storage costs…

Was the Amazon outage affecting more than one facility?

Were the affected web sites using a reduced S3 storage protection to save cost?

Amazon Cloud Outage

Anyway, whatever service you provide through Amazon or other cloud providers, you should make sure that your required availability level is covered, either by other Cloud providers, or by creating your own data and service protection process. In Amazon's case you could look into using a 3rd party service to duplicate data as well as using the Amazon EC2 load balancing option, allowing you run several EC2 instances of your service (this does not take care of data duplication).

Here are lessons learned from the Microsoft Azure cloud failure, which are still relevant:

The Microsoft Azure cloud failure happened 4 months ago, as a result of a leap year bug in Azure. Here is a recording with few notes here:  http://audioboo.fm/boos/707727-your-cloud-computing-provider-is-down-now-what

Here is the extended "transcript" version of those notes:

1. Consider Cloud is like a new OS that has just been, released, it is not very mature, so maybe avoid putting critical stuff on it without additional protection

2. Cloud Platform Apps are different from Generic Cloud – Salesforce is a Cloud Platform APP, more focused and mature and thus resilient, compared to newer generic cloud providers

3. Cloud providers should tag their cloud release versions and offer several versions (stable, mature, new)

4. You should plan your tasks deployment, such that they could migrate to, or operate on several cloud platforms at once, so when one fails, you use the other

5. You should plan for a massive periodic outage, and consider it is much more than you failing. It could be your providers, partners, customers.

6. Part of planning for outage should include basic and complementary services people could benefit of, even when the main system is down. This could include read only access to data, opportunity to review and update some basic records, bonus option that is activated until systems are back and so on.

What do you think or would add to this list?

 

 

Want The Mobile Secuity and other Power Guides?

Why Mobile Security Solutions are bound to fail and what is the Ultimate Green Force Defender

Click Here to get it now in the IT Pro Club


Google+