Is 100% cloud computing uptime SLA a real possibility?

Recently I got asked about my views on Amazon AWS and their 100% redundant cloud computing SLA. Here is a couple of notes I took on the subject…

First of all, according to Amazon AWS EC2 Service Level Agreement (2017), Amazon’s Uptime SLA for the EC2 compute cloud isn’t 100%, but only 99.95%.

The very same Uptime SLA rate (99.95%) is also guaranteed by Google Cloud Platform and Microsoft Azure:

Just to explain what this means:

99.95% essentially implies that AWS, Google or Azure computing cloud can go down every day for 43.2 seconds (or no more than once a year for 4h 22m 58.5s).

For those who never read SLAs, this basically says that if your business is hosting the infrastructure on Amazon, Google or Microsoft and suddenly everything goes dark for 4 hours, and this happens only once a year, you wouldn’t be able to do much, because that’s the SLA you as a customer agreed to during the sign-up process. Or in other words: no reimbursed for the downtime, period.

Why 99.95%? 

It’s quite interesting that the uptime SLA of 99.95% is the current maximum offered by Amazon, Microsoft, and Google.

I think this is because the IT cloud giants such as Amazon, Google, and Microsoft realized that it’s not cost-effective to guarantee anything better than 99.95%. Which is why they universally aligned around this rate. In my view, 99.95% redundancy for cloud computing is a benchmark of quality that is nearing the perfection, given the current state of the cloud infrastructure. The cost of improving the uptime rate would likely come with a financial burden that just wouldn’t be worth additional investments. So it’s possible that all three companies simply achieved a definition of Six Sigma at this specific rate.

To illustrate it a bit more. Imagine being a cloud company that wants to improve the uptime improvement by 0.04% and go from 99.95% to 99.99%. It seems like not much of a change, but the additional 0.04% would bring the unscheduled outage quota down from 4 and half hours to only 52 minutes a year. And improving the 99.95% by 0.05% that would be even worse, allowing barely 30 seconds of unscheduled outage a year.

Exceptions?

Of course, there are always some exceptions – even among prominent cloud players. In this space, the IBM is a winner, as they calmly claim a 100% uptime. For some reason, whenever I see their 100% uptime SLA, I imagine everyone in the cloud business shaking their heads.

But, one look at IBM’s SLA and it’s clear that it’s not exactly 100%.

Excerpt from IBM Cloud SLA: “SoftLayer cloud will use reasonable efforts to provide a service level of 100% for the Public Network.” (Softlayer.com, 2017).

Okay, that’s all I needed to know. In my view, this is an excellent example of client hooking – or in other words, selling the pipe dream. What’s the definition of a reasonable effort?

IBM, please let us know what’s the definition of a reasonable effort. In my view, the illustration below could be a reasonable effort too… :)

So is the 100% uptime truly possible?

Well, hard to say. We certainly don’t see Amazon, Google and Microsoft (who take 63% of the cloud market) claiming 100% uptime for their cloud computing. I suspect this is primarily because of the costs associated with building and maintaining what would be quite a large redundancy infrastructure.

But even if the giants of IT spent all the money they have, could they truly guarantee 100% uptime? Or get even close to it?

Well, let’s say there is a disaster affecting entire continent, such as a solar storm of 1859 (also known as the Carrington Event). 

According to Board, S. S., & National Research Council (2009), “a solar storm of this magnitude occurring today would likely cause widespread problems for a modern and technology-dependent society.”. The Business Insider says “A massive solar storm [such as Carrington Event] could wipe out almost all of our modern technology — and we’d have just hours to prepare” Letzter, R. (2017).

So, in my view, even a single event like the one I just described would probably render any attempts to keep the cloud uptime at 100% as completely ineffective. In fact, even an additional 1 hour of downtime (over the SLA guaranteed rate) could end up being an enormous problem – just imagine AWS reimbursing over 1 million of their current corporate clients (including NASDAQ, NASA, NETFLIX, etc.) for an hour of downtime – you get the picture.

 

Conclusion

Amazon, Google, and Microsoft (leading cloud vendors) feel the same about the redundancy. 99.95% is their way of saying that it’s possible for the cloud solution to go down because given the today’s infrastructure it’s currently the best they can do.  At the end of the day, it comes down to the profitability of implementing a supporting additional infrastructure.

I like the following quote by Albert Einstein:

In my view, it essentially says, that is actually actively used can be 100% perfect. Or in other words, that customers shouldn’t take seriously any cloud company that claims anything 100% – including the cloud uptime. So to conclude… While the 99.95% uptime rate is not perfect, it seems to be the final answer in today’s world of cloud computing. It’s a rational number based on the reality of the current cloud infrastructure.

References

Amazon AWS EC2 Service Level AgreementWeb Services, Inc. (2017). Amazon EC2 SLA. [online] Available at: https://aws.amazon.com/ec2/sla/ [Accessed 22 April, 2018].

Softlayer.com. (2017). [online] Available at: https://www.softlayer.com/sites/default/files/sla.pdf [Accessed 22 April, 2018].

Board, S. S., & National Research Council. (2009). Severe space weather events: Understanding societal and economic impacts: A workshop report. National Academies Press.

Letzter, R. (2017). A massive solar storm could wipe out almost all of our modern technology — and we’d have just hours to prepare. [online] Business Insider. Available at: http://www.businessinsider.com/massive-1859-solar-storm-telegraph-scientists-2016-9 [Accessed 22 April, 2018].