Simplifying cloud: Reliability

The original Google server rack

Reliability in cloud computing is a very simple concept which I’ve explained in many presentations but never actually documented:

Traditional legacy IT systems consist of relatively unreliable software (Microsoft Exchange, Lotus Notes, Oracle, etc.) running on relatively reliable hardware (Dell, HP, IBM servers, Cisco networking, etc.). Unreliable software is not designed for failure and thus any fluctuations in the underlying hardware platform (including power and cooling) typically result in partial or system-wide outages. In order to deliver reliable service using unreliable software you need to use reliable hardware, typically employing lots of redundancy (dual power supplies, dual NICs, RAID arrays, etc.). In summary:

unreliable software
reliable hardware

Cloud computing platforms typically prefer to build reliability into the software such that it can run on cheap commodity hardware. The software is designed for failure and assumes that components will misbehave or go away from time to time (which will always be the case, regardless of how much you spend on reliability – the more you spend the lower the chance but it will never be zero). Reliability is typically delivered by replication, often in the background (so as not to impair performance). Multiple copies of data are maintained such that if you lose any individual machine the system continues to function (in the same way that if you lose a disk in a RAID array the service is uninterrupted). Large scale services will ideally also replicate data in multiple locations, such that if a rack, row of racks or even an entire datacenter were to fail then the service would still be uninterrupted. In summary:

reliable software
unreliable hardware

Asked for a quote for Joe Weinman’s upcoming Cloudonomics: The Business Value of Cloud Computing book, I said:

The marginal cost of reliable hardware is linear while the marginal cost of reliable software is zero.

That is to say, once you’ve written reliability into your software you can scale out with cheap hardware without spending more on reliability per unit, while if you’re using reliable hardware then each unit needs to include reliability (typically in the form of redundant components), which quickly gets very expensive.
The other two permutations are ineffective:

Unreliable software on unreliable hardware gives an unreliable system. That’s why you should never try to install unreliable software like Microsoft Exchange, Lotus Notes, Oracle etc. onto unreliable hardware like Amazon EC2:

unreliable software
unreliable hardware

Finally, reliable software on reliable hardware gives a reliable but inefficient and expensive system. That’s why you’re unlikely to see reliable software like Cassandra running on reliable platforms like VMware with brand name hardware:

reliable software
reliable hardware

Google enjoyed a significant competitive advantage for many years by using commodity components with a revolutionary proprietary software stack including components like the distributed Google File System (GFS). You can still see Google’s original hand-made racks built with motherboards laid on cork board at their Mountain View campus and the computer museum (per image above), but today’s machines are custom made by ODMs and are a lot more advanced. Meanwhile Facebook have decided to focus on their core competency (social networking) and are actively commoditising “unreliable” web scale hardware (by way of the Open Compute Project) and software (by way of software releases, most notably the Cassandra distributed database which is now used by services like Netflix).

The challenge for enterprises today is to adopt cheap reliable software so as to enable the transition away from expensive reliable hardware. That’s easier said than done, but my advice to them is to treat this new technology as another tool in the toolbox and use the right tool for the job. Set up cloud computing platforms like Cassandra and OpenStack and look for “low-hanging fruit” to migrate first, then deal with the reticent applications once the “center of gravity” of your information technology systems has moved to cloud computing architectures.

P.S. Before the server huggers get all pissy about my using the term “relatively unreliable software”, this is a perfectly valid way of achieving a reliable system — just not a cost effective one now “relatively reliable software” is here.

Cloud computing’s concealed complexity

Cloud gears cropped

James Urquhart claims Cloud is complex—deal with it, adding that “If you are looking to cloud computing to simplify your IT environment, I’m afraid I have bad news for you” and citing his earlier CNET post drawing analogies to a recent flash crash.

Cloud computing systems are complex, in the same way that nuclear power stations are complex — they also have catastrophic failure modes, but given cloud providers rely heavily on their reputations they go to great lengths to ensure continuity of service (I was previously the technical program manager for Google’s global tape backup program so I appreciate this first hand). The best analogies to flash crashes are autoscaling systems making too many (or too few) resources available and spot price spikes, but these are isolated and there are simple ways to mitigate the risk (DDoS protection, market limits, etc.)

Fortunately this complexity is concealed behind well defined interfaces — indeed the term “cloud” itself comes from network diagrams in which complex interconnecting networks became the responsibility of service providers and were concealed by a cloud outline. Cloud computing is, simply, the delivery of information technology as a service rather than a product, and like other utility services there is a clear demarcation point (the first socket for telephones, the meter for electricity and the user or machine interface for computing).

Everything on the far side of the demarcation point is the responsibility of the provider, and users often don’t even know (nor do they need to know) how the services actually work — it could be an army of monkeys at typewriters for all they care. Granted it’s often beneficial to have some visibility into how the services are provided (in the same way that we want to know our phone lines are secure and power is clean), but we’ve developed specifications like CloudAudit to improve transparency.

Making simple topics complex is easy — what’s hard is making complex topics simple. We should be working to make cloud computing as approachable as possible, and drawing attention to its complexity does not further that aim. Sure there are communities of practitioners who need to know how it all works (and James is addressing that community via GigaOm), but consumers of cloud services should finally be enabled to apply information technology to business problems, without unnecessary complexity.

If you find yourself using complex terminology or unnecessary acronyms (e.g. anything ending with *aaS) then ask yourself if you’re not part of the problem rather than part of the solution.