12 October 2009

If it's dangerous it's NOT cloud computing

Having written something similar over the weekend myself (How Open Cloud could have saved Sidekick users' skins) I was getting ready to complement Reuven Cohen on his latest post (really), but fear-mongering title aside (Cloud Computing is Dangerous) I was dismayed to see this:

"Let's call it what it is, it's a cloud app -- your data when using a Sidekick is hosted in some elses data center."

I simply can not and will not accept this, and I'm not the only one:

Help me out here. I'm seeing really smart people I totally respect jump on this T-Mobile issue as a "Cloud" failure. Am I losing my mind?

Reuven: I'm disappointed that you feel this way, particularly as people (for better or worse) do actually listen to what you have to say. As such you owe it to the community you [unofficially] represent to think (or better yet, ask) before you speak on its behalf - what you consider "partly kidding" others take very seriously. I'd swear I spend half my life cleaning up after things like the Open Cloud Manifestation (albeit granted if we all agreed from the outset we'd have nothing to talk about!).

For a start, Sidekicks predate cloud by 1/2 a dozen *years*, with the first releases back in 2001. Are we saying that they were so far ahead (like Google) that we just hadn't come up with a name for their technology yet? No. Is Blackberry cloud? No, it isn't either. This was a legacy n-tier Internet-facing application that catastrophically failed as many such applications do. It was NOT cloud. As Alexis Richardson pointed out to Redmonk's James Governor "if it loses your data - it's not a cloud".

While I know that this analogy is inconvenient for some vendors it works and it's the best we have: Cloud is resilient in the same way that the electricity grid is resilient. Power stations do fail and we (generally) don't hear about it. Similarly datacenters fail, get disconnected, overheat, flood, burn to the ground and so on, but these events should not cause any more than a minor interruption for end users. Otherwise how are they different from "legacy" web applications? Sure, occasionally we'll have cloud computing "blackouts" but we'll learn to live with them just as we do today when the electricity goes out.

As a more specific example, if an Amazon DC fails you'll lose your EC2 instances (the cost/performance hit of running lock-step across high latency links is way too high for live redundancy). However the virtual machine image itself should be automagically replicated across multiple geographically independent availability zones by S3 so it's just a case of starting them again. If you're using S3 directly (or Gmail for that matter) you should never need to know that something went wrong.

But Salesforce predates cloud by almost a decade you say? This data point was a thorn in my side until I found this article (Salesforce suffers gridlock as database collapses) and the associated Oracle press release (Salesforce.com’s 267,000 Subscribers To Go On Demand With Oracle® Grid). With wording like "one of its four data hubs collapsed" in what "appears to be a database cluster crash" I'm starting to question whether Salesforce really is as "cloudy" as they are claim (and are assumed) to be. Indeed the URL I'm staring at as I use Salesforce.com now (https://na1.salesforce.com/home/home.jsp - emphasis mine) would suggest that it is anything but. NA1 is one of 1/2 a dozen different data centers and their "cloud" only appears as a single point when you log in (http://login.salesforce.com/) at which time you are redirected to the one that hosts your data. Is it any wonder then that it's Google and Amazon that are topping the surveys now rather than Microsoft and Salesforce?

Don't get me wrong - Salesforce.com is a great company with a great product suite that I use and recommend every day. They may well be locked in to a legacy n-tier architecture but they do a great job of keeping it running at large scale and I almost can't believe it's not cloud. I see it as "Software. As a Service", bearing in mind that it's replacing some piece of software that traditionally would have run on the desktop by delivering it over the Internet via the browser. SaaS is, if anything, a subset of cloud and I'm sure that nobody here would suggest that any old LAMP application constitutes cloud. But we digress...

I honestly thought we had this issue resolved last year, having spent an inordinate amount of time discussing, blogging, writing Wikipedia articles and generally trying to extract sense (and consensus) from the noise. I was apparently wrong as even our self-appointed spokesman has foolishly conceded that what can only really be described as gross negligence in IT operations and a crass act of stupidity is somehow a failure of the cloud computing model itself. I agree completely with Chris Hoff in that "This T-Mobile debacle is a good thing. It will help further flush out definitions and expectations of Cloud. (I can dream, right?)" - it's high time for us to revisit and nail the issue of what is (and more importantly, what is not) cloud once and for all.


  1. So you have a definition of cloud computing where cloud computing is where things can't fail cause they have cloud technology. Then every time a cloud app fails, you say they weren't as "cloudy" as they claimed.

    You're just moving shells around.

  2. No, I'm saying that cloud storage services are engineered to tolerate failure. Had this have been an iPhone syncing with Google (ignoring for the sake of the argument that the iPhone maintains a fully fledged copy of the data and is not a "dumb terminal" like the Sidekick) and had the terrorists nuked a datacenter then I [hopefully] wouldn't have noticed - except for degraded performance as is the failure mode for RAID today.


  3. "No, I'm saying that cloud storage services are engineered to tolerate failure."

    Oh, really? Take a completely new, mostly proprietary set of code managing a huge datacenter, and I'm supposed to -assume- these clouds are engineered to tolerate failure?

    The lesson of the Sidekick failure isn't a failure of "cloud computing", or of "bad backups", or of "old datacenters". As usual, everybody misses the real problem here. It's a disturbing reminder that reliability is completely dependent on people that you are hoping are running these networks correctly.

    That's where the Cloud services fail. The difference between a cloud and running my own server is that I know when my own server is being run correctly, because I can check everything on it and physically inspect and audit the datacenter it's placed in. The Cloud services promise that they do the same, but all I can do is trust them, because their process is completely transparent to me.

    People need to start demanding proof that these "Clouds" are being run correctly, and that's the hallmark difference between good engineers that know how servers work and fat nerds that jump on the hype bandwagon, becoming apologists for big companies that I hope are receiving bribes for their blind and unquestioning loyalty.

    As for your comment on "occasional blackouts", we run millions of dollars through our company. Our servers should have NO blackouts, at all. With a good server cluster and a real datacenter with generators and redundant internet connections, this is a very achievable goal.

  4. There's one other part that defines cloud: a virtual OS that abstracts the real OS and hardware. It's not just about not failing, but it's how it's able to prevent failure.

    (directed at the "moving shells around" comment)

  5. @Kyle: I guess you missed the part where I explained that Sidekick was NOT cloud. Nor am I defending any provider in particular, rather the cloud computing model. Also see the previous post where I explained that with Open Cloud (Open APIs + Open Formats) you can and should do more than trust your provider - you can maintain your own copies on another cloud or internally if you prefer. And as for your company paying "good" engineers good money to run a "good" server cluster in a "real datacenter" - go for your life but don't bitch to me when some competitor rips you a new one because of their operational efficiency and agility advantages.

    @AC: I think you meant tolerate/deal with failure rather than prevent it (which I agree is impossible).

  6. Sam notes:

    "but all I can do is trust them"

    and there in lies the whole value argument. This is the key statement that for me the cloud vendors have to wrestle with and figure out how they can answer this in a way that makes sense and comfort reading for cloud users.

    This does not take the form of an SLA, which is just a dressed up legal letter to protect themselves when the worse actually does happen.

    Why should I trust the bookseller/search-engine with my IT infrastructure?

  7. Glad to see you taking Ruv to task for miscategorizing Sidekick's massive fail as a cloud computing issue.

    I'm waiting for Microsoft's official explanation of what went wrong, as are thousands (millions?) of others.


  8. Sam:

    I think cloud has great promise. I have been a very satisfied Amazon EC2 customer for a about six months.

    However there doesn't seem to be anyway to verify the claims you make about EC2 recovery time in the event of failure. There is no way to directly test, there is no architecture description in enough detail to do an assessment if the architecture supports the claims made for recovery time and the data loss window, and nothing in the terms of service.

  9. I think the explanation in this post has it's heart in the right place, but is fundamentally wrong.

    The correct way to say it is "Sidekick wasn't using cloud computing. They were just an ordinary company hosting their data the ordinary way, and they lost it."

    Cloud Computing is designed to handle failure better, but saying "if it failed it's not a cloud" is silly.

    I also think that the term "Cloud" has come to have different meanings to different groups. The technical meaning ("Amazon Cloud Computing") and the non-technical meaning ("my data is in the cloud because apple/google/whomever stores my data". This is just like the word "Hacker" which means different things to different people.

  10. Trust is, and will continue to be, an important issue for cloud computing's business model, and its proponents will do themselves no favors if they act dismissively about it. An excessive preoccupation with definitions will look like ivory-tower irrelevance, and building an argument around a useless tautology is one way to lead customers into feeling that you don't take their concerns seriously, even though you do.

  11. I think the example of the sidekick failure is something that bothered me a whole lot when I had a dumb cell phone. What would happen to all my contacts if I lost the phone. Luckily I never found out given that I did not have a backup by using a data cable. Now I have both a local and remote (ATT) backup on the computer for my Iphone.One should have both a local and a remote backup no matter what the remote backup shape be it cloud SAAS or mailing dvds to your Mom.

  12. The Titanic was unsinkable. Therefore the boat that hit the iceberg and sank was obviously not the Titanic.

  13. @AC: At which point did I argue that cloud is infallible? What I said was that it was "not dangerous", in response to a claim that it was. Microsoft appear to have lost all data for their entire user base - *that* is dangerous. Could it happen to Google or Amazon? Sure. Will it? I doubt it. That doesn't mean users shouldn't take precautions, such as having a local copy.


  14. Thank heavens we have people like you that keep trying to bring logic, reason and science to all the marketing and bandwagon hype that makes it difficult to move forward with real IT progress and solutions. Your analogy of the PowerGrid is spot on, and as we create open standards for cloud, "outages" will be less and less an issue for public media whores, and more of just a daily grind task for IT administrators.


Note: only a member of this blog may post a comment.