Leaving Equinix


It’s been three years to the day since my last post — a side effect of my being completely immersed in my job at Equinix (where I was, until last week, Director of Cloud & IT Services). I’ve been based in Zürich and working ostensibly in London for the past 5 years (having spent the last decade in Europe, and probably a year of it in Silicon Valley), though in reality I’ve spent most of my time on the road — according to TripIt I’ve traveled almost a million kilometres to almost 200 locations, be it to visit partners, customers, attend & present at events, or work with colleagues in other offices, as well as the occasional holiday. Here’s hoping I’ll be able to be more grounded for the coming years (though if the last week is any indication I’m not so sure)!

When non-technical people ask me who Equinix is (Americans often confuse it with Equinox, the gyms — maybe they’ll tie up one day so your treadmill will power the data centres?), I tell them they’re essentially the “landlord of the Internet”. That’s not entirely true — there are a number of carrier-neutral, multi-tenant data center providers in the market — but it’s understandable, and few can hold a candle to Equinix’s quality, scale, global reach, and (arguably most importantly), business ecosystems. Another analogy I use is the “Hilton of the Internet”, where companies wishing to participate can rent a room, meet each other at the “lobby bar” (regional events and Marketplace), and communicate over the “phone system” (Internet exchanges). Chairman Peter Van Camp refers to the data centres (“International Business Exchanges” or “IBXs”) as “international airports where passengers from many different airlines make connections to get to their final destinations”. You get the idea.

As the Internet developed, Equinix founders Jay Adelson and the late Al Avery identified a need for a neutral location for carriers to connect together — the Switzerland of the Internet if you like. Over the 15+ years since it was founded in 1998, Equinix has grown from its first location in the USA to a global footprint of 105 data centres in 33 metros (cities) in 15 countries spanning 5 continents (by the time you read this they may have many more thanks to the acquisition of Telecity which will basically double the size of the EMEA region). The company usually expands through acquisition or by building new data centres, typically following a “metro” model whereby an accessible (but not necessarily central) location is chosen for a “campus” of data centres (London for example now has 6 data centres, half of which are on the same road in Slough). Recent builds look something like these:

AM3Amsterdam AM3


Melbourne ME1

Having established a critical mass of network service providers, Equinix IBXs became attractive to early content providers like Yahoo! They needed to reach the eyeballs which were connected to the carriers (at the time, typically by dialup or ADSL services), and rather than tapping into multiple/many carriers in one location they’d have to arrange to connect to those carriers wherever they were. Furthermore, the carriers themselves needed to connect to each other (that’s the “inter” in “internet”), and they found it easier to do so in a neutral location rather than on their own turf.

Equinix went on to establish similar ecosystems around the financial industry, where trading exchanges (like Internet exchanges) would act as magnets for high frequency traders, news providers, etc. — there are now thriving financial hubs in 16 Equinix metros. More recent ecosystems include advertising, whereby a content provider could ask — in the milliseconds it takes to render a page — for advertisers to bid on ad placements. While light travels quickly, over long distances it can significantly impair the performance of an application (plus it travels slower inside glass fibres), and for these applications there’s no prize for second! The most interesting ecosystem though (in my somewhat biased opinion anyway) is the cloud ecosystem. By chance many of the content providers of yesterday (Amazon, Google, Microsoft) transformed into the cloud providers of today, and I think it’s safe to say now that Equinix is the “home of the cloud” (a term I introduced in 2011, albeit somewhat aspirational at the time).

When I joined Equinix the only way to access cloud providers was over the Internet, or by special arrangement (typically only available to the largest customers like Netflix). This was a problem for most enterprise consumers, and indeed 8 of the top 10 blockers for cloud adoption according to analysts are partially or fully addressed by bypassing the Internet. We first launched AWS Direct Connect with Amazon that year, and I proposed that the process should be more automated (at the time it required filling out paperwork and waiting for someone in the data centre to run a fibre from your infrastructure to a port you had to rent in Amazon’s). The solution proposed by product was a box of robots, and while I was no stranger to boxes of robots from my time at Google, I was convinced we could do better. Here’s the back-of-an-envelope blueprint I submitted in my first month in the company, which (following years of research and development by the CTO office and product teams) essentially went on to become the Equinix Cloud Exchange (I called it CloudConnect at the time, but there were trademark issues):

Cloud Exchange

This hybrid- and multi-cloud architecture allows customers to seamlessly integrate legacy/on-premises, hosted private, and public cloud infrastructure, and I believe it (or something like it) will be the “default” reference architecture for most enterprises in future. Anyone can automate a switch fabric though — indeed a number of competitors have (we even had something like it at UNSW ~20 years ago which would allow you to put any port anywhere on campus onto any network, via a web interface, using the same standards no less!). What differentiates Equinix’s is the presence of hundreds of cloud providers, including all of the top providers in the market today (thanks in no small part to the tireless efforts of the GAM and CAT teams).

For the enterprise CIO, they should look at the data centre as an operating system, only rather than installing best-of-breed applications like Office and Photoshop, they simply connect to services like Office 365 and AWS (after all, cloud is simply the migration from product to service). Alternatively I often use the shopping mall analogy, only rather than visiting to buy products from a store (like Apple), you’re buying from a service provider (like Apple).

Anyway, having spent the past decade on the provision of services at Citrix, Google, and Equinix, I’m hanging up my Equinix hat and getting to work on the consumption and application of information technology to solving business problems (among other things). Watch this space.

HTTP2 Expression of Interest

Here’s my (rather rushed) personal submission to the Internet Engineering Task Force (IETF) in response to their Call for Expressions of Interest in new work around HTTP; specifically, a new wire-level protocol for the semantics of HTTP (i.e., what will become HTTP/2.0), and new HTTP authentication schemes. You can also review the submissions of Facebook, FirefoxGoogle, Microsoft, Twitter and others.

[The views expressed in this submission are mine alone and not (necessarily) those of Citrix, Google, Equinix or any other current, former or future client or employer.]

My primary interest is in the consistent application of HTTP to (“cloud”) service interfaces, with a view to furthering the goals of the Open Cloud Initiative (OCI); namely widespread and ideally consistent interoperability through the use of open standard formats and interfaces.

In particular, I strongly support the use of the existing metadata channel (headers) over envelope overlays (SOAP) and alternative/ancillary representations (typically in JSON/XML) as this should greatly simplify interfaces while ensuring consistency between services. The current approach to cloud “standards” calls on vendors to define their own formats and interfaces and to maintain client libraries for the myriad languages du jour. In an application calling on multiple services this can result in a small amount of business logic calling on various bulky, often poorly written and/or unmaintained libraries. The usual counter to the interoperability problems this creates is to write “adapters” (ala ODBC) which expose a lowest-common-denominator interface, thus hiding functionality and creating an “impedence mismatch”. Ultimately this gives rise to performance, security, cost and other issues.

By using HTTP as intended it is possible to construct (cloud) services that can be consumed using nothing more than the built-in, standards compliant HTTP client. I’m not writing to discuss whether this is a good idea, but to expose a use case that I would like to see considered, and one that we have already applied with an amount of success in the Open Cloud Computing Interface (OCCI).

To illustrate the original scope, versions of HTTP (RFC 2068) included not only the Link header (recently revived by Mark Nottingham in RFC 5988) but also LINK and UNLINK verbs to manipulate it (recently proposed for revival by James Snell). Unfortunately hypertext, and in particular HTML (which includes linking in-band rather than out-of-band) arguably stole HTTP’s thunder, leaving the overwhelming majority of formats that lack in-band linking (images, documents, virtual machines, etc.) high and dry and resulting in inconsistent linking styles (HTML vs XML vs PDF vs DOC etc.). This limited the extent of web linking as well as the utility of HTTP for innovative applications including APIs. Indeed HTTP could easily and simply meet the needs of many “Semantic Web” applications, but that is beyond the scope of this particular discussion.

To illustrate by way of example, consider the following synthetic request/response for an image hosting site which incorporates Web Linking (RFC 5988), Web Categories (draft-johnston-http-category-header) and Web Attributes (yet to be published):

GET /1.jpg HTTP/1.0

HTTP/1.0 200 OK
Content-Length: 69730
Content-Type: image/jpeg
Link: http://creativecommons.org/licenses/by-sa/3.0/; rel=”license”
Link: /2.jpg; rel=”next”
Category: dog; label=”Dog”; scheme=”http://example.org/animals”
Attribute: name=”Spot”

In order to “animate” resources, consider the use of the Link header to start a virtual machine in the Open Cloud Computing Interface (OCCI):

Link: </compute/123;action=start>; rel="http://schemas.ogf.org/occi/infrastructure/compute/action#start"

The main objection to the use of the metadata channel in this fashion (beyond the application of common sense in determining what constitutes data vs metadata) is implementation issues (e.g. arbitrary limitations, i18n, handling of multiple headers, etc.) which could be largely resolved through specification. For example, the (necessary) use of e.g. RFC 2231 encoding for header values (but not keys) in e.g. RFC 5988 Web Linking gives rise to unnecessary complexity that may lead to interoperability, security and other issues which could be resolved through the specification of Unicode for keys and/or values. Another concern is the absence of features such as a standardised ability to return a collection (e.g. multiple responses). I originally suggested that HTTP 2.0 incorporate such ideas in 2009.

I’ll leave the determination of what would ultimately be required for such applications to the working group (should this use case be considered interesting by others), and while better support for performance, scalability and mobility are obviously required this has already been discussed at length. I strongly support Poul-Henning Kamp’s statement that “I think it would be far better to start from scratch, look at what HTTP/2.0 should actually do, and then design a simple, efficient and future proof protocol to do just that, and leave behind all the aggregations of badly thought out hacks of HTTP/1.1.” (and agree that we should incorporate the concept of a “HTTP Router”) as well as Tim Bray’s statement that: “I’m inclined to err on the side of protecting user privacy at the expense of almost all else” (and believe that we should prevent eavesdroppers from learning anything about an encrypted transaction; something we failed to do with DNSSEC even given alternatives like dnscurve that ensure confidentiality as well as integrity).

Leaving Google+

Ironically many Google employees have even given up on Google+
(though plenty still post annoying “Moved to Google+” profile pics on other social networks)

One of those sneaky tweets that links to Google+ just tricked me into wading back into the swamp that it’s become, hopefully for the last time (I say “hopefully” because in all likelihood I’ll be forced back onto it at some point — it’s already apparently impossible to create a Google Account for any Google services without also landing yourself a Google+ profile and Gmail account and it’s very likely that the constant prompting for me to “upgrade” to Google+ will be more annoying than the infamous red notification box). Here’s what I saw in my stream:

  • 20 x quotes/quotepics/comics
  • 8 x irrelevant news articles & opeds
  • 1 x PHP code snippet
  • 3 x blatant ads
  • 2 x Google+ fanboi posts (including this little chestnut: “Saying nobody uses Google+ is like a virgin saying sex is boring. They’ve never actually tried it.” — you just failed at life by comparing Google+ to sex my friend).
  • 2 x random photos

That’s pretty much 0% signal and 100% noise, and before you jump down my throat about who I’m following, it’s a few hundred generally intelligent people (though I note it is convenient that the prevalent defense for Google+ being a ghost town, or worse, a cesspool, is that your experience depends not only on who you’re following, but what they choose to share with you — reminds me of the kind of argument you regularly hear from religious apologists).

Google+ Hangouts

My main gripe with Google+ this week though was the complete failure of Google+ Hangouts (which should arguably be an entirely separate product) for Rishidot Research‘s Open Conversations: Cloud Transparency on Monday. The irony of holding an open/transparency discussion on a close platform aside, we were plagued with technical problems from the outset. First it couldn’t find my MacBook Air’s camera so I had to move from my laptop to my iMac (which called for heavy furniture to be moved to get a clean background). When I joined we started immediately (albeit late, and sans 2-3 of the half dozen attendees), but it wasn’t long before one of the missing attendees joined and repeatedly interrupted the first half of the meeting with audio problems. The final attendee never managed to join, though their name and a blank screen appeared each of the 5-10 times they tried. We then inexplicably lost two attendees, and by the time they managed to re-join I too got a “Network failure for media packets” error:

Then there was “trouble connecting with the plugin”, which called for me to refresh the page and then reinstall the plugin:

Eventually I made it back in, only to discover that we had now lost the host(!?!) and before long it was down to just me and one other attendee. We struggled through the last half of the hour but it was only afterwards that we discovered we were talking to ourselves because the live YouTube stream and recording stopped when the host was kicked out. Needless to say, Google+ Hangouts are not ready for the prime time, and if you invite me to join one then don’t be surprised if I refer you to this article.

Hotel California

To leave Google+ head over to Google Takeout and download your Circles (I grabbed data for other services too for good measure, and exported this blog separately since my profile is now Google+ integrated). You might want to see who’s following you, Actions->Select All and dump them into a circle first, otherwise you’ll probably lose that information when you close your account.

When you go to the Google+ “downgrade” page and select “Delete your entire Google profile” you’ll get a sufficiently complicated warning as to scare most people back into submission, but the most concerning part for me was this unhelpful help advising “Other Google products which require a profile will be impacted“:

Fortunately for YouTube and Blogger at least you can check and revert your decision to use a Google+ profile respectively, but you’ll immediately be told to “Connect to Google+” once you unplug:

After that it’s just a case of checking “I understand that deleting this service can’t be undone and the data I delete can’t be restored.” and clicking “Remove selected services” (what “selected services”? I just want to be rid of Google+!). I’ll let you know how that goes once my friends on Google+ have had a chance to read this.

Getting started with OpenStack in your lab

Having recently finished building my new home lab I wanted to put the second server to good use by installing OpenStack (the first is running VMware ESXi 5.0 with Windows 7, Windows 8, Windows 8 Server and Ubuntu 12.04 LTS virtual machines). I figured many of you would benefit from a detailed walkthrough so here it is (without warranty, liability, support, etc).

The two black boxes on the left are HP Proliant MicroServer N36L’s with modest AMD Athlon(tm) II Neo 1.3GHz dual-core processors and 8GB RAM and the one on the right is an iomega ix4-200d NAS box providing 8TB of networked storage (including over iSCSI for ESXi should I run low on direct attached storage). There’s a 5 port gigabit switch stringing it all together and a 500Mbps CPL device connecting it back up the house. You should be able to set all this up inside 2 grand. Before you try to work out where I live, the safe is empty as I don’t trust electronic locks.

IMG 1198

Download Ubuntu Server (12.04 LTS or the latest long term support release) and write it to a USB key — if you’re a Mac OS X only shop then you’ll want to follow these instructions. Boot your server with the USB key inserted and it should drop you straight into the installer (if not you might need to tell the BIOS to boot from USB by pressing the appropriate key, usually F2 or F10, at the appropriate time). Most of the defaults are OK but you’ll probably want to select the “OpenSSH Server” option in tasksel unless you want to do everything from the console, but be sure to tighten up the default configuration if you care about security. Unless you like mundane admin tasks then you might want to enable automatic updates too. Even so let’s ensure any updates since release have been applied:

sudo apt-get update
sudo apt-get -u upgrade

Next you’ll want to install DevStack (“a documented shell script to build complete OpenStack development environments from RackSpace Cloud Builders“), but first you’ll need to get git:

sudo apt-get install git

Now grab the latest version of DevStack from GitHub:

git clone git://github.com/openstack-dev/devstack.git

And run the script:

cd devstack/; ./stack.sh

The first thing it will do is ask you for passwords for MySQL, Rabbit, a SERVICE_TOKEN and SERVICE_PASSWORD and finally a password for Horizon & Keystone. I used the (excellent) 1Password to generate passwords like “sEdvEuHNNeA7mYJ8Cjou” (the script doesn’t like special characters) and stored them in a secure note.

The script will then go and download dozens of dependencies, which are conveniently packaged by Ubuntu and/or the upstream Debian distribution, run setup.py for a few python packages, clone some repositories, etc. While you wait you may as well go read the script to understand what’s going on. At this point the script failed because /opt/stack/nova didn’t exist. I filed bug 995078 but the script succeeded when I ran it for a second time — looks like it may have been a glitch with GitHub.

You should end up with something like this:

Horizon is now available at
Keystone is serving at
Examples on using novaclient command line is in exercise.sh
The default users are: admin and demo
The password: qqG6YTChVLzEHfTDzm8k
This is your host ip:
stack.sh completed in 431 seconds.

If you browse to that address you’ll be able to log in to the console:

Openstack login

That will drop you into the Admin section of the OpenStack Desktop (Horizon) where you can get an overview and administer instances, services, flavours, images, projects, users and quotas. You can also download OpenStack and EC2 credentials from the “Settings” pages.

Openstack console

Switch over to the “Project” tab and “Create Keypair” under “Access & Security” (so you can access any instances you create):

Openstack keygen

The key pair will be created and downloaded as a .pem file (e.g. admin.pem).

Now select “Images & Snapshots” under “Manage Compute” you’ll be able to launch the cirros-0.3.0-x86_64-uec image which is included for testing. Simply click “Launch” under “Actions”:

Openstack project

Give it a name like “Test”, select the key pair you created above and click “Launch Instance”:

Openstack launch

You’ll see a few tasks executed and your instance should be up and running (Status: Active) in a few seconds:

Openstack spawning

Now what? First, try to ping the running instance from within the SSH session on the server (you won’t be able to ping it from your workstation):

$ ping
PING ( 56(84) bytes of data.
64 bytes from icmp_req=1 ttl=64 time=0.734 ms
64 bytes from icmp_req=2 ttl=64 time=0.585 ms
64 bytes from icmp_req=3 ttl=64 time=0.588 ms

Next let’s copy some EC2 credentials over to our user account on the server so we can use the command line euca-* tools. Go to “Settings” in the top right and then the “EC2 Credentials” tab. Now “Download EC2 Credentials”, which come in the form of a ZIP archive containing an X.509 certificate (cert.pem) and key (pk.pem) pair as well as a CA certificate (cacert.pem) and an rc script (ec2rc.sh) to set various environment variables which tell the command line tools where to find these files:

Openstack ec2

While you’re at it you may as well grab your OpenStack Credentials which come in the form of an rc script (openrc.sh) only. It too sets environment variables which can be seen by tools running under that shell.

Openstack rc

Let’s copy them (and the key pair from above) over from our workstation to the server:

scp b34166e97765499b9a75f59eaff48b98-x509.zip openrc.sh admin.pem samj@

Stash the EC2 credentials in ~/.euca:

mkdir ~/.euca; chmod 0700 ~/.euca; cd ~/.euca
cp ~/b34166e97765499b9a75f59eaff48b98-x509.zip ~/.euca; unzip *.zip

Finally let’s source the rc scripts:

source ~/.euca/ec2rc.sh
source ~/openrc.sh

You’ll see the openrc.sh script asks you for a password. Given this is a dev/test environment and we’ve used a complex password, let’s modify the script and hard code the password by commenting out the last 3 lines and adding a new one to export OS_PASSWORD:

# With Keystone you pass the keystone password.
#echo "Please enter your OpenStack Password: "

You probably don’t want anyone seeing your password or key pair so let’s lock down those files:

chmod 0600 ~/openrc.sh ~/admin.pem

Just make sure the environment variables are set correctly:

echo $EC2_USER_ID

Finally we should be able to use the EC2 command line tools:

RESERVATION r-8wvdh1c7 b34166e97765499b9a75f59eaff48b98 default
INSTANCE i-00000001 ami-00000001 test test running None (b34166e97765499b9a75f59eaff48b98, ubuntu) 0 m1.tiny 2012-05-05T13:59:47.000Z nova aki-00000002 ari-00000003 monitoring-disabled instance-store

As well as the openstack command:

openstack list server
| ID | Name | Status | Networks |
| 44a43355-7f95-4621-be61-d34fe53e50a8 | Test | ACTIVE | private= |

You should be able to ssh to the running instance using the IP address and key pair from above:

ssh -i admin.pem -l cirros
$ uname -a
Linux cirros 3.0.0-12-virtual #20-Ubuntu SMP Fri Oct 7 18:19:02 UTC 2011 x86_64 GNU/Linux

That’s all for today — I hope you find the process as straightforward as I did and if you do follow these instructions then please leave a comment below (especially if you have any tips or solutions to problems you run into along the way).

Is carrying an iPhone worth the risk?

Update: It appears that Apple have resolved the issue with the September launch of IOS 7, essentially by implementing what I suggested below (highlighted):

Find my iphone
Yesterday I was robbed of my brand new iPhone (S/N: DNPGQ4RDDTDM IMEI: 013032008785006 ) for the second time, in public, in Paris. While I’m still a little shaken, angry and disappointed, I’m glad everyone survived unscathed… this time (last time I was assaulted in the process).

These less fortunate victims of crime lost their lives over iPhones, in the course of a robbery, in trying to retrieve the stolen device and as an innocent bystander respectively:

The latter story (around this time last year), in which a 68 year old woman was pushed down a flight of stairs in a Chicago subway station by the fleeing thief only to die later of head injuries, is almost identical to a robbery in Paris in which a young woman also died of head injuries only weeks prior:

Paris police data from that period showed that 53 percent of 1,071 violent thefts on Paris public transport involved smartphones, and the last two models of iPhones accounted for almost 28 percent of items stolen on public transport. The Interior Minister was at the time seeking faster efforts to allow smartphone owners to “block” stolen phones, disabling calling functions to make them worthless in the resale market as a deterrent to theft. “It will be naturally much less attractive” to steal a phone that can be de-activated remotely, he noted, adding that “we have the technical means to deter thieves”. And yet the grey market for iPhones is obviously still alive and well some 18 months later, in no small part because the parties with the capability to solve the problem (carriers, manufacturers, etc.) lack the interest (stolen phones drive new sales).

This brings me to the point of this post — finding a technical solution to solve the problem once and for all. Indeed, if a smartphone can be “bricked” then its resale value is severely limited. Most efforts today involve blacklisting the IMEI number such that the phone cannot be used on the networks in that country, but this usually takes time as it has to be done securely (typically by the operator from which it was purchased, and only after receiving a police report — too bad for those of us who purchase outright from a retailer!). A few days is long enough for the thief to sell the phone, only to have the buyer find it stop working some time later, thus creating another victim of crime (albeit someone guilty of receiving stolen goods, and in doing so driving demand!). Unless the database is global (which gives rise to other problems including distributed trust, denial of service, duplicated IMEIs, equipment limitations, etc.) then the thief can just sell it into another market, especially here in Europe, or swap it.

Enter Apple, who already have (and heavily advertise) the capability to securely locate, message and wipe the device (should it be able to reach the Internet — too bad if you’re roaming and have data disabled, and care about security and have auto join networks disabled, as I did!). Their trivial restore process (which makes iPhones extremely, and I would argue unnecessarily, transferable) also apparently involves a handshake with Apple servers, so who better to “brick” stolen devices by preventing them from being restored until returned? This would make it essentially impossible for anyone but the legitimate owner of the device to make use of it, thereby destroying the market and going from the most attractive to least attractive smartphone for thieves overnight. Sure you could argue that it’s not their problem, but unlike the police they have the capability (and I would argue the interest) to put an end to it once and for all.

I for one will be seriously reconsidering the cost vs benefit of carrying a device that others value more than my own life, and I’m sure that the benefit of a “Remote Disable” function in competitive advantage would outstrip the profit from replacement of stolen devices, so it’s not just about doing the right thing.

Update: Brian Katz points out that the thief need only enter the wrong PIN 10 times and then the iPhone will factory reset itself (depending on settings), no need for iTunes restore!

P.S. Here’s some advice on protecting your iPhone as well as some tips for avoiding pickpockets in Paris from TripAdvisor and the US Embassy.

Simplifying cloud: Reliability

The original Google server rack

Reliability in cloud computing is a very simple concept which I’ve explained in many presentations but never actually documented:

Traditional legacy IT systems consist of relatively unreliable software (Microsoft Exchange, Lotus Notes, Oracle, etc.) running on relatively reliable hardware (Dell, HP, IBM servers, Cisco networking, etc.). Unreliable software is not designed for failure and thus any fluctuations in the underlying hardware platform (including power and cooling) typically result in partial or system-wide outages. In order to deliver reliable service using unreliable software you need to use reliable hardware, typically employing lots of redundancy (dual power supplies, dual NICs, RAID arrays, etc.). In summary:

unreliable software
reliable hardware

Cloud computing platforms typically prefer to build reliability into the software such that it can run on cheap commodity hardware. The software is designed for failure and assumes that components will misbehave or go away from time to time (which will always be the case, regardless of how much you spend on reliability – the more you spend the lower the chance but it will never be zero). Reliability is typically delivered by replication, often in the background (so as not to impair performance). Multiple copies of data are maintained such that if you lose any individual machine the system continues to function (in the same way that if you lose a disk in a RAID array the service is uninterrupted). Large scale services will ideally also replicate data in multiple locations, such that if a rack, row of racks or even an entire datacenter were to fail then the service would still be uninterrupted. In summary:

reliable software
unreliable hardware

Asked for a quote for Joe Weinman’s upcoming Cloudonomics: The Business Value of Cloud Computing book, I said:

The marginal cost of reliable hardware is linear while the marginal cost of reliable software is zero.

That is to say, once you’ve written reliability into your software you can scale out with cheap hardware without spending more on reliability per unit, while if you’re using reliable hardware then each unit needs to include reliability (typically in the form of redundant components), which quickly gets very expensive.
The other two permutations are ineffective:

Unreliable software on unreliable hardware gives an unreliable system. That’s why you should never try to install unreliable software like Microsoft Exchange, Lotus Notes, Oracle etc. onto unreliable hardware like Amazon EC2:

unreliable software
unreliable hardware

Finally, reliable software on reliable hardware gives a reliable but inefficient and expensive system. That’s why you’re unlikely to see reliable software like Cassandra running on reliable platforms like VMware with brand name hardware:

reliable software
reliable hardware

Google enjoyed a significant competitive advantage for many years by using commodity components with a revolutionary proprietary software stack including components like the distributed Google File System (GFS). You can still see Google’s original hand-made racks built with motherboards laid on cork board at their Mountain View campus and the computer museum (per image above), but today’s machines are custom made by ODMs and are a lot more advanced. Meanwhile Facebook have decided to focus on their core competency (social networking) and are actively commoditising “unreliable” web scale hardware (by way of the Open Compute Project) and software (by way of software releases, most notably the Cassandra distributed database which is now used by services like Netflix).

The challenge for enterprises today is to adopt cheap reliable software so as to enable the transition away from expensive reliable hardware. That’s easier said than done, but my advice to them is to treat this new technology as another tool in the toolbox and use the right tool for the job. Set up cloud computing platforms like Cassandra and OpenStack and look for “low-hanging fruit” to migrate first, then deal with the reticent applications once the “center of gravity” of your information technology systems has moved to cloud computing architectures.

P.S. Before the server huggers get all pissy about my using the term “relatively unreliable software”, this is a perfectly valid way of achieving a reliable system — just not a cost effective one now “relatively reliable software” is here.

Cloud computing’s concealed complexity

Cloud gears cropped

James Urquhart claims Cloud is complex—deal with it, adding that “If you are looking to cloud computing to simplify your IT environment, I’m afraid I have bad news for you” and citing his earlier CNET post drawing analogies to a recent flash crash.

Cloud computing systems are complex, in the same way that nuclear power stations are complex — they also have catastrophic failure modes, but given cloud providers rely heavily on their reputations they go to great lengths to ensure continuity of service (I was previously the technical program manager for Google’s global tape backup program so I appreciate this first hand). The best analogies to flash crashes are autoscaling systems making too many (or too few) resources available and spot price spikes, but these are isolated and there are simple ways to mitigate the risk (DDoS protection, market limits, etc.)

Fortunately this complexity is concealed behind well defined interfaces — indeed the term “cloud” itself comes from network diagrams in which complex interconnecting networks became the responsibility of service providers and were concealed by a cloud outline. Cloud computing is, simply, the delivery of information technology as a service rather than a product, and like other utility services there is a clear demarcation point (the first socket for telephones, the meter for electricity and the user or machine interface for computing).

Everything on the far side of the demarcation point is the responsibility of the provider, and users often don’t even know (nor do they need to know) how the services actually work — it could be an army of monkeys at typewriters for all they care. Granted it’s often beneficial to have some visibility into how the services are provided (in the same way that we want to know our phone lines are secure and power is clean), but we’ve developed specifications like CloudAudit to improve transparency.

Making simple topics complex is easy — what’s hard is making complex topics simple. We should be working to make cloud computing as approachable as possible, and drawing attention to its complexity does not further that aim. Sure there are communities of practitioners who need to know how it all works (and James is addressing that community via GigaOm), but consumers of cloud services should finally be enabled to apply information technology to business problems, without unnecessary complexity.

If you find yourself using complex terminology or unnecessary acronyms (e.g. anything ending with *aaS) then ask yourself if you’re not part of the problem rather than part of the solution.