Amazon Web Service’s (AWS) Vice President and CTO, Werner Vogels, is famously quoted as having once said, “Everything fails all the time”. Let’s face it – he’s right.
No matter how hard you try to strengthen and ‘reinforce’ your IT systems and infrastructure, failure is not a matter of ‘if’ but ‘when’. Why, then, is it that we still spend a lot of our time working to harden our infrastructure to attempt to protect ourselves from failure? These days, many decisions IT leaders make are still driven by this belief that if the underlying infrastructure is ‘solid’ enough, failure can be overcome once and for all. That’s why modern businesses still spend millions of dollars on hardware like IBM Power physical servers, because they believe that the record low average downtimes can secure for them a failure-free future.
What Vogels meant by his famous quote was that it essentially doesn’t matter how hard you try to avoid individual failures – they will happen. You can invest huge sums in the very best hardware, but who’s to say there will never be a power cut at your premises? Regardless of how good your hardware, processes or team’s capability is, the occasional failure is unstoppable. It’s what you might call an asymptote – if you drew a graph with dollars spent on the X-axis and reliability on the Y-axis, you’d notice that no matter how much money you spend, you’ll never reach 100% reliability. Investment will make a significant impact on reliability up to a certain point, but towards the end you’ll find for example that an extra $100 million investment will only lead to a 0.1% increase in reliability.
So Vogels’ message (and ours too!) is that instead of pouring all your energy and time into trying to prevent failure from happening, plan for it. Invert the logic, embrace failure and approach everything you do with the goal of recovering as quickly as possible.
Here’s three ways you can plan for failure in your IT:
- Spread the risk
Say you’ve identified that you’ll need 64GB of memory resources to effectively run a key application for your user base. Instead of choosing one 64GB machine, the principle of anti-fragility would suggest that it’s smarter to go for multiple, smaller servers. Perhaps deploy four (or more) 16GB machines instead and minimise the risk associated with failure. The truth is a server will fail eventually, but if you have ten servers the impact of any single failure will be reduced to just 10%. If one fails another will be able to take over, which means a reduction in downtime.
- Compartmentalise as much as possible
Building applications using the Service Oriented Architecture (SOA) paradigm dictates that instead of having one huge monolithic app, it’s better to break it down into smaller parts which are executed, managed and deployed separately. Where the old approach used to be to tightly integrate the various elements of an app and have them running together, a ‘planning-for-failure’ approach means that any issues can be isolated and resolved without the user experience of other elements being affected. For example, if there is a security breach in your monolithic enterprise app, the whole system is exposed to the threat. If the system was effectively compartmentalised, however, the security breach would only affect one particular area and could be resolved quickly without hampering the usability of other elements of the system.
Amazon.com is a great example of this. Each element of their website runs separately so that no one issue can cause the whole system to become unavailable. For example, the product recommendations feature might not be working at a given point in time, but customers will still have the ability to buy a product or read reviews. This ensures that the key functionality (purchasing) is not affected by an issue with a more minor element of the site. If Amazon.com operated as a monolithic app, a failure in the product recommendations module could cause the whole site to go down.
- Chaos Engineering
Once you’re comfortable with the first two elements of planning for failure, you could take it a step further and do what’s called ‘chaos engineering’. This is essentially the art and science of deliberately provoking problems in your production environments, to see what happens and test how fast you’ll be able to recover. This sounds mad, but it can be a great way of pre-empting failure and fine-tuning your infrastructure accordingly. The most famous example of chaos engineering is Netflix’s ‘Chaos Monkey’, which they released into the wild in mid-2012. Netflix have come to believe that “by frequently causing failures, we force our services to be built in a way that is more resilient”. Chaos Monkey is software that runs on the Netflix AWS environment, which goes around randomly terminating machines. It spends its days roaming the company’s infrastructure, selecting machines in production (which might even be doing things for customers at that very moment!) and killing them. And this is across every part of Netflix’s service – from video streaming to billing to financial data. Does that sound outrageous to you?! The amazing thing is that it’s been four years since Chaos Monkey caused any noticeable problems for customers and, through this process, Netflix have been able to tweak and improve their service to account for the inevitability of failure.
But why is designing for failure a key consideration when migrating to the cloud? In short, because it gives you the power to do so to a much higher level. The flexibility the cloud affords means you can quickly bounce back from failure and even build it in to your IT ecosystem. We’re used to planning for failure in everyday life – after all that’s why we wear seat belts in cars and helmets on bikes. But for some reason, there is still an underlying belief that, when it comes to IT, everything should work all the time. In reality, that’s simply not the case. So don’t try to fight failure – accept it and plan for it. Transform your business’ IT infrastructure from a propped-up, weather-beaten oak tree into a flexible bamboo tree that sways with the wind instead of fighting against it.