AWS Outage: We Need to Talk About These Nines
I walked out of a meeting, preparing to go to lunch. One of the guys on the database team grabs me as I pass and informs me that his AWS (Amazon Web Services) permissions are broken. He’s unable to see any of his S3 buckets. I walk to my workstation, sit down, log into the console and find that none of our S3 buckets seem to exist. First thing’s first- let my boss, the director of technology, know, then run to the development directors to inform them. Grabbing my laptop, it’s back to the conference room with my director and a fellow sysadmin. Amazon’s status site asserted that everything was actually okay. Operations is doing what they do best – scrambling. Email notifications are going out to the technology department, the sales department, and the customer management and support teams. Two of the team members from Operations and Help Desk were at an AWS conference. They chime in on the email threads, digitally chortling about how the presenters had finished explaining that the eleven-nines of availability meant the S3 service would only go down once every 10 million years just before their presentation ground to a halt, because S3 was… unavailable.
This was a small outage for us. But an outage is an outage and they happen. Ultimately, they’re unavoidable because nothing is flawless. Mistakes will always happen eventually and Murphy has a law out there that’s still on the books. Within a few hours, the services were back up and our products recovered. It was a shock when it happened, but once we got our bearings, there was nothing we could do but accept it and wait. Notifications were out, and so it was time to monitor and send new ones when things were back up. This is both the benefit and burden of relying on someone else’s infrastructure. The aftermath seen in the headlines the next day is where it gets interesting. The tech world was awash in scolding sentiments about redundancy and proper architecture. There was a considerable amount of finger waving and condescension exclaiming that all those companies that suffered outages should have used multiple providers or at the very least multiple regions. But in all fairness, that’s not what these cloud service companies sell us.
Everyone in the industry knows that buzzwords are just that, words. Those of us in the trenches hear them day in and day out. They’re sometimes what gets a company to buy into a brilliant new project, and other times the thing that gets a company to push a futile, terrible and frustrating new project. The latest thing, for now, is to brag about the 9’s of uptime. Five-nines has become a misused claim that’s so pervasive very few people consider what it actually means anymore. Truly offering that level of uptime would mean that something is only unavailable a total of 5.26 minutes every year. So at eleven-nines, a five hour outage on S3 would be valid if the service didn’t have another outage for about 57.04 million years. I don’t think there’s anyone who doesn’t realize this is hyperbolic and that the vendors are really just trying to express an extremely high level of confidence (read hubris) in their product.
Technologists who work in an enterprise environment understand how important high availability is and the consequences of not configuring the proper level of redundancy. Those of us who have made the shift to a public cloud infrastructure are cautious about how readily we believe the claims these providers make. Every one of us has been burned at some point or another but we try to recognize why it happens and make sure we find a way to correct it or work around it. While we may balk at their exaggerations, at some point you have to trust your vendor. Companies aren’t in the habit of purchasing an EMC then going out and buying a 3Par as a back up when getting new SAN storage so why should the same mindset not apply to infrastructure as a service?
While I understand that companies like Netflix invest large sums of money into building suites of applications specifically made to cause outages so they can work to ensure their reliability, I also understand that not every company can afford the time, effort or assets to do this. The prospect of services such as AWS, Azure, Google Cloud and Digital Ocean are that companies can have access to the necessities of technological infrastructure in order to create, grow and innovate without the unattainable initial capital required to do so just ten years ago. With these services comes a certain expectation from their customer base that is in no way unwarranted. Amazon themselves got burned by this outage, as their status page was incorrect because it relied on the very services that went down. Most of the people who work with these things already knew they were relying on something that could fail and were taking a calculated risk in doing so, and those who didn’t know now. Unfortunately, mitigating those risks often takes a considerable effort by several times, and the driving forces behind it don’t always have the ability to allocate the necessary resources outside of their central team. Moments like this can sometimes be leveraged as proof of value for spending those resources, and those of us who lead these public infrastructure migrations try to do just that.
However, I question the validity of the allegations that customers should “know better.” These services are boasted as being unrealistically reliable, and while reason dictates that they can’t actually live up to their declarations, it shouldn’t be as far of a departure as it is. If a company is stating they have x-nines of reliability, they are asking to be relied upon. And when it’s a company like Amazon, an established technological powerhouse which not only embraces its position as the leader in public cloud offerings but redefined the market, those assertions should come with a level of accountability and expectation. We have to trust our vendors, so maybe it’s time they assess their claims and make them a bit more realistic.