syn_dcknowledge: AWS Outage that Broke the Internet Caused by Mistyped Command

Пишет Data Center Knowledge | News and analysis for the data center industry - Industr (

syn_dcknowledge)
@ 2017-03-02 19:43:00

AWS Outage that Broke the Internet Caused by Mistyped Command

This past Tuesday morning Pacific Time an Amazon Web Services engineer was debugging an issue with the billing system for the company’s popular cloud storage service S3 and accidentally mistyped a command. What followed was a several hours’ long cloud outage that wreaked havoc across the internet and resulted in what is potentially hundreds of millions of dollars in losses for AWS customers.

The long list of popular web services that either suffered full blackouts or degraded performance because of the AWS outage includes the likes of Coursera, Medium, Quora, Slack, Docker (which delayed a major news announcement by two days because of the issue), Expedia, and AWS’s own cloud health status dashboard, which as it turned out relied on S3 infrastructure hosted in a single region.

Cyence, an analytics company that quantifies economic impact of cyber risk, estimated that the S&P 500 companies impacted by the outage collectively lost between $150 million and $160 million as a result of the incident. That estimate doesn’t include countless other businesses that rely on S3, on other AWS services that rely on S3, or on service providers that built their services on Amazon’s cloud.

The engineer that made the expensive mistake meant to execute a command intended to remove only a small number of servers running one of the S3 subsystems. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” according to a post-mortem Amazon published Thursday, which also included an apology.

Why AWS went down… Human pressed the wrong button. https://t.co/YMnbXUBd7a pic.twitter.com/JJUCbW0E73

— Barry Schwartz (@rustybrick) March 2, 2017

The servers removed supported two other crucial S3 subsystems: one that manages metadata and location information of all S3 objects in Amazon’s largest data center cluster, located in Northern Virginia, and one that managed allocation of new storage and relies on the first subsystem.

Once the two systems lost a big chunk of capacity they needed to be restarted, which is where another problem occurred. Restarting them took much longer than AWS engineers expected, and while they were being restarted, other services in the Northern Virginia region (US-East-1) that rely on S3 – namely the S3 console, launches of new cloud VMs by the flagship Elastic Compute Cloud service, Elastic Block Store volumes, and Lambda – were malfunctioning.

Amazon explained the prolonged restart by saying the two subsystems had not been completely restarted for many years. “S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”

To prevent similar issues from occurring in the future, the AWS team modified its tool for removing capacity to prevent it from removing too much capacity too quickly and to prevent capacity from being removed when any subsystem reaches its minimum required capacity.

The team also reprioritized work to partition one of the affected subsystems into smaller “cells,” which was planned for later this year but will now begin right away.

Finally, the Status Health Dashboard now runs across multiple AWS regions so that customers don’t have to rely on Twitter to learn about the health of their cloud infrastructure in case of another outage.