Amazon S3 problem caused by command line mistake during maintenance

Amazon Web Services (AWS) has explained the hours-long service disruption that caused many websites and Internet-connected services to go offline earlier this week.

The Amazon Simple Storage Service (S3) team was debugging a problem in the S3 billing system on Tuesday morning when one team member “executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” Amazon wrote in a post-mortem describing the incident. That’s when things went wrong. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.”

An index subsystem that “manages the metadata and location information of all S3 objects in the [Virginia data center] region” was one of the two affected, Amazon wrote. “This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects.”

Read 6 remaining paragraphs | Comments

Technology Lab – Ars Technica