The Mishap

Last Tuesday, a number of websites began to experience issues with their services and a lot of other websites went down completely. It appeared to be a severe outage at Amazon Web Services, the company’s sizable cloud-computing business, which hosts vast swaths of cyberspace.

The Simple Storage Service (S3) in the US-East-1 (North Virginia) region was disrupted for approximately 5 hours. Even the status indicators for AWS Services displayed contrary results as they rely on AWS S3 for storage of its health marker graphics and thus resulted in a massive disruptive impact on the companies running their production workloads on AWS.


What is S3 and how it works?

Amazon Simple Storage Service (Amazon S3) is object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web. It is designed to deliver 99.999999999% durability, and scale past trillions of objects worldwide.

Customers use Amazon S3 as primary storage as a bulk repository for user-generated content, as a tier in an active archive. Amazon S3 is a key-based object storage which means every time one store’s data, a unique object key is assigned to retrieve the data in future.Amazon S3 replicates the data across the multiple devices within the Region although it follows an eventual consistency model for its data consistency. This means that one may not be able to read the latest version of data even if there is an update in the S3 object. This is due to an absence of status and information from AWS during the time of replication of objects between the AZ’s.


Acknowledging the mishap

In a statement, Amazon said: “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.” An engineer servicing Amazon’s S3 system using an established playbook executed a command and pressed a wrong button which rather than taking a handful of servers offline for servicing, took a whole slew of them offline which supported two other S3 subsystems.

One of this subsystem was the index subsystem that is accountable for managing the metadata and location information of all S3 objects in the regions and serves all GET, LIST, PUT, and DELETE requests.
The second subsystem was the placement subsystem which is responsible for allocation of new storage and is reliant on index subsystem to function properly. Removing a significant portion of the capacity caused each of these systems to a complete restart and hence resulted in not processing service requests by S3.


All that you can do!

Undoubtedly since the reason of disruption was due to a single typo, a need to build a framework at the time of the hour for future is required.

Solution 1: