The Mishap

Last Tuesday, a number of websites began to experience issues with their services and a lot of other websites went down completely. It appeared to be a severe outage at Amazon Web Services, the company’s sizable cloud-computing business, which hosts vast swaths of cyberspace.

The Simple Storage Service (S3) in the US-East-1 (North Virginia) region was disrupted for approximately 5 hours. Even the status indicators for AWS Services displayed contrary results as they rely on AWS S3 for storage of its health marker graphics and thus resulted in a massive disruptive impact on the companies running their production workloads on AWS.

 

What is S3 and how it works?

Amazon Simple Storage Service (Amazon S3) is object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web. It is designed to deliver 99.999999999% durability, and scale past trillions of objects worldwide.

Customers use Amazon S3 as primary storage as a bulk repository for user-generated content, as a tier in an active archive. Amazon S3 is a key-based object storage which means every time one store’s data, a unique object key is assigned to retrieve the data in future.Amazon S3 replicates the data across the multiple devices within the Region although it follows an eventual consistency model for its data consistency. This means that one may not be able to read the latest version of data even if there is an update in the S3 object. This is due to an absence of status and information from AWS during the time of replication of objects between the AZ’s.

 

Acknowledging the mishap

In a statement, Amazon said: “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.” An engineer servicing Amazon’s S3 system using an established playbook executed a command and pressed a wrong button which rather than taking a handful of servers offline for servicing, took a whole slew of them offline which supported two other S3 subsystems.

One of this subsystem was the index subsystem that is accountable for managing the metadata and location information of all S3 objects in the regions and serves all GET, LIST, PUT, and DELETE requests.
The second subsystem was the placement subsystem which is responsible for allocation of new storage and is reliant on index subsystem to function properly. Removing a significant portion of the capacity caused each of these systems to a complete restart and hence resulted in not processing service requests by S3.

 

All that you can do!

Undoubtedly since the reason of disruption was due to a single typo, a need to build a framework at the time of the hour for future is required.

Solution 1:

Configuring Amazon S3 cross region replication. This provides automatic, bucket level asynchronous replication of objects in different AWS Regions. Configuring AWS Lambda, AWS SNS and Amazon Route53 along with Amazon S3 will aid in showing high API error rates on AWS Service Health Dashboard by setting up the SNS notification for triggering a lambda function to swipe the Route 53 entry.

Pros:

  • This is an automated process and no manual intervention is required.

Cons:

  • The Route 53 has to be configured for a low Time to Live.
  • The produced data can be obsolete due to asynchronous replication.
  • AWS Service Health Dashboard can difficult as they rely on S3 as well.
  • The latency of data transfer is high.

Solution 2:

Configuring Amazon S3 cross region replication and using a secondary bucket URL as fail safe to avoid failure of first API calls.

Pros:

  • The configuration of Route 53 is not required.
  • There is no need to configure Amazon SNS and Amazon Lambda.

Cons:

  • The latency of data transfer is high.
  • Automating the S3 URL swapping on the code level can be intricate for Developers.

Solution 3:

Writing the metadata of the S3 objects in DynamoDB whenever a PUT operation is performed on the S3 bucket. Storing S3 metadata in DynamoDB to ensure that write operation on S3 would be written on S3 as well as on DynamoDB to perform get operation. All S3 read/list operations need to be re-written to query DynamoDB so that the applications rely only on the metadata stored in DynamoDB. In the case of failure, it is easier to update the metadata from DynamoDB and point it to the bucket which has the replicated data.

Pros:

  • The URL update is only done in DynamoDB.

Cons:

  • This can be a challenging program to code.
  • The latency of data transfer would still be an issue if CloudFront is not used.

 

Conclusions

  • Reduce Blast Radius isolation by using Multiple AWS Accounts per Region and Service for limiting the impact of a critical event such as if an AWS Region or Availability Zone becomes unavailable.
  • Provisioning for future growth requires continuous iteration and adaptation of design. It is also necessary to design a framework that caters for elasticity.
  • Multi-region design is important and easier than multi-cloud.
  • No technology is ever 100% fail proof, and hence strong operational performance is mandate.

Tags: , , ,