Monday, October 29, 2012

How to work around Amazon EC2 outages « James Cohen

How to work around Amazon EC2 outages « James Cohen

I am working on a AWS design for high availability and found this insight from the school of hard knocks very helpful.


A few of these options are good in principle, but are not necessarily informed by the reality of operational experience with the more-common failure modes of AWS at a medium to larger scale (~50-100+ instances).
The author recommends using EBS volumes to provide for backups and snapshots. However, Amazon’s EBS system is one of the more failure-prone components of the AWS infrastructure, and lies at the heart of this morning’s outage [1]. Any steps you can take to reduce your dependence upon a service that is both critical to operation and failure-prone will limit the surface of your vulnerability to such outages. While the snapshotting ability of EBS is nice, waking up to a buzzing pager to find that half of the EBS volumes in your cluster have dropped out, hosing each of the striped RAID arrays you’ve set up to achieve reasonable IO throughput, is not. Instead, consider using the ephemeral drives of your EC2 instances, switching to a non-snapshot-based backup strategy, and replicating data to other instances and AZ’s to improve resilience.
The author also recommends Elastic Load Balancers to distribute load across services in multiple availability zones. Load balancing across availability zones is excellent advice in principle, but still succumbs to the problem above in the instance of EBS unavailability: ELB instances are also backed by Amazon’s EBS infrastructure. ELB’s can be excellent day-to-day and provide some great monitoring and introspection. However, having a quick chef script to spin up an Nginx or HAProxy balancer and flipping DNS could save your bacon in the event of an outage that also affected ELBs, like today.
With each service provider incident, you learn more about your availability, dependencies, and assumptions, along with what must improve. Proportional investment following each incident should reduce the impact of subsequent provider issues. Naming and shaming providers in angry Twitter posts will not solve your problem, and it most certainly won’t solve your users’ problem. Owning your availability by taking concrete steps following each outage to analyze what went down and why, mitigating your exposure to these factors, and measuring your progress during the next incident will. It is exciting to see these investments pay off.
Some of these:
– *Painfully* thorough monitoring of every subsystem of every component of your infrastructure. When you get paged, it’s good to know *exactly* what’s having issues rather than checking each manually in blind suspicion.
– Threshold-based alerting.
– Keeping failover for all systems as automated, quick, and transparent as is reasonably possible.
– Spreading your systems across multiple availability zones and regions, with the ideal goal of being able to lose an entire AZ/region without a complete production outage.
– Team operational reviews and incident analysis that expose the root cause of an issue, but also spider out across your system’s dependencies to preemptively identify other components which are vulnerable to the same sort of problem.
[1] See the response from AWS in the first reply here: https://forums.aws.amazon.com/thread.jspa?messageID=239106&tstart=0

No comments:

Post a Comment