8 Aug 2012

Save your AWS webface when Godzilla strikes

The last AWS US-East outage ruffled a few feathers and wasn't fun for those who experienced downtime, however as before there's a lesson to be learnt and it goes to show that even the most resilient systems can fail and contributory strings of events, no matter how disjointed, can and do happen.
The lesson? Simple, design for redundancy (or have contingency measures in place for a quick restore). Let's take a look at a few of the options available for AWS resilience and restore that could help you save face should Godzilla strike.

Multi Zone

Amazon Web Services operate a global infrastructure of data centres (Regions). Inside each region are a number (3+) of 'Multi Availability Zones'. These Multi-AZs provide regionally internal redundancy and a intelligently designed platform will span at least two zones, with application data (where possible) held in AWS S3 storage (S3 is inherently stable).

Technologies used to facilitate redundancy and application load across Multi-AZs would include AWS Elastic Load Balancing (ELB) and auto-scaling, with options for database replication using either the AWS Relational Database Service (RDS) or multiple instance configuration with a tried, tested (and easy to configure) MySQL master/slave setup. Replicating Apache can be achieved at basic level through triggered cron jobs for rsync and if you're mad enough to be using an MS/Srv08/IIS setup stay clear of the MS Web Farm Framework and contact Cirronix for advice, we have a MUCH easier way, seriously.

Scaling in vanilla AWS is (at time of writing) via scripted config using their suite of CLI tools only and does require a degree of familiarity with the backend, scaling configs are also confined to set AMIs (Amazon Machine Images). Instance replacement is possible however automation for linked DNS and re-assignment of Elastic IPs (EIPs) is not catered for and requires manual application.

Multi Region

Multi-AZ design is a no-brainer but what happens when Godzilla storms down the East Coast in a really bad mood and takes out a full region? What you need, ideally, is some sort of cross-region platform right? Well, yes, and it can be achieved, AWS Route 53 can deliver weighted or round robin DNS for your multi-region ELBs, but to do so brings with it deepened considerations for mirroring on a global scale with all the accompanying overheads across cost and maintenance for two (or more) complete platforms. If you're prepared for that then good, just be aware that global redundancy costs.

CloudFormation

AWS Cloudformation allows you to replicate your full stack from created JSON scripts, either in its home zone(s) or adapted for alternate regions. CF is incredibly useful for immediate re-launch (of all instances and accompanying parameters) to cover not only eventualities for disaster recovery but with useful potential for development and testing.

Snapshots

At the most basic end of redundancy (especially if you don't run S3 backed app data) have to be instance snapshots, in a word - take them and/or automate them (Here's how). If you have current snapshots of your EBS volume data you at least stand some chance of restoration as creating new volumes and replacing a new instance /dev/sda1 boot is relatively quick and straightforward.

3rd Party Tools

Vanilla AWS provides an amazing base level cloud framework but is admittedly lacking for easy config of automated redundancy. Thankfully, to make our lives easier, there are a number of 3rd party cloud management suites available which add substantial value to backend AWS, Scalr is one such offering and one which Cirronix prefer and highly recommend.

With plans from $99 Scalr is feature rich and includes the following great features above base AWS :

  • Replication of instance changes to live (scaled) copies (includes base AMI update & live replacement).
  • Automated instance replacement from failure.
  • MySQL backup, rollback & instance resilience (Slave > Master promotion).
  • Multi-Region farms.
  • Multi-cloud deployment.
  • Managed (easy) vHosting.
  • PhpMyAdmin.
  • Cron scripting.
You can, or course, never have enough redundancy, though by taking note of what's on offer and designing to your limits you will certainly reduce potential for downtime. And even if you're a sole operator on a limited budget running nothing else but a single micro-instance on the free tier you can still take advantage of cost effective options to implement solid disaster recovery. Scalr is the cream of the crop but Pingdom and AutoSnappy can go a long way to fighting Godzilla.

2 comments:

techstyled said...

You recommend to stay clear of MS Web Farm Framework. Considering the level of activity from MS surrounding that "platform" (i.e. none), that makes sense but could you please elaborate on that?

RichBos said...

Hi, sure, basically, it just doesn't work and is horrendously convoluted. We lost a week (or more) trying to get it to replicate but with no joy at all. As I recall the 'slave' just wouldn't drop in. Admittedly we were installing it to an already populated IIS installation, and on further tests we managed to get a clean (new) empty framework connected with all parts confirmed as seeing each other but by then we'd totally lost confidence in it (plus the client had over reached their budgeted development time for the project).

Our (now tried and tested) solution involves simple scripted robocopy with IIS shared config and works flawlessly.

Post a Comment