Coping With Cloud Downtime
Tags: apache, aws, cloud, downtime, ec2, HA, iaas, linux, mylvmbackup, mysql, puppet, rdiff-backup, rsnapshot, rsync, s3, unix
In light of the recent Amazon cloud service interruptions it seems like a good time to share some ideas about how to help keep cloud hosted services available during unexpected and potentially long lasting outages.
Each of these items can be implemented using free and open source software either hosted in your own datacenter or *gasp* in a different cloud.
Use Puppet instead of custom machine images.
Sure, custom machine images are useful for quickly lighting up lots of copies of the same system, but they fall down in many ways.
A configuration management system like Puppet provides you ability to customize machine images, plus it avoids the above shortcomings and provides a number of extra features.
Synchronize your data elsewere on a regular basis.
Cloud storage offers some really compelling features and is very useful, don’t get me wrong. But I wouldn’t expect it to always be there. Especially if the backups and snapshots that you depend on for recovery are hosted with the same provider, running the same software, potentially inside the same physical facilities. After all, cloud storage is susceptible to the same problems as any other storage platform. It’s a good idea to keep backups outside the cloud (or at least inside a different cloud). The simpler the backup methodology, the better.
Rsync, mylvmbackup, rsnapshot and rdiff-backup are great open source backup tools that are secure and are optimized for efficiency with bandwidth and on-disk size.
Manage your systems from outside the cloud.
If your infrastructre is a cloud and the running instances are aircraft, then configuration management and monitoring systems are ground control. If they were hosted in the same cloud as the systems they manage, game over.
By separating your management infrastructure and your service delivery infrastructure you gain the ability to monitor and manage systems remotely. You can even quickly deploy replacement resources elsewhere using a config management system and then copy your most recent off-cloud backup up there to restore your database and web content.
Optimize your DNS configuration.
If things are looking really bad you can at least put up a maintenance page. That is if you can update your DNS and get the chages to propagate quickly. Ensuring the following ahead of time will save you many headaches in the event of a service outage.
Verify that you have a backup MX.
While we’re on the subject of DNS, it’s a good idea to make sure you have a backup mail exchanger configured and defined as an MX for your domain. It doesn’t have to be anything fancy, just something to recieve and queue up mail until you’re able to get the primary mail system back online.
Keep thinking and talking about it!
The above ideas hopefully are a good starting point for protecting yourself against unexpected outages but it definitely doesn’t stop here.
What practices have worked well for you or your company?
What ideas do you have?