Coping With Cloud Downtime

Tags: apache, aws, cloud, downtime, ec2, HA, iaas, linux, mylvmbackup, mysql, puppet, rdiff-backup, rsnapshot, rsync, s3, unix

In light of the recent Amazon cloud service interruptions it seems like a good time to share some ideas about how to help keep cloud hosted services available during unexpected and potentially long lasting outages.

Each of these items can be implemented using free and open source software either hosted in your own datacenter or *gasp* in a different cloud.

Use Puppet instead of custom machine images.

Sure, custom machine images are useful for quickly lighting up lots of copies of the same system, but they fall down in many ways.

Applying/reverting changes to already running instances is a manual process.

No way to ensure that systems are in a consistent state.

Updating images is cumbersome, especially for minior config tweaks.

Machine instance versions are a pain to keep track of, and can sprawl out of control.

Dependency on specific cloud resources can make them difficult to access during an outage.

Small differences in system configs require a whole new machine image.

A configuration management system like Puppet provides you ability to customize machine images, plus it avoids the above shortcomings and provides a number of extra features.

Automatic propagation of configuration changes to all relevent systems.

Regular checks are performed to ensure system consistency. If a difference is detected the necessary updates are made automatically.

Provides a single location and interface for system configuration.

Small differences between systems are easy to incorporate while using the same template.

Provides the ability to push a configuration change out to all systems right away.

There are no dependencies on a particular cloud technology.

Revision control is easy to hook in and lets you quickly revert changes that didn’t go quite as expected.

System configuration details are inherently documented in detail by the configuration manifests.

Synchronize your data elsewere on a regular basis.

Cloud storage offers some really compelling features and is very useful, don’t get me wrong. But I wouldn’t expect it to always be there. Especially if the backups and snapshots that you depend on for recovery are hosted with the same provider, running the same software, potentially inside the same physical facilities. After all, cloud storage is susceptible to the same problems as any other storage platform. It’s a good idea to keep backups outside the cloud (or at least inside a different cloud). The simpler the backup methodology, the better.

Rsync, mylvmbackup, rsnapshot and rdiff-backup are great open source backup tools that are secure and are optimized for efficiency with bandwidth and on-disk size.

Manage your systems from outside the cloud.

If your infrastructre is a cloud and the running instances are aircraft, then configuration management and monitoring systems are ground control. If they were hosted in the same cloud as the systems they manage, game over.

By separating your management infrastructure and your service delivery infrastructure you gain the ability to monitor and manage systems remotely. You can even quickly deploy replacement resources elsewhere using a config management system and then copy your most recent off-cloud backup up there to restore your database and web content.

Optimize your DNS configuration.

If things are looking really bad you can at least put up a maintenance page. That is if you can update your DNS and get the chages to propagate quickly. Ensuring the following ahead of time will save you many headaches in the event of a service outage.

Set your DNS TTL low. This makes your DNS updates take effect more quickly, reducing the number of error messages your users see.

Use as many DNS servers are you can get your hands on. Lots of DNS servers means that the chances of one or more of them having problems is less likely to affect you. After all, they may be using the same cloud resources that you are!

Verify that you have a backup MX.

While we’re on the subject of DNS, it’s a good idea to make sure you have a backup mail exchanger configured and defined as an MX for your domain. It doesn’t have to be anything fancy, just something to recieve and queue up mail until you’re able to get the primary mail system back online.

Keep thinking and talking about it!

The above ideas hopefully are a good starting point for protecting yourself against unexpected outages but it definitely doesn’t stop here.

What practices have worked well for you or your company?

What ideas do you have?