When Failovers fail

I was going to give this post the title “Everything will ultimately fail”, but that seemed a bit sad. So here are some kittens to make you feel better:

At my current company, we have crept forwards in building highly available systems. We have moved on from the overnight backup, to the hourly diff backup, to the replicated database and so on. Now we are at the stage that some systems are multi-master or hot failover. Some examples would be Always On databases, clustered windows services and on the infra side: blade clusters, SANs, replicated file shares and active directory.

So that’s great because these replicated systems have data loss that approaches 0 (for our level of load, replication is faster than the time between writes) and our time-to-recovery is also much lower. Compare that to 10 years ago where for a full-site DR we were targeting a multi hour failover process.

So what’s interesting is that we have a new type of failure which happens when an automated failover fails. Most often this is due to the fact that the “heartbeat” that tells the system to cut over is still working but the system is not; or that the system is functionally up (i.e., powered on and responding to TCP connections) but is so slow that no data is returned, or a downstream system is dead or responding very slowly.

Well, who cares? It’s better than it was and we still have less impact than before, right? Sometimes, that is true.

My worry is that when you build an excellent HA system it’s really expensive, so you tend to pile a lot into it and when it goes wrong, you lose a lot at the same time. Automated failovers don’t degrade gracefully, and often are hard to manually fail over (especially if they are partially dead). An example of that would be an HA blade cluster that supports VMWare. If the SAN decides once a year that it isn’t happy, you lost 50 VMs. But did you lose more time that 50 physical machines throwing a hard drive over the same period?

Also, the split-brain syndrome is hard to rehearse recovery from, because it happens so rarely and may be hard to cause as a test.

I don’t have any answers, just keep those overnight backups? And make sure you can restore them.

From another article on the same subject: page 48, https://manohars.files.wordpress.com/2009/11/97-things-every-software-architect-should-know.pdf




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s