Virtualization is bad for database integrity

2011-04-30

One of the lessons about Amazon's cloud failure last week is that "the cloud" is incredibly overhyped. The cloud doesn't necessarily deliver a lot of the things implicitly promised in the popular imagination – such as improved uptime or reliability, or even cost savings. But the real problem with the cloud is that you actually get less database reliability than proper hardware. This is because each layer of indirection introduces an additional unknown that must be accounted for.

A database like MySQL or Oracle is able to guarantee atomicity and durability through a number of features. One of these features is a transaction journal (or binary log). These files contain the sequential commands that mutate the database. In the event of a power or disk failure, a database will "replay" its journal upon restarting to ensure that all operations it has guaranteed to be completed are in fact completed.

One complication with the transaction journal is that hard disks are actually complicated abstractions of physical media. To achieve high performance, hard disks have buffers and caches that can be dangerous because they are volatile: imagine a scenario where the database is assured that data is written to disk, when in fact it has only been written to disk cache, and disappears on power loss. Any guarantee about reliability that depends on the data being written to disk is therefore invalid, because we can't be sure that it will actually survive a power failure. In fact, most implementations of fsync(), which flushes data to the disk, do not guarantee that the information is actually written to the physical media; it might just be saved in the write-back cache.

Another complication stemming from hard disk abstraction is write re-ordering. This is sometimes known as Native Command Queuing or the elevator algorithm. This feature, intended to increase performance, means the hard drive may write data to disk in a different order than the order in which the commands were issued. This is a massive problem for database integrity because it literally makes it impossible to reason about atomicity guarantees if a COMMIT can be written to disk before previously sent data.

In fact, losing power during a reordered transaction probably won't damage anything until power is restored, when the bad transaction is 'replayed', corrupting the filesystem.

Both of these are not new problems; they are well understood. Highly reliable databases hack fsync() with flags and other dummy commands for every unique snowflake operating system to ensure write order and data durability. However, when you introduce a virtualization layer, all of this basically goes to hell. You cannot reason that fsync() means anything in a virtual machine because it's just a virtual hard drive telling you that it has written the data to its virtual persistence. Is the data on the physical disk, or is it not? The only way to know for sure is to disable the write cache, which is not controlled by the guest operating system.

With respect to write reordering, one of the best defenses is to use a journaled filesystem and just pray that a journal write doesn't get reordered at the exact moment that you lose power or crash. Even better is to use a journaled filesystem that has journal checksumming, so that invalid journal entries can be detected and ignored. But again, the success of all these systems depends on actually having control over the physical disk, which is exactly what virtual operating systems don't have.

What this really means is that with a database on a virtual machine, the best you can hope for is that it is NO LESS STABLE than the same implementation on a real hardware. This is a big leap of faith, considering that it is already extremely difficult to offer ACID guarantees (ex: caches without battery backups, weird fsync() behavior, and the variability of platforms and disk controllers in general).