Debates about the risks of write-back caching have been around for decades. Write-back caches have been employed using many different methodologies in an attempt to work around the issues, but with each solution, risks remain. Write-back caching is not inherently bad, however, it should never be employed without a full understanding of the implications.
Writing before journaling
How systems deal with unexpected power failures – or temporary loss of access to the storage media – determines how reliable they are. In addition, how quickly they can boot into the operating system and be ready to service requests following recovery procedures is especially important for the modern enterprise.
In the bad old days of pre-journaled filesystems, filesystem integrity applications such as FSCK and CHKDSK had to run before a system could boot. This changed with the journaled filesystem.
A few things need to be updated when writing to a modern filesystem; the three main steps need to be in sync: volume bitmap, the blocks (user data), and metadata (inode, bitmap, data block). Unless all three are updated, you end up with an inconsistent filesystem. This can occur in a number of different ways, incorrect or cross-linked data being the most common. This is what journaling was supposed to help solve.
Journaled filesystems: write order is important
To get away from the frequent FSCK-on-boot requirements, filesystem engineers had to rely on “disk magic.” This is where metadata journaling came about. The journal wouldn’t contain the data itself; it would only contain the metadata and the bitmap. Data and journal would be written separately. This would help ensure that the data was written out in the correct order.
In a journaled filesystem, the three elements of user data are written separately and they are written in sequence. The disk is first presented with the data block write, then the metadata, and then updates to the bitmap occur. This is done to preserve crash consistency; if the system crashed, it could replay the transactions from the filesystem log, resulting in a consistent system.
Bad things happen if you perform these writes out of order. Writing metadata before the data block, for example, could result in a filesystem that believes data should exist at a given block when in fact the system had crashed before the data block had been written, thus resulting in corrupt data being presented to applications.
The worst-case scenario in a journaled filesystem that has preserved write order is when the filesystem believes that a block of data does not exist when in fact it had been written to disk before the system crashed. This is better than believing that data exists when it does not; it doesn’t result in “lying” to the operating system or applications about what data is present, and a block of written data that isn’t referenced by the metadata or bitmap can be reclaimed as free space later on.
Preserving write order can be hard
Over time, journaled filesystems have proved their worth. They have proved to be more reliable and quicker to mount, but they are highly vulnerable to any technology that proposes to accelerate writes. Journaled filesystems are absolutely reliant on the concept that the underlying storage medium will preserve the write ordering of critical operations. Here even common technologies such as native command queuing can cause problems, requiring techniques such as “write fences” that are beyond the scope of this article.
Write-back caches exacerbate this problem. Unless a write-back cache can preserve the order of critical operations, it offers no guarantee that the filesystem’s representation of the data on the storage device is in fact the correct data stored on that device.
For a write optimizer (including write-back cache) to operate successfully and perform any type of optimization, it has to be filesystem-aware. Its optimizations must be designed to flush user data to disk and then flush metadata. This requirement is especially big in a high availability environment and increasingly an issue in large virtualized datacenters where a single storage point services hundreds or even thousands of servers that in turn can service millions of users.
Breaking crash consistency
Write optimizers can typically be lumped into one of three general categories:
1) One linear sweep (elevator sort)
2) Write coalescing
3) Removal of intermediate writes
All of these can break crash consistency. In one way or another, they change the ordering of writes with respect to write barriers. User data is not necessarily written before metadata, at which point we’re back to storage we cannot trust.
It is theoretically possible to use these techniques and preserve crash consistency, but it is very difficult, and the gains obtained are not likely to be meaningful. If you aren’t worried about crash consistency, then the standard write optimization techniques can provide performance gains between 30% and 50%. If, however, reliable storage is a key concern, then personal experience and internal testing has shown that crash consistent write optimization techniques typically deliver less than 10% improvements in write performance.
By taking the steps necessary to preserve crash consistency, write caches rarely actually reduce the write workload by any significant amount. Because of this, write caches do not accelerate write-bound systems (since the amount of data flowing to the disks is roughly the same, and the disks are the bottlenecks). Write caches only serve to delay writes in the hope that the disk will not be as busy in the future.
This can be of some benefit, assuming that your cumulative storage load is very bursty. In this case, a write-back cache can serve to buffer writes between lulls in activity. Unfortunately, during the period in which a write is delayed, the disk is out of date and may be in a non-crash-consistent state. What’s more, demand generally expands to meet capacity; although the lulls in your write patterns may be enough to empty the write-back cache today, there is no guarantee that this will continue.
Solving the problem
The holy grail of write-back caching is guest OS cooperation. In addition to this, for virtualized environments the guest OS cooperation should also be hypervisor aware. The ultimate goal for host-side caching would be a hypervisor-aware and caching-aware filesystem built into the guest operating system. In this scenario guest OSs would be aware that they exist inside a hypervisor and that there is – or at least could be – a layer of cache attached to that hypervisor.
The guest OS needs a means of informing the layers below it (hypervisor, caching software) that it is trying to preserve consistency. It would force one set of data to be committed before another set of data is committed, passing “order of writes” information through the storage layers, allowing write eliminations to be done while still preserving consistency. This would also allow for logical writes to be split across independent devices (e.g. the cache device and the backing store) with the contextual information required to guarantee crash consistency.
The more layers of obfuscation between the application and the physical storage, the greater the risk of losing crash consistency, the higher the difficulty of the required code around it, and the lower the performance boost that can be realized risk-free.
Please watch for our next installment on write-back caching as we continue to unfold how we address this piece of the market.