The Technical Implications of Write-Back Cache

Posted August 13th, 2013 by Rich Pappas

Debates about the risks of write-back caching have been around for decades. Write-back caches have been employed using many different methodologies in an attempt to work around the issues, but with each solution, risks remain. Write-back caching is not inherently bad, however, it should never be employed without a full understanding of the implications.

Writing before journaling

How systems deal with unexpected power failures – or temporary loss of access to the storage media – determines how reliable they are. In addition, how quickly they can boot into the operating system and be ready to service requests following recovery procedures is especially important for the modern enterprise.

In the bad old days of pre-journaled filesystems, filesystem integrity applications such as FSCK and CHKDSK had to run before a system could boot. This changed with the journaled filesystem.

A few things need to be updated when writing to a modern filesystem; the three main steps need to be in sync: volume bitmap, the blocks (user data), and metadata (inode, bitmap, data block). Unless all three are updated, you end up with an inconsistent filesystem. This can occur in a number of different ways, incorrect or cross-linked data being the most common. This is what journaling was supposed to help solve.

Journaled filesystems: write order is important

To get away from the frequent FSCK-on-boot requirements, filesystem engineers had to rely on “disk magic.” This is where metadata journaling came about. The journal wouldn’t contain the data itself; it would only contain the metadata and the bitmap. Data and journal would be written separately. This would help ensure that the data was written out in the correct order.

In a journaled filesystem, the three elements of user data are written separately and they are written in sequence. The disk is first presented with the data block write, then the metadata, and then updates to the bitmap occur. This is done to preserve crash consistency; if the system crashed, it could replay the transactions from the filesystem log, resulting in a consistent system.

Bad things happen if you perform these writes out of order. Writing metadata before the data block, for example, could result in a filesystem that believes data should exist at a given block when in fact the system had crashed before the data block had been written, thus resulting in corrupt data being presented to applications.

The worst-case scenario in a journaled filesystem that has preserved write order is when the filesystem believes that a block of data does not exist when in fact it had been written to disk before the system crashed. This is better than believing that data exists when it does not; it doesn’t result in “lying” to the operating system or applications about what data is present, and a block of written data that isn’t referenced by the metadata or bitmap can be reclaimed as free space later on.

Preserving write order can be hard

Over time, journaled filesystems have proved their worth. They have proved to be more reliable and quicker to mount, but they are highly vulnerable to any technology that proposes to accelerate writes. Journaled filesystems are absolutely reliant on the concept that the underlying storage medium will preserve the write ordering of critical operations. Here even common technologies such as native command queuing can cause problems, requiring techniques such as “write fences” that are beyond the scope of this article.

Write-back caches exacerbate this problem. Unless a write-back cache can preserve the order of critical operations, it offers no guarantee that the filesystem’s representation of the data on the storage device is in fact the correct data stored on that device.

For a write optimizer (including write-back cache) to operate successfully and perform any type of optimization, it has to be filesystem-aware. Its optimizations must be designed to flush user data to disk and then flush metadata. This requirement is especially big in a high availability environment and increasingly an issue in large virtualized datacenters where a single storage point services hundreds or even thousands of servers that in turn can service millions of users.

Breaking crash consistency

Write optimizers can typically be lumped into one of three general categories:

1) One linear sweep (elevator sort)

2) Write coalescing

3) Removal of intermediate writes

All of these can break crash consistency. In one way or another, they change the ordering of writes with respect to write barriers. User data is not necessarily written before metadata, at which point we’re back to storage we cannot trust.

It is theoretically possible to use these techniques and preserve crash consistency, but it is very difficult, and the gains obtained are not likely to be meaningful. If you aren’t worried about crash consistency, then the standard write optimization techniques can provide performance gains between 30% and 50%. If, however, reliable storage is a key concern, then personal experience and internal testing has shown that crash consistent write optimization techniques typically deliver less than 10% improvements in write performance.

By taking the steps necessary to preserve crash consistency, write caches rarely actually reduce the write workload by any significant amount. Because of this, write caches do not accelerate write-bound systems (since the amount of data flowing to the disks is roughly the same, and the disks are the bottlenecks). Write caches only serve to delay writes in the hope that the disk will not be as busy in the future.

This can be of some benefit, assuming that your cumulative storage load is very bursty. In this case, a write-back cache can serve to buffer writes between lulls in activity. Unfortunately, during the period in which a write is delayed, the disk is out of date and may be in a non-crash-consistent state. What’s more, demand generally expands to meet capacity; although the lulls in your write patterns may be enough to empty the write-back cache today, there is no guarantee that this will continue.

Solving the problem

The holy grail of write-back caching is guest OS cooperation. In addition to this, for virtualized environments the guest OS cooperation should also be hypervisor aware. The ultimate goal for host-side caching would be a hypervisor-aware and caching-aware filesystem built into the guest operating system. In this scenario guest OSs would be aware that they exist inside a hypervisor and that there is – or at least could be – a layer of cache attached to that hypervisor.

The guest OS needs a means of informing the layers below it (hypervisor, caching software) that it is trying to preserve consistency. It would force one set of data to be committed before another set of data is committed, passing “order of writes” information through the storage layers, allowing write eliminations to be done while still preserving consistency. This would also allow for logical writes to be split across independent devices (e.g. the cache device and the backing store) with the contextual information required to guarantee crash consistency.

The more layers of obfuscation between the application and the physical storage, the greater the risk of losing crash consistency, the higher the difficulty of the required code around it, and the lower the performance boost that can be realized risk-free.

Please watch for our next installment on write-back caching as we continue to unfold how we address this piece of the market.

This entry was posted on Tuesday, August 13th, 2013 at 9:36 am and is filed under Caching, Hypervisor-based Caching, Server-side Caching. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

2 Responses to “The Technical Implications of Write-Back Cache”

  1. David Lethe says:

    Good thing that we have ZFS. All of this is a non-issue with the ZFS file system. Slap in a pair of mirrored SSDs as the write intent log and you can lose power in the middle of a transaction and you’re still OK.

    The O/S eliminates these problems by doing both a flush on write, and insuring rewrites never rewrite data. It writes the new data elsewhere then when the data has been flushed to the drive it marks the original data as being free.

    Problem solved.

    Down side is that any external write cache/accelerator or optimizer solution is a total waste of money on ZFS.

    • Rich Pappas says:

      Thx for the comment, David. ZFS is a very good filesystem, and one of the advances it has made is in design provisions for cache devices which it can manage directly.

      ZFS is a very reasonable solution for environments where it is supported and where it is practical for ZFS to have dedicated control of the the physical cache and disk resources, and serves to illustrate the need for coordination between the filesystem and the cache devices to ensure coherency, consistency, and recoverability – the very point were were attempting to make in our blog!!

      Unfortunately it is not quite so simple a matter to deploy ZFS in a virtualized environment such as ESXi, particularly so if you want to leverage the efficiency of shared resources. Sure, you could try to deploy ZFS to every guest instance, and you could allocate FLASH resources to each guest for it’s write intent log, but is this practical at scale? Another ZFS solution would be to use a central, shared storage resource that was ZFS based, which may also be practical for some users (although many environments have a lot of internal resistance to changes in storage infrastructure). One downside of this approach would be the latencies and protocol overhead associated with accessing “off box” FLASH vs a host based FLASH resource.

      VIrtualization has taken over the datacenter, more servers are virtualized than run on bare metal. Oracle has made announcements this week regarding the availability of their virtualization server and we assume it will use ZFS, however VMware is unquestionably the market leader in virtualization today. ZFS is absolutely a step in the right direction, and part of what we are advocating – ideally this type of file system would be natively supported by all major operating systems. That said, we firmly believe that in virtualized environments there are additional benefits above what ZFS can deliver from a file system being both cache device aware as well as hypervisor aware – the goal is to be able to leverage shared resources and optimize performance while maintaining coherency, consistency, and recoverability. It is easy to imagine ZFS a part of such a solution!

Leave a Reply

Current month ye@r day *