Multi-log structured storage layer

I recently had to build an SSD garbage collector which proved to be an interesting and fun exercise. SSD or Solid State Drives are storage devices that promise low latency operations compared to mechanical hard drives which are prone to seek / rotational latency. While reads to SSDs are low latency, writes are a different story. Flash media (which is the basis of most SSDs) requires “erasing” a block (typically 128K) of flash before anything can be written on it. Also, block erasure takes a lot of time (compared to actually reading / writing). Most SSDs have firmware which handles pre-erasing a bunch of blocks so that the write latency can be minimized. Also, given that blocks cannot be partially written, the valid content on blocks need to be re-written to make the free space available on partial blocks available for newer content. SSD firmware is, hence, also responsible for this “garbage collection” process. A good starting point for further reading on SSD garbage collection is Wikipedia: Write Amplification.

Note that garbage collection can be a processor heavy operation given the data movement involved. Recently, I designed and implemented a multi-log structured storage layer which I ran on top of an SSD to relieve it of the garbage collection task (which it was performing poorly). However, such a multi-log storage layer is very versatile and can have a number of other applications.

A multi-log storage layer is essentially a storage layer in which storage content is laid out in multiple contiguous logs. While I kept the metadata for the content on the SSD in RAM, it is possible to tweak the implementation to keep metadata in the log as well. The primary reason to divide the content into multiple logs is to keep the garbage collection efficient (not re-writing too much data) when the storage space is close to full. However, similar principles could also be used for garbage collection in RAM (in garbage collection supporting language runtimes such as Java). A muti-log storage layout can also be used over hard drives to get good sequential write performance, though the reads would need to be supported by a layer of caching to avoid too many seeks.

Multi-log Storage Layer Design

First, the storage layer (whether backed by a file, or a device, or even RAM) needs to divided into multiple fixed size contiguous logs. These logs should be large enough that IO writes at that size can be done at near peak write bandwidth. At the same time, the log size should be small enough so that we see a _large_ variance in the amount of garbage seen across logs. This variance in garbage will allow us to select logs with more garbage for garbage collection. It is advisable to use the smallest log size at which we can achieve near peak read/write bandwidth. The storage layer will mostly write full logs at any time. If durability requirements of the data are relaxed (such as in a cache) then one can do away with forced flushes of data and flush logs _only_ when a full log’s worth content is available. Reads may require random access across logs, but it should be okay because random access has no cost on SSD and we shouldn’t be iop bound because of the sizing of the log.

The Write Queues

An incoming write is kept in a RAM write queue. If the RAM write queue is designed using blocks of RAM of the same log size, it will allow us to read the write queue RAM logs in a manner similar to the regular logs while reading content. Otherwise, reading from content in the write queue would need to be coded separately.

It should be possible to enable concurrent writes into the write queue by keeping allocation of content blocks separate from the actual write (memory copy). The allocation of content blocks would required synchronized access to (only) a current_RAM_log and a current_RAM_log_offset. Once the allocation has been performed, multiple writes across multiple RAM logs at different offsets can happen concurrently. Once the writes complete, they should update a bytes_committed variable present in every RAM log (under a lock or using CAS). Once the bytes_committed is equal to the log size, the RAM log is ready to be committed to media.

A (synchronized) list of to-be-committed RAM logs is maintained and a task is spawned up as soon as there are to-be-committed RAM logs to commit them to media. Care should be taken to not have more than one log write on any device at any point in time. This will avoid randomizing the device firmware with multiple simultaneous random writes.

Content Block handles

A content block allocation also involves creation of a block handle to be given back to the higher software layers. This block handle will be used by the higher layers to read the block (possibly multiple times) and then delete it. Writes to the same block shouldn’t be supported as this will randomize the media firmware and break the one log size write at any point in time rule. Instead, writes should always go through new block creation.

A block handle encapsulates an ordered pair of (log, offset [into the log], size [after the offset]) tuples. In most cases, only the first tuple would be sufficient for the allocation request. However, in cases current_RAM_log in the write queue doesn’t have enough space for the allocation request the second tuple denotes the rest of the allocation. This also means that an allocation request cannot be larger than the log size. However, an aggregate data structure on top of this tuple pair could be built to address that. The tuple pair also comes handy during garbage collection when the content block is re-written to the media. Using a tuple pair structure puts a cap on the max size of data movement at any given point in time during garbage collection. Thus, large sized blocks (using an aggregate data structure at a higher layer) would be moved partially upon garbage collection improving garbage collection efficiency. The other advantage is the simplicity of using just a tuple pair thus avoiding code complexity and allocation/manipulation of list data structures.

Content Block Deletion – Garbage Accounting/Collection

When content blocks are deleted, a (protected) garbage_size variable on the log is updated to reflect the new and larger garbage size. Note that a content block deletion can cause upto two garbage size updates on the upto two logs it points to. A max-heap (ordered w.r.t. garbage size) of the logs is maintained and updated upon block deletion. The top of the heap (max garbage size) is the most eligible log for garbage collection. Note that, this won’t give us SSD wear-leveling, but 1) we can depend on the device firmware to do that 2) it can be done at the max-heap by suitably designing a metric to combine no. of writes with garbage size.

Garbage collection can be triggered whenever a (small) reserved pool of empty SSD logs is below its threshold. The RAM logs use this pool of SSD logs to flush their content into – which triggers garbage collection. The garbage collection task picks a log from the max-heap outlined above and starts re-writing the valid blocks present in it back to the storage layer. It can use the same read and write APIs which are used by the client for accessing the storage layer. Once all the valid blocks in a log are re-written, the log can be given back to the reserved pool of empty SSD logs.

Note that client writes are dependent on space in the RAM log based write queue which is in-turn dependent on space present in the reserved pool of empty SSD logs which is in-turn dependent on garbage collection. If garbage collection depends on the same write APIs as the client, then it will complete a circle of dependency back to the RAM log based write queue. To prevent deadlock, the write API reserves a few RAM logs for GC induced writes but doesn’t use those logs for client writes. This will break the deadlock.

Data structures and Locking

Here’s an outline of the data structures described above:

Storage Layer:

    • Storage Layer mutex (protects all the data structures below in the Storage Layer)
    • Write queue of RAM logs, current_RAM_log, current_RAM_log_offset
    • To-be-committed list of RAM logs
    • Reserved pool of SSD logs
    • Max heap of logs w.r.t garbage size

Log:

    • Log RW lock
    • Unordered Set of valid content blocks
    • Garbage Size
    • (For RAM logs only)
      • Pointer to memory
      • bytes_committed (used in write API for supporting concurrent writes)
    • (For SSD logs only)
      • SSD device / file
      • SSD offset into device / file

Content Block:

    • pair of (log, offset, size) tuples

The Storage Layer mutex is taken to protect the various free lists, write queues and the allocation log and offset. The log’s RW lock protects the log’s metadata (garbage size, set of content blocks). Note that the actual content doesn’t need any locking as once committed it is immutable. During read, we Read lock the locks on the pair of logs for the content block. This read locks protects against garbage collecting the logs (which would take write locks on them to change their garbage size and/or set of valid content blocks). Note that the locks taken while reading should honor lock ordering to avoid deadlocks. A simple scheme is to just use the log data structure’s address as the lock order. Technically, we could allow for larger concurrency by allowing append modifications to set of valid content blocks in RAM logs, or by allowing removal of content blocks and increase of garbage size if the log is not selected for garbage collection. However, the above locking scheme works well in practice and I could see no surprising bottlenecks caused by lock contention.

Advertisements
This entry was posted in Computing. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s