Skip to content

mv overview

matthewvon edited this page Dec 5, 2014 · 41 revisions

Summary

This wiki entry contains notes concerning internal structures and functions within Google's leveldb. The notes are not authoritative. They are based upon impressions and research.

Key space

leveldb is by definition a key/value store. It has to track every key in a manner that allows any key to be retrieved at any time in the future. leveldb access a requested key via a logical search tree. This search tree has four layers.

The first layer is the "manifest". Every table file that contains key/values has an entry in the manifest. The manifest entry tracks the first and last key contained in each table file. The manifest keeps the table file entries in one of seven sorted arrays. Each of the seven arrays represents one "level" of table files. A user request for a key causes leveldb to check each table file, by level, that overlaps the target key. leveldb searches each potential table file, level by level, until the first yields an exact match for requested key.

The second layer of the search tree is a table file's block index of keys. The block index is part of the table file's metadata which is located at the end of the physical data file. The index contains one entry for each logical data block within the table file. The entry represents the last key in the block. leveldb performs a binary search to locate a candidate data block. It reads the candidate data block from the file.

The third layer is the restart index that is part of each data block. The restart index is located at the end of the data block. The restart index contains a subset of keys contained in the block. Each key in the subset includes an offset for where it is located within the data block. leveldb performs a binary search of the restart index to locate the nearest key to the user's requested key.

The fourth layer is simply a list of key/values in sorted order within a data block. leveldb now performs a linear search of the key/values to see if the user's requested key exists.

leveldb contains several optimizations to enhance its key retrieval performance. There is a table cache (also called the file cache in some places). All leveldb access to the second layer of the search tree goes through the table cache. The table cache attempts to keep the most popular table files open and their block index available. Similarly, there is a block cache. The block cache keeps recently read blocks in memory. Finally, each table file also contains a bloom filter in its metadata. The bloom filter is kept in the table cache along with the block index. The bloom filter is quickly checked before the second layer search to see if there is any possibility of the key actually being present within the table file.

The bloom filter is very good at eliminating unnecessary reads into a table file. Google's original table file implementation had a 1% false positive rate. Basho's bloom filter has a 0.04% false positive rate. A false positive is when the bloom filter suggests that the key might be present in the table file, but a complete search finds that it is not present.

Limitation

Note that the entire manifest structure must fit into memory. leveldb will fail if there are more manifest entries than available memory.

Updating the key space

leveldb makes changes to the key space in batches. Seldom will a user's single write of a key/value make a direct change to the key space. A user's individual write first posts to a write buffer. Google's write buffer has two components. The first component is a "skip list". A skip list is an alternative to a binary tree structure. It keeps key/values in sorted order within memory. The second component is a "recovery log" file. The recovery log contains a copy of the user's written key/value. leveldb uses the recovery log to repopulate a write buffer if the program terminates for any reason before the write buffer is converted to a table file.

skip lists paper

leveldb continues to add new user key/value writes to the write buffer until the total size of the key/values exceeds the write_buffer_size parameter. The user gives the write_buffer_size in the Options structure as part of a database Open operation. Once the total size exceeds write_buffer_size, leveldb performs a quick operation to shift the current write buffer to a secondary slot and create a new write buffer. The secondary slot for the old write buffer is called "imm" in the leveldb code (imm stand for immutable memory). leveldb creates a background task to convert the imm to a table file in level 0.

Note: leveldb will stall if it needs to convert the current write buffer to the imm, but the imm is still populated with the previous write buffer. The stall only ends once the background task successfully converts the previous imm to a level 0 table file, making the imm available to the current write buffer.

leveldb now posts the new level 0 table file to the manifest via a "VersionEdit" object. The post operation adds the new level 0 table file to the key space.

The leveldb manifest is therefore subject to change over time. leveldb has the concept of manifest "versions" (i.e. key space versions). A given user operation may start while one manifest version exists and conclude after one or more manifest versions have transpired. Each user operation must process completely against a single version of the manifest. Therefore leveldb maintains multiple manifest versions as necessary to give each user operation a stable view of the database. leveldb can therefore easily support "snapshots" so that the user can perform multiple operations against a single version of the dataset. leveldb's "VersionSet" object manages access to the current manifest version and tracks any older versions necessary for snapshots and pending user operations.

Clone this wiki locally