1 of 4

Storage Engine

Dolt is the world's first version controlled SQL database. How would you build a storage engine for such a thing?

Dolt's storage engine is heavily influenced and shares code with Noms. We here at DoltHub have immense respect for the Noms team's pioneering work, without which Dolt would not exist.

Motivation: Database Version Control

Dolt's storage engine is motivated by the desire to add Git-style version control to databases. Both Noms and Dolt share this vision. Noms attempts to achieve the vision on a generic, document-like database while Dolt restricts the vision to an Online Transaction Processing (OLTP) SQL database. Currently, Noms is not under active development while Dolt is.

Git-style version control on a database provides a number of useful features including:

Instantaneous rollback to any previous state
A full, queryable audit log of all data back to inception
Multiple evolving branches of data
The ability to merge data branches

Requirements

As noted in , a storage engine for a SQL database with Git-style versioning would provide:

Tables

A SQL database is a collection of tables. Tables have rows and columns. The columns have schema: names, types, and constraints. Columns can have indexes to improve query performance. Advanced database features like views, stored procedures, and triggers would also be useful. Tables must be able to be accessed via standard SQL select, insert, update, and delete queries.

Performance at Scale

A modern relational database must be performant at large scale. In particular, sub-linear worst case scaling on seek (ie. select) is a must.

Fast diff

Comparing and producing the differences between two versions must be fast. Displaying diffs to a user is a common operation. Querying diffs must also be supported and should be as fast as a regular select query. Moreover, merge relies on diffs so fast diff is essential for merge operations at scale.

Structural Sharing

To offer version control in a database, all versions of the data must be stored. Storing a full copy of the database at every change is untenable. Data that does not change between versions must share storage.

Prolly Trees + Commit Graph = Dolt Storage Engine

Dolt's storage engine is built on a Git-style commit graph of Prolly Trees. Table schema and data is stored in Prolly Trees. The roots of those Prolly Trees along with other metadata are stored in a commit graph to provide Git-style version control. We'll start by explaining Prolly Trees and then show how these fit into a commit graph.

The Database Backbone: Search Trees

Databases are built on top of , a class of data structure. Most databases are built on .

Dolt is built on a novel Search Tree, closely related to a B-tree, called a Probabilistic B-Tree, or Prolly Tree for short. As far as we can tell, Prolly Trees were and they also coined the term.

B-Trees

Most SQL databases you are familiar with, like or , are built on storage. Tables are represented as a map of primary keys to values and the keys of that map are stored in a B-tree. Values are stored in the leaf nodes.

B-tree storage allows for fast seek (ie. select) with reasonable write performance (ie. insert, update, delete). B-trees have proven to be the ideal data structure to back databases.

However, finding the differences between two B-trees requires scanning both trees and comparing all values, an operation that scales with the size of the tree. This operation is slow for large trees.

Also, writes to B-trees are not history independent, the order of the writes internally changes the structure of the tree. Thus, storage cannot be easily shared between two versions of the same tree.

Prolly Trees

A Prolly Tree, or Probabilistic B-tree, is a content-addressed B-tree.

Similarly to B-trees, a map of primary keys to values are used to represent tables. Keys form the intermediary nodes and values are stored in the leaf nodes. Because of their similarity to B-trees, Prolly trees of basic SQL database read and write operations like select, insert, update, and delete. Content-addressing allows for fast comparison (ie. diff) and of versions.

In Prolly Trees, each internal node in the B-tree is labelled with an immutable, history independent hash or content address built from the node's contents. This allows fast comparison of two Prolly Trees. One simply needs to look at the root hash of the tree to tell if two trees are different. Walking the tree following paths with nodes of different values quickly identifies differences in the tree. Thus, calculating the difference between two Prolly tree scales with the size of the differences, not the size of the tree.

Moreover, sections of the tree that share the same root hash can share storage between versions because the contents of the trees are necessarily the same. When laid out in a commit graph, Prolly Tree roots share storage if the root hash of the tree or any sub-tree are the same in each version.

Prolly trees are described in more detail .

Comparison

provides the following useful algorithmic, big O() comparison of B-trees and Prolly Trees:

Operation

B-Trees

Prolly Trees

n: total leaf data in tree, k: average block size, w: window width

Commit Graph

Effectively, Prolly Tree storage laid out in a Commit Graph, or of versions, allows for Git-style version control on a SQL database. , the world's first version controlled SQL database, is working proof of the combination of a Commit Graph with Prolly Trees effectiveness.

How do we turn the table data storage described above into a SQL database? We must make all the contents of the database Prolly Trees and hash the roots together. The data in tables is by far the most complex tree. But in Dolt, it's Prolly Trees all the way down.

Table data is stored in a Prolly Tree as described above: a map of primary keys to the values in additional columns. Thus, each table version has an immutable hash of the data in the table. In SQL, tables also have schema. Table schema is stored in a much simpler Prolly Tree and given a content address. The combination of the roots of the schema and data Prolly Trees form the content address of the table.

The content addresses of the tables in the database are then hashed to form the content address of the database.

Finally, we lay these database roots out in a commit graph of database versions. Any subtree where the roots have the same content-address are only stored once, accomplishing structural sharing.

And, thus, you end up with Dolt's storage engine.

Commit Graph

Dolt's unique storage engine implements a Git-style commit graph of Prolly Trees. Dolt's commit graph facilitates common version control operations like log, diff, branch and merge on database tables instead of files.

Git vs Dolt

Git and Dolt share the same version control conceptual underpinnings. In Git and Dolt, the commit graph concepts are the same. In Git and Dolt, the commands to modify the commit graph are the same. The only difference is what Git and Dolt version. Git versions files. Dolt versions tables. Thus, if you know how the Git commit graph works, you know how the Dolt commit graph works. If not, read on.

What is a Commit?

A commit is a marker in your version history that stores all the relevant information for recreating that version. A commit is identified by a unique commit hash that looks something like 9shmcqu3q4o6ke8807pedlad2cfakvl7.

A commit contains two sets of information: the content and the metadata.

In Git, the content is the set of files as they existed at that point in time, identified by a content address. In Dolt, the content is the set of tables in the database at that point in time, identified by a content address. In Dolt, content addresses are created using a novel data structure called a , that allows for structural sharing, efficient diff, and fast querying of table data.

Additionally, commit metadata like author, date, and message are stored so it is easier to identify the commit you are looking for in the version history. This metadata is considered when creating the content address that you see in the commit log. So, even if two commits have the exact same content but are committed at different times or by different authors, they will have different commit hashes.

Why put Commits in a Graph?

Putting Commits is a graph allows for a representation of history, branches, and merges; core concepts of version control. A branch allows for multiple evolving histories. A merge allows two disparate histories to be combined.

How to Build a Commit Graph

The easiest way to understand the commit graph is to build one. Let's build a simple commit graph from scratch.

The Init Commit

To create a commit graph you must "initialize" one. Initialization can be done with the init command via the command line. This creates an "init commit". In Dolt, create database also creates an init commit if you are in the SQL context.

The init command creates a commit with metadata taken from the environment and empty contents.

The init commit is made on the default branch, usually named main. A branch is a pointer to a commit. The tip of a branch has a special name or reference called HEAD.

`WORKING` and `STAGED`

In Git and Dolt, at the HEAD of a branch there are two additional special references, called STAGED ad WORKING. These references point to active changes you are making to the HEAD of the branch. If there are no changes, the contents of HEAD, STAGED, and WORKING are the same.

When you make changes to the content of a branch, changes are made in WORKING. Changes to the content of WORKING change its content address.

When you are ready to make a commit, you stage the changes using the add command. If you stage all your changes, STAGED and WORKING point to the same content and thus share the same content address.

STAGED and WORKING allow for changes to content to be tested and verified before being stored permanently in the commit graph.

WORKING is often called the working set. An interesting way to think about the working set is traditional file systems that don't use Git only have a working set. Traditional databases like MySQL or Postgres only have a working set. If you create a database in Dolt and only run traditional SQL, your working set will look and act exactly like a MySQL database.

History

Commits are created using the aptly named commit command. When you commit your STAGED changes, the content that is staged is moved to the tip of the branch and you have an opportunity to add metadata like message and author. The HEAD of the branch main becomes this newly created commit.

Commits have zero to many parents. The init commit has zero parents. A normal commit has one parent, representing the previous commit metadata and content. A merge commit, which we'll discuss later, has many parents.

Parents allow for the history of branches to be computed by walking the branch from its HEAD. This is commonly called the commit log and generated using the log command. For instance, in our pictured example, using the log command on the main branch here would list commits h512kl and t1ms3n.

Branches

Up to this point, we are dealing only with linear history. If there is only one editor making serial changes, the commit graph will look like a long line of commits. A linear commit graph is still a graph, but not a very interesting graph.

Branches allow for non-linear history, a fork in the commit graph. Branches are often used to isolate multiple users' changes. Two users can make changes to content without worrying about what the other is changing. This capability is quite powerful. We've all worked on a shared document where people stomp on each other's changes. Branches prevent stomping.

Branches are created using the branch command. When branches are created the HEAD of the branch points at a specified commit, usually the HEAD commit of the branch you are currently using.

Now, using the same process above we can make a commit on the branch. The HEAD of the new branch now points at this new commit.

In parallel, we can make a commit on main.

The two branches now contain different contents and share a common ancestor. As you can see, parallel, isolated evolving histories are now possible using branches.

Merges

Merges are performed using the merge command. Merges allow you to join separate histories that exist on branches. Merges create a commit with multiple parents.

Merges are performed by finding the common ancestor commit and applying the changes from other branches in the merge to the current branch. Merge functionality requires the ability to quickly find the differences between the contents of two commits.

After merging, it is common to delete the branch that was merged signaling the change intended on the branch is complete.

Merges can generate . If two branches modify the same value, Git and Dolt notify the user. The user has the opportunity to resolve the conflicts as part of the merge.

Merges allow for collaboration among multiple users. Usually prior to merge, changes are reviewed by observing the computed differences between the branch you are merging from and the branch you are merging to. If the changes pass review, the merge is executed.

Prolly Trees

"Prolly Tree" is short for "Probabilistic B-tree". "Prolly Tree" was coined by the good folks who built Noms, who as far as we can tell invented the data structure. We here at DoltHub have immense respect for their pioneering work, without which Dolt would not exist.

A Prolly Tree is a data structure closely related to a B-tree. Prolly Trees are generally useful but have proven particularly effective as the basis of the storage engine for version controlled databases.

Motivation

Let's say you need a data structure with the following properties:

B-tree like performance - Specifically on reads and writes.
Fast Diffs - The ability to compare two versions efficiently.
Structural Sharing - Any given portion of the data shared between versions is only stored once.

A Prolly Tree delivers these properties.

Operation

B-Trees

Prolly Trees

n: total leaf data in tree, k: average block size, w: window width

As you can see a Prolly Tree approximates B-tree performance on reads and writes while also offering the ability to compute differences in time proportional to the size of differences rather than the total size of the data. Prolly Trees can also structurally share portions of the tree between versions due to the content-addressed nature of their intermediary storage.

Let's dive a bit deeper and show you how.

How Prolly Trees work

Prolly Trees are a variant of B-trees so let's first review some key B-tree concepts and then dive into how Prolly Trees work in light of those concepts.

B-tree Review

A B-tree is data structure that maps keys to values. A B-tree stores key-value pairs in leaf nodes in sorted order. Internal nodes of a B-tree store pointers to children nodes and key delimiters; everything reachable from a pointer to a child node falls within a range of key values corresponding to the key delimiters before and after that pointer within the internal node.

A B-tree is optimized for a tradeoff between write and read performance. It's more expensive to maintain than dropping tuples into a heap without any ordering constraints, but when data is stored in a B-tree it's much quicker to seek to a given key and to do an in-order traversal of the keys with their values.

However, finding the differences between two B-trees requires scanning both trees and comparing all values, an operation that scales with the size of the tree. This operation is slow for large trees.

Also, writes to B-trees are not history independent, the order of the writes internally changes the structure of the tree. Thus, storage cannot be easily shared between two versions of the same tree.

Building a Prolly Tree

The easiest way to understand Prolly trees is to walk through, step-by-step, how we build one. You start with a map of keys to values. Using the above B-tree example, we have keys between 1 and 21. The values don't really matter. Just imagine them dangling off the keys.

Sort: Sort the map by its key value, so that it is laid out in order.

Determine Chunk Boundaries: Use a seed, the size of the current chunk, the key value and a strong hash function to calculate whether this current entry represents a new chunk boundary. Any time the hash value is below a target value, form a chunk boundary and start a new block. Here's the chunking step on our leaf nodes:

Hash each Chunk: You now have the leaf nodes of the Prolly Tree. Compute the content address of each block by applying a strong hash function to its contents. You store the contents in . Here our blocks have been addressed and stored in the block store:

Finished?: If the length of your list of content addresses is 1, then you are done. This is the content address of your tree.
Build a New Map Otherwise, form the next layer of your Prolly Tree by creating a map of the highest key value in the chunk and the content address of the chunk. Here's what the entries for the first internal level of the tree would look like for our example:

Return to Step 2: Use this new map step 2. The algorithm terminates when step 4's condition is reached.

Following these steps results in the following Prolly Tree.

Modifying a Prolly Tree

The magic of Prolly Trees is not seen on construction, but on modification. As you'll see, modifying a Prolly Tree changes the minimal amount of content addresses. Minimal modification means more structural sharing of contents because fewer content addresses change. It is also the basis of fast diff which is accomplished by comparing content addresses.

Update a Value

If we update a value, we walk the tree by key to the leaf node holding the value. We then create the new leaf chunk by directly modifying the existing value in a copy of the existing chunk (ie. ). After the edit, we recalculate the content address of the chunk. We then walk up the tree recalculating each internal content address up to the root of the tree.

Note, this is only true for fixed-size value updates. If you change the length of a value, then we do have to re-chunk. Changes to NULL values or strings are not fixed length updates.

Insert Keys

We can insert keys at the beginning, middle, or end of the key space. It's helpful to visualize each type of insert.

When we insert a key, we walk the tree by key to the leaf node where the key belongs. We then edit the chunk, calculate whether to split the chunk or not, and then recalculate the content address. Note, a key insert or a delete has a small probability of changing an existing chunk boundary. If the computed hash values for the keys, combined with the size of chunk at those keys, happens to choose a different chunk boundary, the chunk will be split at the new boundary. Finally, we walk up the tree recalculating each content address up to the root of the tree.

In Dolt's Prolly Tree implementation, chunks are set to be on average 4 kilobytes. This means that you can imagine a probability of 1/4096 or 0.02% of triggering a chunk boundary shift when changing a single byte. In practice, it's quite a bit more complicated. As we'll discuss later, the size of the chunk is considered when computing a chunk boundary split. So, the larger the chunk, the greater the probability the chunk will be split on the addition of a new key. Conversely, for small chunks, the probability of a chunk split is very small. This means that every edit to a table in Dolt is a minimum of 4Kb multiplied by the depth of the tree with some small probability of writing multiple chunks.

At the Beginning

When you insert a key at the beginning of the key space, the tree is modified along the left edge.

At the end

When you insert a key at the end of the key space, the tree is modified along the right edge.

In the middle

When you insert a key in the middle of the key space, the tree is modified along a spline.

Delete a key

When you delete a key, the tree is modified under the same rules as an insert.

Properties

History Independence

A key property of a Prolly tree that enables fast diff and structural sharing is history independence. No matter which order you insert, update, or delete values, the Prolly tree is the same. This is best seen through example.

Consider a map with 4 integer keys, (1, 2, 3, 4) pointing at the same values. Before hand we know the combination of keys (1, 2) will trigger a chunk boundary. We can know this beforehand because chunk boundaries are based on the size of the chunk and its contents. No matter which order we insert, update, or delete, if we will end up with a tree with two leaf nodes (1, 2), and (3,4). This also happens to be true if the values at these keys are different. The tree will be the same but the the chunks will have different addresses.

Let's say we insert the chunks in sequential order.

Then, let's say we insert the keys in reverse order.

As you can see, we end up with the same Prolly Tree no matter which order we insert the values. It's a fun exercise to try and come up with a sequence of inserts, updates, and deletes that result in a different tree that contains the same values. It's fun because you can't. The Prolly Tree algorithm always spits out the same tree.

Fast Diff

Given history independence, the Prolly Tree difference calculation becomes quite simple. If the hash of the root chunk is the same the entire subtree is the same. Thus, one must just walk the tree down to the leaves, ignoring hashes that are equal. Hashes that are different represent the difference.

Let's see how this algorithm works in practice. Recall the Prolly Tree where I updated the value in key 9. Let's compare it to our originally built Prolly Tree.

Start by comparing the root hashes. They are necessarily different.

Then walk both chunks identifying the subtree hashes that are different. In this case, a single hash is different.

Follow the pointer to the next layer and compare those chunks, finding the hashes that are different. If you are in a leaf node, produce the values that are different.

As you can see this algorithm scales with the size of the differences, not the size of the tree. This makes finding small differences in even large trees very fast.

Recall that Prolly trees are stored in a content addressed block store where the content address forms the lookup key for the block.

Thus, any blocks that share the same content address are only stored in the block store once. When they need to be retrieved they are retrieved via content address. This means that any blocks shared across versions will only be in the block store once.

As noted earlier, in Dolt's Prolly Tree implementation, the block size is set to be on average 4 kilobytes. Thus, a single change will cause a change in the block store of 4 kilobytes times the size of the tree on average.

The Nitty Gritty Details

Now that you have the general details, let's dive into some details highlighting some of the things we learned iterating on the Noms implementation of Prolly Trees and finally landing on Dolt's stable Prolly Tree implementation.

Controlling Chunk Size

The original Noms implementation of Prolly Trees was susceptible to a chunk size problem. You would end up with many small chunks and a few large ones. The chunk size was a geometric distribution with an average size of 4 kilobytes.

This is the pattern one would expect from repeated rolls of a rolling hash function. Let's imagine you have a six-sided die. You walk along the road picking up pebbles. For each pebble, you put it in a bag and roll the die. If the die shows 6 you get a new bag. The average number of pebbles you have in each bag is 3.5 but you will mostly have bags with one pebble and a few bags with >10 pebbles. The pebbles in the Noms case was the key-value byte stream and the dice was the rolling hash function.

Large chunks create a particular problem, especially on the read path, because you are more likely to be reading from large chunks. Within each chunk a binary search is performed to find the key you want. The larger the chunk the slower this is at log₂(n) chunk size. So, you really want to keep a normally distributed chunk size.

To fix this, Dolt's Prolly Tree implementation considers the chunk size when deciding whether to make a boundary. Given a target probability distribution function (PDF), Dolt uses its associated cumulative distribution function (CDF) to decide the probability of splitting a chunk of size x. Specifically, we use the formula (CDF(end) - CDF(start)) / (1 - CDF(start)) to calculate the target probability. So, if the current chunk size is 2000 bytes the probability of triggering a boundary by appending a 64 byte key-value pair is (CDF(2064) - CDF(2000) / 1 - CDF(2000). In Dolt, we want our chunk size normally distributed around 4 kilobytes and by using the above approach, we get that.

Only Consider Keys

Noms original Prolly Tree implementation considered keys and values when deciding when to make a chunk boundary. Dolt's implementation only considers keys.

Dolt's Prolly Trees have the advantage of backing a SQL database. SQL databases define schema with fixed types. Fixed types have maximum sizes. Thus, any updates to values can be done in place because the values will be the same size. If we compute the rolling hash on only keys, any update to values is guaranteed not to shift the chunk boundary. This is a desirable property that improves update performance.

Moreover, Noms rolling hash function, , performed poorly for byte streams with low entropy. Tables with ordered keys, specifically time series data where very little changes at each sample, were problematic. This would again lead to very large chunks as no chunk boundary was triggered because most of the byte stream considered by the rolling hash function was the same. By considering only keys, which again are necessarily unique, Dolt's hash function created chunk boundaries more normally.

One subtlety of this change is that Dolt now chunks to an average number of key, values pairs rather than an average size of 4 kilobytes, but this difference disappears when used in concert with chunk size consideration in the probability of creating chunk boundaries.

Less Flexible Block Store

Prolly Trees themselves are agnostic. So if you only care about the Prolly Tree data structure, you can skip this section.

Noms block store encoded the types of the data in the store. is built for a SQL database with fixed schema. Thus, type information is stored out of band of the block store and a . This new, type-less layout improves Dolt read and write performance.

Explaining Algorithmic Performance

Recall this table:

Operation

B-Trees

Prolly Trees

n: total leaf data in tree, k: average block size, w: window width

We now understand what n, k, and w are in the context of B-Trees and Prolly Trees. As you can see, B-Trees and Prolly Trees offer similar read performance. Prolly Trees pay a slight performance penalty on writes due to the small probability of a chunk split. However, Prolly Trees can produce differences in time proportional to the size of the differences, rather than the size of the tree.

How Prolly Trees are used in Dolt

In Dolt, .

For table data, a map of primary key to data columns is stored in a Prolly Tree. Similarly, secondary indexes are maps of index values to the primary key identifying each row. Schemas are stored as Prolly trees to make calculating the root of the database easy. Keyless tables are implemented as every column is a primary key with a count of the number of duplicate rows as the value.

Prolly Trees In Practice

This all looks good on paper. How do Prolly Trees work in practice? On a standard suite of sysbench performance tests, Dolt is approximately .

Upon profiling, we find any performance differences to be unrelated to Prolly Trees. Performance differences come from:

Dolt is implemented in Golang. MySQL is implemented in C.
MySQL's SQL analyzer is sometimes faster than Dolt's because it is more mature.
MySQL does fewer transformations on data than Dolt to get it into the necessary wire format.

Dolt can . Dolt structurally shares .

Block Store

Storage of information is the bedrock of every Database in existence. The logical application concepts by which you store that information is where a lot of Database discussion occurs. In Dolt's case, Prolly Trees allow for some of its key properties like structural sharing and fast diffs. For many databases, Dolt included, the abstraction of how those data structures are written to disk is somewhat secondary. Dolt uses a custom content-addressed block store to store data on disk.

Core Access Patterns

In our Prolly Tree documentation, you'll find a lot of pictures that look like this.

It is important to remember that this is a representation of a structured object. The green entity is a list of addresses which point to other entities. All of these entities have addresses, such as t1ms3n, and the values are byte arrays. In Dolt, we call this byte array value a "Chunk". Every piece of information in a Dolt database is stored as Chunks.

This is such a fundamental truth, it's worth repeating in bold: Every piece of information in a Dolt database is stored as Chunks.

Chunks are byte buffers. Their address is the SHA512 checksum of the byte buffer truncated to 20 bytes. Content addressable systems have a few characteristics which make them interesting and useful:

Strict Hierarchy of Information. For chunks which refer to other chunks, it's impossible to create chunks out of order because you can't refer to a chunk without its address. It's impossible to have cycles in your data.
Chunk corruption can be detected. If you read a chunk from disk and its checksum doesn't match you know that something is wrong.
Idempotent Persistence. If something on the write path fails and you need to write it again, there is no risk of corruption given (1) and (2).

depend on content addressed storage, and Dolt embraces them fully.

Given that all Dolt data is stored as Chunks, we want to access these chunks by two primary operations:

Contains: We need a fast way to determine if a given storage file has the chunk in question. This comes up when we have multiple files and we don't actually know where the given chunk is. We need to be able to do this without accessing the disk excessively.
Retrieve: Get the data out.

Both of these patterns have batch equivalents.

By keeping this level of simplicity, the storage system doesn't need to know anything about the contents of the Chunks. The serialization and deserialization of application data is performed at a layer above the storage system.

Noms Block Store

is foundational to Dolt's existence. The Noms team started out using a key value store system to store Chunks, but it proved to be suboptimal. So, they developed the . The NBS file format lives on as the only way to store Dolt data currently.

NBS stores Chunks in files which have the following structure:

Footer: Fixed size end of the file which tells us primarily how many Chunks there are stored in the entire file.
Index: A deterministically sized (based on the footer) block of the file which contains all address information.
ChunkRecords: zero-addressed Chunk Data.

The index is where most of the complexity resides. The index is made up of three sections:

Prefix Map: This is the first 8 bytes of the Chunk addresses, coupled with an ordinal. The Prefixes are sorted so that we can do a binary search on the map to find a set of addresses which are in the file which have the given prefix. This binary search greatly reduces the search space needed to find the location the chunk requested. The ordinal is used to indicate the offset into Lengths and Suffixes.
Lengths: This tracks the length of the Chunk at each ordinal.
Address Suffixes: The address at that ordinal, minus the 8 bytes of the prefix which are redundant.

Contains

The algorithm to determine if the storage file contains a given address:

Strip the first 8 bytes of the address, and call that the Prefix.
Perform a binary search on the Prefix map, once one is found search the vicinity for additional matches.
For the set of matched Prefixes, grab the Address Suffix at the associated Ordinal. This results in the suffix set.
For the set of suffixes, compare each to the last twelve bytes of the address your are looking for.

Given that we are using SHA512 to calculate our addresses, we assume there is a decent distribution of addresses. Therefore, the time intensive portion of this algorithm is the binary search on the Prefix Map, which is O(log2(N)), where N is the number of chunks.

In the following example, there are 3 keys: ABC, LMN, and TUV. For each, the first character is considered the Prefix, and the second two are the Suffix.

Let's determine if the storage file contains ABC. First we need to read the footer, which tells us there are 3 chunks in this file. Using that, we calculate the offsets for the index from the end of the file. Using a seek operation, we load and parse this information for the duration of the process.

Now that we have the Prefix Map in memory, we pull the prefix, A off of our desired address ABC. The Prefix Map contains A,L,T, so doing a binary search we determine that the index 0 contains the prefix.

This example contains no collisions in the Prefix Map, but in the general case we find all matches, and gather the Ordinals. In this case, the Ordinal is 1. The Ordinal in the index into the Lengths and Suffixes.

Now, we compare the Suffix found at that location with the Suffix we are looking for, BC, and since they match, we know that this file contains the ABC Chunk.

Retrieve

To Retrieve the object requires additional steps. Using the Contains algorithm, we get the Ordinal of the Chunk found, then:

Sum all values in the Lengths array up to the Ordinal. This is the byte offset into ChunkRecords.
Perform a disk seek operation to the ChunkRecords portion of the file, and read the length of the chunk from that offset.
Snappy Decompress the byte array, and you are done!

Note that step (6) has a O(N) operation in it, but we calculate those values at the time we load the Index initially. Loading the Index is a O(N) operation, but we only do it once.

To continue our example, to retrieve the value for ABC, using the Ordinal 1 from before, we find the sum of all lengths before the Ordinal in question. In our example, we don't have many things to sum up since our Ordinal is 1. I should have thought of that before I made these pretty pictures!

The goal of this step is to get the offset and length of the Chunk. By summing all of the lengths of Chunks which become before the one we care about, we get the offset. Then the length is simply the value in the Ordinal position. In our case, offset is 1, and length is 4.

The value for Chunk ABC is 44 61 74 61! Well, technically we'd need to decompress that, but let's leave that for another time. Looking up LMN and TUV is an exercise left to the reader.

To recap, here is a gif!

The critical piece here is that the Index does most of the heavy lifting. It's loaded into memory at server start up, and as a result we confine all of our lookup code to that memory. It's only after we've determined the full offset of the Chunk that we venture into that yellow block, which requires a disk read.

Commit Graph

Git vs Dolt

What is a Commit?

A commit contains two sets of information: the content and the metadata.

Why put Commits in a Graph?

How to Build a Commit Graph

The easiest way to understand the commit graph is to build one. Let's build a simple commit graph from scratch.

The Init Commit

The init command creates a commit with metadata taken from the environment and empty contents.

The init commit is made on the default branch, usually named main. A branch is a pointer to a commit. The tip of a branch has a special name or reference called HEAD.

`WORKING` and `STAGED`

When you make changes to the content of a branch, changes are made in WORKING. Changes to the content of WORKING change its content address.

STAGED and WORKING allow for changes to content to be tested and verified before being stored permanently in the commit graph.

History

Branches

Branches are created using the branch command. When branches are created the HEAD of the branch points at a specified commit, usually the HEAD commit of the branch you are currently using.

Now, using the same process above we can make a commit on the branch. The HEAD of the new branch now points at this new commit.

In parallel, we can make a commit on main.

The two branches now contain different contents and share a common ancestor. As you can see, parallel, isolated evolving histories are now possible using branches.

Merges

Merges are performed using the merge command. Merges allow you to join separate histories that exist on branches. Merges create a commit with multiple parents.

After merging, it is common to delete the branch that was merged signaling the change intended on the branch is complete.

Merges can generate . If two branches modify the same value, Git and Dolt notify the user. The user has the opportunity to resolve the conflicts as part of the merge.

Block Store

Core Access Patterns

In our Prolly Tree documentation, you'll find a lot of pictures that look like this.

This is such a fundamental truth, it's worth repeating in bold: Every piece of information in a Dolt database is stored as Chunks.

Chunks are byte buffers. Their address is the SHA512 checksum of the byte buffer truncated to 20 bytes. Content addressable systems have a few characteristics which make them interesting and useful:

Strict Hierarchy of Information. For chunks which refer to other chunks, it's impossible to create chunks out of order because you can't refer to a chunk without its address. It's impossible to have cycles in your data.
Chunk corruption can be detected. If you read a chunk from disk and its checksum doesn't match you know that something is wrong.
Idempotent Persistence. If something on the write path fails and you need to write it again, there is no risk of corruption given (1) and (2).

depend on content addressed storage, and Dolt embraces them fully.

Given that all Dolt data is stored as Chunks, we want to access these chunks by two primary operations:

Contains: We need a fast way to determine if a given storage file has the chunk in question. This comes up when we have multiple files and we don't actually know where the given chunk is. We need to be able to do this without accessing the disk excessively.
Retrieve: Get the data out.

Both of these patterns have batch equivalents.

Noms Block Store

NBS stores Chunks in files which have the following structure:

Footer: Fixed size end of the file which tells us primarily how many Chunks there are stored in the entire file.
Index: A deterministically sized (based on the footer) block of the file which contains all address information.
ChunkRecords: zero-addressed Chunk Data.

The index is where most of the complexity resides. The index is made up of three sections:

Prefix Map: This is the first 8 bytes of the Chunk addresses, coupled with an ordinal. The Prefixes are sorted so that we can do a binary search on the map to find a set of addresses which are in the file which have the given prefix. This binary search greatly reduces the search space needed to find the location the chunk requested. The ordinal is used to indicate the offset into Lengths and Suffixes.
Lengths: This tracks the length of the Chunk at each ordinal.
Address Suffixes: The address at that ordinal, minus the 8 bytes of the prefix which are redundant.

Contains

The algorithm to determine if the storage file contains a given address:

Strip the first 8 bytes of the address, and call that the Prefix.
Perform a binary search on the Prefix map, once one is found search the vicinity for additional matches.
For the set of matched Prefixes, grab the Address Suffix at the associated Ordinal. This results in the suffix set.
For the set of suffixes, compare each to the last twelve bytes of the address your are looking for.

In the following example, there are 3 keys: ABC, LMN, and TUV. For each, the first character is considered the Prefix, and the second two are the Suffix.

Now, we compare the Suffix found at that location with the Suffix we are looking for, BC, and since they match, we know that this file contains the ABC Chunk.

Retrieve

To Retrieve the object requires additional steps. Using the Contains algorithm, we get the Ordinal of the Chunk found, then:

Sum all values in the Lengths array up to the Ordinal. This is the byte offset into ChunkRecords.
Perform a disk seek operation to the ChunkRecords portion of the file, and read the length of the chunk from that offset.
Snappy Decompress the byte array, and you are done!

Note that step (6) has a O(N) operation in it, but we calculate those values at the time we load the Index initially. Loading the Index is a O(N) operation, but we only do it once.

The value for Chunk ABC is 44 61 74 61! Well, technically we'd need to decompress that, but let's leave that for another time. Looking up LMN and TUV is an exercise left to the reader.

To recap, here is a gif!

Prolly Trees

Motivation

Let's say you need a data structure with the following properties:

B-tree like performance - Specifically on reads and writes.
Fast Diffs - The ability to compare two versions efficiently.
Structural Sharing - Any given portion of the data shared between versions is only stored once.

A Prolly Tree delivers these properties.

Operation

B-Trees

Prolly Trees

n: total leaf data in tree, k: average block size, w: window width

Let's dive a bit deeper and show you how.

How Prolly Trees work

Prolly Trees are a variant of B-trees so let's first review some key B-tree concepts and then dive into how Prolly Trees work in light of those concepts.

B-tree Review

However, finding the differences between two B-trees requires scanning both trees and comparing all values, an operation that scales with the size of the tree. This operation is slow for large trees.

Also, writes to B-trees are not history independent, the order of the writes internally changes the structure of the tree. Thus, storage cannot be easily shared between two versions of the same tree.

Building a Prolly Tree

Sort: Sort the map by its key value, so that it is laid out in order.

Determine Chunk Boundaries: Use a seed, the size of the current chunk, the key value and a strong hash function to calculate whether this current entry represents a new chunk boundary. Any time the hash value is below a target value, form a chunk boundary and start a new block. Here's the chunking step on our leaf nodes:

Hash each Chunk: You now have the leaf nodes of the Prolly Tree. Compute the content address of each block by applying a strong hash function to its contents. You store the contents in . Here our blocks have been addressed and stored in the block store:

Finished?: If the length of your list of content addresses is 1, then you are done. This is the content address of your tree.
Build a New Map Otherwise, form the next layer of your Prolly Tree by creating a map of the highest key value in the chunk and the content address of the chunk. Here's what the entries for the first internal level of the tree would look like for our example:

Return to Step 2: Use this new map step 2. The algorithm terminates when step 4's condition is reached.

Following these steps results in the following Prolly Tree.

Modifying a Prolly Tree

Update a Value

Note, this is only true for fixed-size value updates. If you change the length of a value, then we do have to re-chunk. Changes to NULL values or strings are not fixed length updates.

Insert Keys

We can insert keys at the beginning, middle, or end of the key space. It's helpful to visualize each type of insert.

At the Beginning

When you insert a key at the beginning of the key space, the tree is modified along the left edge.

At the end

When you insert a key at the end of the key space, the tree is modified along the right edge.

In the middle

When you insert a key in the middle of the key space, the tree is modified along a spline.

Delete a key

When you delete a key, the tree is modified under the same rules as an insert.

Properties

History Independence

Let's say we insert the chunks in sequential order.

Then, let's say we insert the keys in reverse order.

Fast Diff

Let's see how this algorithm works in practice. Recall the Prolly Tree where I updated the value in key 9. Let's compare it to our originally built Prolly Tree.

Start by comparing the root hashes. They are necessarily different.

Then walk both chunks identifying the subtree hashes that are different. In this case, a single hash is different.

Follow the pointer to the next layer and compare those chunks, finding the hashes that are different. If you are in a leaf node, produce the values that are different.

As you can see this algorithm scales with the size of the differences, not the size of the tree. This makes finding small differences in even large trees very fast.

Recall that Prolly trees are stored in a content addressed block store where the content address forms the lookup key for the block.

The Nitty Gritty Details

Controlling Chunk Size

Only Consider Keys

Noms original Prolly Tree implementation considered keys and values when deciding when to make a chunk boundary. Dolt's implementation only considers keys.

Less Flexible Block Store

Prolly Trees themselves are agnostic. So if you only care about the Prolly Tree data structure, you can skip this section.

Explaining Algorithmic Performance

Recall this table:

Operation

B-Trees

Prolly Trees

n: total leaf data in tree, k: average block size, w: window width

How Prolly Trees are used in Dolt

In Dolt, .

Prolly Trees In Practice

This all looks good on paper. How do Prolly Trees work in practice? On a standard suite of sysbench performance tests, Dolt is approximately .

Upon profiling, we find any performance differences to be unrelated to Prolly Trees. Performance differences come from:

Dolt is implemented in Golang. MySQL is implemented in C.
MySQL's SQL analyzer is sometimes faster than Dolt's because it is more mature.
MySQL does fewer transformations on data than Dolt to get it into the necessary wire format.

Dolt can . Dolt structurally shares .