Skip to content

Commit

Permalink
updated readme
Browse files Browse the repository at this point in the history
  • Loading branch information
tomfran committed Oct 27, 2023
1 parent e5120a1 commit 0c61094
Show file tree
Hide file tree
Showing 3 changed files with 533 additions and 19 deletions.
24 changes: 5 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,12 @@ An implementation of the Log-Structured Merge Tree (LSM tree) data structure in
5. [Possible future improvements](#possible-improvements)
6. [References](#references)

## Console
### Console

To interact with a toy tree you can use `./gradlew run -q` to spawn a console.

![console.png](misc%2Fconsole.png)

---

# Architecture

Architecture overview, from SSTables, which are the disk-resident portion of the database, Skip Lists, used
Expand Down Expand Up @@ -86,8 +84,6 @@ To save space, all integers are stored
in [variable-length encoding](https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html),
and offsets in the index are stored as [deltas](https://en.wikipedia.org/wiki/Delta_encoding).

---

## Skip-List

A [skip-list](https://en.wikipedia.org/wiki/Skip_list) is a probabilistic data structure that allows fast search,
Expand All @@ -112,8 +108,6 @@ we are looking for. Then we move down to the next level and repeat the process u
Insertions, deletions, and updates are done by first locating the element, then performing
the operation on the node. All of them have an average time complexity of `O(log(n))`.

---

## Tree

Having defined SSTables and Skip Lists we can obtain the final structure as a combination of the two.
Expand Down Expand Up @@ -157,16 +151,14 @@ tables in lookups.
Note that this style of compaction is not standard, there are various sophisticated techniques, but for the sake of
this project this simple level-like compaction works wonders.

---

# Benchmarks

I am using [JMH](https://openjdk.java.net/projects/code-tools/jmh/) to run benchmarks,
the results are obtained on AMD Ryzen™ 5 4600H with 16GB of RAM and 512GB SSD.

To run them use `./gradlew jmh`.

## SSTable
### SSTable

- Negative access: the key is not present in the table, hence the Bloom filter will likely stop the search;
- Random access: the key is present in the table, the order of the keys is random.
Expand All @@ -179,7 +171,7 @@ c.t.l.sstable.SSTableBenchmark.randomAccess thrpt 5 7989.945 ± 40
```

## Bloom filter
### Bloom filter

- Add: add keys to a 1M keys Bloom filter with 0.01 false positive rate;
- Contains: test whether the keys are present in the Bloom filter.
Expand All @@ -191,7 +183,7 @@ c.t.l.bloom.BloomFilterBenchmark.contains thrpt 5 3567392.634 ± 220377
```

## Skip-List
### Skip-List

- Get: get keys from a 100k keys skip-list;
- Add/Remove: add and remove keys from a 100k keys skip-list.
Expand All @@ -204,7 +196,7 @@ c.t.l.memtable.SkipListBenchmark.get thrpt 5 487265.620 ± 8201
```

## Tree
### Tree

- Get: get elements from a tree with 1M keys;
- Add: add 1M distinct elements to a tree with a memtable size of 2^18
Expand All @@ -216,8 +208,6 @@ c.t.l.tree.LSMTreeGetBenchmark.get thrpt 5 9426.951 ± 241
```

---

## Possible improvements

There is certainly space for improvement on this project:
Expand All @@ -231,15 +221,11 @@ There is certainly space for improvement on this project:

I don't have the practical time to do all of this, perhaps the first two points will be handled in the future.

---

## References

- [Database Internals](https://www.databass.dev/) by Alex Petrov, specifically chapters about Log-Structured Storage and
File Formats;
- [A Skip List Cookbook](https://api.drum.lib.umd.edu/server/api/core/bitstreams/17176ef8-8330-4a6c-8b75-4cd18c570bec/content)
by William Pugh.

---

_If you found this useful or interesting do not hesitate to ask clarifying questions or get in touch!_
264 changes: 264 additions & 0 deletions out.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
<p>An implementation of the Log-Structured Merge Tree (LSM tree) data
structure in Java.</p>
<p><strong>Table of Contents</strong></p>
<ol type="1">
<li><a href="#Architecture">Architecture</a>
<ol type="1">
<li><a href="#SSTable">SSTable</a></li>
<li><a href="#Skip-List">Skip-List</a></li>
<li><a href="#Tree">Tree</a></li>
</ol></li>
<li><a href="#Benchmarks">Benchmarks</a>
<ol type="1">
<li><a href="#sstable-1">SSTable</a></li>
<li><a href="#skip-list-1">Skip-List</a></li>
<li><a href="#tree-1">Tree</a></li>
</ol></li>
<li><a href="#possible-improvements">Possible future
improvements</a></li>
<li><a href="#references">References</a></li>
</ol>
<p>To interact with a toy tree you can use <code>./gradlew run -q</code>
to spawn a console.</p>
<figure>
<img src="misc%2Fconsole.png" alt="console.png" />
<figcaption aria-hidden="true">console.png</figcaption>
</figure>
<hr />
<h1 data-number="1" id="architecture"><span
class="header-section-number">1</span> Architecture</h1>
<p>Architecture overview, from SSTables, which are the disk-resident
portion of the database, Skip Lists, used as memory buffers, and finally
to the combination of the twos to create insertion, lookup and deletion
primitives.</p>
<h2 data-number="1.1" id="sstable"><span
class="header-section-number">1.1</span> SSTable</h2>
<p>Sorted String Table (SSTable) is a collection of files modelling
key-value pairs in sorted order by key. It is used as a persistent
storage for the LSM tree.</p>
<h3 data-number="1.1.1" id="components"><span
class="header-section-number">1.1.1</span> Components</h3>
<ul>
<li><em>Data</em>: key-value pairs in sorted order by key, stored in a
file;</li>
<li><em>Sparse index</em>: sparse index containing key and offset of the
corresponding key-value pair in the data;</li>
<li><em>Bloom filter</em>: a <a
href="https://en.wikipedia.org/wiki/Bloom_filter">probabilistic data
structure</a> used to test whether a key is in the SSTable.</li>
</ul>
<h3 data-number="1.1.2" id="key-lookup"><span
class="header-section-number">1.1.2</span> Key lookup</h3>
<p>The basic idea is to use the sparse index to find the key-value pair
in the data file. The steps are:</p>
<ol type="1">
<li>Use the Bloom filter to test whether the key might be in the
table;</li>
<li>If the key might be present, use binary search on the index to find
the maximum lower bound of the key;</li>
<li>Scan the data from the position found in the previous step to find
the key-value pair. The search can stop when we are seeing a key greater
than the one we are looking for, or when we reach the end of the
table.</li>
</ol>
<p>The search is as lazy as possible, meaning that we read the minimum
amount of data from disk, for instance, if the next key length is
smaller than the one we are looking for, we can skip the whole key-value
pair.</p>
<h3 data-number="1.1.3" id="persistence"><span
class="header-section-number">1.1.3</span> Persistence</h3>
<p>A table is persisted to disk when it is created. A base filename is
defined, and three files are present:</p>
<ul>
<li><code>&lt;base_filename&gt;.data</code>: data file;</li>
<li><code>&lt;base_filename&gt;.index</code>: index file;</li>
<li><code>&lt;base_filename&gt;.bloom</code>: bloom filter file.</li>
</ul>
<p><strong>Data format</strong></p>
<ul>
<li><code>n</code>: number of key-value pairs;</li>
<li><code>&lt;key_len_1, value_len_1, key_1, value_1, ... key_n, value_n&gt;</code>:
key-value pairs.</li>
</ul>
<p><strong>Index format</strong></p>
<ul>
<li><code>s</code>: number of entries in the whole table;</li>
<li><code>n</code>: number of entries in the index;</li>
<li><code>o_1, o_2 - o_1, ..., o_n - o_n-1</code>: offsets of the
key-value pairs in the data file, skipping the first one;</li>
<li><code>s_1, s_2, ..., s_n</code>: remaining keys after a sparse index
entry, used to exit from search;</li>
<li><code>&lt;key_len_1, key_1, ... key_len_n, key_n&gt;</code>: keys in
the index.</li>
</ul>
<p><strong>Filter format</strong></p>
<ul>
<li><code>m</code>: number of bits in the bloom filter;</li>
<li><code>k</code>: number of hash functions;</li>
<li><code>n</code>: size of underlying long array;</li>
<li><code>b_1, b_2, ..., b_n</code>: bits of the bloom filter.</li>
</ul>
<p>To save space, all integers are stored in <a
href="https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html">variable-length
encoding</a>, and offsets in the index are stored as <a
href="https://en.wikipedia.org/wiki/Delta_encoding">deltas</a>.</p>
<hr />
<h2 data-number="1.2" id="skip-list"><span
class="header-section-number">1.2</span> Skip-List</h2>
<p>A <a href="https://en.wikipedia.org/wiki/Skip_list">skip-list</a> is
a probabilistic data structure that allows fast search, insertion and
deletion of elements in a sorted sequence.</p>
<p>In the LSM tree, it is used as an in-memory data structure to store
key-value pairs in sorted order by key. Once the skip-list reaches a
certain size, it is flushed to disk as an SSTable.</p>
<h3 data-number="1.2.1" id="operations-details"><span
class="header-section-number">1.2.1</span> Operations details</h3>
<p>The idea of a skip list is similar to a classic linked list. We have
nodes with forward pointers, but also levels. We can think about a level
as a fast lane between nodes. By carefully constructing them at
insertion time, searches are faster, as they can use higher levels to
skip unwanted nodes.</p>
<p>Given <code>n</code> elements, a skip list has <code>log(n)</code>
levels, the first level containing all the elements. By increasing the
level, the number of elements is cut roughly by half.</p>
<p>To locate an element, we start from the top level and move forward
until we find an element greater than the one we are looking for. Then
we move down to the next level and repeat the process until we find the
element.</p>
<p>Insertions, deletions, and updates are done by first locating the
element, then performing the operation on the node. All of them have an
average time complexity of <code>O(log(n))</code>.</p>
<hr />
<h2 data-number="1.3" id="tree"><span
class="header-section-number">1.3</span> Tree</h2>
<p>Having defined SSTables and Skip Lists we can obtain the final
structure as a combination of the two. The main idea is to use the
latter as an in-memory buffer, while the former efficiently stores
flushed buffers.</p>
<h3 data-number="1.3.1" id="insertion"><span
class="header-section-number">1.3.1</span> Insertion</h3>
<p>Each insert goes directly to a Memtable, which is a Skip List under
the hood, so the response time is quite fast. There exists a threshold,
over which the mutable structure is made immutable by appending it to
the <em>immmutable memtables LIFO list</em> and replaced with a new
mutable list.</p>
<p>The immutable memtable list is asynchronously consumed by a
background thread, which takes the next available list and create a
disk-resident SSTable with its content.</p>
<h3 data-number="1.3.2" id="lookup"><span
class="header-section-number">1.3.2</span> Lookup</h3>
<p>While looking for a key, we proceed as follows:</p>
<ol type="1">
<li>Look into the in-memory buffer, if the key is recently written it is
likely here, if not present continue;</li>
<li>Look into the immutable memtables list, iterating from the most
recent to the oldest, if not present continue;</li>
<li>Look into disk tables, iterating from the most recent one to the
oldest, if not present return null.</li>
</ol>
<h3 data-number="1.3.3" id="deletions"><span
class="header-section-number">1.3.3</span> Deletions</h3>
<p>To delete a key, we do not need to delete all its replicas, from the
on-disk tables, we just need a special value called <em>tombstone</em>.
Hence a deletion is the same as an insertion, but with a value set to
null. While looking for a key, if we encounter a null value we simply
return null as a result.</p>
<h3 data-number="1.3.4" id="sstable-compaction"><span
class="header-section-number">1.3.4</span> SSTable Compaction</h3>
<p>The most expensive operation while looking for a key is certainly the
disk search, and this is why bloom filters are crucial for negative
lookup on SSTables. But no bloom filter can save us if too many tables
are available to search, hence we need <em>compaction</em>.</p>
<p>When flushing a Memtable, we create an SSTable of level one. When the
first level reaches a certain threshold, all its tables are merged into
a level-two table, and so on. This permits us to save storage and query
fewer tables in lookups.</p>
<p>Note that this style of compaction is not standard, there are various
sophisticated techniques, but for the sake of this project this simple
level-like compaction works wonders.</p>
<hr />
<h1 data-number="2" id="benchmarks"><span
class="header-section-number">2</span> Benchmarks</h1>
<p>I am using <a
href="https://openjdk.java.net/projects/code-tools/jmh/">JMH</a> to run
benchmarks, the results are obtained on AMD Ryzen™ 5 4600H with 16GB of
RAM and 512GB SSD.</p>
<p>To run them use <code>./gradlew jmh</code>.</p>
<h2 data-number="2.1" id="sstable-1"><span
class="header-section-number">2.1</span> SSTable</h2>
<ul>
<li>Negative access: the key is not present in the table, hence the
Bloom filter will likely stop the search;</li>
<li>Random access: the key is present in the table, the order of the
keys is random.</li>
</ul>
<pre><code>
Benchmark Mode Cnt Score Error Units
c.t.l.sstable.SSTableBenchmark.negativeAccess thrpt 5 3316202.976 ± 32851.546 ops/s
c.t.l.sstable.SSTableBenchmark.randomAccess thrpt 5 7989.945 ± 40.689 ops/s
</code></pre>
<h2 data-number="2.2" id="bloom-filter"><span
class="header-section-number">2.2</span> Bloom filter</h2>
<ul>
<li>Add: add keys to a 1M keys Bloom filter with 0.01 false positive
rate;</li>
<li>Contains: test whether the keys are present in the Bloom
filter.</li>
</ul>
<pre><code>Benchmark Mode Cnt Score Error Units
c.t.l.bloom.BloomFilterBenchmark.add thrpt 5 3190753.307 ± 74744.764 ops/s
c.t.l.bloom.BloomFilterBenchmark.contains thrpt 5 3567392.634 ± 220377.613 ops/s
</code></pre>
<h2 data-number="2.3" id="skip-list-1"><span
class="header-section-number">2.3</span> Skip-List</h2>
<ul>
<li>Get: get keys from a 100k keys skip-list;</li>
<li>Add/Remove: add and remove keys from a 100k keys skip-list.</li>
</ul>
<pre><code>
Benchmark Mode Cnt Score Error Units
c.t.l.memtable.SkipListBenchmark.addRemove thrpt 5 430239.471 ± 4825.990 ops/s
c.t.l.memtable.SkipListBenchmark.get thrpt 5 487265.620 ± 8201.227 ops/s
</code></pre>
<h2 data-number="2.4" id="tree-1"><span
class="header-section-number">2.4</span> Tree</h2>
<ul>
<li>Get: get elements from a tree with 1M keys;</li>
<li>Add: add 1M distinct elements to a tree with a memtable size of
2^18</li>
</ul>
<pre><code>Benchmark Mode Cnt Score Error Units
c.t.l.tree.LSMTreeAddBenchmark.add thrpt 5 540788.751 ± 54491.134 ops/s
c.t.l.tree.LSMTreeGetBenchmark.get thrpt 5 9426.951 ± 241.190 ops/s
</code></pre>
<hr />
<h2 data-number="2.5" id="possible-improvements"><span
class="header-section-number">2.5</span> Possible improvements</h2>
<p>There is certainly space for improvement on this project:</p>
<ol type="1">
<li>Blocked bloom filters: its a variant of a classic array-like bloom
filter which is more cache efficient;</li>
<li>Search fingers in the Skip list: the idea is to keep a pointer to
the last search, and start from there with subsequent queries;</li>
<li>Proper level compaction in the LSM tree;</li>
<li>Write ahead log for the insertions, without this, a crash makes all
the in-memory writes disappear;</li>
<li>Proper recovery: handle crashes and reboots, using existing SSTables
and the write-ahead log.</li>
</ol>
<p>I don’t have the practical time to do all of this, perhaps the first
two points will be handled in the future.</p>
<hr />
<h2 data-number="2.6" id="references"><span
class="header-section-number">2.6</span> References</h2>
<ul>
<li><a href="https://www.databass.dev/">Database Internals</a> by Alex
Petrov, specifically chapters about Log-Structured Storage and File
Formats;</li>
<li><a
href="https://api.drum.lib.umd.edu/server/api/core/bitstreams/17176ef8-8330-4a6c-8b75-4cd18c570bec/content">A
Skip List Cookbook</a> by William Pugh.</li>
</ul>
<hr />
<p><em>If you found this useful or interesting do not hesitate to ask
clarifying questions or get in touch!</em></p>
Loading

0 comments on commit 0c61094

Please sign in to comment.