updated readme

tomfran · Oct 27, 2023 · 0c61094 · 0c61094
1 parent e5120a1
commit 0c61094
Show file tree

Hide file tree

Showing 3 changed files with 533 additions and 19 deletions.
diff --git a/README.md b/README.md
@@ -15,14 +15,12 @@ An implementation of the Log-Structured Merge Tree (LSM tree) data structure in
 5. [Possible future improvements](#possible-improvements)
 6. [References](#references)
 
-## Console
+### Console
 
 To interact with a toy tree you can use `./gradlew run -q` to spawn a console.
 
 ![console.png](misc%2Fconsole.png)
 
----
-
 # Architecture
 
 Architecture overview, from SSTables, which are the disk-resident portion of the database, Skip Lists, used
@@ -86,8 +84,6 @@ To save space, all integers are stored
 in [variable-length encoding](https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html),
 and offsets in the index are stored as [deltas](https://en.wikipedia.org/wiki/Delta_encoding).
 
----
-
 ## Skip-List
 
 A [skip-list](https://en.wikipedia.org/wiki/Skip_list) is a probabilistic data structure that allows fast search,
@@ -112,8 +108,6 @@ we are looking for. Then we move down to the next level and repeat the process u
 Insertions, deletions, and updates are done by first locating the element, then performing
 the operation on the node. All of them have an average time complexity of `O(log(n))`.
 
----
-
 ## Tree
 
 Having defined SSTables and Skip Lists we can obtain the final structure as a combination of the two.
@@ -157,16 +151,14 @@ tables in lookups.
 Note that this style of compaction is not standard, there are various sophisticated techniques, but for the sake of
 this project this simple level-like compaction works wonders.
 
----
-
 # Benchmarks
 
 I am using [JMH](https://openjdk.java.net/projects/code-tools/jmh/) to run benchmarks,
 the results are obtained on AMD Ryzen™ 5 4600H with 16GB of RAM and 512GB SSD.
 
 To run them use `./gradlew jmh`.
 
-## SSTable
+### SSTable
 
 - Negative access: the key is not present in the table, hence the Bloom filter will likely stop the search;
 - Random access: the key is present in the table, the order of the keys is random.
@@ -179,7 +171,7 @@ c.t.l.sstable.SSTableBenchmark.randomAccess    thrpt    5     7989.945 ±     40
 
 ```
 
-## Bloom filter
+### Bloom filter
 
 - Add: add keys to a 1M keys Bloom filter with 0.01 false positive rate;
 - Contains: test whether the keys are present in the Bloom filter.
@@ -191,7 +183,7 @@ c.t.l.bloom.BloomFilterBenchmark.contains      thrpt    5  3567392.634 ± 220377
 
 ```
 
-## Skip-List
+### Skip-List
 
 - Get: get keys from a 100k keys skip-list;
 - Add/Remove: add and remove keys from a 100k keys skip-list.
@@ -204,7 +196,7 @@ c.t.l.memtable.SkipListBenchmark.get           thrpt    5   487265.620 ±   8201
 
 ```
 
-## Tree
+### Tree
 
 - Get: get elements from a tree with 1M keys;
 - Add: add 1M distinct elements to a tree with a memtable size of 2^18
@@ -216,8 +208,6 @@ c.t.l.tree.LSMTreeGetBenchmark.get             thrpt    5     9426.951 ±    241
 
 ```
 
----
-
 ## Possible improvements
 
 There is certainly space for improvement on this project:
@@ -231,15 +221,11 @@ There is certainly space for improvement on this project:
 
 I don't have the practical time to do all of this, perhaps the first two points will be handled in the future.
 
----
-
 ## References
 
 - [Database Internals](https://www.databass.dev/) by Alex Petrov, specifically chapters about Log-Structured Storage and
   File Formats;
 - [A Skip List Cookbook](https://api.drum.lib.umd.edu/server/api/core/bitstreams/17176ef8-8330-4a6c-8b75-4cd18c570bec/content)
   by William Pugh.
 
----
-
 _If you found this useful or interesting do not hesitate to ask clarifying questions or get in touch!_ 
diff --git a/out.html b/out.html
@@ -0,0 +1,264 @@
+<p>An implementation of the Log-Structured Merge Tree (LSM tree) data
+structure in Java.</p>
+<p><strong>Table of Contents</strong></p>
+<ol type="1">
+<li><a href="#Architecture">Architecture</a>
+<ol type="1">
+<li><a href="#SSTable">SSTable</a></li>
+<li><a href="#Skip-List">Skip-List</a></li>
+<li><a href="#Tree">Tree</a></li>
+</ol></li>
+<li><a href="#Benchmarks">Benchmarks</a>
+<ol type="1">
+<li><a href="#sstable-1">SSTable</a></li>
+<li><a href="#skip-list-1">Skip-List</a></li>
+<li><a href="#tree-1">Tree</a></li>
+</ol></li>
+<li><a href="#possible-improvements">Possible future
+improvements</a></li>
+<li><a href="#references">References</a></li>
+</ol>
+<p>To interact with a toy tree you can use <code>./gradlew run -q</code>
+to spawn a console.</p>
+<figure>
+<img src="misc%2Fconsole.png" alt="console.png" />
+<figcaption aria-hidden="true">console.png</figcaption>
+</figure>
+<hr />
+<h1 data-number="1" id="architecture"><span
+class="header-section-number">1</span> Architecture</h1>
+<p>Architecture overview, from SSTables, which are the disk-resident
+portion of the database, Skip Lists, used as memory buffers, and finally
+to the combination of the twos to create insertion, lookup and deletion
+primitives.</p>
+<h2 data-number="1.1" id="sstable"><span
+class="header-section-number">1.1</span> SSTable</h2>
+<p>Sorted String Table (SSTable) is a collection of files modelling
+key-value pairs in sorted order by key. It is used as a persistent
+storage for the LSM tree.</p>
+<h3 data-number="1.1.1" id="components"><span
+class="header-section-number">1.1.1</span> Components</h3>
+<ul>
+<li><em>Data</em>: key-value pairs in sorted order by key, stored in a
+file;</li>
+<li><em>Sparse index</em>: sparse index containing key and offset of the
+corresponding key-value pair in the data;</li>
+<li><em>Bloom filter</em>: a <a
+href="https://en.wikipedia.org/wiki/Bloom_filter">probabilistic data
+structure</a> used to test whether a key is in the SSTable.</li>
+</ul>
+<h3 data-number="1.1.2" id="key-lookup"><span
+class="header-section-number">1.1.2</span> Key lookup</h3>
+<p>The basic idea is to use the sparse index to find the key-value pair
+in the data file. The steps are:</p>
+<ol type="1">
+<li>Use the Bloom filter to test whether the key might be in the
+table;</li>
+<li>If the key might be present, use binary search on the index to find
+the maximum lower bound of the key;</li>
+<li>Scan the data from the position found in the previous step to find
+the key-value pair. The search can stop when we are seeing a key greater
+than the one we are looking for, or when we reach the end of the
+table.</li>
+</ol>
+<p>The search is as lazy as possible, meaning that we read the minimum
+amount of data from disk, for instance, if the next key length is
+smaller than the one we are looking for, we can skip the whole key-value
+pair.</p>
+<h3 data-number="1.1.3" id="persistence"><span
+class="header-section-number">1.1.3</span> Persistence</h3>
+<p>A table is persisted to disk when it is created. A base filename is
+defined, and three files are present:</p>
+<ul>
+<li><code>&lt;base_filename&gt;.data</code>: data file;</li>
+<li><code>&lt;base_filename&gt;.index</code>: index file;</li>
+<li><code>&lt;base_filename&gt;.bloom</code>: bloom filter file.</li>
+</ul>
+<p><strong>Data format</strong></p>
+<ul>
+<li><code>n</code>: number of key-value pairs;</li>
+<li><code>&lt;key_len_1, value_len_1, key_1, value_1, ... key_n, value_n&gt;</code>:
+key-value pairs.</li>
+</ul>
+<p><strong>Index format</strong></p>
+<ul>
+<li><code>s</code>: number of entries in the whole table;</li>
+<li><code>n</code>: number of entries in the index;</li>
+<li><code>o_1, o_2 - o_1, ..., o_n - o_n-1</code>: offsets of the
+key-value pairs in the data file, skipping the first one;</li>
+<li><code>s_1, s_2, ..., s_n</code>: remaining keys after a sparse index
+entry, used to exit from search;</li>
+<li><code>&lt;key_len_1, key_1, ... key_len_n, key_n&gt;</code>: keys in
+the index.</li>
+</ul>
+<p><strong>Filter format</strong></p>
+<ul>
+<li><code>m</code>: number of bits in the bloom filter;</li>
+<li><code>k</code>: number of hash functions;</li>
+<li><code>n</code>: size of underlying long array;</li>
+<li><code>b_1, b_2, ..., b_n</code>: bits of the bloom filter.</li>
+</ul>
+<p>To save space, all integers are stored in <a
+href="https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html">variable-length
+encoding</a>, and offsets in the index are stored as <a
+href="https://en.wikipedia.org/wiki/Delta_encoding">deltas</a>.</p>
+<hr />
+<h2 data-number="1.2" id="skip-list"><span
+class="header-section-number">1.2</span> Skip-List</h2>
+<p>A <a href="https://en.wikipedia.org/wiki/Skip_list">skip-list</a> is
+a probabilistic data structure that allows fast search, insertion and
+deletion of elements in a sorted sequence.</p>
+<p>In the LSM tree, it is used as an in-memory data structure to store
+key-value pairs in sorted order by key. Once the skip-list reaches a
+certain size, it is flushed to disk as an SSTable.</p>
+<h3 data-number="1.2.1" id="operations-details"><span
+class="header-section-number">1.2.1</span> Operations details</h3>
+<p>The idea of a skip list is similar to a classic linked list. We have
+nodes with forward pointers, but also levels. We can think about a level
+as a fast lane between nodes. By carefully constructing them at
+insertion time, searches are faster, as they can use higher levels to
+skip unwanted nodes.</p>
+<p>Given <code>n</code> elements, a skip list has <code>log(n)</code>
+levels, the first level containing all the elements. By increasing the
+level, the number of elements is cut roughly by half.</p>
+<p>To locate an element, we start from the top level and move forward
+until we find an element greater than the one we are looking for. Then
+we move down to the next level and repeat the process until we find the
+element.</p>
+<p>Insertions, deletions, and updates are done by first locating the
+element, then performing the operation on the node. All of them have an
+average time complexity of <code>O(log(n))</code>.</p>
+<hr />
+<h2 data-number="1.3" id="tree"><span
+class="header-section-number">1.3</span> Tree</h2>
+<p>Having defined SSTables and Skip Lists we can obtain the final
+structure as a combination of the two. The main idea is to use the
+latter as an in-memory buffer, while the former efficiently stores
+flushed buffers.</p>
+<h3 data-number="1.3.1" id="insertion"><span
+class="header-section-number">1.3.1</span> Insertion</h3>
+<p>Each insert goes directly to a Memtable, which is a Skip List under
+the hood, so the response time is quite fast. There exists a threshold,
+over which the mutable structure is made immutable by appending it to
+the <em>immmutable memtables LIFO list</em> and replaced with a new
+mutable list.</p>
+<p>The immutable memtable list is asynchronously consumed by a
+background thread, which takes the next available list and create a
+disk-resident SSTable with its content.</p>
+<h3 data-number="1.3.2" id="lookup"><span
+class="header-section-number">1.3.2</span> Lookup</h3>
+<p>While looking for a key, we proceed as follows:</p>
+<ol type="1">
+<li>Look into the in-memory buffer, if the key is recently written it is
+likely here, if not present continue;</li>
+<li>Look into the immutable memtables list, iterating from the most
+recent to the oldest, if not present continue;</li>
+<li>Look into disk tables, iterating from the most recent one to the
+oldest, if not present return null.</li>
+</ol>
+<h3 data-number="1.3.3" id="deletions"><span
+class="header-section-number">1.3.3</span> Deletions</h3>
+<p>To delete a key, we do not need to delete all its replicas, from the
+on-disk tables, we just need a special value called <em>tombstone</em>.
+Hence a deletion is the same as an insertion, but with a value set to
+null. While looking for a key, if we encounter a null value we simply
+return null as a result.</p>
+<h3 data-number="1.3.4" id="sstable-compaction"><span
+class="header-section-number">1.3.4</span> SSTable Compaction</h3>
+<p>The most expensive operation while looking for a key is certainly the
+disk search, and this is why bloom filters are crucial for negative
+lookup on SSTables. But no bloom filter can save us if too many tables
+are available to search, hence we need <em>compaction</em>.</p>
+<p>When flushing a Memtable, we create an SSTable of level one. When the
+first level reaches a certain threshold, all its tables are merged into
+a level-two table, and so on. This permits us to save storage and query
+fewer tables in lookups.</p>
+<p>Note that this style of compaction is not standard, there are various
+sophisticated techniques, but for the sake of this project this simple
+level-like compaction works wonders.</p>
+<hr />
+<h1 data-number="2" id="benchmarks"><span
+class="header-section-number">2</span> Benchmarks</h1>
+<p>I am using <a
+href="https://openjdk.java.net/projects/code-tools/jmh/">JMH</a> to run
+benchmarks, the results are obtained on AMD Ryzen™ 5 4600H with 16GB of
+RAM and 512GB SSD.</p>
+<p>To run them use <code>./gradlew jmh</code>.</p>
+<h2 data-number="2.1" id="sstable-1"><span
+class="header-section-number">2.1</span> SSTable</h2>
+<ul>
+<li>Negative access: the key is not present in the table, hence the
+Bloom filter will likely stop the search;</li>
+<li>Random access: the key is present in the table, the order of the
+keys is random.</li>
+</ul>
+<pre><code>
+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.sstable.SSTableBenchmark.negativeAccess  thrpt    5  3316202.976 ±  32851.546  ops/s
+c.t.l.sstable.SSTableBenchmark.randomAccess    thrpt    5     7989.945 ±     40.689  ops/s
+</code></pre>
+<h2 data-number="2.2" id="bloom-filter"><span
+class="header-section-number">2.2</span> Bloom filter</h2>
+<ul>
+<li>Add: add keys to a 1M keys Bloom filter with 0.01 false positive
+rate;</li>
+<li>Contains: test whether the keys are present in the Bloom
+filter.</li>
+</ul>
+<pre><code>Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.bloom.BloomFilterBenchmark.add           thrpt    5  3190753.307 ±  74744.764  ops/s
+c.t.l.bloom.BloomFilterBenchmark.contains      thrpt    5  3567392.634 ± 220377.613  ops/s
+</code></pre>
+<h2 data-number="2.3" id="skip-list-1"><span
+class="header-section-number">2.3</span> Skip-List</h2>
+<ul>
+<li>Get: get keys from a 100k keys skip-list;</li>
+<li>Add/Remove: add and remove keys from a 100k keys skip-list.</li>
+</ul>
+<pre><code>
+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.memtable.SkipListBenchmark.addRemove     thrpt    5   430239.471 ±   4825.990  ops/s
+c.t.l.memtable.SkipListBenchmark.get           thrpt    5   487265.620 ±   8201.227  ops/s
+</code></pre>
+<h2 data-number="2.4" id="tree-1"><span
+class="header-section-number">2.4</span> Tree</h2>
+<ul>
+<li>Get: get elements from a tree with 1M keys;</li>
+<li>Add: add 1M distinct elements to a tree with a memtable size of
+2^18</li>
+</ul>
+<pre><code>Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.tree.LSMTreeAddBenchmark.add             thrpt    5   540788.751 ±  54491.134  ops/s
+c.t.l.tree.LSMTreeGetBenchmark.get             thrpt    5     9426.951 ±    241.190  ops/s
+</code></pre>
+<hr />
+<h2 data-number="2.5" id="possible-improvements"><span
+class="header-section-number">2.5</span> Possible improvements</h2>
+<p>There is certainly space for improvement on this project:</p>
+<ol type="1">
+<li>Blocked bloom filters: its a variant of a classic array-like bloom
+filter which is more cache efficient;</li>
+<li>Search fingers in the Skip list: the idea is to keep a pointer to
+the last search, and start from there with subsequent queries;</li>
+<li>Proper level compaction in the LSM tree;</li>
+<li>Write ahead log for the insertions, without this, a crash makes all
+the in-memory writes disappear;</li>
+<li>Proper recovery: handle crashes and reboots, using existing SSTables
+and the write-ahead log.</li>
+</ol>
+<p>I don’t have the practical time to do all of this, perhaps the first
+two points will be handled in the future.</p>
+<hr />
+<h2 data-number="2.6" id="references"><span
+class="header-section-number">2.6</span> References</h2>
+<ul>
+<li><a href="https://www.databass.dev/">Database Internals</a> by Alex
+Petrov, specifically chapters about Log-Structured Storage and File
+Formats;</li>
+<li><a
+href="https://api.drum.lib.umd.edu/server/api/core/bitstreams/17176ef8-8330-4a6c-8b75-4cd18c570bec/content">A
+Skip List Cookbook</a> by William Pugh.</li>
+</ul>
+<hr />
+<p><em>If you found this useful or interesting do not hesitate to ask
+clarifying questions or get in touch!</em></p>