diff --git a/README.md b/README.md index 1ba82f1..fca67c8 100644 --- a/README.md +++ b/README.md @@ -15,14 +15,12 @@ An implementation of the Log-Structured Merge Tree (LSM tree) data structure in 5. [Possible future improvements](#possible-improvements) 6. [References](#references) -## Console +### Console To interact with a toy tree you can use `./gradlew run -q` to spawn a console. ![console.png](misc%2Fconsole.png) ---- - # Architecture Architecture overview, from SSTables, which are the disk-resident portion of the database, Skip Lists, used @@ -86,8 +84,6 @@ To save space, all integers are stored in [variable-length encoding](https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html), and offsets in the index are stored as [deltas](https://en.wikipedia.org/wiki/Delta_encoding). ---- - ## Skip-List A [skip-list](https://en.wikipedia.org/wiki/Skip_list) is a probabilistic data structure that allows fast search, @@ -112,8 +108,6 @@ we are looking for. Then we move down to the next level and repeat the process u Insertions, deletions, and updates are done by first locating the element, then performing the operation on the node. All of them have an average time complexity of `O(log(n))`. ---- - ## Tree Having defined SSTables and Skip Lists we can obtain the final structure as a combination of the two. @@ -157,8 +151,6 @@ tables in lookups. Note that this style of compaction is not standard, there are various sophisticated techniques, but for the sake of this project this simple level-like compaction works wonders. ---- - # Benchmarks I am using [JMH](https://openjdk.java.net/projects/code-tools/jmh/) to run benchmarks, @@ -166,7 +158,7 @@ the results are obtained on AMD Ryzen™ 5 4600H with 16GB of RAM and 512GB SSD. To run them use `./gradlew jmh`. -## SSTable +### SSTable - Negative access: the key is not present in the table, hence the Bloom filter will likely stop the search; - Random access: the key is present in the table, the order of the keys is random. @@ -179,7 +171,7 @@ c.t.l.sstable.SSTableBenchmark.randomAccess thrpt 5 7989.945 ± 40 ``` -## Bloom filter +### Bloom filter - Add: add keys to a 1M keys Bloom filter with 0.01 false positive rate; - Contains: test whether the keys are present in the Bloom filter. @@ -191,7 +183,7 @@ c.t.l.bloom.BloomFilterBenchmark.contains thrpt 5 3567392.634 ± 220377 ``` -## Skip-List +### Skip-List - Get: get keys from a 100k keys skip-list; - Add/Remove: add and remove keys from a 100k keys skip-list. @@ -204,7 +196,7 @@ c.t.l.memtable.SkipListBenchmark.get thrpt 5 487265.620 ± 8201 ``` -## Tree +### Tree - Get: get elements from a tree with 1M keys; - Add: add 1M distinct elements to a tree with a memtable size of 2^18 @@ -216,8 +208,6 @@ c.t.l.tree.LSMTreeGetBenchmark.get thrpt 5 9426.951 ± 241 ``` ---- - ## Possible improvements There is certainly space for improvement on this project: @@ -231,8 +221,6 @@ There is certainly space for improvement on this project: I don't have the practical time to do all of this, perhaps the first two points will be handled in the future. ---- - ## References - [Database Internals](https://www.databass.dev/) by Alex Petrov, specifically chapters about Log-Structured Storage and @@ -240,6 +228,4 @@ I don't have the practical time to do all of this, perhaps the first two points - [A Skip List Cookbook](https://api.drum.lib.umd.edu/server/api/core/bitstreams/17176ef8-8330-4a6c-8b75-4cd18c570bec/content) by William Pugh. ---- - _If you found this useful or interesting do not hesitate to ask clarifying questions or get in touch!_ diff --git a/out.html b/out.html new file mode 100644 index 0000000..9c6e7c0 --- /dev/null +++ b/out.html @@ -0,0 +1,264 @@ +

An implementation of the Log-Structured Merge Tree (LSM tree) data +structure in Java.

+

Table of Contents

+
    +
  1. Architecture +
      +
    1. SSTable
    2. +
    3. Skip-List
    4. +
    5. Tree
    6. +
  2. +
  3. Benchmarks +
      +
    1. SSTable
    2. +
    3. Skip-List
    4. +
    5. Tree
    6. +
  4. +
  5. Possible future +improvements
  6. +
  7. References
  8. +
+

To interact with a toy tree you can use ./gradlew run -q +to spawn a console.

+
+console.png + +
+
+

1 Architecture

+

Architecture overview, from SSTables, which are the disk-resident +portion of the database, Skip Lists, used as memory buffers, and finally +to the combination of the twos to create insertion, lookup and deletion +primitives.

+

1.1 SSTable

+

Sorted String Table (SSTable) is a collection of files modelling +key-value pairs in sorted order by key. It is used as a persistent +storage for the LSM tree.

+

1.1.1 Components

+ +

1.1.2 Key lookup

+

The basic idea is to use the sparse index to find the key-value pair +in the data file. The steps are:

+
    +
  1. Use the Bloom filter to test whether the key might be in the +table;
  2. +
  3. If the key might be present, use binary search on the index to find +the maximum lower bound of the key;
  4. +
  5. Scan the data from the position found in the previous step to find +the key-value pair. The search can stop when we are seeing a key greater +than the one we are looking for, or when we reach the end of the +table.
  6. +
+

The search is as lazy as possible, meaning that we read the minimum +amount of data from disk, for instance, if the next key length is +smaller than the one we are looking for, we can skip the whole key-value +pair.

+

1.1.3 Persistence

+

A table is persisted to disk when it is created. A base filename is +defined, and three files are present:

+ +

Data format

+ +

Index format

+ +

Filter format

+ +

To save space, all integers are stored in variable-length +encoding, and offsets in the index are stored as deltas.

+
+

1.2 Skip-List

+

A skip-list is +a probabilistic data structure that allows fast search, insertion and +deletion of elements in a sorted sequence.

+

In the LSM tree, it is used as an in-memory data structure to store +key-value pairs in sorted order by key. Once the skip-list reaches a +certain size, it is flushed to disk as an SSTable.

+

1.2.1 Operations details

+

The idea of a skip list is similar to a classic linked list. We have +nodes with forward pointers, but also levels. We can think about a level +as a fast lane between nodes. By carefully constructing them at +insertion time, searches are faster, as they can use higher levels to +skip unwanted nodes.

+

Given n elements, a skip list has log(n) +levels, the first level containing all the elements. By increasing the +level, the number of elements is cut roughly by half.

+

To locate an element, we start from the top level and move forward +until we find an element greater than the one we are looking for. Then +we move down to the next level and repeat the process until we find the +element.

+

Insertions, deletions, and updates are done by first locating the +element, then performing the operation on the node. All of them have an +average time complexity of O(log(n)).

+
+

1.3 Tree

+

Having defined SSTables and Skip Lists we can obtain the final +structure as a combination of the two. The main idea is to use the +latter as an in-memory buffer, while the former efficiently stores +flushed buffers.

+

1.3.1 Insertion

+

Each insert goes directly to a Memtable, which is a Skip List under +the hood, so the response time is quite fast. There exists a threshold, +over which the mutable structure is made immutable by appending it to +the immmutable memtables LIFO list and replaced with a new +mutable list.

+

The immutable memtable list is asynchronously consumed by a +background thread, which takes the next available list and create a +disk-resident SSTable with its content.

+

1.3.2 Lookup

+

While looking for a key, we proceed as follows:

+
    +
  1. Look into the in-memory buffer, if the key is recently written it is +likely here, if not present continue;
  2. +
  3. Look into the immutable memtables list, iterating from the most +recent to the oldest, if not present continue;
  4. +
  5. Look into disk tables, iterating from the most recent one to the +oldest, if not present return null.
  6. +
+

1.3.3 Deletions

+

To delete a key, we do not need to delete all its replicas, from the +on-disk tables, we just need a special value called tombstone. +Hence a deletion is the same as an insertion, but with a value set to +null. While looking for a key, if we encounter a null value we simply +return null as a result.

+

1.3.4 SSTable Compaction

+

The most expensive operation while looking for a key is certainly the +disk search, and this is why bloom filters are crucial for negative +lookup on SSTables. But no bloom filter can save us if too many tables +are available to search, hence we need compaction.

+

When flushing a Memtable, we create an SSTable of level one. When the +first level reaches a certain threshold, all its tables are merged into +a level-two table, and so on. This permits us to save storage and query +fewer tables in lookups.

+

Note that this style of compaction is not standard, there are various +sophisticated techniques, but for the sake of this project this simple +level-like compaction works wonders.

+
+

2 Benchmarks

+

I am using JMH to run +benchmarks, the results are obtained on AMD Ryzen™ 5 4600H with 16GB of +RAM and 512GB SSD.

+

To run them use ./gradlew jmh.

+

2.1 SSTable

+ +

+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.sstable.SSTableBenchmark.negativeAccess  thrpt    5  3316202.976 ±  32851.546  ops/s
+c.t.l.sstable.SSTableBenchmark.randomAccess    thrpt    5     7989.945 ±     40.689  ops/s
+
+

2.2 Bloom filter

+ +
Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.bloom.BloomFilterBenchmark.add           thrpt    5  3190753.307 ±  74744.764  ops/s
+c.t.l.bloom.BloomFilterBenchmark.contains      thrpt    5  3567392.634 ± 220377.613  ops/s
+
+

2.3 Skip-List

+ +

+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.memtable.SkipListBenchmark.addRemove     thrpt    5   430239.471 ±   4825.990  ops/s
+c.t.l.memtable.SkipListBenchmark.get           thrpt    5   487265.620 ±   8201.227  ops/s
+
+

2.4 Tree

+ +
Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.tree.LSMTreeAddBenchmark.add             thrpt    5   540788.751 ±  54491.134  ops/s
+c.t.l.tree.LSMTreeGetBenchmark.get             thrpt    5     9426.951 ±    241.190  ops/s
+
+
+

2.5 Possible improvements

+

There is certainly space for improvement on this project:

+
    +
  1. Blocked bloom filters: its a variant of a classic array-like bloom +filter which is more cache efficient;
  2. +
  3. Search fingers in the Skip list: the idea is to keep a pointer to +the last search, and start from there with subsequent queries;
  4. +
  5. Proper level compaction in the LSM tree;
  6. +
  7. Write ahead log for the insertions, without this, a crash makes all +the in-memory writes disappear;
  8. +
  9. Proper recovery: handle crashes and reboots, using existing SSTables +and the write-ahead log.
  10. +
+

I don’t have the practical time to do all of this, perhaps the first +two points will be handled in the future.

+
+

2.6 References

+ +
+

If you found this useful or interesting do not hesitate to ask +clarifying questions or get in touch!

diff --git a/out.md b/out.md new file mode 100644 index 0000000..9c6e7c0 --- /dev/null +++ b/out.md @@ -0,0 +1,264 @@ +

An implementation of the Log-Structured Merge Tree (LSM tree) data +structure in Java.

+

Table of Contents

+
    +
  1. Architecture +
      +
    1. SSTable
    2. +
    3. Skip-List
    4. +
    5. Tree
    6. +
  2. +
  3. Benchmarks +
      +
    1. SSTable
    2. +
    3. Skip-List
    4. +
    5. Tree
    6. +
  4. +
  5. Possible future +improvements
  6. +
  7. References
  8. +
+

To interact with a toy tree you can use ./gradlew run -q +to spawn a console.

+
+console.png + +
+
+

1 Architecture

+

Architecture overview, from SSTables, which are the disk-resident +portion of the database, Skip Lists, used as memory buffers, and finally +to the combination of the twos to create insertion, lookup and deletion +primitives.

+

1.1 SSTable

+

Sorted String Table (SSTable) is a collection of files modelling +key-value pairs in sorted order by key. It is used as a persistent +storage for the LSM tree.

+

1.1.1 Components

+ +

1.1.2 Key lookup

+

The basic idea is to use the sparse index to find the key-value pair +in the data file. The steps are:

+
    +
  1. Use the Bloom filter to test whether the key might be in the +table;
  2. +
  3. If the key might be present, use binary search on the index to find +the maximum lower bound of the key;
  4. +
  5. Scan the data from the position found in the previous step to find +the key-value pair. The search can stop when we are seeing a key greater +than the one we are looking for, or when we reach the end of the +table.
  6. +
+

The search is as lazy as possible, meaning that we read the minimum +amount of data from disk, for instance, if the next key length is +smaller than the one we are looking for, we can skip the whole key-value +pair.

+

1.1.3 Persistence

+

A table is persisted to disk when it is created. A base filename is +defined, and three files are present:

+ +

Data format

+ +

Index format

+ +

Filter format

+ +

To save space, all integers are stored in variable-length +encoding, and offsets in the index are stored as deltas.

+
+

1.2 Skip-List

+

A skip-list is +a probabilistic data structure that allows fast search, insertion and +deletion of elements in a sorted sequence.

+

In the LSM tree, it is used as an in-memory data structure to store +key-value pairs in sorted order by key. Once the skip-list reaches a +certain size, it is flushed to disk as an SSTable.

+

1.2.1 Operations details

+

The idea of a skip list is similar to a classic linked list. We have +nodes with forward pointers, but also levels. We can think about a level +as a fast lane between nodes. By carefully constructing them at +insertion time, searches are faster, as they can use higher levels to +skip unwanted nodes.

+

Given n elements, a skip list has log(n) +levels, the first level containing all the elements. By increasing the +level, the number of elements is cut roughly by half.

+

To locate an element, we start from the top level and move forward +until we find an element greater than the one we are looking for. Then +we move down to the next level and repeat the process until we find the +element.

+

Insertions, deletions, and updates are done by first locating the +element, then performing the operation on the node. All of them have an +average time complexity of O(log(n)).

+
+

1.3 Tree

+

Having defined SSTables and Skip Lists we can obtain the final +structure as a combination of the two. The main idea is to use the +latter as an in-memory buffer, while the former efficiently stores +flushed buffers.

+

1.3.1 Insertion

+

Each insert goes directly to a Memtable, which is a Skip List under +the hood, so the response time is quite fast. There exists a threshold, +over which the mutable structure is made immutable by appending it to +the immmutable memtables LIFO list and replaced with a new +mutable list.

+

The immutable memtable list is asynchronously consumed by a +background thread, which takes the next available list and create a +disk-resident SSTable with its content.

+

1.3.2 Lookup

+

While looking for a key, we proceed as follows:

+
    +
  1. Look into the in-memory buffer, if the key is recently written it is +likely here, if not present continue;
  2. +
  3. Look into the immutable memtables list, iterating from the most +recent to the oldest, if not present continue;
  4. +
  5. Look into disk tables, iterating from the most recent one to the +oldest, if not present return null.
  6. +
+

1.3.3 Deletions

+

To delete a key, we do not need to delete all its replicas, from the +on-disk tables, we just need a special value called tombstone. +Hence a deletion is the same as an insertion, but with a value set to +null. While looking for a key, if we encounter a null value we simply +return null as a result.

+

1.3.4 SSTable Compaction

+

The most expensive operation while looking for a key is certainly the +disk search, and this is why bloom filters are crucial for negative +lookup on SSTables. But no bloom filter can save us if too many tables +are available to search, hence we need compaction.

+

When flushing a Memtable, we create an SSTable of level one. When the +first level reaches a certain threshold, all its tables are merged into +a level-two table, and so on. This permits us to save storage and query +fewer tables in lookups.

+

Note that this style of compaction is not standard, there are various +sophisticated techniques, but for the sake of this project this simple +level-like compaction works wonders.

+
+

2 Benchmarks

+

I am using JMH to run +benchmarks, the results are obtained on AMD Ryzen™ 5 4600H with 16GB of +RAM and 512GB SSD.

+

To run them use ./gradlew jmh.

+

2.1 SSTable

+ +

+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.sstable.SSTableBenchmark.negativeAccess  thrpt    5  3316202.976 ±  32851.546  ops/s
+c.t.l.sstable.SSTableBenchmark.randomAccess    thrpt    5     7989.945 ±     40.689  ops/s
+
+

2.2 Bloom filter

+ +
Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.bloom.BloomFilterBenchmark.add           thrpt    5  3190753.307 ±  74744.764  ops/s
+c.t.l.bloom.BloomFilterBenchmark.contains      thrpt    5  3567392.634 ± 220377.613  ops/s
+
+

2.3 Skip-List

+ +

+Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.memtable.SkipListBenchmark.addRemove     thrpt    5   430239.471 ±   4825.990  ops/s
+c.t.l.memtable.SkipListBenchmark.get           thrpt    5   487265.620 ±   8201.227  ops/s
+
+

2.4 Tree

+ +
Benchmark                                       Mode  Cnt        Score        Error  Units
+c.t.l.tree.LSMTreeAddBenchmark.add             thrpt    5   540788.751 ±  54491.134  ops/s
+c.t.l.tree.LSMTreeGetBenchmark.get             thrpt    5     9426.951 ±    241.190  ops/s
+
+
+

2.5 Possible improvements

+

There is certainly space for improvement on this project:

+
    +
  1. Blocked bloom filters: its a variant of a classic array-like bloom +filter which is more cache efficient;
  2. +
  3. Search fingers in the Skip list: the idea is to keep a pointer to +the last search, and start from there with subsequent queries;
  4. +
  5. Proper level compaction in the LSM tree;
  6. +
  7. Write ahead log for the insertions, without this, a crash makes all +the in-memory writes disappear;
  8. +
  9. Proper recovery: handle crashes and reboots, using existing SSTables +and the write-ahead log.
  10. +
+

I don’t have the practical time to do all of this, perhaps the first +two points will be handled in the future.

+
+

2.6 References

+ +
+

If you found this useful or interesting do not hesitate to ask +clarifying questions or get in touch!