Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReshardingV3 memtrie #574

Merged
merged 4 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added neps/assets/nep-0568/NEP-HybridMemTrie.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added neps/assets/nep-0568/NEP-SplitState.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
65 changes: 61 additions & 4 deletions neps/nep-0568.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,41 @@ post-processing, as long as the chain's view reflects a fully resharded state.
must be correctly distributed between the child shards.
* ShardId Semantics: The shard identifiers will become abstract identifiers
where today they are number in the 0..num_shards range.
* Congestion Info: CongestionInfo in the chunk header would be recalculated for the child
shards at the resharding boundary. Proof must be compatible with Stateless Validation.

### State Storage - Mem Trie
### State Storage - MemTrie

MemTrie is the in-memory representation of the trie that the runtime uses for all trie accesses. This is kept in sync with the Trie representation in state.

As of today it isn't mandatory for nodes to have MemTrie feature enabled but going forward, with ReshardingV3, all nodes would require to have MemTrie enabled for resharding to happen successfully.

For the purposes of resharding, we need an efficient way to split the MemTrie into two child tries based on the boundary account. This splitting happens at the epoch boundary when the new epoch is expected to have the two child shards. The set of requirements around MemTrie splitting are:
* MemTrie splitting needs to be "instant", i.e. happen efficiently within the span of one block. The child tries need to be available for the processing of the next block in the new epoch.
* MemTrie splitting needs to be compatible with stateless validation, i.e. we need to generate a proof that the memtrie split proposed by the chunk producer is correct.
* The proof generated for splitting the MemTrie needs to be compatible with the limits of the size of state witness that we send to all chunk validators. This prevents us from doing things like iterating through all trie keys for delayed receipts etc.

With ReshardingV3 design, there's no protocol change to the structure of MemTries, however the implementation constraints required us to introduce the concept of a Frozen MemTrie. More details are in the [implementation](#state-storage---memtrie-1) section below.

Based on the requirements above, we came up with an algorithm to efficiently split the parent trie into two child tries. Trie entries can be divided into three categories based on whether the trie keys have an `account_id` prefix and based on the total number of such trie keys. Splitting of these keys is handled in different ways.

#### TrieKey with AccountID prefix

This category includes most of the trie keys like `TrieKey::Account`, `TrieKey::ContractCode`, `TrieKey::PostponedReceipt`, etc. For these keys, we can efficiently split the trie based on the boundary account trie key. Note that we only need to read all the intermediate nodes that form a part of the split key. In the example below, if "pass" is the split key, we access all the nodes along the path of `root` -> `p` -> `a` -> `s` -> `s`, while not needing to touch any of the other intermediate nodes like `o` -> `s` -> `t` in key "post". The accessed nodes form a part of the state witness as those are the only nodes that the validators would need to verify that the resharding split is correct. This limits the size of the witness to effectively O(depth) of trie for each trie key in this category.

![Splitting Trie diagram](assets/nep-0568/NEP-SplitState.png)

#### Singleton TrieKey

This category includes the trie keys `TrieKey::DelayedReceiptIndices`, `TrieKey::PromiseYieldIndices`, `TrieKey::BufferedReceiptIndices`. Notably, these are just a single entry (or O(num_shard) entries) in the trie and hence are small enough to read and modify for the children tries efficiently.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say how these keys are split in practice? Or what happens to them in state witness?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean what exact values do we push into the children for these keys? Yes, we should do that, but I think it would be better as part of the delayed receipts, promise yield and buffered receipts section (which I plan to add soon).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good to me


#### Indexed TrieKey

This category includes the trie keys `TrieKey::DelayedReceipt`, `TrieKey::PromiseYieldTimeout` and `TrieKey::BufferedReceipt`. The number of entries for these keys can potentially be arbitrarily large and it's not feasible to iterate through all the entries. In pre-stateless validation world, where we didn't care about state witness size limits, for ReshardingV2 we could just iterate over all delayed receipts and split them into the respective child shards.

For ReshardingV3, these are handled by either of the two strategies
- `TrieKey::DelayedReceipt` and `TrieKey::PromiseYieldTimeout` are handled by duplicating entries across both child shards as each entry could belong to either of the child shards. More details in the [Delayed Receipts](#delayed-receipt-handling) and [Promise Yield](#promiseyield-receipt-handling) sections below.
- `TrieKey::BufferedReceipt` are independent of the account_id and therefore can be sent to either of the child shards, but not both. We copy the buffered receipts and the associated metadata to the child shard with the lower index. More details in the [Buffered Receipts](#buffered-receipt-handling) section below.

### State Storage - Flat State

Expand Down Expand Up @@ -137,9 +170,9 @@ supporting smooth transitions without altering storage structures directly.

### Cross Shard Traffic

### Receipt Handling - Delayed, Postponed, PromiseYield

### Receipt Handling - Buffered
### Delayed Receipt Handling
### PromiseYield Receipt Handling
### Buffered Receipt Handling

### ShardId Semantics

Expand All @@ -160,6 +193,30 @@ In this NEP, we propose updating the ShardId semantics to allow for arbitrary id

The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.]
```
### State Storage - MemTrie

The current implementation of MemTrie uses a pool of memory (`STArena`) to allocate and deallocate nodes and internal pointers in this pool to reference child nodes. MemTries, unlike the State representation of Trie, do not work with the hash of the nodes but internal memory pointers directly. Additionally, MemTries are not thread safe and one MemTrie exists per shard.

As described in [MemTrie](#state-storage---memtrie) section above, we need an efficient way to split the MemTrie into two child MemTries within a span of 1 block. What makes this challenging is that the current implementation of MemTrie is not thread safe and can not be shared across two shards.

The naive way to create two MemTries for the child shards would be to iterate through all the entries of the parent MemTrie and fill in these values into the child MemTries. This however is prohibitively time consuming.

The solution to this problem was to introduce the concept of Frozen MemTrie (with a `FrozenArena`) which is a cloneable, read-only, thread-safe snapshot of a MemTrie. We can call the `freeze` method on an existing MemTrie that converts it into a Frozen MemTrie. Note that this process consumes the original MemTrie and we can no longer allocate and deallocate nodes to it.

Along with `FrozenArena`, we also introduce a `HybridArena` which is effectively a base made of `FrozenArena` with a top layer of `STArena` where we support allocating and deallocating new nodes into the MemTrie. Newly allocated nodes can reference/point to nodes in the `FrozenArena`. We use this Hybrid MemTrie as a temporary MemTrie while the flat storage is being constructed in the background.

While Frozen MemTries provide the benefits of being compatible with instant resharding, they come at the cost of memory consumption. Once a MemTrie is frozen, since it doesn't support deallocation of memory, it continues to consume as much memory as it did at the time of freezing. In case a node is tracking only one of the child shards, a Frozen MemTrie would continue to use the same amount of memory as the parent trie. Due to this, Hybrid MemTries are only a temporary solution and we rebuild the MemTrie for the children once the post-processing step for Flat Storage is completed.

Additionally, a node would have to support 2x the memory footprint of a single trie as after resharding, we would have two copies of the trie in memory, one from the temporary Hybrid MemTrie in use for block production, and other from the background MemTrie that would be under construction. Once the background MemTrie is fully constructed and caught up with the latest block, we do an in-place swap of the Hybrid MemTrie with the new child MemTrie and deallocate the memory from the Hybrid MemTrie.

During a resharding event, at the boundary of the epoch, when we need to split the parent shard into the two child shards, we do the following steps:
1. Freeze the parent MemTrie arena to create a read-only frozen arena that represents a snapshot of the state as of the time of freezing, i.e. after postprocessing last block of epoch. Note that we no longer require the parent MemTrie in runtime going forward.
2. We cheaply clone the Frozen MemTrie for both the child MemTries to use. Note that this doesn't clone the parent arena memory, but just increases the refcount.
3. We then create a new MemTrie with HybridArena for each of the children. The base of the MemTrie is the read-only FrozenArena while all new node allocations happens on a dedicated STArena memory pool for each child MemTrie. This is the temporary MemTrie that we use while Flat Storage is being built in the background.
4. Once the Flat Storage is constructed in the post processing step of resharding, we use that to load a new MemTrie and catchup to the latest block.
5. After the new child MemTrie has caught up to the latest block, we do an in-place swap in Client and discard the Hybrid MemTrie.

![Hybrid MemTrie diagram](assets/nep-0568/NEP-HybridMemTrie.png)

### State Storage - State mapping

Expand Down
Loading