diff --git a/neps/assets/nep-0568/NEP-HybridMemTrie.png b/neps/assets/nep-0568/NEP-HybridMemTrie.png new file mode 100644 index 000000000..013c761f8 Binary files /dev/null and b/neps/assets/nep-0568/NEP-HybridMemTrie.png differ diff --git a/neps/assets/nep-0568/NEP-SplitState.png b/neps/assets/nep-0568/NEP-SplitState.png new file mode 100644 index 000000000..a160a5949 Binary files /dev/null and b/neps/assets/nep-0568/NEP-SplitState.png differ diff --git a/neps/nep-0568.md b/neps/nep-0568.md index 98615acf0..9f536e9bc 100644 --- a/neps/nep-0568.md +++ b/neps/nep-0568.md @@ -69,8 +69,41 @@ post-processing, as long as the chain's view reflects a fully resharded state. must be correctly distributed between the child shards. * ShardId Semantics: The shard identifiers will become abstract identifiers where today they are number in the 0..num_shards range. +* Congestion Info: CongestionInfo in the chunk header would be recalculated for the child + shards at the resharding boundary. Proof must be compatible with Stateless Validation. -### State Storage - Mem Trie +### State Storage - MemTrie + +MemTrie is the in-memory representation of the trie that the runtime uses for all trie accesses. This is kept in sync with the Trie representation in state. + +As of today it isn't mandatory for nodes to have MemTrie feature enabled but going forward, with ReshardingV3, all nodes would require to have MemTrie enabled for resharding to happen successfully. + +For the purposes of resharding, we need an efficient way to split the MemTrie into two child tries based on the boundary account. This splitting happens at the epoch boundary when the new epoch is expected to have the two child shards. The set of requirements around MemTrie splitting are: +* MemTrie splitting needs to be "instant", i.e. happen efficiently within the span of one block. The child tries need to be available for the processing of the next block in the new epoch. +* MemTrie splitting needs to be compatible with stateless validation, i.e. we need to generate a proof that the memtrie split proposed by the chunk producer is correct. +* The proof generated for splitting the MemTrie needs to be compatible with the limits of the size of state witness that we send to all chunk validators. This prevents us from doing things like iterating through all trie keys for delayed receipts etc. + +With ReshardingV3 design, there's no protocol change to the structure of MemTries, however the implementation constraints required us to introduce the concept of a Frozen MemTrie. More details are in the [implementation](#state-storage---memtrie-1) section below. + +Based on the requirements above, we came up with an algorithm to efficiently split the parent trie into two child tries. Trie entries can be divided into three categories based on whether the trie keys have an `account_id` prefix and based on the total number of such trie keys. Splitting of these keys is handled in different ways. + +#### TrieKey with AccountID prefix + +This category includes most of the trie keys like `TrieKey::Account`, `TrieKey::ContractCode`, `TrieKey::PostponedReceipt`, etc. For these keys, we can efficiently split the trie based on the boundary account trie key. Note that we only need to read all the intermediate nodes that form a part of the split key. In the example below, if "pass" is the split key, we access all the nodes along the path of `root` -> `p` -> `a` -> `s` -> `s`, while not needing to touch any of the other intermediate nodes like `o` -> `s` -> `t` in key "post". The accessed nodes form a part of the state witness as those are the only nodes that the validators would need to verify that the resharding split is correct. This limits the size of the witness to effectively O(depth) of trie for each trie key in this category. + +![Splitting Trie diagram](assets/nep-0568/NEP-SplitState.png) + +#### Singleton TrieKey + +This category includes the trie keys `TrieKey::DelayedReceiptIndices`, `TrieKey::PromiseYieldIndices`, `TrieKey::BufferedReceiptIndices`. Notably, these are just a single entry (or O(num_shard) entries) in the trie and hence are small enough to read and modify for the children tries efficiently. + +#### Indexed TrieKey + +This category includes the trie keys `TrieKey::DelayedReceipt`, `TrieKey::PromiseYieldTimeout` and `TrieKey::BufferedReceipt`. The number of entries for these keys can potentially be arbitrarily large and it's not feasible to iterate through all the entries. In pre-stateless validation world, where we didn't care about state witness size limits, for ReshardingV2 we could just iterate over all delayed receipts and split them into the respective child shards. + +For ReshardingV3, these are handled by either of the two strategies +- `TrieKey::DelayedReceipt` and `TrieKey::PromiseYieldTimeout` are handled by duplicating entries across both child shards as each entry could belong to either of the child shards. More details in the [Delayed Receipts](#delayed-receipt-handling) and [Promise Yield](#promiseyield-receipt-handling) sections below. +- `TrieKey::BufferedReceipt` are independent of the account_id and therefore can be sent to either of the child shards, but not both. We copy the buffered receipts and the associated metadata to the child shard with the lower index. More details in the [Buffered Receipts](#buffered-receipt-handling) section below. ### State Storage - Flat State @@ -137,9 +170,9 @@ supporting smooth transitions without altering storage structures directly. ### Cross Shard Traffic -### Receipt Handling - Delayed, Postponed, PromiseYield - -### Receipt Handling - Buffered +### Delayed Receipt Handling +### PromiseYield Receipt Handling +### Buffered Receipt Handling ### ShardId Semantics @@ -160,6 +193,30 @@ In this NEP, we propose updating the ShardId semantics to allow for arbitrary id The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.] ``` +### State Storage - MemTrie + +The current implementation of MemTrie uses a pool of memory (`STArena`) to allocate and deallocate nodes and internal pointers in this pool to reference child nodes. MemTries, unlike the State representation of Trie, do not work with the hash of the nodes but internal memory pointers directly. Additionally, MemTries are not thread safe and one MemTrie exists per shard. + +As described in [MemTrie](#state-storage---memtrie) section above, we need an efficient way to split the MemTrie into two child MemTries within a span of 1 block. What makes this challenging is that the current implementation of MemTrie is not thread safe and can not be shared across two shards. + +The naive way to create two MemTries for the child shards would be to iterate through all the entries of the parent MemTrie and fill in these values into the child MemTries. This however is prohibitively time consuming. + +The solution to this problem was to introduce the concept of Frozen MemTrie (with a `FrozenArena`) which is a cloneable, read-only, thread-safe snapshot of a MemTrie. We can call the `freeze` method on an existing MemTrie that converts it into a Frozen MemTrie. Note that this process consumes the original MemTrie and we can no longer allocate and deallocate nodes to it. + +Along with `FrozenArena`, we also introduce a `HybridArena` which is effectively a base made of `FrozenArena` with a top layer of `STArena` where we support allocating and deallocating new nodes into the MemTrie. Newly allocated nodes can reference/point to nodes in the `FrozenArena`. We use this Hybrid MemTrie as a temporary MemTrie while the flat storage is being constructed in the background. + +While Frozen MemTries provide the benefits of being compatible with instant resharding, they come at the cost of memory consumption. Once a MemTrie is frozen, since it doesn't support deallocation of memory, it continues to consume as much memory as it did at the time of freezing. In case a node is tracking only one of the child shards, a Frozen MemTrie would continue to use the same amount of memory as the parent trie. Due to this, Hybrid MemTries are only a temporary solution and we rebuild the MemTrie for the children once the post-processing step for Flat Storage is completed. + +Additionally, a node would have to support 2x the memory footprint of a single trie as after resharding, we would have two copies of the trie in memory, one from the temporary Hybrid MemTrie in use for block production, and other from the background MemTrie that would be under construction. Once the background MemTrie is fully constructed and caught up with the latest block, we do an in-place swap of the Hybrid MemTrie with the new child MemTrie and deallocate the memory from the Hybrid MemTrie. + +During a resharding event, at the boundary of the epoch, when we need to split the parent shard into the two child shards, we do the following steps: +1. Freeze the parent MemTrie arena to create a read-only frozen arena that represents a snapshot of the state as of the time of freezing, i.e. after postprocessing last block of epoch. Note that we no longer require the parent MemTrie in runtime going forward. +2. We cheaply clone the Frozen MemTrie for both the child MemTries to use. Note that this doesn't clone the parent arena memory, but just increases the refcount. +3. We then create a new MemTrie with HybridArena for each of the children. The base of the MemTrie is the read-only FrozenArena while all new node allocations happens on a dedicated STArena memory pool for each child MemTrie. This is the temporary MemTrie that we use while Flat Storage is being built in the background. +4. Once the Flat Storage is constructed in the post processing step of resharding, we use that to load a new MemTrie and catchup to the latest block. +5. After the new child MemTrie has caught up to the latest block, we do an in-place swap in Client and discard the Hybrid MemTrie. + +![Hybrid MemTrie diagram](assets/nep-0568/NEP-HybridMemTrie.png) ### State Storage - State mapping