Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/lazy load deployment #2519

Closed

Conversation

joske
Copy link
Contributor

@joske joske commented Jul 12, 2024

Motivation

We've seen that RSS size is correlated with deployments loaded (and staying) in memory. This PR keeps MAX_STACKS programs in memory (chosen arbitrarily at 1000).

Test Plan

Tested by running a local devnet, deploying programs and executing them. Then restarting the network and trying to execute these programs. The Stacks were lazily loaded from storage, execution worked. Also tested executing programs that are not deployed.

@joske joske force-pushed the feat/lazy-load-deployment branch 2 times, most recently from a334232 to 61872f3 Compare July 15, 2024 07:44
@raychu86
Copy link
Contributor

Would love some unit tests for this as well! An easy way to do this would be to use a #[cfg(any(test, feature = "test"))] and set MAX_STACKS to be a small value for testing purposes only.

Comment on lines 267 to 271
// try to lazy load the stack
#[cfg(feature = "rocks")]
let store = ConsensusStore::<N, ConsensusDB<N>>::open(storage_mode);
#[cfg(not(feature = "rocks"))]
let store = ConsensusStore::<N, ConsensusMemory<N>>::open(storage_mode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be easier to either have C: ConsensusStorage as a generic or to pass in storage somehow.

It might break some APIs, but it feels cleaner than reloading the storage it at each instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this was the original idea. but adding the generic cascaded everywhere... Will look into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cascades too much. The perf hit to open storage is about 2s (on a 15GB ledger). That's not negligible, but hopefully we don't need to keep so many programs in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found a way to avoid opening storage every time 🎉

@joske joske force-pushed the feat/lazy-load-deployment branch 3 times, most recently from 2e79b48 to ab3fb4b Compare July 30, 2024 14:48
@joske joske marked this pull request as ready for review July 30, 2024 14:58
@joske joske requested a review from raychu86 July 30, 2024 14:59
@joske joske force-pushed the feat/lazy-load-deployment branch from ab3fb4b to dc9340b Compare July 31, 2024 07:14
@vicsn vicsn requested a review from niklaslong July 31, 2024 10:45
@kpandl
Copy link

kpandl commented Sep 4, 2024

When running the snarkOS ./devnet.sh script (with the commits of this PR attached to snarkVM) with e.g. 10 validators (and no clients), the nodes fail to load the ledger and the network doesn't start.

Typically, the logs look similar to this:
9 validators log:

DEBUG snarkvm_synthesizer_process: Opening storage
thread 'main' panicked at cli/src/commands/start.rs:176:84:
Failed to parse the node: Failed to load ledger (run 'snarkos clean' and try again)

IO error: While lock file: /Users/user/dev/snarkOS/.ledger-0-0/LOCK: Resource temporarily unavailable

1 validator logs:

INFO snarkvm_ledger: Loading the ledger from storage...
thread 'main' panicked at cli/src/commands/start.rs:176:84:
Failed to parse the node: Failed to load ledger (run 'snarkos clean' and try again)
Mismatching network ID or storage mode in the database

@joske
Copy link
Contributor Author

joske commented Sep 4, 2024

lock file: /Users/user/dev/snarkOS/.ledger-0-0/LOCK: Resource temporarily unavailable This is a bit strange. Do all nodes have this?

@kpandl
Copy link

kpandl commented Sep 4, 2024

Yes, all nodes had it. On aws, validator 0 worked, all other validators got stuck on this:

2024-09-03T14:50:51.110986Z DEBUG Opening storage
2024-09-03T14:50:51.172513Z DEBUG Opened storage
2024-09-03T14:51:08.816863Z INFO Loading the ledger from storage...

(the --no-dev-txs flag was set)

@joske
Copy link
Contributor Author

joske commented Sep 4, 2024

did you put --dev 0 on all by accident?

}
}

impl<N: Network> Process<N> {
/// Initializes a new process.
#[inline]
pub fn load() -> Result<Self> {
// Assumption: this is only called in test code.
Process::load_from_storage(Some(aleo_std::StorageMode::Development(0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting this may be the source of our troubles, and we may want to enclose the function in #[cfg(any(test, feature = "test"))] if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this code needs to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marking as draft again.

Copy link
Contributor

@vicsn vicsn Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4aa125b looks good.

If enclosing fn load() in something like #[cfg(any(test, feature = "test"))] is too difficult (perhaps because its called both by tests and benchmarks?), another safer approach could be to get rid of fn load() alltogether. Should take you just one search and replace to call Process::load_from_storage(Some(aleo_std::StorageMode::Development(0))) directly, and that keeps the aleo_std::StorageMode::Development(0) logic contained to within tests and benchmarks. That makes a future developer less likely to make mistakes. UPDATE: See new comment below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise also that the current approach can lead to deadlocks on the storage when running tests, because all the tests will try to access the same database... Looks like we also didn't run CI on this PR yet (you can ask @ljedrz for how he has been running CI with a separate -ci branch).

New proposal: what about we let fn load use store: None. Just like load_from_storage , it should load the credits.aleo program because the tests need it. To avoid adding too many lines of code, you can still abstract similar logic to a shared function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not use None as this breaks many tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear. Then we should rename load to load_testing_only.

@joske joske marked this pull request as draft September 9, 2024 19:02
@@ -0,0 +1,92 @@
// Copyright (C) 2019-2023 Aleo Systems Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -93,9 +96,12 @@ package = "snarkvm-utilities"
path = "../../utilities"
version = "=0.16.19"

[dependencies.tracing]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hate to nit, but these are ordered alphabetically, should be moved after [dependencies.serde_json]

@@ -78,6 +78,9 @@ package = "snarkvm-ledger-store"
path = "../../ledger/store"
version = "=0.16.19"

[dependencies.lru]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nit, all the non-Aleo dependencies are below

@vicsn vicsn marked this pull request as ready for review October 1, 2024 12:54
@vicsn vicsn mentioned this pull request Oct 7, 2024
@vicsn
Copy link
Contributor

vicsn commented Oct 7, 2024

Closed by #2553

@vicsn vicsn closed this Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants