Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit parallel processing for clients #3358

Open
wants to merge 7 commits into
base: mainnet-staging
Choose a base branch
from

Conversation

vicsn
Copy link
Contributor

@vicsn vicsn commented Jul 9, 2024

Motivation

Clients run out of memory because they have no limit on how many transactions or solutions they verify in parallel. This PR proposes to queue them (just like the validator does in Consensus) and limit how much parallel verification we do.

We can do a lot of clever things to increase the processing speed, like check how many constraints the incoming transactions have, await on a channel to rapidly start verifying, but the focus for now is simplicity and safety.

Even though it was recently suggested clients should have at least 128 GiB of memory, the current implementation uses "only" up to 30 GiB for transaction verification. The right number is up for debate.

Test Plan

CI passed
Ran a local network shooting some executions.
In canary, more serious concurrent traffic can be shot at the network.

Related PRs

Potentially closes: #3341
Replaces: #2970

@vicsn
Copy link
Contributor Author

vicsn commented Jul 9, 2024

An open design question is how much memory should be reserved for worst case transaction + solution verification.

Summoning @evanmarshall @zosorock @Meshiest . Provable previously recommended that clients should have 128 GiB of RAM, but I understand you and others want to run with less RAM. So my questions to you:

  1. What do you think is a sufficient default amount of RAM which should be assumed to be available for transaction/solution verification?
  2. How badly do you want an additional feature flag which lets you increase the amount of RAM used for transaction/solution verification?

@Meshiest
Copy link
Contributor

Meshiest commented Jul 9, 2024

I have an undiversified node operational experience so I don't have a good estimate for 1 and personally haven't run into anything needing the feature flag from 2

The main limiter we've observed in client sync is core count. 16 cores was barely able to keep up with block production on canary after the 15tps tests. Upgrading a client VM from 16 to 32 cores massively increased sync speed. Our servers with more than 64 cores were powerhouses when syncing. RAM seemed less important in our tests, though we weren't using large programs or verifying any solutions.

raychu86
raychu86 previously approved these changes Jul 10, 2024
Copy link
Contributor

@raychu86 raychu86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code quality and logic LGTM.

Dialing in a reasonable number for these bounds will likely be an iterative process.

In theory we could also use the sysinfo crate to fetch the total memory of the machine thats running the node.

ljedrz
ljedrz previously approved these changes Jul 10, 2024
Copy link
Collaborator

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one open point

@vicsn vicsn dismissed stale reviews from ljedrz and raychu86 via 799f157 July 11, 2024 11:27
@vicsn
Copy link
Contributor Author

vicsn commented Jul 11, 2024

@raychu86 sorry added a separate execution queue, I couldn't let a simple 200x improvement in max throughput slide (assuming available compute): 9049f3e

In theory we could also use the sysinfo crate to fetch the total memory of the machine thats running the node.

Yes thought about it, but this dynamic behaviour will complicate infra planning too much I think, so I think there should rather be a --available-ram flag or something if users have very diverse preferences.

@vicsn vicsn requested review from ljedrz and raychu86 July 11, 2024 11:34
raychu86
raychu86 previously approved these changes Jul 15, 2024
Copy link
Contributor

@raychu86 raychu86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

node/src/client/router.rs Outdated Show resolved Hide resolved
@raychu86
Copy link
Contributor

@vicsn Is this PR still relevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] OOM kill because sudden memory increase and requests
5 participants