Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitstore race condition due to caching and reverts #12420

Open
Stebalien opened this issue Aug 28, 2024 · 3 comments
Open

Splitstore race condition due to caching and reverts #12420

Stebalien opened this issue Aug 28, 2024 · 3 comments
Labels
kind/bug Kind: Bug

Comments

@Stebalien
Copy link
Member

The splitstore may remove important state given the following sequence of events:

  1. Client syncs to tipset A at height X.
  2. Client switches to tipset B at height X.
  3. Splitstore starts garbage collecting.
  4. Client switches back to tipset A at height X.

In step 4, the client will not re-execute tipset A because it'll be in the cache so the state for tipset will not get re-written. The splitstore will fail to keep the state from tipset A because (a) it was not reachable from tipset B and (b) it was not written after garbage collection started.

This can lead to corrupted datastores with missing blocks, leading to state mismatches and sync failures when the splitstore is enabled.

@Stebalien Stebalien added the kind/bug Kind: Bug label Aug 28, 2024
@rjan90
Copy link
Contributor

rjan90 commented Sep 3, 2024

2024-09-03

During the triage we discussed if we could drop the cache (maybe once a day). But we need to investigate if this is feasible. @ZenGround0 you have a lot of knowledge about the Splitstore, do you know if this would be okay? And also, once we schedule some more time to tackle this issue, maybe pair up with another so we can do some knowledge share about splitstore

@Stebalien
Copy link
Member Author

During the triage we discussed if we could drop the cache (maybe once a day).

Specifically, drop the state cache for all tipsets not on the canonical chain at the start of compaction. That way we have to recompute their state when switching to them, ensuring the splitstore sees that their state is live.

@ZenGround0
Copy link
Contributor

@rjan90 I could probably help speed up with a pair and would like to do that. I am rusty though so I will need time to understand what's going on and the proposed solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Kind: Bug
Projects
Status: 📌 Triage
Development

No branches or pull requests

3 participants