Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memdb: retain old version nodes of ART to satisfy snapshot read #1503

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

you06
Copy link
Contributor

@you06 you06 commented Nov 19, 2024

ref pingcap/tidb#57425

Changes

The snapshot iterators always read from a snapshot of MemBuffer, but writes between Next calls can alter the structure of ART (see the "but explanation" section for details), potentially causing the snapshot iterator to read incorrect results.

This PR introduces a counter for active snapshots. When the counter is greater than 0, it indicates that old versions need to be retained for snapshot reads. In such cases, we store freed nodes in unused slices and delay the actual free operation to prevent them from being reused.

Add a counter for

Bug explanation

1. The snapshot iterator scans to node (click to expand the figure)
  │                                 │ 
  │        iterator range           │ 
  ◄─────────────────────────────────► 
  │                                 │ 
  │                                 │ 
  │            ┌────────┐           │ 
  │            │        │           │ 
  │            │  root  │           │ 
  │            │        │           │ 
  │            └────┬───┘           │ 
  │                 │               │ 
  │         ┌───────┴───────┐       │ 
  │         │               │       │ 
  │    ┌────▼───┐      ┌────▼───┐   │ 
  │    │        │      │        │   │ 
  │    │    1   │      │    2   │   │ 
  │    │        │      │        │   │ 
  │    └─────▲──┘      └────────┘   │ 
  │          │                      │ 
  │          │                      │ 
  │          │                      │ 
  │          │                      │ 
  │          │                      │ 
  │    ┌─────┴───┐                  │ 
  │    │ snapshot│                  │ 
  │    │ iterator│                  │ 
  │    └─────────┘                  │ 
  │                                 │ 
2. The node 1 grows to larger capacity (node 3) due to coming writes, and the node1 is reused (click to expand the figure)
 │                                 │                
 │        iterator range           │                
 ◄─────────────────────────────────►                
 │                                 │                
 │                                 │                
 │            ┌────────┐           │                
 │            │        │           │                
 │            │  root  │           │                
 │            │        │           │                
 │            └────┬───┘           │                
 │                 │               │                
 │         ┌───────┴───────┬───────┼─────────┐      
 │         │               │       │         │      
 │    ┌────▼───┐      ┌────▼───┐   │    ┌────▼───┐  
 │    │        │      │        │   │    │        │  
 │    │    3   │      │    2   │   │    │    1   │  
 │    │        │      │        │   │    │        │  
 │    └────────┘      └────────┘   │    └─────▲──┘  
 │                                 │          │     
 │                                 │          │     
 │                                 │          │     
 │                                 │          │     
 │                                 │          │     
 │                                 │    ┌─────┴───┐ 
 │                                 │    │ snapshot│ 
 │                                 │    │ iterator│ 
 │                                 │    └─────────┘ 
 │                                 │                
  1. The iterators will return keys out of the given range in following Next call.

@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 19, 2024
Signed-off-by: you06 <[email protected]>
internal/unionstore/art/art_snapshot.go Show resolved Hide resolved
@@ -62,7 +63,8 @@ func (t *ART) SnapshotIterReverse(k, lowerBound []byte) *SnapIter {
Iterator: inner,
cp: t.getSnapshot(),
}
for !it.setValue() && it.valid {
it.tree.allocator.snapshotInc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

internal/unionstore/art/art_snapshot.go Show resolved Hide resolved
Copy link
Contributor

@ekexium ekexium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks OK. I just have some questions:

  1. In your example, which variable in the iterator is pointing to node-1? Is it nodes in baseIter?
  2. Why are we only fixing for SnapshotIter? What about the ART Iterator? Is it currently safe in implementation but not guaranteed by design?
  3. Are there any other inner structure change that could lead to iterator invalidation, other than the free nodes? I don't see other cases at first glance. Have you verified this as well?

Signed-off-by: you06 <[email protected]>
@you06
Copy link
Contributor Author

you06 commented Nov 19, 2024

  • In your example, which variable in the iterator is pointing to node-1? Is it nodes in baseIter?

Yes, nodes[len(nodes) - 1] is node1 in the example.

  • Why are we only fixing for SnapshotIter? What about the ART Iterator?

The SnapshotIter can be explained always read from the snapshot, so we need to protect it against the later writes.

For Iterator, there should be no writes during the iterator, unless the result makes no sense.

Is it currently safe in implementation but not guaranteed by design?

Yes, such usage in TiDB is out of my expectation. For long-term, we may deprecate the SnapshotIter and replace it with SnapshotScan which returns all the rows in one call.

  • Are there any other inner structure change that could lead to iterator invalidation, other than the free nodes? I don't see other cases at first glance. Have you verified this as well?

I don't see other cases also. GetSnapshotValue can filter out the new added keys or versions.

Copy link
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

h := db.Staging()
defer db.Release(h)

iter := db.SnapshotIter([]byte{0, 0}, []byte{0, 255})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to Close the iter following the requirement usage pattern in test.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Nov 19, 2024
Copy link

ti-chi-bot bot commented Nov 19, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-11-19 14:06:12.450469174 +0000 UTC m=+969934.641338172: ☑️ agreed by cfzjywxk.

Copy link

ti-chi-bot bot commented Nov 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Nov 19, 2024
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Nov 19, 2024

@you06
Please also ensure the read-with-write test cases are covered in the PR for tidb repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants