Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement row group skipping for the default engine parquet readers #362

Merged
merged 33 commits into from
Oct 9, 2024

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Sep 27, 2024

Previous PR #357 implemented the logic of stats-based skipping for a parquet reader, but in abstract form that doesn't actually depend on parquet footers. With that in place, we can now wire up the kernel default parquet readers to use row group skipping.

Also fixes #380.

Copy link

codecov bot commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 85.61525% with 83 lines in your changes missing coverage. Please review.

Project coverage is 77.66%. Comparing base (1c4b9ce) to head (4a77f3a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/parquet_row_group_skipping.rs 52.79% 26 Missing and 50 partials ⚠️
kernel/src/engine/default/parquet.rs 50.00% 5 Missing ⚠️
kernel/src/snapshot.rs 88.88% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #362      +/-   ##
==========================================
+ Coverage   77.06%   77.66%   +0.59%     
==========================================
  Files          47       49       +2     
  Lines        9524    10079     +555     
  Branches     9524    10079     +555     
==========================================
+ Hits         7340     7828     +488     
- Misses       1790     1805      +15     
- Partials      394      446      +52     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

let log_iter = self.snapshot.log_segment.replay(
engine,
commit_read_schema,
checkpoint_read_schema,
self.predicate.clone(),
Copy link
Collaborator Author

@scovich scovich Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This was an existing bug -- passing a query-level filter to the metadata file reads.

(We probably should add a meta-skipping unit test for replay?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #380

Copy link
Collaborator Author

@scovich scovich Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already done -- #381 added three new replay_for_XXX tests, which this PR updates to account for the expected row group skipping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry! thanks!!

.map_ok(|batch| (batch, true));

let parquet_client = engine.get_parquet_handler();
// TODO change predicate to: predicate AND add.path not null
Copy link
Collaborator Author

@scovich scovich Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these two TODO because the P&M query also invokes this code path, and filtering by add.path NOT NULL is just plain incorrect. Also, see the code comment at the other replay call site for why it doesn't make sense to pass add.path IS NOT NULL as a row group skipping filter.

kernel/tests/read.rs Outdated Show resolved Hide resolved
kernel/src/engine/parquet_row_group_skipping.rs Outdated Show resolved Hide resolved
kernel/tests/read.rs Outdated Show resolved Hide resolved
kernel/tests/read.rs Outdated Show resolved Hide resolved
@scovich scovich removed the merge hold Don't allow the PR to merge label Oct 7, 2024
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks. Just a couple of small nits

ffi/src/engine_funcs.rs Outdated Show resolved Hide resolved
kernel/src/engine/parquet_row_group_skipping.rs Outdated Show resolved Hide resolved
kernel/src/engine/parquet_row_group_skipping.rs Outdated Show resolved Hide resolved
kernel/tests/read.rs Outdated Show resolved Hide resolved
Some(self.get_stats(col)?.null_count_opt()? as i64)
let nullcount = match self.get_stats(col) {
Some(s) => s.null_count_opt()? as i64,
None => self.get_rowcount_stat_value(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new find, exposed on accident by me hacking two more parts into the checkpoint so we could test transaction app id filtering (the "checkpoint" schema was truncated, which prevented the P&M query from skipping those parts)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the statistics() method on ColumnChunkMetadata returns None, that just means that there are no stats for that column, but doesn't necessarily imply that all values are null does it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch. I didn't put the check deep enough. There are three levels of None here:

  • The column chunk doesn't even exist (infer nullcount = rowcount)
  • The column chunk doesn't have stats (should not infer anything clever)
  • The stats object lacks a particular stat

To make things even more "fun", we have the following warning in Statistics::null_count_opt 🤦:

this API returns Some(0) even if the null count was not present in the statistics

So I have two problems to work around now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fixed.

kernel/src/expressions/mod.rs Outdated Show resolved Hide resolved
kernel/src/scan/mod.rs Show resolved Hide resolved
kernel/src/snapshot.rs Outdated Show resolved Hide resolved
// txn.appId BINARY 0 "3ae45b72-24e1-865a-a211-3..." / "3ae45b72-24e1-865a-a211-3..."
// txn.version INT64 0 "4390" / "4390"
#[test]
fn test_replay_for_metadata() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An accidentally clever test :P
(see other comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

// checkpoint part when patitioned by `add.path` like the Delta spec requires. There's no
// point filtering by a particular app id, even if we have one, because people usually query
// for app ids that exist.
let meta_predicate = Expr::column("txn.appId").is_not_null();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code didn't previously attempt row group skipping for app ids. Now it does.

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the size of this PR i'll say this is a best-effort review but LGTM

let log_iter = self.snapshot.log_segment.replay(
engine,
commit_read_schema,
checkpoint_read_schema,
self.predicate.clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #380

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a question about correctness, but otherwise lgtm

Some(self.get_stats(col)?.null_count_opt()? as i64)
let nullcount = match self.get_stats(col) {
Some(s) => s.null_count_opt()? as i64,
None => self.get_rowcount_stat_value(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the statistics() method on ColumnChunkMetadata returns None, that just means that there are no stats for that column, but doesn't necessarily imply that all values are null does it?

// txn.appId BINARY 0 "3ae45b72-24e1-865a-a211-3..." / "3ae45b72-24e1-865a-a211-3..."
// txn.version INT64 0 "4390" / "4390"
#[test]
fn test_replay_for_metadata() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

scovich added a commit that referenced this pull request Oct 9, 2024
In preparation for
#362 that
actually implements parquet row group skipping, here we make various
preparatory changes that can stand on their own:
* Plumb the predicates through to the parquet readers, so that they can
easily start using them
* Add and use a new `Expression::is_not_null` helper that does what it
says
* Factor out `replay_for_XXX` methods, so that log replay involving
push-down predicates can be tested independently.
* Don't involve <n>.json in log replay if <n>.checkpoint.parquet is
available

This should make both changes easier to review.
@scovich scovich requested a review from nicklan October 9, 2024 03:37
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! two small suggestions that you can take or leave as you prefer.

!matches!(result, Some(false))
}

/// Returns `None` if the column doesn't exist and `Some(None)` if the column has no stats.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could define an enum for this rather than Some(None). Just a thought, I'm okay with both ways, and the Some(None) will have cleaner code in the method (at the cost of a more confusing return type)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice i think i commented on this too - maybe just Result? and we can have Err(MissingColumn) for a more understandable return type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoided Result because that would be exceptions as control flow (this isn't actually an error, it's just a situation).

If this were public code, the enum might make sense. But for a private method, I don't think it's worth the cognitive overhead (both to define and use it) when one line of code comment can fully explain what's going on?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's fine as is since both the method and the call site have a comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm!

kernel/src/engine/parquet_row_group_skipping.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another stamp with a few more questions :)

kernel/src/engine/parquet_row_group_skipping.rs Outdated Show resolved Hide resolved
kernel/src/engine/parquet_row_group_skipping.rs Outdated Show resolved Hide resolved
kernel/src/scan/mod.rs Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to introduce testing for the E2E stats to skipping path? this test suite does all of the stats part it seems - are we relying on existing tests to make sure the remaining skipping based on these stats is correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what the test in read.rs was doing before I deleted it... should I reinstate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we think E2E path is covered without it? if not then I'm not opposed to one more test :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think test_data_row_group_skipping should cover it pretty well? It plumbs the predicate through starting from the scan builder. The only remaining coverage would be FFI and the toy table reader -- neither of which has any predicates that they could pass. If/when those get updated to support predicates at all, they should be able to test that the predicates actually work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

@scovich scovich merged commit 42afc06 into delta-io:main Oct 9, 2024
13 checks passed
@scovich scovich deleted the use-row-group-skipping branch November 8, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add a meta-skipping unit test for log replay
5 participants