-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet RowSelection benchmark #6623
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, thank you for building this!
parquet/benches/row_selector.rs
Outdated
let bools: Vec<bool> = (0..total_rows) | ||
.map(|_| rng.gen_bool(selection_ratio)) | ||
.collect(); | ||
let boolean_array = BooleanArray::from(bools); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like if we change generate_random_row_selection
like this:
fn generate_random_row_selection(total_rows: usize, selection_ratio: f64) -> RowSelection {
let mut rng = rand::thread_rng();
let bools: Vec<bool> = (0..total_rows)
.map(|_| rng.gen_bool(selection_ratio))
.collect();
let boolean_array = BooleanArray::from(bools);
- RowSelection::from_filters(&[boolean_array])
}
We can save duplicated code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get this, can you elaborate a bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this:
fn generate_random_boolean_array(total_rows: usize, selection_ratio: f64) -> BooleanArray {
let mut rng = rand::thread_rng();
let bools: Vec<bool> = (0..total_rows)
.map(|_| rng.gen_bool(selection_ratio))
.collect();
BooleanArray::from(bools)
}
// Generate two random RowSelections with approximately 1/3 of the rows selected.
let row_selection_a = RowSelection::from_filters(generate_random_boolean_array(total_rows, selection_ratio));
let row_selection_b = RowSelection::from_filters(generate_random_boolean_array(total_rows, selection_ratio));
// Benchmark the intersection of the two RowSelections.
c.bench_function("intersection", |b| {
b.iter(|| {
let intersection = row_selection_a.intersection(&row_selection_b);
criterion::black_box(intersection);
})
});
c.bench_function("union", |b| {
b.iter(|| {
let union = row_selection_a.union(&row_selection_b);
criterion::black_box(union);
})
});
c.bench_function("from_filters", |b| {
let boolean_array = generate_random_boolean_array(total_rows, selection_ratio);
b.iter(|| {
let array = boolean_array.clone();
let selection = RowSelection::from_filters(&[array]);
criterion::black_box(selection);
})
})
No a big issue, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @XiangpengHao for building this!
Thank you for this, I'm sure you're aware and what you're trying to empirically demonstrate, but RowSelection is not designed for highly non-contiguous, e.g. random selections. It might be worth adding some benchmarks of long contiguous selections, as might arise when filtering sorted data |
🫶
yes, I think this is what @XiangpengHao is considering improving
I agree adding benchmarks for the case where RowSelection already does well would be valuable (to ensure we don't introduce regressions) |
Which issue does this PR close?
Part of #5523
Rationale for this change
As the first step of measure-then-build, we add some benchmarks.
The benchmark has 300_000 rows, and the selector will select 1/3 of the rows, this roughly matches with the
SearchPhase <> ''
predicate in many ClickBench queries.I added
intersection
,union
,from_filters
andand_then
because they are the most pronounced ones in the flamegraph.What changes are included in this PR?
Are there any user-facing changes?