Add Parquet RowSelection benchmark #6623

XiangpengHao · 2024-10-24T15:13:31Z

Which issue does this PR close?

Part of #5523

Rationale for this change

As the first step of measure-then-build, we add some benchmarks.

The benchmark has 300_000 rows, and the selector will select 1/3 of the rows, this roughly matches with the SearchPhase <> '' predicate in many ClickBench queries.

I added intersection, union, from_filters and and_then because they are the most pronounced ones in the flamegraph.

What changes are included in this PR?

Are there any user-facing changes?

Xuanwo

Mostly LGTM, thank you for building this!

Xuanwo · 2024-10-24T15:28:16Z

parquet/benches/row_selector.rs

+        let bools: Vec<bool> = (0..total_rows)
+            .map(|_| rng.gen_bool(selection_ratio))
+            .collect();
+        let boolean_array = BooleanArray::from(bools);


It looks like if we change generate_random_row_selection like this:

fn generate_random_row_selection(total_rows: usize, selection_ratio: f64) -> RowSelection { let mut rng = rand::thread_rng(); let bools: Vec<bool> = (0..total_rows) .map(|_| rng.gen_bool(selection_ratio)) .collect(); let boolean_array = BooleanArray::from(bools); - RowSelection::from_filters(&[boolean_array]) }

We can save duplicated code here.

I don't quite get this, can you elaborate a bit more?

Something like this:

fn generate_random_boolean_array(total_rows: usize, selection_ratio: f64) -> BooleanArray { let mut rng = rand::thread_rng(); let bools: Vec<bool> = (0..total_rows) .map(|_| rng.gen_bool(selection_ratio)) .collect(); BooleanArray::from(bools) } // Generate two random RowSelections with approximately 1/3 of the rows selected. let row_selection_a = RowSelection::from_filters(generate_random_boolean_array(total_rows, selection_ratio)); let row_selection_b = RowSelection::from_filters(generate_random_boolean_array(total_rows, selection_ratio)); // Benchmark the intersection of the two RowSelections. c.bench_function("intersection", |b| { b.iter(|| { let intersection = row_selection_a.intersection(&row_selection_b); criterion::black_box(intersection); }) }); c.bench_function("union", |b| { b.iter(|| { let union = row_selection_a.union(&row_selection_b); criterion::black_box(union); }) }); c.bench_function("from_filters", |b| { let boolean_array = generate_random_boolean_array(total_rows, selection_ratio); b.iter(|| { let array = boolean_array.clone(); let selection = RowSelection::from_filters(&[array]); criterion::black_box(selection); }) })

No a big issue, though.

Oh I see, makes sense!

Xuanwo

LGTM, thank you @XiangpengHao for building this!

tustvold · 2024-10-25T08:59:48Z

Thank you for this, I'm sure you're aware and what you're trying to empirically demonstrate, but RowSelection is not designed for highly non-contiguous, e.g. random selections. It might be worth adding some benchmarks of long contiguous selections, as might arise when filtering sorted data

alamb · 2024-10-25T13:29:30Z

🫶

but RowSelection is not designed for highly non-contiguous, e.g. random selections.

yes, I think this is what @XiangpengHao is considering improving

It might be worth adding some benchmarks of long contiguous selections, as might arise when filtering sorted data

I agree adding benchmarks for the case where RowSelection already does well would be valuable (to ensure we don't introduce regressions)

add benchmark

5b565cf

github-actions bot added the parquet Changes to the parquet crate label Oct 24, 2024

add and_then benchmark

f069274

XiangpengHao changed the title ~~Add RowSelection benchmark~~ Add Parquet RowSelection benchmark Oct 24, 2024

fix ci

8472b0a

Xuanwo approved these changes Oct 24, 2024

View reviewed changes

update bench

4bc92c8

Xuanwo approved these changes Oct 24, 2024

View reviewed changes

tustvold merged commit 56525ef into apache:master Oct 25, 2024
17 checks passed

XiangpengHao deleted the row-selector2 branch October 25, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parquet RowSelection benchmark #6623

Add Parquet RowSelection benchmark #6623

XiangpengHao commented Oct 24, 2024 •

edited

Loading

Xuanwo left a comment

Xuanwo Oct 24, 2024

XiangpengHao Oct 24, 2024

Xuanwo Oct 24, 2024

XiangpengHao Oct 24, 2024

XiangpengHao Oct 24, 2024

Xuanwo left a comment

tustvold commented Oct 25, 2024

alamb commented Oct 25, 2024

Add Parquet RowSelection benchmark #6623

Add Parquet RowSelection benchmark #6623

Conversation

XiangpengHao commented Oct 24, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Xuanwo left a comment

Choose a reason for hiding this comment

Xuanwo Oct 24, 2024

Choose a reason for hiding this comment

XiangpengHao Oct 24, 2024

Choose a reason for hiding this comment

Xuanwo Oct 24, 2024

Choose a reason for hiding this comment

XiangpengHao Oct 24, 2024

Choose a reason for hiding this comment

XiangpengHao Oct 24, 2024

Choose a reason for hiding this comment

Xuanwo left a comment

Choose a reason for hiding this comment

tustvold commented Oct 25, 2024

alamb commented Oct 25, 2024

XiangpengHao commented Oct 24, 2024 •

edited

Loading