Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet RowSelection benchmark #6623

Merged
merged 4 commits into from
Oct 25, 2024
Merged

Conversation

XiangpengHao
Copy link
Contributor

@XiangpengHao XiangpengHao commented Oct 24, 2024

Which issue does this PR close?

Part of #5523

Rationale for this change

As the first step of measure-then-build, we add some benchmarks.

The benchmark has 300_000 rows, and the selector will select 1/3 of the rows, this roughly matches with the SearchPhase <> '' predicate in many ClickBench queries.

I added intersection, union, from_filters and and_then because they are the most pronounced ones in the flamegraph.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 24, 2024
@XiangpengHao XiangpengHao changed the title Add RowSelection benchmark Add Parquet RowSelection benchmark Oct 24, 2024
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, thank you for building this!

let bools: Vec<bool> = (0..total_rows)
.map(|_| rng.gen_bool(selection_ratio))
.collect();
let boolean_array = BooleanArray::from(bools);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if we change generate_random_row_selection like this:

fn generate_random_row_selection(total_rows: usize, selection_ratio: f64) -> RowSelection {
    let mut rng = rand::thread_rng();
    let bools: Vec<bool> = (0..total_rows)
        .map(|_| rng.gen_bool(selection_ratio))
        .collect();
    let boolean_array = BooleanArray::from(bools);
-    RowSelection::from_filters(&[boolean_array])
}

We can save duplicated code here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get this, can you elaborate a bit more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this:

fn generate_random_boolean_array(total_rows: usize, selection_ratio: f64) -> BooleanArray {
    let mut rng = rand::thread_rng();
    let bools: Vec<bool> = (0..total_rows)
        .map(|_| rng.gen_bool(selection_ratio))
        .collect();
    BooleanArray::from(bools)
}

// Generate two random RowSelections with approximately 1/3 of the rows selected.
let row_selection_a = RowSelection::from_filters(generate_random_boolean_array(total_rows, selection_ratio));
let row_selection_b = RowSelection::from_filters(generate_random_boolean_array(total_rows, selection_ratio));

// Benchmark the intersection of the two RowSelections.
c.bench_function("intersection", |b| {
    b.iter(|| {
        let intersection = row_selection_a.intersection(&row_selection_b);
        criterion::black_box(intersection);
    })
});

c.bench_function("union", |b| {
    b.iter(|| {
        let union = row_selection_a.union(&row_selection_b);
        criterion::black_box(union);
    })
});

c.bench_function("from_filters", |b| {
    let boolean_array = generate_random_boolean_array(total_rows, selection_ratio);
    b.iter(|| {
            let array = boolean_array.clone();
            let selection = RowSelection::from_filters(&[array]);
            criterion::black_box(selection);
    })
})

No a big issue, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, makes sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @XiangpengHao for building this!

@tustvold tustvold merged commit 56525ef into apache:master Oct 25, 2024
17 checks passed
@tustvold
Copy link
Contributor

Thank you for this, I'm sure you're aware and what you're trying to empirically demonstrate, but RowSelection is not designed for highly non-contiguous, e.g. random selections. It might be worth adding some benchmarks of long contiguous selections, as might arise when filtering sorted data

@alamb
Copy link
Contributor

alamb commented Oct 25, 2024

🫶

but RowSelection is not designed for highly non-contiguous, e.g. random selections.

yes, I think this is what @XiangpengHao is considering improving

It might be worth adding some benchmarks of long contiguous selections, as might arise when filtering sorted data

I agree adding benchmarks for the case where RowSelection already does well would be valuable (to ensure we don't introduce regressions)

@XiangpengHao XiangpengHao deleted the row-selector2 branch October 25, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants