Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix parquet infer statistics for BinaryView types #12575

Merged
merged 1 commit into from
Sep 22, 2024

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Currently if the file/stats format is in BinaryArray and we attempt to coerce it into BinaryViewArray, the stats will be NULL because type mismatch.

This PR fix it.

What changes are included in this PR?

Related to #12092 and #6906

In particular, consider this query (simplified Q21 from clickbench):

SELECT MIN("URL") FROM hits;

Datafusion is able to compeletely remove ParquetExec based on the statistics, as shown in the physical plan below:

Q21: SELECT MIN("URL") FROM hits;
=== Optimized logical plan ===
Aggregate: groupBy=[[]], aggr=[[min(hits.URL)]]
  TableScan: hits projection=[URL]

=== Physical plan with metrics ===
ProjectionExec: expr=[ as min(hits.URL)], metrics=[output_rows=1, elapsed_compute=6.662µs]
  PlaceholderRowExec, metrics=[]

Without proper statistics, we will need to scan the entire column, which explains the slowdown.

Are these changes tested?

Are there any user-facing changes?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao and @dmitrybugakov -- onwards 🚀

@alamb alamb merged commit 64a3896 into apache:main Sep 22, 2024
24 checks passed
bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants