-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet::column::reader::GenericColumnReader::skip_records
still decompresses most data
#6454
Comments
Have you enabled the page index? |
Indeed. Or enabled v2 page headers? The issue seems to be that when skipping rows ( I don't think pages are uncompressed twice...it's just a result of the two paths through |
I think the documentation on https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html is also instructive. Even if all the decode is working properly, I think the arrow reader may well decode certain pages twice. It is one of my theories about why pushing filters down doesn't make things always faster, but I have not had time to look into it in more detail |
See also #5523 Although I suspect in this case the issue is a lack of page index information for whatever reason |
Describe the bug
I noticed this while investigating apache/datafusion#7845 (comment).
The suggestion from @jayzhan211 and @alamb was that
datafusion.execution.parquet.pushdown_filters true
should improve performance of queries like this, but it seems to make them slower.I think the reason is that data is being decompressed twice (or data is being decompressed that shouldn't be), here's a screenshot from samply running on this code:
(You can view this flamegraph properly here)
You can see that there are two blocks of decompression work, the second one is associated with
parquet::column::reader::GenericColumnReader::skip_records
and happens after the first decompression chunk and running the query has completed.In particular you can se that there's a
read_new_page()
cal inparquet::column::reader::GenericColumnReader::skip_records
(line 335) that's taking a lot of time:My question is - could this second run of compression be avoided?
To Reproduce
Clone https://github.com/samuelcolvin/batson-perf, comment out one of the modes, compile with profiling enabled
cargo build --profile profiling
, run with samplysamply record ./target/profiling/batson-perf
Expected behavior
I would expect that
datafusion.execution.parquet.pushdown_filters true
was faster, I think the reason it's not is decompressing the data twice.Additional context
apache/datafusion#7845 (comment)
The text was updated successfully, but these errors were encountered: