`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

samuelcolvin · 2024-09-25T09:18:27Z

Describe the bug

I noticed this while investigating apache/datafusion#7845 (comment).

The suggestion from @jayzhan211 and @alamb was that datafusion.execution.parquet.pushdown_filters true should improve performance of queries like this, but it seems to make them slower.

I think the reason is that data is being decompressed twice (or data is being decompressed that shouldn't be), here's a screenshot from samply running on this code:

(You can view this flamegraph properly here)

You can see that there are two blocks of decompression work, the second one is associated with parquet::column::reader::GenericColumnReader::skip_records and happens after the first decompression chunk and running the query has completed.

In particular you can se that there's a read_new_page() cal in parquet::column::reader::GenericColumnReader::skip_records (line 335) that's taking a lot of time:

My question is - could this second run of compression be avoided?

To Reproduce

Clone https://github.com/samuelcolvin/batson-perf, comment out one of the modes, compile with profiling enabled cargo build --profile profiling, run with samply samply record ./target/profiling/batson-perf

Expected behavior

I would expect that datafusion.execution.parquet.pushdown_filters true was faster, I think the reason it's not is decompressing the data twice.

Additional context

apache/datafusion#7845 (comment)

The text was updated successfully, but these errors were encountered:

tustvold · 2024-09-25T11:29:15Z

Have you enabled the page index?

etseidl · 2024-09-25T17:13:18Z

Have you enabled the page index?

Indeed. Or enabled v2 page headers? The issue seems to be that when skipping rows (skip_records defines a record as rep_level == 0, so a row), the number of rows per page is not known in advance, so to figure out the number of levels to skip, the repetition levels need to be decoded for every page. For V1 pages, unfortunately, the level information is compressed along with the page data, so the entire page needs decompressing to calculate the number of rows. If either of V2 page headers or the page index were enabled, then the number of rows per page is known without having to do decompression, so entire pages can be skipped with very little effort (the continue at L330 above).

I don't think pages are uncompressed twice...it's just a result of the two paths through ParquetRecordBatchReader::next (call call skip_records until enough have been skipped, then switch over to read_records`).

alamb · 2024-09-25T17:28:02Z

I think the documentation on https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html is also instructive. Even if all the decode is working properly, I think the arrow reader may well decode certain pages twice. It is one of my theories about why pushing filters down doesn't make things always faster, but I have not had time to look into it in more detail

tustvold · 2024-09-25T17:55:12Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

samuelcolvin commented Sep 25, 2024

tustvold commented Sep 25, 2024

etseidl commented Sep 25, 2024

alamb commented Sep 25, 2024

tustvold commented Sep 25, 2024

parquet::column::reader::GenericColumnReader::skip_records still decompresses most data #6454

parquet::column::reader::GenericColumnReader::skip_records still decompresses most data #6454

Comments

samuelcolvin commented Sep 25, 2024

tustvold commented Sep 25, 2024

etseidl commented Sep 25, 2024

alamb commented Sep 25, 2024

tustvold commented Sep 25, 2024

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454