Add `ParquetMetaDataBuilder` #6466

alamb · 2024-09-26T15:26:48Z

Which issue does this PR close?

Closes #6465

Rationale for this change

At the moment it is

awkward to modify ParquetMetaData (you have to re-create it from its constitutent fields)
and there is no way to avoid clone'ing when trying to modify it

See #6465 for more rationale

What changes are included in this PR?

Add ParquetMetaDataBuilder
Add documentation and examples
Update code that modifies ParquetMetaData to use ParquetMetadataBuilder (which requires fewer clones) so might be marginally faster

Are there any user-facing changes?

wiedld

If you are planning to eventually deprecate the non-builder ParquetMetaData::new*, then this would also need to be switched to the builder.

alamb · 2024-09-26T15:28:10Z

parquet/src/file/serialized_reader.rs

-        let mut filtered_row_groups = Vec::<RowGroupMetaData>::new();
-        for (i, rg_meta) in row_groups.into_iter().enumerate() {
+
+        // Filter row groups based on the predicates


I think the cleanup of this code (which is modifying the ParquetMetaData) is the best example of why having this API makes sense -- it makes one fewer copies and also I think is quite a bit clearer

alamb · 2024-09-26T15:30:42Z

parquet/src/file/metadata/mod.rs

+    }
+
+    /// Adds a row group to the metadata
+    pub fn add_row_group(mut self, row_group: RowGroupMetaData) -> Self {


these methods follow the exsiting convention of add and set as the other Builders in this crate sucha s

https://docs.rs/parquet/latest/parquet/file/metadata/struct.RowGroupMetaDataBuilder.html

https://docs.rs/parquet/latest/parquet/file/metadata/struct.ColumnChunkMetaDataBuilder.html

alamb · 2024-09-27T10:45:36Z

If you are planning to eventually deprecate the non-builder ParquetMetaData::new*, then this would also need to be switched to the builder.

That is a good point -- I wasn't planing to deprecate the functions, though we could argume that deprecating new_with_page_index might be a good idea 🤔

alamb · 2024-09-27T10:45:49Z

Thank you for the review @wiedld

etseidl

I love this. ❤️

parquet/src/file/metadata/mod.rs

etseidl · 2024-09-27T17:56:17Z

parquet/src/file/serialized_reader.rs

            }
        }

        if options.enable_page_index {
            let mut columns_indexes = vec![];
            let mut offset_indexes = vec![];

-            for rg in &mut filtered_row_groups {
+            for rg in metadata_builder.row_groups().iter() {


I think we can build the metadata here (with the filtered row groups), pass it into ParquetMetaDataReader and then load the page indexes into the metadata. Let me give that a try.

if options.enable_page_index { let mut reader = ParquetMetaDataReader::new_with_metadata(metadata_builder.build()) .with_page_indexes(options.enable_page_index); reader.read_page_indexes(&chunk_reader)?; metadata_builder = ParquetMetaDataBuilder::new_from_metadata(reader.finish()?); }

I forgot to do this in #6450.

I don't quite follow why this is needed. What scenario does it help (I can write a test to cover it)

I mean replace

if options.enable_page_index { let mut columns_indexes = vec![]; let mut offset_indexes = vec![]; for rg in metadata_builder.row_groups().iter() { let column_index = index_reader::read_columns_indexes(&chunk_reader, rg.columns())?; let offset_index = index_reader::read_offset_indexes(&chunk_reader, rg.columns())?; columns_indexes.push(column_index); offset_indexes.push(offset_index); } metadata_builder = metadata_builder .set_column_index(Some(columns_indexes)) .set_offset_index(Some(offset_indexes)); }

with the above code snippet from my earlier comment. This should be a bit more efficient since read_page_indexes will fetch the necessary bytes from the file in a single read, rather than 2 reads per row group.

I implemented something slightly different:

Since there is already a ParquetMetaDataReader created at the beginning of the function, I made the change to simply read it when needed.

One thing that might be different is that it looks like the current code may only read the column index/page index for row groups that passed the "predicates" but the ParquetMetadataReader reads the index for all the row groups.

That being said, no tests fail, so i am not sure if it is a real problem or not

Hmm, this worries me a bit, since the column and offset indexes will have more row groups represented than are in the ParquetMetaData. The split path from before would only read the page indexes for the remaining row groups.

We could prune the page indexes at the same time we're pruning the row groups.

We could prune the page indexes at the same time we're pruning the row groups.

Yeah, that is probably mirrors the intent the most closely. How about I'll back out c0432e6 and we can address improving this code as a follow on PR (along with tests)

Filed #6491 and reverted c0432e6

…der_api

alamb · 2024-09-28T10:48:12Z

If you are planning to eventually deprecate the non-builder ParquetMetaData::new*, then this would also need to be switched to the builder.

That is a good point -- I wasn't planing to deprecate the functions, though we could argume that deprecating new_with_page_index might be a good idea 🤔

I deprecated new_with_page_index in b6c63a8

…der_api

parquet/src/file/serialized_reader.rs

This reverts commit c0432e6.

alamb · 2024-10-01T22:19:48Z

Thanks again for the reviews @wiedld and @etseidl

Follow on work is tracked in #6491

github-actions bot added the parquet Changes to the parquet crate label Sep 26, 2024

alamb mentioned this pull request Sep 26, 2024

Add round trip tests for reading/writing parquet metadata #6463

Merged

Add ParquetMetadtaBuilder

a2818bf

alamb force-pushed the alamb/metadata_builder_api branch from 1d78ac4 to a2818bf Compare September 26, 2024 15:33

wiedld approved these changes Sep 27, 2024

View reviewed changes

alamb commented Sep 27, 2024

View reviewed changes

etseidl approved these changes Sep 27, 2024

View reviewed changes

alamb added 3 commits September 28, 2024 06:42

Merge remote-tracking branch 'apache/master' into alamb/metadata_buil…

572ce5c

…der_api

Add accessors for ColumnIndex / OffsetIndex

5e324f6

Deprecate ParquetMetaData::new_with_page_index

b6c63a8

alamb added 2 commits October 1, 2024 16:33

Merge remote-tracking branch 'apache/master' into alamb/metadata_buil…

997c3f9

…der_api

simplify reading metadata

c0432e6

alamb commented Oct 1, 2024

View reviewed changes

parquet/src/file/serialized_reader.rs Outdated Show resolved Hide resolved

Revert "simplify reading metadata"

21e8dc2

This reverts commit c0432e6.

alamb merged commit 31d6891 into apache:master Oct 1, 2024
16 checks passed

alamb deleted the alamb/metadata_builder_api branch October 1, 2024 22:19

alamb mentioned this pull request Oct 2, 2024

API for encoding/decoding ParquetMetadata with more control #6002

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `ParquetMetaDataBuilder` #6466

Add `ParquetMetaDataBuilder` #6466

alamb commented Sep 26, 2024

wiedld left a comment

alamb Sep 26, 2024

alamb Sep 26, 2024

alamb commented Sep 27, 2024

alamb commented Sep 27, 2024

etseidl left a comment

etseidl Sep 27, 2024

etseidl Sep 27, 2024 •

edited

Loading

alamb Sep 28, 2024

etseidl Sep 29, 2024

alamb Oct 1, 2024

etseidl Oct 1, 2024 •

edited

Loading

alamb Oct 1, 2024

alamb Oct 1, 2024

alamb commented Sep 28, 2024

alamb commented Oct 1, 2024

Add ParquetMetaDataBuilder #6466

Add ParquetMetaDataBuilder #6466

Conversation

alamb commented Sep 26, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

wiedld left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 27, 2024

alamb commented Sep 27, 2024

etseidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 28, 2024

alamb commented Oct 1, 2024

Add `ParquetMetaDataBuilder` #6466

Add `ParquetMetaDataBuilder` #6466

etseidl Sep 27, 2024 •

edited

Loading

etseidl Oct 1, 2024 •

edited

Loading