-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ParquetMetaDataBuilder
#6466
Conversation
1d78ac4
to
a2818bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are planning to eventually deprecate the non-builder ParquetMetaData::new*
, then this would also need to be switched to the builder.
let mut filtered_row_groups = Vec::<RowGroupMetaData>::new(); | ||
for (i, rg_meta) in row_groups.into_iter().enumerate() { | ||
|
||
// Filter row groups based on the predicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the cleanup of this code (which is modifying the ParquetMetaData) is the best example of why having this API makes sense -- it makes one fewer copies and also I think is quite a bit clearer
} | ||
|
||
/// Adds a row group to the metadata | ||
pub fn add_row_group(mut self, row_group: RowGroupMetaData) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these methods follow the exsiting convention of add
and set
as the other Builders in this crate sucha s
That is a good point -- I wasn't planing to deprecate the functions, though we could argume that deprecating |
Thank you for the review @wiedld |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this. ❤️
} | ||
} | ||
|
||
if options.enable_page_index { | ||
let mut columns_indexes = vec![]; | ||
let mut offset_indexes = vec![]; | ||
|
||
for rg in &mut filtered_row_groups { | ||
for rg in metadata_builder.row_groups().iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can build the metadata here (with the filtered row groups), pass it into ParquetMetaDataReader
and then load the page indexes into the metadata. Let me give that a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if options.enable_page_index {
let mut reader = ParquetMetaDataReader::new_with_metadata(metadata_builder.build())
.with_page_indexes(options.enable_page_index);
reader.read_page_indexes(&chunk_reader)?;
metadata_builder = ParquetMetaDataBuilder::new_from_metadata(reader.finish()?);
}
I forgot to do this in #6450.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow why this is needed. What scenario does it help (I can write a test to cover it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean replace
if options.enable_page_index {
let mut columns_indexes = vec![];
let mut offset_indexes = vec![];
for rg in metadata_builder.row_groups().iter() {
let column_index = index_reader::read_columns_indexes(&chunk_reader, rg.columns())?;
let offset_index = index_reader::read_offset_indexes(&chunk_reader, rg.columns())?;
columns_indexes.push(column_index);
offset_indexes.push(offset_index);
}
metadata_builder = metadata_builder
.set_column_index(Some(columns_indexes))
.set_offset_index(Some(offset_indexes));
}
with the above code snippet from my earlier comment. This should be a bit more efficient since read_page_indexes
will fetch the necessary bytes from the file in a single read, rather than 2 reads per row group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented something slightly different:
Since there is already a ParquetMetaDataReader
created at the beginning of the function, I made the change to simply read it when needed.
One thing that might be different is that it looks like the current code may only read the column index/page index for row groups that passed the "predicates" but the ParquetMetadataReader reads the index for all the row groups.
That being said, no tests fail, so i am not sure if it is a real problem or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this worries me a bit, since the column and offset indexes will have more row groups represented than are in the ParquetMetaData
. The split path from before would only read the page indexes for the remaining row groups.
We could prune the page indexes at the same time we're pruning the row groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could prune the page indexes at the same time we're pruning the row groups.
Yeah, that is probably mirrors the intent the most closely. How about I'll back out c0432e6 and we can address improving this code as a follow on PR (along with tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deprecated |
This reverts commit c0432e6.
Which issue does this PR close?
Closes #6465
Rationale for this change
At the moment it is
See #6465 for more rationale
What changes are included in this PR?
ParquetMetaDataBuilder
ParquetMetaData
to useParquetMetadataBuilder
(which requires fewer clones) so might be marginally fasterAre there any user-facing changes?