Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Parquet metadata via suffix requests #5979

Open
kylebarron opened this issue Jun 28, 2024 · 5 comments · May be fixed by #6157
Open

Read Parquet metadata via suffix requests #5979

kylebarron opened this issue Jun 28, 2024 · 5 comments · May be fixed by #6157
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@kylebarron
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In high-latency environments, like in WebAssembly in users' browsers, minimizing the number of sequential requests can be a significant performance improvement. Now that #5222 has been merged, object_store now has the ability to fetch a suffix byte range of files. It would be great to be able to integrate this with parquet to reduce the number of individual requests required.

Describe the solution you'd like

It seems that parquet::arrow::async_reader::MetadataLoader::load can be refactored to use an initial suffix request instead of needing to know the file size.

Or, perhaps, there should be a new MetadataLoader::load_suffix method, so that implementations can choose to use load if they already know the file size, and use load_suffix if they don't.

Related to this, the ParquetObjectReader::new API requires an ObjectMeta, which requires knowing the file size. It would be great to be able to construct ParquetObjectReader with only the store and the object_store::path::Path. (I'm trying to construct ParquetObjectReader with a fake file length, passing my own ArrowReaderMetadata #5583, but I haven't figured out if that works yet)

Describe alternatives you've considered

Make an extra HEAD request instead of using suffix requests.

Additional context

@kylebarron kylebarron added the enhancement Any new improvement worthy of a entry in the changelog label Jun 28, 2024
@alamb
Copy link
Contributor

alamb commented Jul 1, 2024

Or, perhaps, there should be a new MetadataLoader::load_suffix method, so that implementations can choose to use load if they already know the file size, and use load_suffix if they don't.

It seems to me that load_suffix would always be preferred (as it requires less information), except for when the underlying store doesn't support suffix requests.

@kylebarron
Copy link
Contributor Author

I think the API of AsyncFileReader might have to change? Right now the signature of get_bytes takes a range: core::ops::range::Range<usize>, and presumably that would have to change to something like object_store::GetRange for the concept of a suffix range?

If it does need a breaking API change, does that mean we shouldn't start a PR until 52.2.0 is tagged and released?

@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

If it does need a breaking API change, does that mean we shouldn't start a PR until 52.2.0 is tagged and released?

It is up to you, but I think it would be better to get the PR up and ready (and we can merge it when main opens for 53.0.0 work

@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

Potentially related: #6002

@alamb
Copy link
Contributor

alamb commented Jul 17, 2024

We now have a 53.0.0-dev branch where we are merging PRs with breaking API changes, in case you are still interested in submitting a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants