Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write to parquet file? #6

Open
caligo-erik opened this issue Apr 28, 2024 · 13 comments
Open

Write to parquet file? #6

caligo-erik opened this issue Apr 28, 2024 · 13 comments

Comments

@caligo-erik
Copy link

Is it possible to write to parquet file using this library? (quickly checked the code, didn't see any write function).

@platypii
Copy link
Collaborator

No plans for writing parquet files at this time.

I could be convinced otherwise, but generally I feel that if you are creating parquet files, you are more likely to be in a backend environment so it makes sense to use existing parquet libraries in like python, C++ or Rust.

What I really want with this library is to make it easy to view parquet data in the browser, since there was no good library for decoding parquet files in javascript that was lightweight and could handle remote files efficiently.

You might like the work of @kylebarron on parquet-wasm. Hope you find what you need!

@kylebarron
Copy link

lightweight and could handle remote files efficiently.

Generally agree that "webassembly" and "lightweight" are not synonyms, but there's no technical blocker to handling remote files efficiently in parquet-wasm. In the latest release you're able to fetch individual row groups or columns from a Parquet file without downloading the entire file. And we could implement something like pyarrow's filters param, I just haven't taken the time to fully implement that yet.

@caligo-erik
Copy link
Author

caligo-erik commented Apr 29, 2024

No plans for writing parquet files at this time.

I could be convinced otherwise, but generally I feel that if you are creating parquet files, you are more likely to be in a backend environment so it makes sense to use existing parquet libraries in like python, C++ or Rust.

What I really want with this library is to make it easy to view parquet data in the browser, since there was no good library for decoding parquet files in javascript that was lightweight and could handle remote files efficiently.

You might like the work of @kylebarron on parquet-wasm. Hope you find what you need!

I'm creating an offline application (with local JS server and web application using Electron) that stores transactional data locally in the backend/server, and then uploads it to S3 to be analyzed with cloud-native tools such as Athena, QuickSight etc.

I'm looking for a lightweight library to read/write to Parquet file, and your library ticks all boxes except for the write function.
I've checked other libraries but most aren't maintained.

@kylebarron
Copy link

You can use WebAssembly in Electron, so parquet-wasm should work out of the box.

@caligo-erik
Copy link
Author

caligo-erik commented Apr 29, 2024 via email

@platypii
Copy link
Collaborator

Generally agree that "webassembly" and "lightweight" are not synonyms, but there's no technical blocker to handling remote files efficiently in parquet-wasm.

parquet-wasm has 5+ megabytes of wasm file, hyparquet is sub-100k of javascript. Loading can be much faster especially for time to first render.

Because hyparquet is not a compiled wasm blob, there is no need for transferring data across the wasm boundary, and no cold-start time for loading the wasm vm. Also I've done some optimizations for the web like if you are fetching a bunch of columns in a rowgroup, it will fetch the data in just one http request instead of multiple round trips. I'm guessing that parquet-wasm, if you can implement ranged-gets, probably doesn't coalesce the requests to save round trip time?

Huge respect for your work Kyle, I love reading your blog about parquet stuff. Definitely not knocking parquet-wasm! Just pointing out the reasons I built hyparquet. :)

@kylebarron
Copy link

Just pointing out the reasons I built hyparquet. :)

That's very fair! I think it's valuable to have a pure-JavaScript implementation!

My own bias is that Parquet is an absolutely perfect place for WebAssembly, because Parquet is such a complex spec with such a long tail of complexities. It's not that I don't want a pure-JS implementation; rather my own conclusion was that implementing a stable pure-JS Parquet implementation that supports all encodings and compressions would be an absolutely massive engineering effort. Most previous JS Parquet implementations were eventually abandoned.

Whereas there are a ton of people building databases in Rust, so the Parquet implementation is stable, fast, and loads into a binary representation. Perhaps it's a use case where the benefits of WebAssembly outweigh the costs.

So take encouragement with a hint of skepticism 🙂. If you're able to implement a stable pure-JS Parquet reader, it'll be really impressive!

parquet-wasm has 5+ megabytes of wasm file, hyparquet is sub-100k of javascript. Loading can be much faster especially for time to first render.

1.2MB brotli-compressed 😉 , but yes. We might have alternate use cases; you might care more about time to first render whereas I'm more focused on handling large datasets where Parquet 1.2MB is very small compared to the data savings from Parquet.

I'm guessing that parquet-wasm, if you can implement ranged-gets, probably doesn't coalesce the requests to save round trip time?

It does. Multiple ranges are coalesced by default. The coalesce size is currently 1MB and not configurable though.

@kylebarron
Copy link

Also note that the people in loaders.gl are also building a pure-Typescript Parquet implementation, which I think was forked from parquets. It might be worth reaching out to them

@caligo-erik
Copy link
Author

Thanks for the additional information.

The application server - the only project managing data - doesn't know anything about any UI or client libraries, that's why I'd rather stick with a JS/TS library to read/write Parquet files.
image

@severo
Copy link

severo commented Apr 30, 2024

1.2MB brotli-compressed 😉 , but yes. We might have alternate use cases; you might care more about time to first render whereas I'm more focused on handling large datasets where Parquet 1.2MB is very small compared to the data savings from Parquet.

it's particularly valuable when we're interested only in reading the metadata.

@kylebarron
Copy link

it's particularly valuable when we're interested only in reading the metadata.

Your use case involves reading the metadata only... but not the data?

@platypii
Copy link
Collaborator

Oh if we're talking compressed size, then hyparquet is 24.1kb compressed 😉

@severo
Copy link

severo commented May 2, 2024

Your use case involves reading the metadata only... but not the data?

Yes, we just launched a Parquet metadata viewer: https://huggingface.co/datasets/HuggingFaceFW/fineweb/tree/main/data/CC-MAIN-2013-20?show_file_info=data%2FCC-MAIN-2013-20%2F000_00000.parquet

It's powered by hyparquet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants