Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce _common_metadata and _metadata for parquet data sets #418

Open
yymao opened this issue Jan 27, 2021 · 0 comments
Open

Produce _common_metadata and _metadata for parquet data sets #418

yymao opened this issue Jan 27, 2021 · 0 comments
Assignees

Comments

@yymao
Copy link
Member

yymao commented Jan 27, 2021

As @cwwalter mentioned, having _common_metadata and _metadata in the parquet file folder can potentially speed up load time for dask and spark.

We should be able to add the steps to generate these metadata files in write_gcr_to_parquet.py. Alternatively, we can write a post-processing script to generate those files (could be useful for existing datasets).

Ref: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_metadata.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants