Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pyarrow to write parquet files #124

Closed
wants to merge 1 commit into from

Conversation

jaychia
Copy link

@jaychia jaychia commented May 3, 2024

Addresses: #123

On SCALE_FACTOR=10:

  • Lineitem Parquet file before the change have 20k rowgroups
  • Lineitem Parquet file after the change only has 57 rowgroups

This does affect benchmark results quite a bit, depending on how resilient the Parquet reader implementations are to these poorly written Parquet files. However I think for the sake of having a benchmark that represents the expected/common case we should write the Parquet files properly!

@ritchie46
Copy link
Member

I think we should fix the culprit upstream instead of bandaid it here.

@jaychia
Copy link
Author

jaychia commented May 3, 2024

That makes sense @ritchie46, I'll close this PR in favor of an upstream fix

@jaychia jaychia closed this May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants