Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink: TestMetadataTableReadableMetrics relies on Hardcoded File Sizes #11465

Open
3 tasks
RussellSpitzer opened this issue Nov 4, 2024 · 3 comments
Open
3 tasks
Labels
good first issue Good for newcomers improvement PR that improves existing functionality

Comments

@RussellSpitzer
Copy link
Member

Feature Request / Improvement

TestMetadataTableReadableMetrics currently hardcodes in the expected size into the metrics rows rather than actually checking the sizes from the underlying data. This means every time the Parquet version changes (or compression or what not) the test needs to be updated. See

b8c2b20

Ideally we would change this so the expected values we check are only those which are not dependent on parquet version or change the test to check against the actual values.

See #11462 for an instance where this is complicating things

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time
@RussellSpitzer RussellSpitzer added improvement PR that improves existing functionality good first issue Good for newcomers labels Nov 4, 2024
@davidyuan1223
Copy link

can we use the sql select column_sizes from table.files to get the reight size?

@pvary
Copy link
Contributor

pvary commented Nov 5, 2024

can we use the sql select column_sizes from table.files to get the right size?

I would prefer @RussellSpitzer's suggestion to directly check the parquet file sizes. Otherwise we might end up using the same abstraction to get the expected data and the test data.

@davidyuan1223
Copy link

can we use the sql select column_sizes from table.files to get the right size?

I would prefer @RussellSpitzer's suggestion to directly check the parquet file sizes. Otherwise we might end up using the same abstraction to get the expected data and the test data.

Maybe your are right, i have some question about this link https://iceberg.apache.org/docs/1.6.0/spark-queries/?h=readable_metrics#files. At the tag Inspecting tables -- Files, the SQL SELECT * FROM prod.db.table.files; , the result show the parquet file maybe contains multiple column, if we get the file size, how do we know the column level size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants