Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to read parquet file metadata through deephaven #6126

Open
malhotrashivam opened this issue Sep 25, 2024 · 2 comments
Open

Add support to read parquet file metadata through deephaven #6126

malhotrashivam opened this issue Sep 25, 2024 · 2 comments
Assignees
Labels
feature request New feature or request parquet Related to the Parquet integration s3
Milestone

Comments

@malhotrashivam
Copy link
Contributor

This will help with remotely debugging and understanding the parquet file structure.
We can follow the similar API spec as duck_db: https://duckdb.org/docs/data/parquet/overview

  • read_parquet
  • parquet_file_metadata
  • parquet_kv_metadata
  • parquet_schema
@malhotrashivam malhotrashivam added feature request New feature or request parquet Related to the Parquet integration s3 labels Sep 25, 2024
@malhotrashivam malhotrashivam added this to the Backlog milestone Sep 25, 2024
@malhotrashivam malhotrashivam self-assigned this Sep 25, 2024
@malhotrashivam
Copy link
Contributor Author

One approach that @rcaudy suggested in the meanwhile:

If you have a raw source table in groovy, you should be able to:

  1. .initialize() it
  2. Get its columnSourceManager field.
  3. Get the Table result of the CSM’s locationTable()
  4. Get the K-V metadata for each file by applying an update("KV = ((io.deephaven.parquet.table.location.ParquetTableLocation) _TableLocation).getParquetKey().getMetadata().getFileMetaData().getKeyValueMetaData()")

@devinrsmith
Copy link
Member

It may be useful to write a little standalone utility to print out the FileMetaData as JSON; I've found this little script helpful:

        try (final TMemoryBuffer buffer = new TMemoryBuffer(128)) {
            fileMetaData.write(new TSimpleJSONProtocol(buffer));
            buffer.flush();
            System.out.println(buffer.toString(StandardCharsets.UTF_8));
        } catch (TException e) {
            // ignore
        }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request parquet Related to the Parquet integration s3
Projects
None yet
Development

No branches or pull requests

2 participants