Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2492: Add binary protocol extensions #254

Merged
merged 11 commits into from
Sep 19, 2024

Conversation

alkis
Copy link
Contributor

@alkis alkis commented May 29, 2024

Specify a backwards/forward compatible way to extend any Thrift struct in Parquet.

ref Parquet binary protocol extensions

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch from 2555aa8 to 5f12691 Compare June 1, 2024 09:34
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch 3 times, most recently from 5ef488c to d187886 Compare June 6, 2024 16:16
@alkis alkis changed the title [T3] Add extension points for all thrift messages [PARQUET-2492] Add extension points for all thrift messages Jun 6, 2024
@alkis alkis changed the title [PARQUET-2492] Add extension points for all thrift messages PARQUET-2492: Add extension points for all thrift messages Jun 6, 2024
@alkis alkis force-pushed the t3-metadata-experimentation branch from d187886 to 056429a Compare June 6, 2024 16:37
@alkis alkis marked this pull request as ready for review June 7, 2024 06:46
@alkis
Copy link
Contributor Author

alkis commented Jun 7, 2024

I marked this ready for review. I have tested offline that this method works as expected and has virtually no impact in parse speed of the original FileMetaData thrift message.

How can we move this forward?

README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch from 056429a to 7994102 Compare June 12, 2024 05:48
README.md Outdated Show resolved Hide resolved
Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alkis, I left a few comment.
Minor nit: It is nice to add commits to the PR when addressing comments so that we can see the history of when the comment were added. When amending the same commit and force pushing like here, we can't see the history.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch from a892a65 to 4b550f6 Compare July 19, 2024 05:16
@alkis alkis force-pushed the t3-metadata-experimentation branch from 8708ad3 to 77c6de7 Compare August 6, 2024 13:04
@alkis alkis force-pushed the t3-metadata-experimentation branch from 77c6de7 to e2ef70e Compare August 6, 2024 13:11
README.md Outdated Show resolved Hide resolved
ExtensionExamples.md Outdated Show resolved Hide resolved
ExtensionExamples.md Outdated Show resolved Hide resolved
@alkis
Copy link
Contributor Author

alkis commented Aug 15, 2024

Hi, can we perhaps find another term than "extension" for this feature? The reason is that we may also want to add support for extension types to Parquet (where "extension types" fits naturally with the corresponding feature in Apache Arrow), which might breed confusion when talking about "Parquet extensions". https://lists.apache.org/thread/lcrjoymddllxf8wvq33ddm5fd0tk9mh7

(but perhaps I'm being overly cautious here, what do you think?)

It would be beneficial to avoid confusion. I can't think of a great name for it though. If we name them "experiments" it will make them look risky and can reduce adoption when we use them for migration. "plugins" - I don't like it either. Got any suggestions?

For the extensions in the discussion, I do not have a lot of context: what's the advantage of such extensions vs adding the logical types to the spec?

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

For the extensions in the discussion, I do not have a lot of context: what's the advantage of such extensions vs adding the logical types to the spec?

The point is to decouple the Thrift / Format spec from the definition of new extension (logical) types. It allows third-party definitions of such types and gradual standardization thereof (either inside the Apache Parquet project itself, or by consensus inside a subcommunity). It also makes it easier for the implementations of said types to live outside of the core Parquet implementations, which is desirable for complex and/or highly-specific domains such as with GeoParquet.

(personally, I would rather Parquet C++ didn't have to reimplement logic for geospatial data :-))

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

It would be beneficial to avoid confusion. I can't think of a great name for it though. If we name them "experiments" it will make them look risky and can reduce adoption when we use them for migration. "plugins" - I don't like it either. Got any suggestions?

"Thrift protocol escapes"? "Thrift binary appends"?

@alkis
Copy link
Contributor Author

alkis commented Aug 15, 2024

The point is to decouple the Thrift / Format spec from the definition of new extension (logical) types. It allows third-party definitions of such types and gradual standardization thereof (either inside the Apache Parquet project itself, or by consensus inside a subcommunity). It also makes it easier for the implementations of said types to live outside of the core Parquet implementations, which is desirable for complex and/or highly-specific domains such as with GeoParquet.

(personally, I would rather Parquet C++ didn't have to reimplement logic for geospatial data :-))

Makes sense. Should we name these extensions type-extensions or third-party-types or external-types? Should we name the extensions in this PR format-extensions (not sure if this is specific enough though).

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

The reason to prefer "extension types" over "third-party types" or "external types" is that at some point some of them might get standardized inside Parquet, like Arrow does.

Though we could also dictate a policy that standardizing an extension type is done by creating a new logical type. @wgtmac

@alkis
Copy link
Contributor Author

alkis commented Aug 15, 2024

@pitrou wdyt about calling the extensions in this PR "Metadata Extensions" vs the other ones "Type Extensions"? Would that clear it?

Bonus is that if/when we add flatbuffers the extension points for flatbuffers will also fall under "Metadata Extensions".

@wgtmac
Copy link
Member

wgtmac commented Aug 16, 2024

The reason to prefer "extension types" over "third-party types" or "external types" is that at some point some of them might get standardized inside Parquet, like Arrow does.

Though we could also dictate a policy that standardizing an extension type is done by creating a new logical type. @wgtmac

Agreed. I think we can simply follow the rule of Arrow's extension type. A naive proposal would be adding a custom key-value metadata field to struct SchemaElement. WDYT?

@pitrou
Copy link
Member

pitrou commented Aug 19, 2024

@pitrou wdyt about calling the extensions in this PR "Metadata Extensions" vs the other ones "Type Extensions"? Would that clear it?

Perhaps "Binary protocol extensions" or "Unparsed protocol extensions"? "Metadata" is usually vague.

@alkis alkis changed the title PARQUET-2492: Add extension points for all thrift messages PARQUET-2492: Add binary protocol extensions (any thrift message) Aug 19, 2024
@alkis
Copy link
Contributor Author

alkis commented Aug 19, 2024

Perhaps "Binary protocol extensions" or "Unparsed protocol extensions"? "Metadata" is usually vague.

Qualified as "Binary Protocol Extensions".

@alkis
Copy link
Contributor Author

alkis commented Aug 26, 2024

Friendly ping. Since there are no other comments can we merge this?

@pitrou
Copy link
Member

pitrou commented Aug 26, 2024

Doesn't this require a vote before merging?

@alkis alkis changed the title PARQUET-2492: Add binary protocol extensions (any thrift message) PARQUET-2492: Add binary protocol extensions Aug 27, 2024
@alkis
Copy link
Contributor Author

alkis commented Aug 27, 2024

Doesn't this require a vote before merging?

Started one here: https://lists.apache.org/thread/x3472kldrq5kjnld9ztj1jozz25f40hg

@wgtmac
Copy link
Member

wgtmac commented Aug 29, 2024

[INFO] Rat check: Summary over all files. Unapproved: 1, unknown: 1, generated: 0, approved: 28 licenses.
Warning:  Files with unapproved licenses:
  BinaryProtocolExtensions.md

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11.651 s
[INFO] Finished at: 2024-08-22T15:38:07Z
[INFO] ------------------------------------------------------------------------
Error:  Failed to execute goal org.apache.rat:apache-rat-plugin:0.16.1:check (default) on project parquet-format: Too many files with unapproved license: 1 See RAT report in: /home/runner/work/parquet-format/parquet-format/target/rat.txt -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
Error: Process completed with exit code 1.

It seems that license header should be added before merging.


If/when the encoding is ratified, it is added to the official specification as an additional type in `Encodings` at which point the extension is no longer necessary, nor the duplicated data in the row group.

## Appending extensions to thrift
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered directly adding a 32767: optional binary reserved_extension field to FileMetaData and ColumnMetaData to make it easier for implementations to append data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I have considered it. The advantage of not adding it is that readers will not materialize the extension string. This means all but the readers that care about the extension will not use additional memory or incur extra allocation because of it.

@alkis
Copy link
Contributor Author

alkis commented Aug 30, 2024

It seems that license header should be added before merging.

Added.

@wgtmac
Copy link
Member

wgtmac commented Sep 8, 2024

As the vote has passed, I will merge it if no objection or feedback received before Sep 12.

@alkis
Copy link
Contributor Author

alkis commented Sep 16, 2024

@wgtmac are we merging this?

@wgtmac
Copy link
Member

wgtmac commented Sep 17, 2024

@wgtmac are we merging this?

I want to make sure @pitrou is happy with the change.

@alkis
Copy link
Contributor Author

alkis commented Sep 18, 2024

I want to make sure @pitrou is happy with the change.

He had one suggestion which I accepted.

@wgtmac wgtmac merged commit 737ea12 into apache:master Sep 19, 2024
3 checks passed
@alkis alkis deleted the t3-metadata-experimentation branch September 20, 2024 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants