-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2492: Add binary protocol extensions #254
Conversation
b66b5a8
to
2555aa8
Compare
2555aa8
to
5f12691
Compare
5ef488c
to
d187886
Compare
d187886
to
056429a
Compare
I marked this ready for review. I have tested offline that this method works as expected and has virtually no impact in parse speed of the original How can we move this forward? |
056429a
to
7994102
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alkis, I left a few comment.
Minor nit: It is nice to add commits to the PR when addressing comments so that we can see the history of when the comment were added. When amending the same commit and force pushing like here, we can't see the history.
a892a65
to
4b550f6
Compare
8708ad3
to
77c6de7
Compare
77c6de7
to
e2ef70e
Compare
It would be beneficial to avoid confusion. I can't think of a great name for it though. If we name them "experiments" it will make them look risky and can reduce adoption when we use them for migration. "plugins" - I don't like it either. Got any suggestions? For the extensions in the discussion, I do not have a lot of context: what's the advantage of such extensions vs adding the logical types to the spec? |
The point is to decouple the Thrift / Format spec from the definition of new extension (logical) types. It allows third-party definitions of such types and gradual standardization thereof (either inside the Apache Parquet project itself, or by consensus inside a subcommunity). It also makes it easier for the implementations of said types to live outside of the core Parquet implementations, which is desirable for complex and/or highly-specific domains such as with GeoParquet. (personally, I would rather Parquet C++ didn't have to reimplement logic for geospatial data :-)) |
"Thrift protocol escapes"? "Thrift binary appends"? |
Makes sense. Should we name these extensions type-extensions or third-party-types or external-types? Should we name the extensions in this PR format-extensions (not sure if this is specific enough though). |
The reason to prefer "extension types" over "third-party types" or "external types" is that at some point some of them might get standardized inside Parquet, like Arrow does. Though we could also dictate a policy that standardizing an extension type is done by creating a new logical type. @wgtmac |
@pitrou wdyt about calling the extensions in this PR "Metadata Extensions" vs the other ones "Type Extensions"? Would that clear it? Bonus is that if/when we add flatbuffers the extension points for flatbuffers will also fall under "Metadata Extensions". |
Agreed. I think we can simply follow the rule of Arrow's extension type. A naive proposal would be adding a custom key-value metadata field to |
Perhaps "Binary protocol extensions" or "Unparsed protocol extensions"? "Metadata" is usually vague. |
Qualified as "Binary Protocol Extensions". |
Friendly ping. Since there are no other comments can we merge this? |
Doesn't this require a vote before merging? |
Started one here: https://lists.apache.org/thread/x3472kldrq5kjnld9ztj1jozz25f40hg |
It seems that license header should be added before merging. |
|
||
If/when the encoding is ratified, it is added to the official specification as an additional type in `Encodings` at which point the extension is no longer necessary, nor the duplicated data in the row group. | ||
|
||
## Appending extensions to thrift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered directly adding a 32767: optional binary reserved_extension
field to FileMetaData
and ColumnMetaData
to make it easier for implementations to append data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I have considered it. The advantage of not adding it is that readers will not materialize the extension string. This means all but the readers that care about the extension will not use additional memory or incur extra allocation because of it.
Added. |
As the vote has passed, I will merge it if no objection or feedback received before Sep 12. |
Co-authored-by: Antoine Pitrou <[email protected]>
@wgtmac are we merging this? |
He had one suggestion which I accepted. |
Specify a backwards/forward compatible way to extend any Thrift struct in Parquet.
ref Parquet binary protocol extensions
Jira
Commits
Documentation