Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes. If you believe there should be a change or exception to these rules please bring it up for discussion on the developer mailing list ([email protected]).
master
has the latest stable changes
- Submit pull requests against the
master
branch - Try not to pollute your pull request with unintended changes--keep them simple and small
By contributing your code, you agree to license your contribution under the terms of the APLv2: https://github.com/apache/parquet-format/blob/master/LICENSE
Note: This section applies to actual functional changes to the specification. Fixing typos, grammar, and clarifying concepts that would not change the semantics of the specification can be done as long as a committer feels comfortable to merge them. When in doubt starting a discussion on the dev mailing list is encouraged.
The general steps for adding features to the format are as follows:
-
Design/scoping: The goal of this phase is to identify design goals of a feature and provide some demonstration that the feature meets those goals. This phase starts with a discussion of changes on the developer mailing list ([email protected]). Depending on the scope and goals of the feature the it can be useful to provide additional artifacts as part of a discussion. The artifacts can include a design docuemnt, a draft pull request to make the discussion concrete and/or an prototype implementation to demostrate the viability of implementation. This step is complete when there is lazy consensus. Part of the consensus is whether it is sufficient to provide two working implementations as outlined in step 2, or if demonstration of the feature with a downstream query engine is necessary to justify the feature (e.g. demonstrate performance improvements in the Apache Arrow C++ Dataset library, the Apache DataFusion query engine, or any other open source engine).
-
Completeness: The goal of this phase is to ensure the feature is viable, there is no ambiguity in its specification by demonstrating compatibility between implementations. Once a change has lazy consensus, two implementations of the feature demonstrating interopability must also be provided. One implementation MUST be
parquet-java
. It is preferred that the second implementation beparquet-cpp
orparquet-rs
, however at the discretion of the PMC any open source Parquet implementation may be acceptable. Implementations whose contributors actively participate in the community (e.g. keep their feature matrix up-to-date on the Parquet website) are more likely to be considered. If discussed as a requirement in step 1 above, demonstration of integration with a query engine is also required for this step. The implementations must be made available publicly, and they should be fit for inclusion (for example, they were submitted as a pull request against the target repository and committers gave positive reviews). Reports on the benefits from closed source implementations are welcome and can help lend weight to features desirability but are not sufficient for acceptance of a new feature.
Unless otherwise discussed, it is expected the implementations will be developed from their respective main branch (i.e. backporting is not required), to demonstrate that the feature is mergeable to its implementation.
- Ratification: After the first two steps are complete a formal vote is held on
[email protected] to officially ratify the feature. After the vote
passes the format change is merged into the
parquet-format
repository and it is expected the changes from step 2 will also be merged soon after (implementations should not be merged until the addition has been merged toparquet-format
).
-
To the greatest extent possible changes should have an option for forward compatibility (old readers can still read files). The compatibility and feature enablement section below provides more details on expectations for changes that break compatibility.
-
New encodings should be fully specified in this repository and not rely on an external dependencies for implementation (i.e.
parquet-format
is the source of truth for the encoding). If it does require an external dependency, then the external dependency must have its own specification separate from implementation. -
New compression mechanisms should have a pure Java implementation that can be used as a dependency in
parquet-java
, exceptions may be discussed on the mailing list to see if a non-native Java implementation is acceptable.
The Parquet PMC aims to do releases of the format package only as needed when new features are introduced. If multiple new features are being proposed simultaneously some features might be consolidated into the same release. Guidance is provided below on when implementations should enable features added to the specification. Due to confusion in the past over Parquet versioning it is not expected that there will be a 3.x release of the specification in the foreseeable future.
For the purposes of this discussion we classify features into the following buckets:
-
Backward compatible. A file written under an older version of the format should be readable under a newer version of the format.
-
Forward compatible. A file written under a newer version of the format with the feature enabled can be read under an older version of the format, but some metadata might be missing or performance might be suboptimal. Simply phrased, forward compatible means all data can be read back in an older version of the format. New logical types are considered forward compatible despite the loss of semantic meaning.
-
Forward incompatible. A file written under a newer version of the format with the feature enabled cannot be read under an older version of the format (e.g. adding and using a new compression algorithm). It is expected any feature in this category will provide a signal to older readers, so they can unambiguously determine that they cannot properly read the file (e.g. via adding a new value to an existing enum).
New features are intended to be widely beneficial to users of Parquet, and therefore it is hoped third-party implementations will adopt them quickly after they are introduced. It is assumed that writing new parts of the format, and especially forward incompatible features, will be configured with a feature flag defaulted to "off", and at some future point the feature is turned on by default (reading of the new feature will typically be enabled without configuration or defaulted to on). Some amount of lead time is desirable to ensure a critical mass of Parquet implementations support a feature to avoid compatibility issues across the ecosystem. Therefore, the Parquet PMC gives the following recommendations for managing features:
-
Backward compatibility is the concern of implementations but given the ubiquity of Parquet and the length of time it has been used, libraries should support reading older versions of the format to the greatest extent possible.
-
Forward compatible features/changes may be enabled and used by default in implementations once the parquet-format containing those changes has been formally released. For features that may pose a significant performance regression to older format readers, libaries should consider delaying default enablement until 1 year after the release of the parquet-java implementation that contains the feature implementation.
-
Forward incompatible features/changes should not be turned on by default until 2 years after the parquet-java implementation containing the feature is released. It is recommended that changing the default value for a forward incompatible feature flag should be clearly advertised to consumers (e.g. via a major version release if using Semantic Versioning, or highlighed in release notes).
For forward compatible changes which have a high chance of performance
regression for older readers and forward incompatible changes, implementations
should clearly document the compatibility issues. Additionally, while it is up
to maintainers of individual open-source implementations to make the best decision to serve
their ecosystem, they are encouraged to start enabling features by default along
the same timelines as parquet-java
. Parquet-java will wait to enable features
by default until the most conservative timelines outlined above have been
exceeded. This timeline is an attempt to balance ensuring
new features make their way into the ecosystem and avoiding
breaking compatiblity for readers that are slower to adopt new standards. We
encourage earlier adoption of new features when an organization using Parquet
can guarantee that all readers of the parquet files they produce can read a new
feature.
After turning a feature on by default implementations are encouraged to keep a configuration to turn off the feature. A recommendation for full deprecation will be made in a future iteration of this document.
For features released prior to October 2024, target dates for each of these
categories will be updated as part of the parquet-java 2.0
release process
based on a collected feature compatibility matrix.
For each release of parquet-java
or parquet-format
that influences this
guidance it is expected exact dates will be added to parquet-format to provide
clarity to implementors (e.g. When parquet-java
2.X.X is released, any new
format features it uses will be updated with concrete dates). As part of
parquet-format
releases the compatibility matrix will be updated to contain
the release date in the format. Implementations are also encouraged to provide
implementation date/release version information when updating the feature
matrix.
End users of software are generally encouraged to consult the feature matrix and vendor documentation before enabling features that are not yet widely adopted.