Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof-of-concept Parquet GEOMETRY logical type implementation #43977

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Sep 6, 2024

Rationale for this change

This is a continuation of #43196

In apache/parquet-format#240 a GEOMETRY logical type for Parquet is proposed with a proof-of-concept Java implementation ( apache/parquet-java#1379 ). This is a PR to explore what an implementation would look like in C++.

What changes are included in this PR?

We are still in progress of completing all necessary changes to integrate geometry logical type support to the C++ implementation.

  • [DONE] Adding geometry logical type
  • [DONE] Adding geometry column statistics
  • [DONE] Support reading/writing parquet files containing geometry columns

Are these changes tested?

The tests added only cover very basic use cases. Comprehensive tests will be added in future commits.

Are there any user-facing changes?

Yes! (And will eventually be documented)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 12, 2024
@Kontinuation Kontinuation marked this pull request as ready for review September 16, 2024 02:45
@Kontinuation Kontinuation changed the title WIP: Proof-of-concept Parquet GEOMETRY logical type implementation Proof-of-concept Parquet GEOMETRY logical type implementation Sep 16, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 16, 2024
@Kontinuation
Copy link
Member Author

Please ping me when ready for review again. Thanks!

@wgtmac I have added ColumnIndex and covering support. Please help review this PR when you have time, thank you.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just finished reviewing it for the 2nd pass. Thanks for the great work!

My main concern is the difference with Java PoC, which generates min/max values in the statistics and page index as if the GEOMETRY column is a pure BYTE_ARRAY column. Otherwise we need to revise the spec to add a lot of exceptions for geometry type. WDYT?

cpp/src/parquet/thrift_internal.h Outdated Show resolved Hide resolved
cpp/src/parquet/column_reader.cc Outdated Show resolved Hide resolved
cpp/src/parquet/CMakeLists.txt Show resolved Hide resolved
Geometry(std::string crs, LogicalType::GeometryEdges::edges edges,
LogicalType::GeometryEncoding::geometry_encoding encoding,
std::string metadata)
: LogicalType::Impl(LogicalType::Type::GEOMETRY, SortOrder::UNKNOWN),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: LogicalType::Impl(LogicalType::Type::GEOMETRY, SortOrder::UNKNOWN),
: LogicalType::Impl(LogicalType::Type::GEOMETRY, SortOrder::UNSIGNED),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortOrder::UNSIGNED is the default sort order of BYTE_ARRAY type. Could we just use this so you don't have to change a line in column_writer.cc. The good thing is that ColumnIndex of geometry type can also be generated automatically, though the min/max values are derived from their binary values and useless. This is the same practice used in the Java PoC impl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is life-saving. Now all the special handling of geometry statistics for unknown sort order has gone away.

cpp/src/parquet/types.cc Outdated Show resolved Hide resolved
cpp/src/parquet/reader_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
out.mmax = maxes[3];

if (coverings_.empty()) {
// Generate coverings from bounding box if coverings is not present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will coverings_ be empty? Is it the default behavior? I'm not sure if we need to check if the edges is planar since bbox is not accurate for spherical edges. BTW, if we don't have a good implementation for coverings, I think we can just ignore it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for generating coverings from the bounding box when assembling the encoded representation of the geometry statistics. I've added a member called generate_covering_ to make it more explicit.


class GeometryStatisticsImpl;

class PARQUET_EXPORT GeometryStatistics {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding theses!

return std::static_pointer_cast<TypedStatistics<DType>>(Statistics::Make(
descr, encoded_min, encoded_max, num_values, null_count, distinct_count,
has_min_max, has_null_count, has_distinct_count, pool));
int64_t distinct_count, const EncodedGeometryStatistics& geometry_statistics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly add const EncodedGeometryStatistics* geometry_statistics = NULLPTR to the end of the existing function signature?

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 19, 2024
@Kontinuation
Copy link
Member Author

I just finished reviewing it for the 2nd pass. Thanks for the great work!

My main concern is the difference with Java PoC, which generates min/max values in the statistics and page index as if the GEOMETRY column is a pure BYTE_ARRAY column. Otherwise we need to revise the spec to add a lot of exceptions for geometry type. WDYT?

I've changed the min/max statistics of geometry columns to be the WKB representation of lower-left and upper-right corners in the last commit according to apache/iceberg#10981 (comment).

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick change! I've left some minor comments.

Comment on lines 247 to 251
EncodedGeometryStatistics encoded_geometry_stats;
if (stats.__isset.geometry_stats) {
encoded_geometry_stats = FromThrift(stats.geometry_stats);
}
page_statistics.set_geometry(encoded_geometry_stats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EncodedGeometryStatistics encoded_geometry_stats;
if (stats.__isset.geometry_stats) {
encoded_geometry_stats = FromThrift(stats.geometry_stats);
}
page_statistics.set_geometry(encoded_geometry_stats);
page_statistics.set_geometry(FromThrift(stats.geometry_stats));

return std::static_pointer_cast<TypedStatistics<DType>>(Statistics::Make(
descr, encoded_min, encoded_max, num_values, null_count, distinct_count,
has_min_max, has_null_count, has_distinct_count, pool));
int64_t distinct_count, const EncodedGeometryStatistics& geometry_statistics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we'd better not to add another overload. If there is a compelling reason to do so, we can add a ARROW_DEPRECATED macro to the old one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants