Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add geometry logical type #240

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented May 10, 2024

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

@jiayuasu
Copy link
Member

@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it.

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu
Copy link
Member

Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story.

@wgtmac
Copy link
Member Author

wgtmac commented May 11, 2024

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible.

Copy link

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Iceberg side, I am excited about this, I think it will make Geospatial inter-op easier in the long run to define the type formally in parquet-format, and also unlock row group filtering. For example, Iceberg's add_file for parquet file. Perhaps there can be conversion utils for GeoParquet if we go ahead with this, and definitely like to see what they think.

Im new in parquet side, so had some questions

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@wgtmac wgtmac marked this pull request as ready for review May 11, 2024 16:13
@wgtmac wgtmac changed the title WIP: Add geometry logical type PARQUET-2471: Add geometry logical type May 11, 2024
@pitrou
Copy link
Member

pitrou commented May 15, 2024

@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback.

@pitrou
Copy link
Member

pitrou commented May 15, 2024

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

In reading this I do wonder if there should just be an extension mechanism here instead of attempting to enumerate all possible encodings in this repo. The people that are engaged and working on implementations are the right people to engage here, which is why GeoParquet and GeoArrow have been successful (we've engaged the people who care about this, and they are generally not paying attention to apache/parquet-format nor apache/arrow).

There are a few things that this PR solves in a way that might not be possible using EXTENSION, which is that of column statistics. It would be nice to have some geo-specific things there (although maybe that can also be part of the extension mechanism). Another thing that comes up frequently is where to put a spatial index (rtree)...I don't think there's any good place for that at the moment.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata...this metadata is typically propagated through projections and the things we do in the GeoParquet standard (store bounding boxes, refer to columns by name) become stale with the ways that schema metadata are typically propagated through projections and concatenations.

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented May 17, 2024

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT?

EDIT: I think we can remove those informative attributes like subtype, orientation, edges. Perhaps encoding can be removed as well if we only support WKB. dimension is something that we must be aware of because we need to build bbox which depends on whether the coordinate is represented as xy, xyz, xym and xyzm.

@wgtmac
Copy link
Member Author

wgtmac commented May 17, 2024

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata.

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu

@paleolimbot
Copy link
Member

If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that.

I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format.

@wgtmac
Copy link
Member Author

wgtmac commented May 18, 2024

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

@Kontinuation
Copy link
Member

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly.

@paleolimbot
Copy link
Member

I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet.

I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view).

Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.

Spatial index such as R-tree may not be suitable for Parquet

I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access.

@paleolimbot
Copy link
Member

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

I opened up opengeospatial/geoparquet#222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions.

@jorisvandenbossche
Copy link
Member

Just to ensure my understanding is correct:

  • This is proposing to add a new logical type annotating the BYTE_ARRAY physical type. For readers that expect just such a BYTE_ARRAY column (e.g. existing GeoParquet implementations), that is compatible if the column would start having a logical type as well? (although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type).
  • For such "legacy" readers (just reading the WKB values from a binary column), the only thing that actually changes (apart from the logical type annotation) are the values of the statistics? Now, I assume that right now no GeoParquet reader is using the statistics of the binary column, because the physical statistics for BYTE_ARRAY ("unsigned byte-wise comparison") are essentially useless in the case those binary blobs represent WKB geometries. So again that should probably not give any compatibility issues?

@jorisvandenbossche
Copy link
Member

although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type

To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story ..

@wgtmac
Copy link
Member Author

wgtmac commented May 21, 2024

@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new ColumnOrder and need to do some testing. Adding a new logical type should not break anything from legacy readers.

@jornfranke
Copy link

jornfranke commented May 21, 2024

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

@szehon-ho
Copy link

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

Yes there is now a concrete proposal apache/iceberg#10260 , and the plan currently is to bring it up in next community sync

@cholmes
Copy link

cholmes commented May 23, 2024

Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial.

That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.

For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible.

@paleolimbot
Copy link
Member

I like the latest change! (i.e., keep the bbox in thrift, reduce the future coverings options to just WKB for now). The bounding box will always be applicable and it makes sense to keep it in thrift.

@jorisvandenbossche
Copy link
Member

Another quick note: we should mention something about the coordinate axis order being x/y (lon/lat, easting/northing). The text from geoparquet is here: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#coordinate-axis-order

Copy link

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of questions regarding intended semantics.

*/
enum Edges {
PLANAR = 0;
SPHERICAL = 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this really mean? That we only allow either Cartesian geometries, or geometries embedded on a sphere? How would one indicate geometries embedded on a spheroid (like WGS84)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkaravel

If one has a column for cartesian geometries, use Edges.Planar. If one has a column for a geometry on sphere (aka Geography type), use Edges.Spherical.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about geodesic edges? Do we disallow them? How would you represent a geometry in WGS84?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't duplicating the information provided by the CRS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about geodesic edges?

If producing and/or storing data with geodesic edges becomes a widespread use-case, another enum value could be added. Until that is the case, spherical edges tesselated with enough vertices to minimize that error to an acceptable level are almost certainly sufficient for any practical use case.

Isn't duplicating the information provided by the CRS?

The coordinate system (i.e., context for individual coordinate tuples) and how the dataset author intended a line to be drawn between any two coordinates are orthogonal concepts: GIS systems, for better or worse, frequently are used to "simplify" datasets and can do things like model the very long segment of the US--Canada border along the 49th parallel as a single line segment. Because there is quite a lot of data that was authored in this way (and few ways to produce data with spherical edges, even though this makes more sense in a global context), we need a way to express that intent here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coordinate system (i.e., context for individual coordinate tuples) and how the dataset author intended a line to be drawn between any two coordinates are orthogonal concepts

Then the intend is to express the interpolation method between points? Should the enumeration have "interpolation" or something similar in its name for making that intend clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the word interpolation would be great to include in the documentation! EdgeInterpolation is a bit of a mouthful and I prefer Edges (but with no strong feelings either way 🙂 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer adding documentation to Edges for interpolation but keep the name simple.

* but it is recommended that the writer always generates bounding box statistics,
* regardless of whether the geometries are planar or spherical.
*/
struct BoundingBox {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spherical edges, are there any restrictions as to what is considered a valid bounding box?

I would even ask the same for Cartesian edges, although it is less ambiguous here. For example, what happens if xmax is smaller than xmin? Is it up to the reader to decide what to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkaravel

The bounding box statistics will be always calculated by getting the min/max of X and Y values. For cartesian geometries, this is not a problem at all because there is no antimeridian and orientation issue at all in this case.

For spherical geometries, this is a problem so that's why the spec says that it is not safe to use it for spherical geometries.

In this proposal, (1) there is another type of statistics called covering that attempts to solve this issue. (2) there is another field called string metadata which can be used to indicate orientation (clockwise / counter clockwise) of the geoparquet metadata: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46

However, in this proposal, the main focus is the geometry not geography. The Geography problem (aka spherical geometries) will be the focus of the next phase.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bounding box statistics will be always calculated by getting the min/max of X and Y values.

For geometries on a sphere or a spheroid this may no longer produce a bounding box if you restrict to just the vertices of the geometries.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, in this proposal, the main focus is the geometry not geography. The Geography problem (aka spherical geometries) will be the focus of the next phase.

If a GEOGRAPHY going to be a different logical type? Maybe this not such a bad idea, in which case the current proposal may want to focus just on the Cartesian space and nothing else.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While PostGIS uses a distinct type for geography, this is not necessarily a good idea that other formats should reproduce (I think it was due to historical constraints). Any geometry associated to a geographic CRS may be considered "geography". And as said by mkaravel, naive computation of min/max values will not work in the latter case.

I suggest to remove the sentence saying "it is recommended that the writer always generates bounding box statistics, regardless of whether the geometries are planar or spherical". I suggest to even avoid saying that the values are min/max, or said that it is the case in Cartesian coordinate system but not necessarily in other coordinate system.

In particular I suggest to keep open the possibility that "min" > "max". This is the only approach that I know which works with anti-meridian. Bounding boxes in the EPSG database are already expressed that way. Some standards such as GGXF also mandate it. It makes calculation of union and intersection more complicated, but libraries only need to define union and intersection methods in one place and use it consistently.

I'm not saying that the specification should support "min" > "max" now, just suggesting to keep this door open.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While PostGIS uses a distinct type for geography, this is not necessarily a good idea that other formats should reproduce (I think it was due to historical constraints)

I think consolidating GEOGRAPHY and GEOMETRY in the same type helps simplify things and more succinctly describe what the difference is between the two. Considering them as separate types is helpful in type systems where parameterized types are difficult or impossible to express (e.g., Postgres).

I'm not saying that the specification should support "min" > "max" now, just suggesting to keep this door open.

I don't think this is a good idea...dataset authors should split anything that crosses the antimeridian into two polygons to satisfy this constraint (which would allow Parquet implementations to do filter pushdown without ever having to inspect the crs key). The "door open" is the "covering", whose payload is arbitrary key/value and could be used to include this type of specification in the future if it can be demonstrated to be useful in this context.

I suggest to remove the sentence saying "it is recommended that the writer always generates bounding box statistics, regardless of whether the geometries are planar or spherical".

I agree with this...I think it's fine if they exist but saying that they should or must generate them is confusing (since they shouldn't be used).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset authors should split anything that crosses the antimeridian into two polygons

Not necessarily. This approach introduces other problems (e.g. creates an artificial edge), and "min" > "max" is a good alternative which avoids this problem. Some dataset authors already use the latter (e.g. EPSG) and may not agree that they should change. This approach is also already used by some OGC/ISO standards.

which would allow Parquet implementations to do filter pushdown without ever having to inspect the crs key

It is not necessary to inspect the CRS key if the specification said that every "min" > "max" cases are wraparound. It is not even necessary to know that the wraparound is 360° or any other value. It works in a truly generic way. The only requirement is to have more advanced implementations of the contains and intersects methods than the naive ones. This is only mathematics, could be documented in the specification, and would be truly generic (not specific to geographic data).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the background!

I still think this is a better fit for a dedicated (future) Covering kind/payload that works with spherical edges such that the basic level of support can be mostly ignorant of geometry-specific concepts (but happy to hear other opinions).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a geometry-specific concept. The idea is to reinterpret the "min" and "max" values so that they are no longer min/max, but "start value" and "end value". Then add the following rules:

  • The interior is everything from the start value to the end value, going in the direction of increasing values.
  • If start > end, then we go from start value to positive infinity, wrap around to negative infinity and continue until the end value.

That's all. No geometric concept, no need to know that the wraparound happens at ±180°. It works if all coordinates are inside a consistent range of validity. It may be [−180 … +180]° of longitude, or [0 … 360]° or anything else at user's choice (actually defined by the CRS). As long as all coordinates are guaranteed inside the same range, above algorithm does not need to know that range.

This proposal works for all kinds of wraparound: not only longitudes, but also climatological data (e.g. average temperatures of January, February … December, then back to January with no particular year associated to those months), radar, etc. By contrast, if we choose to add a Geography type for handling longitude wraparound, then we would also need a Radar type for data on a polar coordinate system, a ClimateCalendar type for temporal coordinate system, etc. This is not scalable.

Above needs are not hypothetical. By coincidence, I'm facing this week the problem of using climatological data in GIS system that do not understand temporal wraparound. Those data are common at WMO, NATO and other organizations. The "min" > "max" proposal addresses the general case. The Geography type does not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bounding box statistics will be always calculated by getting the min/max of X and Y values. For cartesian geometries, this is not a problem at all because there is no antimeridian and orientation issue at all in this case.

For spherical geometries, this is a problem so that's why the spec says that it is not safe to use it for spherical geometries.

FYI @dmitrykoval . We were discussing earlier that some engine still have a way to use bounding box even for spherical edges (there's ways to convert?)

* Add the new suggestion according to the meeting with Snowflake

* Refine the description according to the suggestion
@wgtmac
Copy link
Member Author

wgtmac commented Sep 18, 2024

Update: Two PoC implementations (apache/parquet-java#2971 and apache/arrow#43977) are approaching their final stages. The current design has been proved to work as expected. In the meanwhile, the Iceberg community is waiting for this feature to be added in its v3 spec. Therefore, it's time to finalize the spec.

Could you help take a review again with your geospatial expertise? @cholmes @jorisvandenbossche @desruisseaux @zhangfengcdt @jiayuasu @paleolimbot @szehon-ho @Kontinuation

At the same time, it would be great if Parquet PMCs can help review it as well @pitrou @emkornfield @gszadovszky @julienledem @shangxinli @rdblue @danielcweeks

Thanks in advance!

Copy link
Member

@jiayuasu jiayuasu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me now!

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is looking very good! Thank you to all, but especially @wgtmac, @jiayuasu, and @cholmes for the continued support, implementation work, and review for all of this!

I've offered a few (optional!) changes to make some of these definitions more precise, but I'm also happy to propose those clarifications later on if they're not helpful at this stage.

Comment on lines +555 to +557
* Additional informative metadata.
* GeoParquet could offload its column metadata in a JSON-encoded UTF-8 string:
* https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Additional informative metadata.
* GeoParquet could offload its column metadata in a JSON-encoded UTF-8 string:
* https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46
* Additional informative metadata formatted as a JSON-encoded UTF-8 string provided
* for future informative metadata that may be useful to include but is not strictly
* required by the low-level Parquet implementation to implement filter pushdown.

...or some other clarification to specifically indicate what is or is not allowed here at the time of this writing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this metadata be anything else than a simple key-value mapping? If just the latter, perhaps this can be optional list<KeyValue> key_value_metadata as in other places?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favour of a list<KeyValue> here (which would make it easier to access the values without explicitly requiring a JSON parser), but I also don't mind a single JSON object either (as long as we make it clear that it's supposed to be a JSON object).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list<KeyValue> would certainly be easier (and probably faster) to process by existing readers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for list as it sounds like that better matches how things work here. I believe used JSON because that was the way to specify in the user space that we have.

Comment on lines +479 to +482
* To be clear, we follow the same rule of WKB and coordinate axis order from
* GeoParquet [1][2]. It is the ISO WKB supporting XY, XYZ, XYM, XYZM and the
* standard geometry types (Point, LineString, Polygon, MultiPoint,
* MultiLineString, MultiPolygon, and GeometryCollection).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is helpful to make this more self-contained, the specifics are:

Suggested change
* To be clear, we follow the same rule of WKB and coordinate axis order from
* GeoParquet [1][2]. It is the ISO WKB supporting XY, XYZ, XYM, XYZM and the
* standard geometry types (Point, LineString, Polygon, MultiPoint,
* MultiLineString, MultiPolygon, and GeometryCollection).
* To be clear, we follow the same rule of WKB and coordinate axis order from
* GeoParquet [1][2]. Geometries SHOULD be encoded as ISO WKB [3][4]
* supporting XY, XYZ, XYM, XYZM and the standard geometry types
* Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon,
* and GeometryCollection). Coordinate order is always (x, y) where x is
* easting or longitude and y is northing or latitude. This ordering explicitly
* overrides the axis order as specified in the CRS following the GeoPackage
* specification [5].

Comment on lines +487 to +488
* [1] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
* [2] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [1] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
* [2] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155
* [1] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
* [2] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155
* [3] https://portal.ogc.org/files/?artifact_id=18241
* [4] https://www.iso.org/standard/60343.html
* [5] https://www.geopackage.org/spec130/#gpb_spec

Comment on lines +241 to +243
* Interpretation for edges of GEOMETRY logical type, i.e. whether the edge
* between points represent a straight cartesian line or the shortest line on
* the sphere. It applies to all non-point geometry objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a more precise wording is useful here, it might be:

Suggested change
* Interpretation for edges of GEOMETRY logical type, i.e. whether the edge
* between points represent a straight cartesian line or the shortest line on
* the sphere. It applies to all non-point geometry objects.
* Interpretation for edges of elements of a GEOMETRY logical type. In other
* words, whether a point between two vertices should be interpolated in
* its XY dimensions as if it were a Cartesian line connecting the two
* vertices (planar) or the shortest spherical arc between the longitude
* and latitude represented by the two vertices (spherical). This value
* applies to all non-point geometry objects and is independent of the
* coordinate reference system.
*
* Because most systems currently assume planar edges and do not support
* spherical edges, planar should be used as the default value.

Comment on lines +254 to +255
* an edge of geographic coordinates crosses the antimeridian). In addition, it can
* also be used to provide vendor-agnostic coverings like S2 or H3 grids.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* an edge of geographic coordinates crosses the antimeridian). In addition, it can
* also be used to provide vendor-agnostic coverings like S2 or H3 grids.
* an edge of geographic coordinates crosses the antimeridian). It may be
* extended in future versions to provide vendor-agnostic coverings like
* vectors of cells on a discrete global grid (e.g., S2 or H3 cells).

* is not valid).
*
* Please refer to links below for more detail:
* [1] https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
Copy link

@szehon-ho szehon-ho Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor point, the origin of WKB and geo class hierarchy seem to be https://www.ogc.org/standard/sfa/ and these other ones are derivative.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really wish OGC published these as html documents so we could link directly into the relevant section, instead of just saying go to https://portal.ogc.org/files/?artifact_id=25355 and look at chapter 8.

* This is the preferred encoding for maximum portability. It also supports
* GeometryStatistics to be set in the column chunk and page index.
*
* [1] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
Copy link

@szehon-ho szehon-ho Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, also not sure if we can reference the other one to this one to eliminate repetition.

Copy link

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this and the lengthy process to build consensus, also left a few minor points (on the comments) for consideration. The Iceberg community is also looking forward to this to enhance its Geo spec development. apache/iceberg#10981

* between points represent a straight cartesian line or the shortest line on
* the sphere. It applies to all non-point geometry objects.
*/
enum Edges {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we avoid the plural? Perhaps make it EdgeShape or EdgeKind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the same definition from GeoParquet (https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md) and we try to be consistent with GeoParquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity. If you want to store a collection of Edges, how do you name it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I understand it's probably not a practical concern :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm.. Why would someone want to store a collection of edges? For each geometry column, the Edges field has only 1 possible value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favour of a more descriptive enum name (EdgeKind or EdgeInterpolation?), but keeping the attribute of the type consistent with GeoParquet (e.g., it's still geom_type.edges, the enum class just happens to be EdgeInterpolation)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say EdgeInterpolation is the better name than EdgeKind. +1 for keeping the attribute of the type the same as GeoParquet for consistency and simplicity, but makes sense to me that the enum name could be more descriptive.

* Edges of geometry type.
* Please refer to the definition of Edges for more detail.
*/
2: required Edges edges;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plural here is a bit confusing. My first hunch is that it's a list of elements... but, no, it's a single enum named Edges.

Copy link

@cholmes cholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Added a few minor comments and +1'ed most everything @paleolimbot added. But great work on this all around.

* between points represent a straight cartesian line or the shortest line on
* the sphere. It applies to all non-point geometry objects.
*/
enum Edges {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say EdgeInterpolation is the better name than EdgeKind. +1 for keeping the attribute of the type the same as GeoParquet for consistency and simplicity, but makes sense to me that the enum name could be more descriptive.

/**
* A type of covering. Currently accepted values: "WKB".
*/
1: required string kind;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'kind' a typical thing in Parquet? If not I'd call this 'type' - kind sounds weird to me, but not sure if 'type' as a very specific meaning for parquet. This is totally just a stylistic thing, so feel free to ignore. But note in the description we say 'a type of covering', not a 'the kind of covering'. Kind seems 'fuzzier', that it's wkb or close to it, while type seems to indicate just one to me.

1: required string kind;
/**
* A payload specific to kind. Below are the supported values:
* - WKB: well-known binary of a POLYGON or MULTI-POLYGON that completely
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - WKB: well-known binary of a POLYGON or MULTI-POLYGON that completely
* - WKB: well-known binary [1] of a POLYGON or MULTI-POLYGON that completely

Not sure this formatting works, but this is the first mention of WKB. I originally just added it here
as I'm not sure that the name 'well-known binary' is actually that well known. Realized that there is the same link as below (currently line 325), but not sure if the [1] does auto-linking, or if this comment in the code should explain its own [1] link.

* is not valid).
*
* Please refer to links below for more detail:
* [1] https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really wish OGC published these as html documents so we could link directly into the relevant section, instead of just saying go to https://portal.ogc.org/files/?artifact_id=25355 and look at chapter 8.

*/
1: required GeometryEncoding encoding;
/**
* Edges of geometry type.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Edges of geometry type.
* Interpretation for edges of elements of a GEOMETRY logical type, i.e. whether the interpolation
* between points along an edge represents a straight cartesian line or the shortest line on the sphere.

In the other comments there seemed to be desire to keep the 'edges' name here, but this documentation is so short that it seemed useful to at least give a short overview before referring people to a more in depth definition.

Comment on lines +555 to +557
* Additional informative metadata.
* GeoParquet could offload its column metadata in a JSON-encoded UTF-8 string:
* https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for list as it sounds like that better matches how things work here. I believe used JSON because that was the way to specify in the user space that we have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.