Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add from_wkt/to_wkt functions #50

Merged
merged 14 commits into from
Oct 25, 2024

Conversation

jorisvandenbossche
Copy link
Collaborator

This adds vectorized from_wkt/to_wkt functions.

In addition to the geoarrow IO in #49, this is still useful to have so users can convert to/from WKT with plain numpy arrays as well instead of having to go through Arrow (and use some arrow library).

src/io.cpp Outdated
#else
if (planar || oriented) {
throw std::invalid_argument(
"planar and oriented options are only available with s2geography >= 0.2");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"planar and oriented options are only available with s2geography >= 0.2");
"planar and oriented options are only available with s2geography >= 1.2");

(per above if?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have it correct now.. In any case, the s2geography version is certainly 0.2 (https://github.com/paleolimbot/s2geography/blob/9ebd7b106ba2ae30046cb4f3e9561a8bbdb18b9b/CMakeLists.txt#L6).

I initially did the #if version check wrong, see de19f0e for correcting it.

We want that both version of 0.2 or 1.1 result in True, with the current version being 0.1 returning False, so that means major >= 1 OR minor >= 2, I think?

Copy link
Owner

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments, although overall it looks already good!

I'm not sure about the cost in performance of constructing (and destroying) new s2geography::WKTReader (or s2geography::WKTWriter) objects within the vectorized inner loops, but perhaps it would be preferable to handle that out of the loop (e.g., using a Functor)? I don't think it would be very useful to allow array values for planar and/or oriented.

src/spherely.pyi Show resolved Hide resolved

to_wkt: _VFunc_Nin1_Nout1[Literal["to_wkt"], str, object]

def from_wkt(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this is also a fully vectorized function, i.e., oriented and planar also accept array-like objects, right?

Or do we want to restrict oriented and planar to accepting each a single value? (and use a Functor in order to handle those arguments out of the vectorized call loop like for the predicate functions)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used a lambda in the m.def so to only vectorize the main input data argument, and not oriented/planar.

src/io.cpp Outdated Show resolved Hide resolved
src/io.cpp Outdated
If set to True, the edges linestrings and polygons are assumed to
be planar. In that case, additional points will be added to the line
while creating the geography objects, to ensure every point is
within 100m of the original line.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make this distance threshold configurable? (Note: this can be addressed later).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should (eg bigquery uses a default of 10m, I think), but was also thinking that this could be added later (although it is not actually difficult to add, because I already hardcode the 100m. I can maybe add this while refactoring to reuse the reader/writer object and to not vectorize those additional arguments)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a tessellate_tol_m keyword, mimicking R's s2 (https://r-spatial.github.io/s2/reference/s2_geog_point.html)

tests/test_io.py Outdated Show resolved Hide resolved
Comment on lines +62 to +64
# if we force to not orient, we get an error
with pytest.raises(RuntimeError, match="Inconsistent loop orientations detected"):
spherely.from_wkt(polygon_with_bad_hole_wkt, oriented=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is there a great performance gain when setting oriented=True and when we know that all ring are consistently oriented? I haven't checked the implementation in s2geography (and not much in s2geometry), but I assume that the orientations are checked anyway if such case may trigger raising an error.

Maybe are there other reasons of setting oriented=True? I.e., when do we want to avoid implicit re-ordering of the rings?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also later uniformize across the library the type of error raised (in #51 a ValueError is raised when a polygon has invalid edges).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know (without trying out) if there is a performance difference, but I think that the oriented=True does allow you to create different geographies? (if you have one that is larger than half of the globe, the default constructor would always normalize it to represent the smaller polygon)

And I can also imagine that just checking that the orientation of all loops are the same is still cheaper than a full normalization (but that is indeed something to check)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also later uniformize across the library the type of error raised (in #51 a ValueError is raised when a polygon has invalid edges).

Yeah, in this case the error comes from s2geography (in theory I could catch the error and re-throw it with a different class, because those RuntimeErrors are a bit strange)

tests/test_io.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Collaborator Author

Updated this now #51 is merged, I think this should be ready for a final review / merge.

Copy link
Owner

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor suggestion but otherwise it is good to go!

Thanks @jorisvandenbossche !

src/io.cpp Outdated
ensure every point is within 100m of the original line.
By default (False), it is assumed that the edges are spherical
(i.e. represent the shortest path on the sphere between two points).
tessellate_tol_m : float, default 100.0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tessellate_tol_m : float, default 100.0
tessellate_tol : float, default 100.0

(nit suggestion: _m makes the parameter name slightly less readable IMHO and its meaning may not be obvious at a first glance without checking the docs anyway... _meters would be more meaningful but would also make the parameter name too long).

Copy link
Collaborator Author

@jorisvandenbossche jorisvandenbossche Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That keyword name was taken from R s2 (https://r-spatial.github.io/s2/reference/s2_geog_point.html), but I am certainly fine with dropping the _m as well. I agree that from just seeing the keyword name (without reading the explanation for what it stands), it is not necessarily clear that this stands for "meters", making it an unnecessary suffix.

(to be honest, I am also not convinced that "tessellation" is the best name. BigQuery uses it, but if you google for it, it is typically about dividing a plane into multiple shapes that nicely cover it without overlap, e.g. https://en.wikipedia.org/wiki/Edge_tessellation and https://en.wikipedia.org/wiki/Tessellation)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm more familiar with the latter meaning (dividing a surface into multiple shapes), but I'm fine with the name here.

Another name could be densify_pts_tol or densify_tol (pyproj uses densify_pts for similar problems).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in the pyproj case it is (I assume) a uniform/regular densification, while s2 does a non-uniform (only adding points where needed to preserve the tolerance)

I renamed the keyword to tessellate_tolerance (making it a bit longer, but also more readable I think than the "tol", and then it is also consistent with the C++ argument)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to merge this, so I can fixup conflicts with the geoarrow PR. But happy to change the keyword name again in a quick follow-up.

@jorisvandenbossche jorisvandenbossche merged commit 4e39dee into benbovy:main Oct 25, 2024
15 checks passed
@jorisvandenbossche jorisvandenbossche deleted the from_wkt branch October 25, 2024 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants