-
-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Polygon support #819
Add Polygon support #819
Conversation
- New PolygonsArray/LinesArray/PointsArray extension arrays - New Canvas.polygons() method to rasterize polygons - Updates to Canvas.points and Canvas.lines to accept new geometry arrays.
This looks great! Particularly glad to see the conversion functions from geopandas. My main question is about all the extension array work and how that overlaps with the work that is going on within geopandas itself. Do we want to continue maintaining these extension arrays? Do we push to merge the implementations with geopandas? Do we work with a project like awkward-array to get a shared implementation across all the different projects? |
This work is currently living in Datashader, but one of the motivations was to have (somewhere, not sure yet!) an implementation of Shapely-like processing in Dask and Pandas that is fully independent of the geospatial stack that GeoPandas uses, i.e. Shapely, GEOS, GDAL, etc. Being independent of that stack makes it much more flexible and easier to deploy on a variety of systems, and so I don't think we should be trying to move this code into GeoPandas itself. That said, it would be possible to abstract some of it into something that this code and GeoPandas can both depend on, but that doesn't rely on the legacy stack. I don't know how feasible that is, but it does sound desirable! In the meantime, we'll have the limitation that we don't have a standard way to persist these objects, which means that for I/O we are still tied to the legacy stack. :-( |
As for awkward-array, it definitely looks interesting and having it be externally maintained is very attractive, but I haven't yet seen anything significant it would add for these use cases. E.g. I don't see any persistence format in https://github.com/scikit-hep/awkward-1.0 yet, and the operations to be supported here don't seem to have much overlap with those proposed for awkward... |
Yes, that was my hope, duplicating these things all over the place does not seem ideal (even if it's fine for the time being).
Right there are quite a few issues that are unsolved which is why I would like to push towards some shared implementation that will solve all these issues, and awkward-array seems like the solution most likely to get there particularly if we encourage it in some form. It has a community of hundreds of folks which will be relying on it for their daily work.
Quoting the README: "Interoperability with ROOT, Arrow, Parquet and Pandas, as well as planned interoperability with Numba and Dask.", with parquet and arrow both being persistence formats. In the longer term goals it also states: "Translation to and from Apache Arrow and Parquet in C++."
This is absolutely true but the idea as I understood it is that it would allow implementing numba CPU and GPU kernels on top of the arrays, which could then be exposed as functions or on custom array subclasses. |
At this point those are all just plans, though. E.g. they mention Parquet export, but Parquet doesn't have any support for ragged arrays that I know of, so presumably they only mean export of non-ragged arrays to Parquet. Extending Parquet (via Arrow/Fletcher?) to support ragged data is a separate task and necessary with or without Awkward, as far as I can see.
How would that differ from what is already implemented here? Doesn't this ExtensionArray already support those? |
Yes, I think this would definitely be a longer term plan.
Good question I don't have an answer to. |
This seems very unlikely to me since almost all their data has ragged arrays in some form, so parquet support without that would be entirely useless for them. |
Maybe that's a question you could raise with that community, because Parquet is very explicitly a columnar storage format (it's the first line on their homepage!), and a Google search for parquet+ragged only comes up with awkward, not anything in Parquet. |
Hmm. I can't tell if that would work here or not. What we need is columnar, but with one chunk per row, and each row lining up with other Pandas columns containing e.g. one scalar per row. I can see how this column works, but not how it maps to other columns in the same Parquet file. |
Looks like it does come back organized that way when your file above is read into Pandas, resulting in an object column:
So maybe it will work, just not getting an object column in our case. |
Cross reference #720 (comment) regarding extension array serialization. For the geometry constructs, I think we only really need one level of nesting, which is what the current |
Great to hear! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, at the code level. Let's have an audio chat to go over the approach and open issues, so that I can be sure I understand the detailed implications of the implementation (e.g. what happens to a polygon that gets partially aggregated into multiple pixels), the implications of using paired coordinates, possible storage formats, detecting that this special Inf convention is in use, etc.
@@ -11,7 +11,7 @@ | |||
from .pipeline import Pipeline # noqa (API import) | |||
from . import transfer_functions as tf # noqa (API import) | |||
from . import data_libraries # noqa (API import) | |||
|
|||
from . import geom # noqa (API import) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module name geom
could be confusing here, as we have both spatial
and geo
already. The contents of geo.py
should probably be moved into the other two, so let's ignore that one. spatial
is full of functions for working with aggregate arrays, whereas geom
is full of functions for doing the aggregations. geom
vs. spatial
doesn't convey that distinction, to me. I can't immediately think of a better name, though.
Btw, I'm totally not objecting to the idea that this implementation is the one we continue to use internally. I'm just saying that it'd be a shame if it was used exclusively by datashader and lots of other people end up duplicating the work. So I'm just suggesting that we should either move it out of datashader eventually OR switch to using other implementation (but only if it makes sense to do so). |
Oh, I definitely don't want the underlying ragged array support to stay in Datashader, so we're on the same page there. It's here until we figure out what to do with it, to avoid committing to a particular implementation until we are really ready for that. The ideal is if it can migrate to a suitable project already maintained by others, and barring that I think it would need to be in its own project that can have a life outside of Datashader. |
That's a really good reason to stick with the +Inf/-Inf convention that lets us use flat arrays rather than lists of arrays. Good thing we went with that; whew! |
Yeah, I agree that it would make sense to eventually move these geometry focused extension arrays into a separate package. But, I'm not familiar enough with the communities around GeoPandas to know whether there are any existing packages that would be a good fit. |
@jonmmease , can you add an example above of plotting the same data as filled polygons, polygon outlines, and isolated points? The first two are illustrated, but not the third. |
I'm pretty sure @jorisvandenbossche will be able to help guide us in that... |
Sure, I updated the notebook at https://anaconda.org/jonmmease/datashader_polygons_pr/notebook with examples of lines and points. |
Perfect, thanks! I figured there might need to be some small fix to support that. :-) |
Co-Authored-By: James A. Bednar <[email protected]>
Thanks for pinging me!
It already has become clear in the above discussion I think, but to be explicit: parquet notably does support ragged arrays (or at least the kind of ragged arrays we are dealing with here with a regular nesting). In "parquet parlance" it is called "nested" data types, and specifically a "list" type. As I also mentioned on one of the other issues, I think this is a good reason to choose a memory format that maps to one of arrow's types (another reason is that this then will directly work on the GPU as well, as RAPIDS is using arrow memory). Note that it would actually be possible to not use the inf/inf approach, but to use a second level of nesting like a "list of list of float array" (the data itself stays a flat array, but you get an extra set of offsets), which still perfectly maps to arrow/parquet types. Now, I don't know the code that uses this format for generating the glyphs, and it might be that the inf approach gives a better performance?
Yes, that is something I am actively working on right now, and normally, with the next arrow version, it should be possible to have a perfect roundtrip to arrow/parquet for a custom ExtensionArray. |
(and then some comments regarding the geospatial aspects)
I am also not aware of any existing packages in this space in Python. But, I think it would certainly be useful to have this functionality somewhere (and not semi-hidden in datashader). Where the best place would be, not directly sure, will think a bit more about this (short term it is of course perfectly fine to have it here). At the moment, I think that GeoPandas would rather keep focus on geometries in sense of what is supported by GEOS, and not necessarily representations based on ragged arrays. Regarding performance:
We are working on better performance in GeoPandas in the PyGEOS project (https://github.com/pygeos/pygeos/), so for comparisons, that is also an interesting target. I took some comparisons from Jon's notebook and redid the comparison of the geopandas method but now with pygeos:
I didn't compare with this PR on my laptop, but compared to geopandas, there is also a considerable speed up with pygeos. The numba-solution based on the ragged array representation in this PR still gives a slightly bigger speed-up, but with pygeos the difference gets much smaller. For those conversions between GEOS-based geopandas and ragged-array based ExtensionArrays here: to have the best possible performance for those conversions, it might be useful to see if certain aspects can be implemented in PyGEOS. |
Thanks for taking the time to weigh in @jorisvandenbossche!
Oh cool, I didn't realize that you could have multiple nested levels like that. The way things work right now is that the +inf/-inf values are used to indicate whether the polygon that follows is in counter clockwise (filled) or clockwise (hole) order. I'll think a bit on whether there are any particular pros/cons to these representations.
Awesome, thanks for taking that on!
Yes, I saw your announcement for PyGEOS and that looks like a great step forward in terms of performance. And based on your benchmarking, it looks like working at the level of PyGEOS for the geopandas <-> ragged conversions could be very beneficial.
By parallelizing the numba implementation of
@jbednar @jorisvandenbossche Here are some notes I started on what a new project might look like that extracts and generalizes the geometric/spatial concepts that Datashader needs. There would certainly be a long tail to implement the various GEOS features on the ragged representation in numba, but I expect it could be useful before everything is fully fleshed out. Let me know if you have any other general thoughts on this approach and how it fits in the ecosystem. Thanks! |
I thought the It could be similar to how eg GeoJSON does it with nested lists: inside the list for a single geometry, the first list is the exterior, and any following lists are holes, if they are present. And for a MultiPolygon, you then get a list of such nested lists. You get a lot of (virtual) nesting then (but the actual coordinates are still just a flat array), where you have offsets to the beginning of each MultiPolgyon, and then offsets to beginning of each polygon inside that, and then offsets to the parts (exterior / holes).
Yes, that is possible. There is still some work to get there (releasing the GIL, I have have POC branch that does this for some operations), but I did a quick experiment on it: https://jorisvandenbossche.github.io/talks/2019_FOSS4GBE_pygeos/#32
Thanks! I will try to comment on that issue shortly. |
Yeah, that's correct. In this PR I'm using the winding number fill algorithm and assuming that the main polygon is CCW and holes are CW. I run the shapely polygons through
For this approach, are you picturing that the extension array itself would store the polygons and holes as nested Arrow ListArrays, rather than numpy arrays? Or just that this format could be used for parquet serialization? In either case, it's not clear to me how to get the offsets back out of a pyarrow import pyarrow as pa
la = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 5])
la
There's a la.values
But I don't see how to get the Apart from this implementation question, it would be pretty compelling if we could come up with a canonical ragged representation of homogenous geometry arrays and if PyGEOS could efficiently create these arrays directly from geopandas geometry arrays. |
Ahh, never mind. I see that I can get at it through the chunk's buffers. |
Ah, yes, indeed. From a quick profile, it seems that 2/3rd of the time is coming from
Yes, just that it can be used for serialization, if you use numpy arrays with the same layout (so without depending directly on pyarrow). Eg if we store the arrays here in a PolygonArray, we could easily construct the arrow array with
Yes, but that is not very user friendly I think. We should probably improve the interface in pyarrow to have easier access to the offsets. I opened https://issues.apache.org/jira/browse/ARROW-7031 for this. |
Thanks for opening those issues @jorisvandenbossche!
Ok, yeah. That makes sense. I've also been studying @xhochy's fletcher pyarrow extension array, and I'm wondering if it would make sense to ditch our It would make the polygon logic here dependent on fletcher and pyarrow, so that would need to be considered. @jbednar do you have any opinion on the code reuse / dependency tradeoff in this case? |
@jonmmease , I'll leave that to your discretion; I don't have any inside knowledge of it other than that I believe @mrocklin proposed using fletcher originally for this purpose. I don't see any problematic dependencies... |
I don't know to what extent you want In the end, for much of the functionality you implement in datashader, you need access to the actual data (flat array + offsets): eg looping through the inner lists in numba to create glyphs. Of course, directly using pyarrow might make the implementation of the PolygonArray a bit simpler. |
Fletcher does text things. Perhaps I was referring to using Pandas Extension Arrays (of which fletcher is an example) ? |
In addition the text functionality, fletcher also has a base Now, I am not familiar enough with fletcher's development to know it is already stable enough to use as dependency. |
@xhochy, if you have a moment to chime in. At this point in time, do you have an opinion on whether you would generally encourage or discourage a library from subclassing |
I'm very excited about this work. I was playing around with applying this to a project that fell by the wayside a couple of years ago. Here are some initial results: This dataset has just over 800K polygons. As you can see converting the dataset from GeoPandas takes a significant amount of time. @jonmmease you mentioned that this may be because of the running the geometries through shapely to ensure orientation is correct. I wonder if that check could be optional, and turned off for data that has been preprocessed. I'm less familiar with the other formats being discussed, but if a faster IO approach could be supported that would be great! |
Hi @sdc50, thanks for chiming and sharing what you're seeing. Yeah, I think we could have something like an Apart from data import time, what are you're impressions of the rendered result and the performance? Do you have anything to compare it to? Do the speckled locations toward the lower left make sense? Also, you should be able to parallelize the aggregation step using Dask if you build a Dask DataFrame from you're initial pandas Thanks! |
To make that concrete, here are the additional conda packages that get installed when you make an datashader environment with pyvarrow, beyond what you get with datashader itself:
A few of those are no big deal, and already likely to be in a real environment anyway (snappy is used in many Datashader examples, bzip2 and lz4-c are useful for anyone reading input files, and some others look straightforward), but others do look substantial and potentially problematic. That said, only a fraction of our users need the ragged array support, and so we can certainly consider requiring |
The rendered result look great, and is much faster than any other tool I've used (ArcGIS) takes several minutes to visualize this dataset and it has to re-render anytime you do anything to the map. Rending this in a notebook with GeoViews takes hours/days (if it doesn't crash). The speckled appearance is accurate for the data. In fact the whole thing is much more speckled than it appears with the blue shade color map. This render helps to show how speckled the whole area is: It would now be useful to be able to zoom in to explore the data better. Is there a way to plot this with Bokeh so the zoom/pan controls would be available. I'd also like to be able to overlay this on some map tile. I've done that with other datasets using GeoViews/HoloViews, but can't work out how to do that with this data type. |
Once there is agreement that the data format for this is finalized I'll be opening a HoloViews PR to add support for this glyph type, so you'll be able to be able to use the usual approaches. |
@jorisvandenbossche, quick question on |
Thanks for taking a look at
Also I see that @jonmmease has started prototyping https://github.com/jonmmease/spatialpandas, I guess this on top of |
Hi @xhochy, thanks for taking the time. That was very helpful. As I mentioned in #826 and holoviz/spatialpandas#2, after a bunch of prototyping I decided not to subclass |
That's great news! Thanks for taking the time to try this out and report back.
I think the best way forward here is probably to build import export capabilities on top of (or in) the PyGEOS library. For the time being, we can at least add the option to disable the |
This PR is now superseded by #826 and holoviz/spatialpandas#2. With this approach the extension arrays are built on nested pyarrow Efficient serialization to parquet already works, and reading from parquet will be possible when such support is added to pandas. Thanks for the discussion all. I'm going to close this PR and we can continue on #826 regarding polygon rendering and on holoviz/spatialpandas#2 regarding the geometric extension arrays. |
Overview
This PR adds support for rasterizing Polygons
Closes #181.
For example usage, see notebook at https://anaconda.org/jonmmease/datashader_polygons_pr/notebook
GeomArray ExtensionArrays
In order to rasterize polygons efficiently, we need a data structure that can store an array of polygon definitions in a form that is directly accessible to numba.
To accomplish this, I added a new
RaggedArray
(see #687) subclass calledPolygonsArray
. Each element of this array can store one or more polygons with holes. So elements of aPolygonArray
are roughly equivalent to a shapelyPolygon
orMultiPolygon
. The entirePolygonsArray
is roughly equivalent to a geopandasGeometryArray
ofPolygon
/MultiPolygon
elements.The new
PolygonsArray
pandas extension array could eventually grow to support many of the operations supported by the geopandasGeoSeries
. The advantage would be that these operations could be implemented in vectorized form using numba for potentially improved performance, and Dask DataFrame support would largely come for free.To demonstrate the potential, I added
length
andarea
properties to the PolygonArray class. These operations are ~8x faster than the equivalentGeoSeries
operations, and they could also be naturally parallelized using Dask for large datasets.Canvas methods
New
Canvas.polygons()
method has been added to rasterize polygons, and theCanvas.line()
method has been updated to support these new geometry arrays, making it easy to draw polygon outlines.Examples
For code and timing, see https://anaconda.org/jonmmease/datashader_polygons_pr/notebook
cc @jbednar @philippjfr