-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove geopandas and fiona + bugfixes #833
Conversation
The geometries are actually used in a couple places (that's why the GitHub Actions tests are failing). t-route uses the geometries to retrieve lat/lon information for nexus points (which are intended to be passed to the model engine via BMI functions at some point to then pass along to coastal models) and waterbodies (to mimic the LAKEOUT files that WRF-Hydro creates, which include waterbody lat/lon). There might be additional places where these packages are used, but I'd need to take a closer look. So, if we want to remove geopandas and fiona, two possible options are:
|
@@ -29,14 +28,19 @@ def find_layer_name(layers, pattern): | |||
if re.search(pattern, layer, re.IGNORECASE): | |||
return layer | |||
return None | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor whitespace diff here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I must not have had black installed
crs = geometry_columns[0][1] | ||
|
||
if has_spatial_metadata: | ||
# select everything from the layer, + the midpoint of it's bounding box |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the midpoint of its bounding box.
I test the performance, there is a good improvement using sql |
I didn't want to litter the code with too many comments, so I'll leave some notes here:
Geopackages property fields in the geopackages aren't always indexed: This doesn't seem to be relevant here, but if you wanted to select a subset of the entries by id: SELECT * FROM divides WHERE id=wb-9999 Takes ~1s on the conus gpkg. If there's an index on id, it takes <30ms CREATE INDEX "diid" ON "divides" ( "id" ASC ) I overlooked this on my gpkg and lost an afternoon to performance testing different access types |
Geopandas
geopandas (more specifically it's fiona dependency) causes issues when wheels aren't available and it attempts to build from source on arm64/aarch64. building fiona is currently preventing NGIAB from building and has caused issues with arm NGIAB in the past.
As far as I can tell it's just used to read non-geometric data from geopackages and non-geometric data from geojson files.
The geometry is an unpleasant blob that would need to be decoded something like this, but it doesn't appear to be used anywhere so we could probably not bother select it from the geopackage in the first place to save some time and memory
Similarly, geojson is read into a normal pandas df, the geometry is less mangled than the geopackage but still not a proper geometry object. The geometry doesn't seem to be used anywhere, so I just left some comments explaining the format if anyone in the future does need to use it.
I haven't done performance tests on this, but it should also be faster than geopandas.
Some similar work
#604
#568
Removals
Changes
Testing & Screenshots
It worked on a small simulation generated with ngiab_data_preprocess, running inside NGIAB
Additional Bugfixes
In order to test this using geopackages generated by ngiab_data_preprocess (v20.1 of the hydrofabric) I had to modify the regex and pandas table merging.
Regex
The original regex would match both flowpaths and flowpath_attributes as flowpaths.
matched_layers: {'flowpaths': 'flowpath_attributes', 'flowpath_attributes': 'flowpath_attributes', 'lakes': None, 'nexus': 'nexus', 'network': None}
adding $ to check for the end of line fixed this.
table merging
I noticed this while debugging the regex issue; trying to merge two dataframes when either one of them didn't have the "id" column was causing errors. Now it checks to see what combination of the flowpaths and flowpath_attributes exist and merge accordingly.