Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Replace geopandas.GeoJSONDataset with geopandas.GenericDataset #812

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

harm-matthias-harms
Copy link

@harm-matthias-harms harm-matthias-harms commented Aug 21, 2024

Description

Geopandas supports reading parquet and feather files since version 0.8.0. These offer performance improvements over classical file types, such as .geojson or .shp.zip (supported by geopandas.GeoJSONDataset). We have used a private implementation for some time, but want to contribute them to kedro-datasets. This PR closes #196.

Development notes

Extended the implementation of geopandas.GeoJSONDataset with an optional file_formatparameter which is needed to use custom read_* and to_* methods. Replaced the geopandas.GeoJSONDataset with geopandas.GenericDataset.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@harm-matthias-harms harm-matthias-harms force-pushed the feature/add_geopandas_parquet_dataset branch from 711093b to af6a1d3 Compare August 21, 2024 13:41
@harm-matthias-harms
Copy link
Author

If the implementation is okay, I can add the feather dataset in the same PR or open a new one.

@harm-matthias-harms harm-matthias-harms marked this pull request as ready for review August 21, 2024 14:34
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @harm-matthias-harms ! One question: do you think we could have some sort of geopandas.GenericDataset, like we do for pandas and Polars? (Just so that we avoid proliferation of datasets)

@harm-matthias-harms
Copy link
Author

Thanks for this contribution @harm-matthias-harms ! One question: do you think we could have some sort of geopandas.GenericDataset, like we do for pandas and Polars? (Just so that we avoid proliferation of datasets)

To be honest, that would be my preferred solution as well. The current approach feels like a lot of duplicated boilerplate. I really like the Polars dataset and will try to modify the GeoJSON dataset tomorrow to make it work more generally.

@harm-matthias-harms harm-matthias-harms marked this pull request as draft August 21, 2024 15:53
@harm-matthias-harms harm-matthias-harms changed the title feat(datasets): Add geopandas ParquetDataset feat(datasets): Add file_format to geopandas.GeoJSONDataset Aug 22, 2024
@harm-matthias-harms harm-matthias-harms force-pushed the feature/add_geopandas_parquet_dataset branch from 0982d10 to 42ae4a2 Compare August 22, 2024 10:07
ankatiyar and others added 11 commits August 22, 2024 12:10
Fix deltatable tests

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
@harm-matthias-harms harm-matthias-harms force-pushed the feature/add_geopandas_parquet_dataset branch from 8e79972 to 04ff6a2 Compare August 22, 2024 10:10
Signed-off-by: Harm Matthias Harms <[email protected]>
kedro-datasets/kedro_datasets/geopandas/geojson_dataset.py Outdated Show resolved Hide resolved

def __init__( # noqa: PLR0913
self,
*,
filepath: str,
file_format: str = "file",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to set a default value, especially because this ensures backward compatibility

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this anymore because of backward compatibility, but it's more or less the generic method of geopandas, and it saves some boilerplate in most cases.

@harm-matthias-harms harm-matthias-harms marked this pull request as ready for review August 22, 2024 10:45
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @harm-matthias-harms! This looks like a step in the right direction.

My worry is that GeoJSONDataset is no longer GeoJSON-exclusive. So I'm wondering if we should more or less take the code you created here, but put it in a geopandas.GenericGeoDataset, and leave geopandas.GeoJSONDataset alone.

Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
@harm-matthias-harms
Copy link
Author

My worry is that GeoJSONDataset is no longer GeoJSON-exclusive. So I'm wondering if we should more or less take the code you created here, but put it in a geopandas.GenericGeoDataset, and leave geopandas.GeoJSONDataset alone.

@astrojuanlu I see two options:

  1. GeoJSON is also not GeoJSON exclusive today because I can read many more formats than geojson with read_file.
  2. We create a GenericDataset (this fixes the naming and offers the same functionality and more ) and deprecate the GeoJSONDataset to remove it in some future major versions. Then we don't have to maintain the same dataset twice in the future.

Please tell me how you would like to proceed here.

@astrojuanlu
Copy link
Member

GeoJSON is also not GeoJSON exclusive today because I can read many more formats than geojson with read_file.

Hah, true... I see this wasn't discussed in kedro-org/kedro#190 originally.

I'm actually in favour of having a new dataset with a more generic name and deprecate the old one, but wondering if the churn is worth the advantages. Any thoughts @merelcht @noklam ?

@merelcht
Copy link
Member

merelcht commented Sep 3, 2024

I'm actually in favour of having a new dataset with a more generic name and deprecate the old one, but wondering if the churn is worth the advantages. Any thoughts @merelcht @noklam ?

We can just remove the old one and add a new one with a generic name in a breaking release. I don't think it's worth the effort of doing a non-breaking release with a deprecation warning first and then remove the old one in the next breaking release just for this one dataset.

We could do a TSC vote if we feel the nature of the dataset changes too much.

@astrojuanlu
Copy link
Member

Let's proceed that way then 👍🏼 @harm-matthias-harms are you willing to update the PR accordingly? Namely:

just remove the old one and add a new one with a generic name in a breaking release.

@harm-matthias-harms
Copy link
Author

@astrojuanlu I updated the PR. Since I merged main the tests have been failing. It looks like a fsspec problem. I don't know much about ffspec, maybe you have a good idea of how to fix this.

Last successful run: https://github.com/kedro-org/kedro-plugins/actions/runs/10593320929/job/29354563096?pr=812
First run that failed: https://github.com/kedro-org/kedro-plugins/actions/runs/10719091250/job/29722424013?pr=812

@harm-matthias-harms harm-matthias-harms changed the title feat(datasets): Add file_format to geopandas.GeoJSONDataset feat(datasets): Replace geopandas.GeoJSONDataset with geopandas.GenericDataset Sep 9, 2024
@astrojuanlu
Copy link
Member

I do not know.

Does geopandas handle remote filepaths? Like, if we do geopandas.read_{self._file_format}("s3://...") directly without

https://github.com/harm-matthias-harms/kedro-plugins/blob/9e88c5ee0284865b1256a5667bd6f7b691ddb3f7/kedro-datasets/kedro_datasets/geopandas/generic_dataset.py#L155-L156

will it work?

@harm-matthias-harms
Copy link
Author

@astrojuanlu This was also a thought I had, but that currently makes more tests fail, and it may be better to have all datasets similar. I don't know why the CI fails.

pytest kedro_datasets/geopandas/generic_dataset.py --doctest-modules --doctest-continue-on-failure and pytest tests/geopandas/test_generic_dataset.py both pass locally with python 3.12. It looks like I have to investigate the CI a bit more. I hope I find some time at the end of the week for that.

@astrojuanlu
Copy link
Member

Thanks for your patience! Let us know how it goes

@merelcht
Copy link
Member

@astrojuanlu This was also a thought I had, but that currently makes more tests fail, and it may be better to have all datasets similar. I don't know why the CI fails.

pytest kedro_datasets/geopandas/generic_dataset.py --doctest-modules --doctest-continue-on-failure and pytest tests/geopandas/test_generic_dataset.py both pass locally with python 3.12. It looks like I have to investigate the CI a bit more. I hope I find some time at the end of the week for that.

It has to do with the docstring part and seems to affect python 3.10 specifically. I'm able to replicate the error locally, but haven't found what's wrong yet.

Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
@harm-matthias-harms
Copy link
Author

@astrojuanlu I figured out the problem... I also bumped geopandas to v1. They use pyogrio as the new default over fiona. pyogrio does not yet support an alternate file system like fsspec (geopandas/pyogrio#430). There are workarounds using buffers, but the simplest way would be to change the used engine to fiona until pyogrio supports fsspec.

Signed-off-by: Harm Matthias Harms <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support parquet and feather datasets from GeoPandas
4 participants