Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Replace geopandas.GeoJSONDataset with geopandas.GenericDataset #812

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
f63d517
ci: Fix `pandas.DeltaTableDataset` tests (#811)
ankatiyar Aug 21, 2024
bb20cf8
feat(datasets): Add geopandas ParquetDataset
harm-matthias-harms Aug 21, 2024
6026742
Add release notes
harm-matthias-harms Aug 21, 2024
d54c181
Add parquet dataset to docs
harm-matthias-harms Aug 21, 2024
3fc3af4
Fix typo in tests
harm-matthias-harms Aug 21, 2024
94d02e7
Fix pylint type
harm-matthias-harms Aug 21, 2024
fe5c31d
Discard changes to kedro-datasets/docs/source/api/kedro_datasets.rst
harm-matthias-harms Aug 22, 2024
032f259
Discard changes to kedro-datasets/kedro_datasets/geopandas/__init__.py
harm-matthias-harms Aug 22, 2024
15fa4ba
Extend geojson dataset to support more file types
harm-matthias-harms Aug 22, 2024
04fd6f8
Discard changes to kedro-datasets/tests/pandas/test_deltatable_datase…
harm-matthias-harms Aug 22, 2024
04ff6a2
Update RELEASE.md
harm-matthias-harms Aug 22, 2024
a8dbb0c
Add test for unsupported file format
harm-matthias-harms Aug 22, 2024
9d3ff7f
Merge branch 'main' into feature/add_geopandas_parquet_dataset
harm-matthias-harms Aug 22, 2024
7106376
Cleanup GeoJSONDataset
harm-matthias-harms Aug 28, 2024
e4d69e9
Fix lint
harm-matthias-harms Aug 28, 2024
72be3c8
Merge branch 'main' into feature/add_geopandas_parquet_dataset
harm-matthias-harms Aug 28, 2024
2b58924
Merge branch 'main' into feature/add_geopandas_parquet_dataset
harm-matthias-harms Sep 5, 2024
1335db1
Replace GeoJSONDataset by GenericDataset
harm-matthias-harms Sep 5, 2024
b743845
Update pyproject.toml
harm-matthias-harms Sep 5, 2024
1e540f4
Update RELEASE.md
harm-matthias-harms Sep 5, 2024
e84ac93
Use new default fs args
harm-matthias-harms Sep 5, 2024
9e88c5e
Merge branch 'main' into feature/add_geopandas_parquet_dataset
harm-matthias-harms Sep 10, 2024
0fd94c8
Fix pattern in test
harm-matthias-harms Sep 11, 2024
da204fa
Merge branch 'main' into feature/add_geopandas_parquet_dataset
harm-matthias-harms Sep 11, 2024
6b2a225
Use fiona for python < 3.11
harm-matthias-harms Sep 13, 2024
db1b446
Install fiona dependency for python < 3.11
harm-matthias-harms Sep 13, 2024
f1bda0e
Revert fiona test
harm-matthias-harms Sep 13, 2024
0dab99f
Use fiona because pyogrio doesnt support fsspec
harm-matthias-harms Sep 13, 2024
978ad6c
Format file
harm-matthias-harms Sep 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,13 @@
* Refactored all datasets to set `fs_args` defaults in the same way as `load_args` and `save_args` and not have hardcoded values in the save methods.

## Breaking Changes
* Replaced the `geopandas.GeoJSONDataset` with `geopandas.GenericDataset` to support parquet and feather file formats.

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
* [Brandon Meek](https://github.com/bpmeek)
* [yury-fedotov](https://github.com/yury-fedotov)
* [harm-matthias-harms](https://github.com/harm-matthias-harms)


# Release 4.1.0
Expand Down
2 changes: 1 addition & 1 deletion kedro-datasets/docs/source/api/kedro_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ kedro_datasets
dask.ParquetDataset
databricks.ManagedTableDataset
email.EmailMessageDataset
geopandas.GeoJSONDataset
geopandas.GenericDataset
holoviews.HoloviewsWriter
huggingface.HFDataset
huggingface.HFTransformerPipelineDataset
Expand Down
31 changes: 0 additions & 31 deletions kedro-datasets/kedro_datasets/geopandas/README.md

This file was deleted.

6 changes: 3 additions & 3 deletions kedro-datasets/kedro_datasets/geopandas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
"""``GeoJSONDataset`` is an ``AbstractVersionedDataset`` to save and load GeoJSON files."""
"""``AbstractDataset`` implementations that produce geopandas GeoDataFrames."""

from typing import Any

import lazy_loader as lazy

# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
GeoJSONDataset: Any
GenericDataset: Any

__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"geojson_dataset": ["GeoJSONDataset"]}
__name__, submod_attrs={"generic_dataset": ["GenericDataset"]}
)
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
"""GeoJSONDataset loads and saves data to a local geojson file. The
"""GenericDataset loads and saves data to a local file. The
underlying functionality is supported by geopandas, so it supports all
allowed geopandas (pandas) options for loading and saving geosjon files.
"""

from __future__ import annotations

import copy
Expand All @@ -18,30 +19,35 @@
get_protocol_and_path,
)

# pyogrio currently supports no alternate file handlers https://github.com/geopandas/pyogrio/issues/430
gpd.options.io_engine = "fiona"

NON_FILE_SYSTEM_TARGETS = ["postgis"]


class GeoJSONDataset(
class GenericDataset(
AbstractVersionedDataset[
gpd.GeoDataFrame, Union[gpd.GeoDataFrame, dict[str, gpd.GeoDataFrame]]
]
):
"""``GeoJSONDataset`` loads/saves data to a GeoJSON file using an underlying filesystem
"""``GenericDataset`` loads/saves data to a file using an underlying filesystem
(eg: local, S3, GCS).
The underlying functionality is supported by geopandas, so it supports all
allowed geopandas (pandas) options for loading and saving GeoJSON files.
allowed geopandas (pandas) options for loading and saving files.

Example:

.. code-block:: pycon

>>> import geopandas as gpd
>>> from kedro_datasets.geopandas import GeoJSONDataset
>>> from kedro_datasets.geopandas import GenericDataset
>>> from shapely.geometry import Point
>>>
>>> data = gpd.GeoDataFrame(
... {"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]},
... geometry=[Point(1, 1), Point(2, 4)],
... )
>>> dataset = GeoJSONDataset(filepath=tmp_path / "test.geojson", save_args=None)
>>> dataset = GenericDataset(filepath=tmp_path / "test.geojson")
>>> dataset.save(data)
>>> reloaded = dataset.load()
>>>
Expand All @@ -50,35 +56,41 @@ class GeoJSONDataset(
"""

DEFAULT_LOAD_ARGS: dict[str, Any] = {}
DEFAULT_SAVE_ARGS = {"driver": "GeoJSON"}
DEFAULT_SAVE_ARGS: dict[str, Any] = {}
DEFAULT_FS_ARGS: dict[str, Any] = {"open_args_save": {"mode": "wb"}}

def __init__( # noqa: PLR0913
self,
*,
filepath: str,
file_format: str = "file",
load_args: dict[str, Any] | None = None,
save_args: dict[str, Any] | None = None,
version: Version | None = None,
credentials: dict[str, Any] | None = None,
fs_args: dict[str, Any] | None = None,
metadata: dict[str, Any] | None = None,
) -> None:
"""Creates a new instance of ``GeoJSONDataset`` pointing to a concrete GeoJSON file
"""Creates a new instance of ``GenericDataset`` pointing to a concrete file
on a specific filesystem fsspec.

Args:

filepath: Filepath in POSIX format to a GeoJSON file prefixed with a protocol like
filepath: Filepath in POSIX format to a file prefixed with a protocol like
`s3://`. If prefix is not provided `file` protocol (local filesystem) will be used.
The prefix should be any protocol supported by ``fsspec``.
Note: `http(s)` doesn't support versioning.
load_args: GeoPandas options for loading GeoJSON files.
file_format: String which is used to match the appropriate load/save method on a best
effort basis. For example if 'parquet' is passed in the `geopandas.read_parquet` and
`geopandas.DataFrame.to_parquet` will be identified. An error will be raised unless
at least one matching `read_{file_format}` or `to_{file_format}` method is
identified. Defaults to 'file'.
load_args: GeoPandas options for loading files.
Here you can find all available arguments:
https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html
save_args: GeoPandas options for saving geojson files.
save_args: GeoPandas options for saving files.
Here you can find all available arguments:
https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html
The default_save_arg driver is 'GeoJSON', all others preserved.
version: If specified, should be an instance of
``kedro.io.core.Version``. If its ``load`` attribute is
None, the latest version will be loaded. If its ``save``
Expand All @@ -94,6 +106,9 @@ def __init__( # noqa: PLR0913
metadata: Any arbitrary metadata.
This is ignored by Kedro, but may be consumed by users or external plugins.
"""

self._file_format = file_format.lower()

_fs_args = copy.deepcopy(fs_args) or {}
_fs_open_args_load = _fs_args.pop("open_args_load", {})
_fs_open_args_save = _fs_args.pop("open_args_save", {})
Expand All @@ -114,28 +129,57 @@ def __init__( # noqa: PLR0913
glob_function=self._fs.glob,
)

self._load_args = copy.deepcopy(self.DEFAULT_LOAD_ARGS)
if load_args is not None:
self._load_args.update(load_args)

self._save_args = copy.deepcopy(self.DEFAULT_SAVE_ARGS)
if save_args is not None:
self._save_args.update(save_args)
# Handle default load and save and fs arguments
self._load_args = {**self.DEFAULT_LOAD_ARGS, **(load_args or {})}
self._save_args = {**self.DEFAULT_SAVE_ARGS, **(save_args or {})}
self._fs_open_args_load = {
**self.DEFAULT_FS_ARGS.get("open_args_load", {}),
**(_fs_open_args_load or {}),
}
self._fs_open_args_save = {
**self.DEFAULT_FS_ARGS.get("open_args_save", {}),
**(_fs_open_args_save or {}),
}

_fs_open_args_save.setdefault("mode", "wb")
self._fs_open_args_load = _fs_open_args_load
self._fs_open_args_save = _fs_open_args_save
def _ensure_file_system_target(self) -> None:
# Fail fast if provided a known non-filesystem target
if self._file_format in NON_FILE_SYSTEM_TARGETS:
raise DatasetError(
f"Cannot create a dataset of file_format '{self._file_format}' as it "
f"does not support a filepath target/source."
)

def _load(self) -> gpd.GeoDataFrame | dict[str, gpd.GeoDataFrame]:
self._ensure_file_system_target()

load_path = get_filepath_str(self._get_load_path(), self._protocol)
with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
return gpd.read_file(fs_file, **self._load_args)
load_method = getattr(gpd, f"read_{self._file_format}", None)
if load_method:
with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
return load_method(fs_file, **self._load_args)
raise DatasetError(
f"Unable to retrieve 'geopandas.read_{self._file_format}' method, please ensure that your "
"'file_format' parameter has been defined correctly as per the GeoPandas API "
"https://geopandas.org/en/stable/docs/reference/io.html"
)

def _save(self, data: gpd.GeoDataFrame) -> None:
self._ensure_file_system_target()

save_path = get_filepath_str(self._get_save_path(), self._protocol)
with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
data.to_file(fs_file, **self._save_args)
self.invalidate_cache()
save_method = getattr(data, f"to_{self._file_format}", None)
if save_method:
with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
# KEY ASSUMPTION - first argument is path/buffer/io
save_method(fs_file, **self._save_args)
self.invalidate_cache()
else:
raise DatasetError(
f"Unable to retrieve 'geopandas.DataFrame.to_{self._file_format}' method, please "
"ensure that your 'file_format' parameter has been defined correctly as "
"per the GeoPandas API "
"https://geopandas.org/en/stable/docs/reference/io.html"
)

def _exists(self) -> bool:
try:
Expand All @@ -147,6 +191,7 @@ def _exists(self) -> bool:
def _describe(self) -> dict[str, Any]:
return {
"filepath": self._filepath,
"file_format": self._file_format,
"protocol": self._protocol,
"load_args": self._load_args,
"save_args": self._save_args,
Expand Down
8 changes: 4 additions & 4 deletions kedro-datasets/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ dask = ["kedro-datasets[dask-parquetdataset, dask-csvdataset]"]
databricks-managedtabledataset = ["kedro-datasets[spark-base,pandas-base,delta-base,hdfs-base,s3fs-base]"]
databricks = ["kedro-datasets[databricks-managedtabledataset]"]

geopandas-geojsondataset = ["geopandas>=0.6.0, <1.0", "pyproj~=3.0"]
geopandas = ["kedro-datasets[geopandas-geojsondataset]"]
geopandas-genericdataset = ["geopandas>=0.8.0, <2.0", "fiona >=1.8, <2.0"]
geopandas = ["kedro-datasets[geopandas-genericdataset]"]

holoviews-holoviewswriter = ["holoviews~=1.13.0"]
holoviews = ["kedro-datasets[holoviews-holoviewswriter]"]
Expand Down Expand Up @@ -213,8 +213,9 @@ test = [
"deltalake>=0.10.0",
"dill~=0.3.1",
"filelock>=3.4.0, <4.0",
"fiona >=1.8, <2.0",
"gcsfs>=2023.1, <2023.3",
"geopandas>=0.6.0, <1.0",
"geopandas>=0.8.0, <2.0",
"hdfs>=2.5.8, <3.0",
"holoviews>=1.13.0",
"ibis-framework[duckdb,examples]",
Expand Down Expand Up @@ -242,7 +243,6 @@ test = [
"pyarrow>=1.0; python_version < '3.11'",
"pyarrow>=7.0; python_version >= '3.11'", # Adding to avoid numpy build errors
"pyodbc~=5.0",
"pyproj~=3.0",
"pyspark>=3.0; python_version < '3.11'",
"pyspark>=3.4; python_version >= '3.11'",
"pytest-cov~=3.0",
Expand Down
Loading