Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(datasets): deprecate "DataSet" type names #328

Merged
merged 37 commits into from
Sep 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
5e47870
refactor(datasets): deprecate "DataSet" type names (api)
deepyaman Sep 4, 2023
581cdd7
refactor(datasets): deprecate "DataSet" type names (biosequence)
deepyaman Sep 5, 2023
039150b
refactor(datasets): deprecate "DataSet" type names (dask)
deepyaman Sep 5, 2023
07cdf9e
refactor(datasets): deprecate "DataSet" type names (databricks)
deepyaman Sep 5, 2023
3bcdd5e
refactor(datasets): deprecate "DataSet" type names (email)
deepyaman Sep 10, 2023
0dac811
refactor(datasets): deprecate "DataSet" type names (geopandas)
deepyaman Sep 10, 2023
f83b0f2
refactor(datasets): deprecate "DataSet" type names (holoviews)
deepyaman Sep 10, 2023
31ac073
refactor(datasets): deprecate "DataSet" type names (json)
deepyaman Sep 10, 2023
893ec01
refactor(datasets): deprecate "DataSet" type names (matplotlib)
deepyaman Sep 10, 2023
f10ad8c
refactor(datasets): deprecate "DataSet" type names (networkx)
deepyaman Sep 11, 2023
d1c0aea
refactor(datasets): deprecate "DataSet" type names (pandas)
deepyaman Sep 11, 2023
88b061d
refactor(datasets): deprecate "DataSet" type names (pandas.csv_dataset)
deepyaman Sep 16, 2023
8b00739
refactor(datasets): deprecate "DataSet" type names (pandas.deltatable…
deepyaman Sep 16, 2023
6ddd9b4
refactor(datasets): deprecate "DataSet" type names (pandas.excel_data…
deepyaman Sep 16, 2023
fbf79e7
refactor(datasets): deprecate "DataSet" type names (pandas.feather_da…
deepyaman Sep 16, 2023
46d9d13
refactor(datasets): deprecate "DataSet" type names (pandas.gbq_dataset)
deepyaman Sep 16, 2023
403e4c0
refactor(datasets): deprecate "DataSet" type names (pandas.generic_da…
deepyaman Sep 16, 2023
dd72155
refactor(datasets): deprecate "DataSet" type names (pandas.hdf_dataset)
deepyaman Sep 16, 2023
2c87fd9
refactor(datasets): deprecate "DataSet" type names (pandas.json_dataset)
deepyaman Sep 16, 2023
f6bbda8
refactor(datasets): deprecate "DataSet" type names (pandas.parquet_da…
deepyaman Sep 16, 2023
f06a8bf
refactor(datasets): deprecate "DataSet" type names (pandas.sql_dataset)
deepyaman Sep 16, 2023
7d0c3ef
refactor(datasets): deprecate "DataSet" type names (pandas.xml_dataset)
deepyaman Sep 16, 2023
dc83db3
refactor(datasets): deprecate "DataSet" type names (pickle)
deepyaman Sep 17, 2023
c9ca8e6
refactor(datasets): deprecate "DataSet" type names (pillow)
deepyaman Sep 17, 2023
6a24fcf
refactor(datasets): deprecate "DataSet" type names (plotly)
deepyaman Sep 17, 2023
b04041b
refactor(datasets): deprecate "DataSet" type names (polars)
deepyaman Sep 18, 2023
4ee74cb
refactor(datasets): deprecate "DataSet" type names (redis)
deepyaman Sep 18, 2023
ce0f92f
refactor(datasets): deprecate "DataSet" type names (snowflake)
deepyaman Sep 18, 2023
6eac94c
refactor(datasets): deprecate "DataSet" type names (spark)
deepyaman Sep 18, 2023
fc2dff7
refactor(datasets): deprecate "DataSet" type names (svmlight)
deepyaman Sep 18, 2023
98bd275
refactor(datasets): deprecate "DataSet" type names (tensorflow)
deepyaman Sep 18, 2023
188a4b6
refactor(datasets): deprecate "DataSet" type names (text)
deepyaman Sep 18, 2023
45be7f0
refactor(datasets): deprecate "DataSet" type names (tracking)
deepyaman Sep 18, 2023
8f5f942
refactor(datasets): deprecate "DataSet" type names (video)
deepyaman Sep 18, 2023
ae77754
refactor(datasets): deprecate "DataSet" type names (yaml)
deepyaman Sep 18, 2023
2f00ba4
chore(datasets): ignore TensorFlow coverage issues
deepyaman Sep 18, 2023
8b2076a
Merge branch 'main' into refactor/rename-data-set
deepyaman Sep 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions kedro-datasets/docs/source/kedro_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,47 +12,90 @@ kedro_datasets
:template: autosummary/class.rst

kedro_datasets.api.APIDataSet
kedro_datasets.api.APIDataset
kedro_datasets.biosequence.BioSequenceDataSet
kedro_datasets.biosequence.BioSequenceDataset
kedro_datasets.dask.ParquetDataSet
kedro_datasets.dask.ParquetDataset
kedro_datasets.databricks.ManagedTableDataSet
kedro_datasets.databricks.ManagedTableDataset
kedro_datasets.email.EmailMessageDataSet
kedro_datasets.email.EmailMessageDataset
kedro_datasets.geopandas.GeoJSONDataSet
kedro_datasets.geopandas.GeoJSONDataset
kedro_datasets.holoviews.HoloviewsWriter
kedro_datasets.json.JSONDataSet
kedro_datasets.json.JSONDataset
kedro_datasets.matplotlib.MatplotlibWriter
kedro_datasets.networkx.GMLDataSet
kedro_datasets.networkx.GMLDataset
kedro_datasets.networkx.GraphMLDataSet
kedro_datasets.networkx.GraphMLDataset
kedro_datasets.networkx.JSONDataSet
kedro_datasets.networkx.JSONDataset
kedro_datasets.pandas.CSVDataSet
kedro_datasets.pandas.CSVDataset
kedro_datasets.pandas.DeltaTableDataSet
kedro_datasets.pandas.DeltaTableDataset
kedro_datasets.pandas.ExcelDataSet
kedro_datasets.pandas.ExcelDataset
kedro_datasets.pandas.FeatherDataSet
kedro_datasets.pandas.FeatherDataset
kedro_datasets.pandas.GBQQueryDataSet
kedro_datasets.pandas.GBQQueryDataset
kedro_datasets.pandas.GBQTableDataSet
kedro_datasets.pandas.GBQTableDataset
kedro_datasets.pandas.GenericDataSet
kedro_datasets.pandas.GenericDataset
kedro_datasets.pandas.HDFDataSet
kedro_datasets.pandas.HDFDataset
kedro_datasets.pandas.JSONDataSet
kedro_datasets.pandas.JSONDataset
kedro_datasets.pandas.ParquetDataSet
kedro_datasets.pandas.ParquetDataset
kedro_datasets.pandas.SQLQueryDataSet
kedro_datasets.pandas.SQLQueryDataset
kedro_datasets.pandas.SQLTableDataSet
kedro_datasets.pandas.SQLTableDataset
kedro_datasets.pandas.XMLDataSet
kedro_datasets.pandas.XMLDataset
kedro_datasets.pickle.PickleDataSet
kedro_datasets.pickle.PickleDataset
kedro_datasets.pillow.ImageDataSet
kedro_datasets.pillow.ImageDataset
kedro_datasets.plotly.JSONDataSet
kedro_datasets.plotly.JSONDataset
kedro_datasets.plotly.PlotlyDataSet
kedro_datasets.plotly.PlotlyDataset
kedro_datasets.polars.CSVDataSet
kedro_datasets.polars.CSVDataset
kedro_datasets.polars.GenericDataSet
kedro_datasets.polars.GenericDataset
kedro_datasets.redis.PickleDataSet
kedro_datasets.redis.PickleDataset
kedro_datasets.snowflake.SnowparkTableDataSet
kedro_datasets.snowflake.SnowparkTableDataset
kedro_datasets.spark.DeltaTableDataSet
kedro_datasets.spark.DeltaTableDataset
kedro_datasets.spark.SparkDataSet
kedro_datasets.spark.SparkDataset
kedro_datasets.spark.SparkHiveDataSet
kedro_datasets.spark.SparkHiveDataset
kedro_datasets.spark.SparkJDBCDataSet
kedro_datasets.spark.SparkJDBCDataset
kedro_datasets.spark.SparkStreamingDataSet
kedro_datasets.spark.SparkStreamingDataset
kedro_datasets.svmlight.SVMLightDataSet
kedro_datasets.svmlight.SVMLightDataset
kedro_datasets.tensorflow.TensorFlowModelDataSet
kedro_datasets.tensorflow.TensorFlowModelDataset
kedro_datasets.text.TextDataSet
kedro_datasets.text.TextDataset
kedro_datasets.tracking.JSONDataSet
kedro_datasets.tracking.JSONDataset
kedro_datasets.tracking.MetricsDataSet
kedro_datasets.tracking.MetricsDataset
kedro_datasets.video.VideoDataSet
kedro_datasets.video.VideoDataset
kedro_datasets.yaml.YAMLDataSet
kedro_datasets.yaml.YAMLDataset
9 changes: 6 additions & 3 deletions kedro-datasets/kedro_datasets/api/__init__.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
"""``APIDataSet`` loads the data from HTTP(S) APIs
"""``APIDataset`` loads the data from HTTP(S) APIs
and returns them into either as string or json Dict.
It uses the python requests library: https://requests.readthedocs.io/en/latest/
"""
from __future__ import annotations

from typing import Any

import lazy_loader as lazy

# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
APIDataSet: Any
APIDataSet: type[APIDataset]
APIDataset: Any

__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"api_dataset": ["APIDataSet"]}
__name__, submod_attrs={"api_dataset": ["APIDataSet", "APIDataset"]}
)
105 changes: 63 additions & 42 deletions kedro-datasets/kedro_datasets/api/api_dataset.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
"""``APIDataSet`` loads the data from HTTP(S) APIs.
"""``APIDataset`` loads the data from HTTP(S) APIs.
It uses the python requests library: https://requests.readthedocs.io/en/latest/
"""
import json as json_ # make pylint happy
import warnings
from copy import deepcopy
from typing import Any, Dict, List, Tuple, Union

import requests
from requests import Session, sessions
from requests.auth import AuthBase

from .._io import AbstractDataset as AbstractDataSet
from .._io import DatasetError as DataSetError
from kedro_datasets._io import AbstractDataset, DatasetError


class APIDataSet(AbstractDataSet[None, requests.Response]):
"""``APIDataSet`` loads/saves data from/to HTTP(S) APIs.
class APIDataset(AbstractDataset[None, requests.Response]):
"""``APIDataset`` loads/saves data from/to HTTP(S) APIs.
It uses the python requests library: https://requests.readthedocs.io/en/latest/

Example usage for the `YAML API <https://kedro.readthedocs.io/en/stable/data/\
Expand All @@ -23,7 +23,7 @@ class APIDataSet(AbstractDataSet[None, requests.Response]):
.. code-block:: yaml

usda:
type: api.APIDataSet
type: api.APIDataset
url: https://quickstats.nass.usda.gov
params:
key: SOME_TOKEN,
Expand All @@ -33,39 +33,42 @@ class APIDataSet(AbstractDataSet[None, requests.Response]):
agg_level_desc: STATE,
year: 2000

Example usage for the `Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_: ::
Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:
::

>>> from kedro_datasets.api import APIDataSet
>>> from kedro_datasets.api import APIDataset
>>>
>>>
>>> data_set = APIDataSet(
>>> url="https://quickstats.nass.usda.gov",
>>> load_args={
>>> "params": {
>>> "key": "SOME_TOKEN",
>>> "format": "JSON",
>>> "commodity_desc": "CORN",
>>> "statisticcat_des": "YIELD",
>>> "agg_level_desc": "STATE",
>>> "year": 2000
>>> }
>>> },
>>> credentials=("username", "password")
>>> )
>>> data = data_set.load()

``APIDataSet`` can also be used to save output on a remote server using HTTP(S)
methods. ::
>>> dataset = APIDataset(
... url="https://quickstats.nass.usda.gov",
... load_args={
... "params": {
... "key": "SOME_TOKEN",
... "format": "JSON",
... "commodity_desc": "CORN",
... "statisticcat_des": "YIELD",
... "agg_level_desc": "STATE",
... "year": 2000
... }
... },
... credentials=("username", "password")
... )
>>> data = dataset.load()

``APIDataset`` can also be used to save output on a remote server using HTTP(S)
methods.
::

>>> example_table = '{"col1":["val1", "val2"], "col2":["val3", "val4"]}'

>>> data_set = APIDataSet(
method = "POST",
url = "url_of_remote_server",
save_args = {"chunk_size":1}
)
>>> data_set.save(example_table)
>>>
>>> dataset = APIDataset(
... method = "POST",
... url = "url_of_remote_server",
... save_args = {"chunk_size":1}
... )
>>> dataset.save(example_table)

On initialisation, we can specify all the necessary parameters in the save args
dictionary. The default HTTP(S) method is POST but PUT is also supported. Two
Expand All @@ -74,7 +77,7 @@ class APIDataSet(AbstractDataSet[None, requests.Response]):
used if the input of save method is a list. It will divide the request into chunks
of size `chunk_size`. For example, here we will send two requests each containing
one row of our example DataFrame.
If the data passed to the save method is not a list, ``APIDataSet`` will check if it
If the data passed to the save method is not a list, ``APIDataset`` will check if it
can be loaded as JSON. If true, it will send the data unchanged in a single request.
Otherwise, the ``_save`` method will try to dump the data in JSON format and execute
the request.
Expand All @@ -99,7 +102,7 @@ def __init__(
credentials: Union[Tuple[str, str], List[str], AuthBase] = None,
metadata: Dict[str, Any] = None,
) -> None:
"""Creates a new instance of ``APIDataSet`` to fetch data from an API endpoint.
"""Creates a new instance of ``APIDataset`` to fetch data from an API endpoint.

Args:
url: The API URL endpoint.
Expand Down Expand Up @@ -179,9 +182,9 @@ def _execute_request(self, session: Session) -> requests.Response:
response = session.request(**self._request_args)
response.raise_for_status()
except requests.exceptions.HTTPError as exc:
raise DataSetError("Failed to fetch data", exc) from exc
raise DatasetError("Failed to fetch data", exc) from exc
except OSError as exc:
raise DataSetError("Failed to connect to the remote server") from exc
raise DatasetError("Failed to connect to the remote server") from exc

return response

Expand All @@ -190,7 +193,7 @@ def _load(self) -> requests.Response:
with sessions.Session() as session:
return self._execute_request(session)

raise DataSetError("Only GET method is supported for load")
raise DatasetError("Only GET method is supported for load")

def _execute_save_with_chunks(
self,
Expand All @@ -214,10 +217,10 @@ def _execute_save_request(self, json_data: Any) -> requests.Response:
response = requests.request(**self._request_args)
response.raise_for_status()
except requests.exceptions.HTTPError as exc:
raise DataSetError("Failed to send data", exc) from exc
raise DatasetError("Failed to send data", exc) from exc

except OSError as exc:
raise DataSetError("Failed to connect to the remote server") from exc
raise DatasetError("Failed to connect to the remote server") from exc
return response

def _save(self, data: Any) -> requests.Response:
Expand All @@ -227,9 +230,27 @@ def _save(self, data: Any) -> requests.Response:

return self._execute_save_request(json_data=data)

raise DataSetError("Use PUT or POST methods for save")
raise DatasetError("Use PUT or POST methods for save")

def _exists(self) -> bool:
with sessions.Session() as session:
response = self._execute_request(session)
return response.ok


_DEPRECATED_CLASSES = {
"APIDataSet": APIDataset,
}


def __getattr__(name):
if name in _DEPRECATED_CLASSES:
alias = _DEPRECATED_CLASSES[name]
warnings.warn(
deepyaman marked this conversation as resolved.
Show resolved Hide resolved
f"{repr(name)} has been renamed to {repr(alias.__name__)}, "
f"and the alias will be removed in Kedro-Datasets 2.0.0",
DeprecationWarning,
stacklevel=2,
)
return alias
raise AttributeError(f"module {repr(__name__)} has no attribute {repr(name)}")
10 changes: 7 additions & 3 deletions kedro-datasets/kedro_datasets/biosequence/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
"""``AbstractDataSet`` implementation to read/write from/to a sequence file."""
"""``AbstractDataset`` implementation to read/write from/to a sequence file."""
from __future__ import annotations

from typing import Any

import lazy_loader as lazy

# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
BioSequenceDataSet: Any
BioSequenceDataSet: type[BioSequenceDataset]
BioSequenceDataset: Any

__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"biosequence_dataset": ["BioSequenceDataSet"]}
__name__,
submod_attrs={"biosequence_dataset": ["BioSequenceDataSet", "BioSequenceDataset"]},
)
43 changes: 31 additions & 12 deletions kedro-datasets/kedro_datasets/biosequence/biosequence_dataset.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""BioSequenceDataSet loads and saves data to/from bio-sequence objects to
"""BioSequenceDataset loads and saves data to/from bio-sequence objects to
file.
"""
import warnings
from copy import deepcopy
from pathlib import PurePosixPath
from typing import Any, Dict, List
Expand All @@ -9,29 +10,29 @@
from Bio import SeqIO
from kedro.io.core import get_filepath_str, get_protocol_and_path

from .._io import AbstractDataset as AbstractDataSet
from kedro_datasets._io import AbstractDataset


class BioSequenceDataSet(AbstractDataSet[List, List]):
r"""``BioSequenceDataSet`` loads and saves data to a sequence file.
class BioSequenceDataset(AbstractDataset[List, List]):
r"""``BioSequenceDataset`` loads and saves data to a sequence file.

Example:
::

>>> from kedro_datasets.biosequence import BioSequenceDataSet
>>> from kedro_datasets.biosequence import BioSequenceDataset
>>> from io import StringIO
>>> from Bio import SeqIO
>>>
>>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n"
>>> raw_data = []
>>> for record in SeqIO.parse(StringIO(data), "fasta"):
>>> raw_data.append(record)
... raw_data.append(record)
>>>
>>> data_set = BioSequenceDataSet(filepath="ls_orchid.fasta",
>>> load_args={"format": "fasta"},
>>> save_args={"format": "fasta"})
>>> data_set.save(raw_data)
>>> sequence_list = data_set.load()
>>> dataset = BioSequenceDataset(filepath="ls_orchid.fasta",
... load_args={"format": "fasta"},
... save_args={"format": "fasta"})
>>> dataset.save(raw_data)
>>> sequence_list = dataset.load()
>>>
>>> assert raw_data[0].id == sequence_list[0].id
>>> assert raw_data[0].seq == sequence_list[0].seq
Expand All @@ -52,7 +53,7 @@ def __init__(
metadata: Dict[str, Any] = None,
) -> None:
"""
Creates a new instance of ``BioSequenceDataSet`` pointing
Creates a new instance of ``BioSequenceDataset`` pointing
to a concrete filepath.

Args:
Expand Down Expand Up @@ -137,3 +138,21 @@ def invalidate_cache(self) -> None:
"""Invalidate underlying filesystem caches."""
filepath = get_filepath_str(self._filepath, self._protocol)
self._fs.invalidate_cache(filepath)


_DEPRECATED_CLASSES = {
"BioSequenceDataSet": BioSequenceDataset,
}


def __getattr__(name):
if name in _DEPRECATED_CLASSES:
alias = _DEPRECATED_CLASSES[name]
warnings.warn(
f"{repr(name)} has been renamed to {repr(alias.__name__)}, "
f"and the alias will be removed in Kedro-Datasets 2.0.0",
DeprecationWarning,
stacklevel=2,
)
return alias
raise AttributeError(f"module {repr(__name__)} has no attribute {repr(name)}")
Loading