[DataCatalog2.0]: `KedroDataCatalog` #4151

ElenaKhaustova · 2024-09-09T15:29:53Z

Description

In this PR we add a new catalog KedroDataCatalog, which uses the DataCatalogConfigResolver and addresses:

Please see the suggested order of work in #3995 (comment) and comment below: #4151 (comment)

This PR is done on top of #4160 and relies on CatalogProtocol.

For the reviewers: this PR does not include unit-tests for KedroDataCatalog, they'll be added after the initial feedback.

Development notes

We kept some old DataCatalog API to avoid multiple if/else branches depending on catalog type used in context, runner and session
Removed _FrozenDatasets and access datasets as properties
Added get dataset by name feature: dedicated function and access by key
Added iterate over the datasets feature
add_feed_dict() simplified and renamed to add_raw_data()
Datasets' init was moved out from from_config() method

To test KedroDataCatalog modify your settings.py and run commands as usual:

# settings.py

from kedro.io import KedroDataCatalog
DATA_CATALOG_CLASS = KedroDataCatalog

kedro run
kedro catalog list/rank/resolve/create

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>

kedro/io/kedro_data_catalog.py

ElenaKhaustova · 2024-09-18T13:36:28Z

Update for the reviewers

Since there is a proposal on the new interface (#4175), we removed some methods (__getitem__, __setitem__, __iter__) from the current implementation since they will most probably change during the implementation of the proposal.

If the proposal won't go through, they'll be added in a separate PR, so we do not block the current one.

Signed-off-by: Elena Khaustova <[email protected]>

merelcht

I've done a first review and added (mostly nit) comments. I'll do another review and have a proper look at the tests tomorrow.

tests/io/conftest.py

kedro/io/kedro_data_catalog.py

idanov · 2024-09-17T15:57:52Z

kedro/io/kedro_data_catalog.py

+from kedro.utils import _format_rich, _has_rich_handler
+
+
+class KedroDataCatalog:


I think it we should do it in here, since it will make sure we adhere to it. While it's optional, it's beneficial at no cost for us. Other implementers do not need to extend it though.

docs/source/conf.py

kedro/io/kedro_data_catalog.py

idanov · 2024-09-19T20:25:25Z

kedro/io/kedro_data_catalog.py

+            DatasetNotFoundError: When a dataset with the given name
+                is not in the collection and do not match patterns.
+        """
+        ds_config = self._config_resolver.resolve_dataset_pattern(ds_name)


Shouldn't we resolve only if ds_name not in self._datasets? Or it's just to make the code a bit simpler?

Yeah, the point was to prevent you from complaining about nested ifs 😅 I moved it inside the condition now.

No, it was just a question, either is fine :)

idanov · 2024-09-19T20:32:28Z

kedro/io/kedro_data_catalog.py

+
+        dataset = self._datasets.get(ds_name, None)
+
+        if dataset is None:


Could we rearrange this in such a way that we fail first and then continue with the successful path? Currently the flow is as follows:

resolve the dataset pattern

if not part of the materialised datasets, add from config

get the dataset

if the dataset does not exist (basically if it cannot be resolved nor existing), go with error scenario

otherwise continue with non-error scenario

I think we can make the flow a bit less zig-zagy.

Well, we can only fail after we try to resolve. Otherwise, you get one more layer of if as the logic needs to go inside the if fail [] else [] scenario.

Now the logic is like this:

if not part of the materialised datasets, resolve the dataset pattern

if resolved, add from config

get the dataset

if the dataset does not exist (basically if it cannot be resolved nor exists), go with the error scenario

otherwise, continue with a non-error scenario

What I meant is that we should have as a first line something like:

if ds_name not in self._datasets and self._config_resolver.match_pattern(): ...

and then continue with the error scenario, and then go on with everything else. It'll be much easier to follow this way. Btw while checking if this is possible, I saw a problem in the resolver - it can fail even if it matches, but that should not happen.

kedro/kedro/io/catalog_config_resolver.py

Lines 149 to 156 in 5147dfb

elif isinstance(config, str) and "}" in config:

try:

config = config.format_map(resolved_vars.named)

except KeyError as exc:

raise DatasetError(

f"Unable to resolve '{config}' from the pattern '{pattern}'. Keys used in the configuration "

f"should be present in the dataset factory pattern."

) from exc

This ☝️ should have been checked at the config_resolver init time, basically we should not allow to create a config_resolver with unresolvable configs or add invalid configs that cannot be resolved.

Also there's other changes like e.g. resolve_dataset_pattern should be just resolve_pattern similar to all other public methods there, which never include the word dataset (rightfully). Hopefully the resolver API is not released yet, so we can change it now.

Checking for a pattern match is not enough, as the config could already be resolved at the init time. resolve_pattern method encapsulates this, not to expose this logic outside the config_resolver. So we ask the config_resolver to provide a config for a pattern without bothering about how it's happening inside. I don't think we need to move any resolution logic (including any specific checks) to the catalog level.

The suggestion about configs resolution makes sense to me. We can move this validation to the init time and simplify the resolution method. But would do that in a separate PR as it doesn't touch the catalog and will be done at the level of the config resolver.

Method was renamed.

Why wouldn't it be enough? We are checking only for failing scenarios, what other failing scenarios would there be apart from not matching a concrete dataset or a pattern? Could we expect other failures of resolution?

✅

✅

In any case, this is a minor thing, let's merge it in as it is and then we can always come back and simplify it.

kedro/io/kedro_data_catalog.py

merelcht

Some final small comments, but otherwise I'm very happy with how this looks! Great work @ElenaKhaustova ⭐ 🌟

kedro/io/kedro_data_catalog.py

tests/io/test_kedro_data_catalog.py

Signed-off-by: Elena Khaustova <[email protected]>

This reverts commit 5208321. Signed-off-by: Elena Khaustova <[email protected]>

Signed-off-by: Elena Khaustova <[email protected]>

idanov · 2024-09-24T13:47:21Z

kedro/io/kedro_data_catalog.py

+
+        dataset = self._datasets.get(ds_name, None)
+
+        if dataset is None:


Why wouldn't it be enough? We are checking only for failing scenarios, what other failing scenarios would there be apart from not matching a concrete dataset or a pattern? Could we expect other failures of resolution?

✅

✅

In any case, this is a minor thing, let's merge it in as it is and then we can always come back and simplify it.

ElenaKhaustova added 30 commits July 31, 2024 18:16

Added a skeleton for AbstractDataCatalog and KedroDataCatalog

a8f4fb3

Signed-off-by: Elena Khaustova <[email protected]>

Removed from_config method

7d56818

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

787e121

Implemented _init_datasets method

0b80f23

Signed-off-by: Elena Khaustova <[email protected]>

Implemented get dataset

5c727df

Signed-off-by: Elena Khaustova <[email protected]>

Started resolve_patterns implementation

05c9171

Signed-off-by: Elena Khaustova <[email protected]>

Implemented resolve_patterns

5c804d6

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

e9ba5c4

Fixed credentials resolving

530f7d6

Signed-off-by: Elena Khaustova <[email protected]>

Updated match pattern

64be83c

Signed-off-by: Elena Khaustova <[email protected]>

Implemented add from dict method

c29828a

Signed-off-by: Elena Khaustova <[email protected]>

Updated io __init__

957403a

Signed-off-by: Elena Khaustova <[email protected]>

Added list method

14908ff

Signed-off-by: Elena Khaustova <[email protected]>

Implemented _validate_missing_keys

c5e925b

Signed-off-by: Elena Khaustova <[email protected]>

Added datasets access logic

b9a92b0

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

2cb794f

Added __contains__ and comments on lazy loading

2f32593

Signed-off-by: Elena Khaustova <[email protected]>

Renamed dataset_name to ds_name

d1ea64e

Signed-off-by: Elena Khaustova <[email protected]>

Updated some docstrings

fb89fca

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

4486939

Fixed _update_ds_configs

c667645

Signed-off-by: Elena Khaustova <[email protected]>

Fixed _init_datasets

be8e929

Signed-off-by: Elena Khaustova <[email protected]>

Implemented add_runtime_patterns

ec7ac39

Signed-off-by: Elena Khaustova <[email protected]>

Fixed runtime patterns usage

8e23450

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into refactor-pattern-logic

529e61a

Merge branch 'main' into refactor-pattern-logic

e4cb21c

Moved pattern logic out of data catalog, implemented KedroDataCatalog

50bc816

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into 4110-move-pattern-resolution-logic

6dfbcb0

KedroDataCatalog updates

9346f08

Signed-off-by: Elena Khaustova <[email protected]>

Added property to return config

9568e29

Signed-off-by: Elena Khaustova <[email protected]>

yury-fedotov reviewed Sep 18, 2024

View reviewed changes

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved

kedro/io/kedro_data_catalog.py Show resolved Hide resolved

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved

ElenaKhaustova added 2 commits September 18, 2024 14:42

Fixinf typos

033a0b7

Signed-off-by: Elena Khaustova <[email protected]>

Removed key completions test

e74ffda

Signed-off-by: Elena Khaustova <[email protected]>

merelcht reviewed Sep 18, 2024

View reviewed changes

idanov reviewed Sep 19, 2024

View reviewed changes

merelcht approved these changes Sep 20, 2024

View reviewed changes

ElenaKhaustova added 10 commits September 20, 2024 15:10

Replaced data set with dataset

00af3ec

Signed-off-by: Elena Khaustova <[email protected]>

Added docstring for get_dataset() method

2de7ccb

Signed-off-by: Elena Khaustova <[email protected]>

Renamed pytest fixture

8affed6

Signed-off-by: Elena Khaustova <[email protected]>

Addressed review comments

a52672e

Signed-off-by: Elena Khaustova <[email protected]>

Updated _assert_requirements_ok starters test

84f249c

Signed-off-by: Elena Khaustova <[email protected]>

Revert "Updated _assert_requirements_ok starters test"

2548119

This reverts commit 5208321. Signed-off-by: Elena Khaustova <[email protected]>

Updated error message

ac124e3

Signed-off-by: Elena Khaustova <[email protected]>

Replaced typo

f62ed03

Signed-off-by: Elena Khaustova <[email protected]>

Replaced data set with dataset in docstrings

b65609f

Signed-off-by: Elena Khaustova <[email protected]>

Updated tests

17199ad

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova force-pushed the 3995-data-catalog-2.0 branch from 9c4701e to 17199ad Compare September 20, 2024 14:10

ElenaKhaustova added 2 commits September 20, 2024 15:11

Merge branch 'main' into 3995-data-catalog-2.0

44c576e

Made KedroDataCatalog subclass from CatalogProtocol

6d5f094

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova requested a review from idanov September 23, 2024 10:09

ElenaKhaustova added 4 commits September 23, 2024 11:52

Updated release notes

e24b2a6

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into 3995-data-catalog-2.0

c8ef90f

Signed-off-by: Elena Khaustova <[email protected]>

Renamed resolve_dataset_pattern to resolve_pattern

d19941f

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into 3995-data-catalog-2.0

572594e

idanov approved these changes Sep 24, 2024

View reviewed changes

ElenaKhaustova merged commit 53280bd into main Sep 24, 2024
41 checks passed

ElenaKhaustova deleted the 3995-data-catalog-2.0 branch September 24, 2024 14:06

This was referenced Sep 24, 2024

[DataCatalog2.0]: Move config validation to the CatalogConfigResolver init #4188

Open

[DataCatalog]: Enhance _FrozenDatasets public API #3926

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog2.0]: `KedroDataCatalog` #4151

[DataCatalog2.0]: `KedroDataCatalog` #4151

ElenaKhaustova commented Sep 9, 2024 •

edited

Loading

ElenaKhaustova commented Sep 18, 2024

merelcht left a comment

idanov Sep 17, 2024

idanov Sep 19, 2024

ElenaKhaustova Sep 19, 2024

idanov Sep 24, 2024 •

edited

Loading

idanov Sep 19, 2024

ElenaKhaustova Sep 19, 2024

idanov Sep 24, 2024

ElenaKhaustova Sep 24, 2024 •

edited

Loading

idanov Sep 24, 2024 •

edited

Loading

merelcht left a comment

idanov Sep 24, 2024 •

edited

Loading

		from kedro.utils import _format_rich, _has_rich_handler


		class KedroDataCatalog:


		dataset = self._datasets.get(ds_name, None)

		if dataset is None:

	elif isinstance(config, str) and "}" in config:
	try:
	config = config.format_map(resolved_vars.named)
	except KeyError as exc:
	raise DatasetError(
	f"Unable to resolve '{config}' from the pattern '{pattern}'. Keys used in the configuration "
	f"should be present in the dataset factory pattern."
	) from exc

[DataCatalog2.0]: KedroDataCatalog #4151

[DataCatalog2.0]: KedroDataCatalog #4151

Conversation

ElenaKhaustova commented Sep 9, 2024 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

ElenaKhaustova commented Sep 18, 2024

merelcht left a comment

Choose a reason for hiding this comment

idanov Sep 17, 2024

Choose a reason for hiding this comment

idanov Sep 19, 2024

Choose a reason for hiding this comment

ElenaKhaustova Sep 19, 2024

Choose a reason for hiding this comment

idanov Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

idanov Sep 19, 2024

Choose a reason for hiding this comment

ElenaKhaustova Sep 19, 2024

Choose a reason for hiding this comment

idanov Sep 24, 2024

Choose a reason for hiding this comment

ElenaKhaustova Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

idanov Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

idanov Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

[DataCatalog2.0]: `KedroDataCatalog` #4151

[DataCatalog2.0]: `KedroDataCatalog` #4151

ElenaKhaustova commented Sep 9, 2024 •

edited

Loading

idanov Sep 24, 2024 •

edited

Loading

ElenaKhaustova Sep 24, 2024 •

edited

Loading

idanov Sep 24, 2024 •

edited

Loading

idanov Sep 24, 2024 •

edited

Loading