-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataCatalog2.0]: KedroDataCatalog
#4151
Conversation
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Update for the reviewers Since there is a proposal on the new interface (#4175), we removed some methods ( If the proposal won't go through, they'll be added in a separate PR, so we do not block the current one. |
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first review and added (mostly nit) comments. I'll do another review and have a proper look at the tests tomorrow.
kedro/io/kedro_data_catalog.py
Outdated
from kedro.utils import _format_rich, _has_rich_handler | ||
|
||
|
||
class KedroDataCatalog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it we should do it in here, since it will make sure we adhere to it. While it's optional, it's beneficial at no cost for us. Other implementers do not need to extend it though.
kedro/io/kedro_data_catalog.py
Outdated
DatasetNotFoundError: When a dataset with the given name | ||
is not in the collection and do not match patterns. | ||
""" | ||
ds_config = self._config_resolver.resolve_dataset_pattern(ds_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we resolve only if ds_name not in self._datasets
? Or it's just to make the code a bit simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the point was to prevent you from complaining about nested ifs 😅 I moved it inside the condition now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it was just a question, either is fine :)
|
||
dataset = self._datasets.get(ds_name, None) | ||
|
||
if dataset is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rearrange this in such a way that we fail first and then continue with the successful path? Currently the flow is as follows:
- resolve the dataset pattern
- if not part of the materialised datasets, add from config
- get the dataset
- if the dataset does not exist (basically if it cannot be resolved nor existing), go with error scenario
- otherwise continue with non-error scenario
I think we can make the flow a bit less zig-zagy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we can only fail after we try to resolve. Otherwise, you get one more layer of if
as the logic needs to go inside the if fail [] else []
scenario.
Now the logic is like this:
- if not part of the materialised datasets, resolve the dataset pattern
- if resolved, add from config
- get the dataset
- if the dataset does not exist (basically if it cannot be resolved nor exists), go with the error scenario
- otherwise, continue with a non-error scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that we should have as a first line something like:
if ds_name not in self._datasets and self._config_resolver.match_pattern():
...
and then continue with the error scenario, and then go on with everything else. It'll be much easier to follow this way. Btw while checking if this is possible, I saw a problem in the resolver - it can fail even if it matches, but that should not happen.
kedro/kedro/io/catalog_config_resolver.py
Lines 149 to 156 in 5147dfb
elif isinstance(config, str) and "}" in config: | |
try: | |
config = config.format_map(resolved_vars.named) | |
except KeyError as exc: | |
raise DatasetError( | |
f"Unable to resolve '{config}' from the pattern '{pattern}'. Keys used in the configuration " | |
f"should be present in the dataset factory pattern." | |
) from exc |
This ☝️ should have been checked at the config_resolver init time, basically we should not allow to create a config_resolver with unresolvable configs or add invalid configs that cannot be resolved.
Also there's other changes like e.g. resolve_dataset_pattern
should be just resolve_pattern
similar to all other public methods there, which never include the word dataset
(rightfully). Hopefully the resolver API is not released yet, so we can change it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Checking for a pattern match is not enough, as the config could already be resolved at the init time.
resolve_pattern
method encapsulates this, not to expose this logic outside theconfig_resolver
. So we ask theconfig_resolver
to provide a config for a pattern without bothering about how it's happening inside. I don't think we need to move any resolution logic (including any specific checks) to the catalog level. - The suggestion about configs resolution makes sense to me. We can move this validation to the init time and simplify the resolution method. But would do that in a separate PR as it doesn't touch the catalog and will be done at the level of the config resolver.
- Method was renamed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why wouldn't it be enough? We are checking only for failing scenarios, what other failing scenarios would there be apart from not matching a concrete dataset or a pattern? Could we expect other failures of resolution?
- ✅
- ✅
In any case, this is a minor thing, let's merge it in as it is and then we can always come back and simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final small comments, but otherwise I'm very happy with how this looks! Great work @ElenaKhaustova ⭐ 🌟
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
This reverts commit 5208321. Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
9c4701e
to
17199ad
Compare
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
|
||
dataset = self._datasets.get(ds_name, None) | ||
|
||
if dataset is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why wouldn't it be enough? We are checking only for failing scenarios, what other failing scenarios would there be apart from not matching a concrete dataset or a pattern? Could we expect other failures of resolution?
- ✅
- ✅
In any case, this is a minor thing, let's merge it in as it is and then we can always come back and simplify it.
Description
In this PR we add a new catalog
KedroDataCatalog
, which uses theDataCatalogConfigResolver
and addresses:_FrozenDatasets
public API #3926DataCatalog
#3931Please see the suggested order of work in #3995 (comment) and comment below: #4151 (comment)
This PR is done on top of #4160 and relies on
CatalogProtocol
.For the reviewers: this PR does not include unit-tests for
KedroDataCatalog
, they'll be added after the initial feedback.Development notes
DataCatalog
API to avoid multiple if/else branches depending on catalog type used incontext
,runner
andsession
_FrozenDatasets
and access datasets as propertiesadd_feed_dict()
simplified and renamed toadd_raw_data()
from_config()
methodTo test
KedroDataCatalog
modify yoursettings.py
and run commands as usual:kedro run
kedro catalog list/rank/resolve/create
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file