How to dynamically pass save_args to kedro catalog? #910

IamSandeepGunda · 2021-09-29T15:08:54Z

IamSandeepGunda
Sep 29, 2021

Hi team, I'm on Kedro 0.17.3.

I'm trying to write delta tables in Kedro. Changing file format to delta makes the write as delta tables with mode as overwrite.

Previously, a node in the raw layer (meta_reload) creates a dataset that determines what's the start date for incremental load for each dataset. each node uses that raw dataset to filter the working dataset to apply the transformation logic and write partitioned parquet tables incrementally.

But now writing delta with mode as overwrite with just file type change to delta makes current incremental data overwrite all the past data instead of just those partitions. So I need to use replaceWhere option in save_args in the catalog. How would I determine the start date for replaceWhere in the catalog when I need to read the meta_reload raw dataset to determine the date. Is there a way to dynamically pass the save_args from inside the node?

my_dataset: type: my_project.io.pyspark.SparkDataSet filepath: "s3://${bucket_de_pipeline}/${data_environment_project}/${data_environment_intermediate}/my_dataset/" file_format: delta layer: intermediate save_args: mode: "overwrite" replaceWhere: "DATE_ID > xyz" ## what I want to implement dynamically partitionBy: [ "DATE_ID" ]

Also added a StackOverflow question here.

datajoely · 2021-09-29T15:33:34Z

datajoely
Sep 29, 2021
Collaborator

Hi @IamSandeepGunda - this isn't native functionality, the only way to do this today is to define a custom dataset which follows this pattern. It would be easiest to subclass the SparkDataSet(which you may have done from the type above) and implement this functionality into the save() method.

5 replies

IamSandeepGunda Sep 29, 2021
Author

Yes, I've a custom SparkDataset my_project.io.pyspark.SparkDataSet with its own save but I want to pass the replaceWhere condition to the save_args dynamically from the node.
My nodes determine the start_date by reading the meta_reload table. If I can update the catalog to add that into save_args as a replaceWhere option, it'd be perfect.

Any suggestions on how to do it? I've seen add functionality from the DataCatalog class, is there an update?

datajoely Sep 29, 2021
Collaborator

Oh I misunderstood you want to do it from the node. Kedro isn't really designed for that - I think your best bet is to define a before_pipeline_run hook for your project that dynamically alters your catalog entries accordingly. I think it's possible, but not sure it will be pretty!

IamSandeepGunda Sep 29, 2021
Author

Is there an example I could use for before_pipeline_run where the catalog is altered? Also, can catalogs be altered with before_node_run?

datajoely Sep 30, 2021
Collaborator

So I guess maybe let's take a step back and work out what you're trying to achieve - the dataset at save time needs to know which approach to use.

Maybe a environment variable would be best here? You could alter your custom save method to do something like this:

save_approach = os.environ.get('kedro_spark_replacement_where')
if save_approach is not None:
     df.filter(save_approach).save(....
else:
     df.save(...

Alternatively you could maybe use the before_dataset_saved hook to alter the data object provided at runtime to apply your filter.

We don't have any examples that do this sort of thing but the Common use cases section should get you on the way.

IamSandeepGunda Oct 8, 2021
Author

Thanks for your answer @datajoely! I started with transformers, then hooks (before dataset saved). Then I moved it to the _save method in my custom SparkDataset. So I guess I ended up at your answer indeed, haha.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to dynamically pass save_args to kedro catalog? #910

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to dynamically pass save_args to kedro catalog? #910

IamSandeepGunda Sep 29, 2021

Replies: 1 comment · 5 replies

datajoely Sep 29, 2021 Collaborator

IamSandeepGunda Sep 29, 2021 Author

datajoely Sep 29, 2021 Collaborator

IamSandeepGunda Sep 29, 2021 Author

datajoely Sep 30, 2021 Collaborator

IamSandeepGunda Oct 8, 2021 Author

IamSandeepGunda
Sep 29, 2021

Replies: 1 comment 5 replies

datajoely
Sep 29, 2021
Collaborator

IamSandeepGunda Sep 29, 2021
Author

datajoely Sep 29, 2021
Collaborator

IamSandeepGunda Sep 29, 2021
Author

datajoely Sep 30, 2021
Collaborator

IamSandeepGunda Oct 8, 2021
Author