Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Upsert functionality in ibis.TableDataset #835

Open
vishu1994 opened this issue Sep 14, 2024 · 1 comment
Open

Support for Upsert functionality in ibis.TableDataset #835

vishu1994 opened this issue Sep 14, 2024 · 1 comment

Comments

@vishu1994
Copy link

Description

In ETL pipelines, updating the existing records in data warehouses is a critical requirement. Currently, the ibis.TableDataset connector in Kedro does not support Upsert() into Ibis backends like Postgres in our case. After discussions in the community , I found that ibis doesn't offers Upsert() natively for any of the backends .

Context

Why is this change important to me?

We are developing ETL pipelines in our organization, and updating existing records in sql backend like Postgres is an essential requirement. At present, without support for upsert(), we must bypass the Kedro DataCatalog and rely on external ORM tools to handle native data storage operations, such as SQLAlchemy , dataset etc .

How would I use it?

Supporting data upsert() in ibis.TableDataset would allow us to maintain a clean and consistent pipeline, avoiding the need for custom load operations within nodes. This would simplify the workflow and allow Kedro to manage the complete I/O process.

How can it benefit other users?

By enabling this feature, users could avoid writing custom loading logic for update operations, thereby keeping their pipelines cleaner and more efficient. This would enhance Kedro's usability in scenarios where heavy I/O operations are involved, particularly for teams working with data warehouses or similar storage backends.

@deepyaman
Copy link
Member

Makes sense! Cleanly supporting this will definitely require ibis-project/ibis#5391 (will see if this can be prioritized on the Ibis side in the coming quarter, as there has definitely been a number of requests for this functionality now); otherwise, will need to do something hacky with raw_sql in the interim.

In the Kedro world (and Spark, pandas, etc.), it seems like a write mode is the most common way to express this behavior (see #542 (comment), https://github.com/kedro-org/kedro-plugins/blob/kedro-datasets-4.1.0/kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py#L34 for some Kedro-Datasets examples). As such, I would propose we wait for #834 to establish this functionality, and then exposing the future upsert implementation from Ibis should be straightforward. 🤞

If somebody needs this functionality before Ibis implements it, we can probably find a way to hack it in with raw_sql, as mentioned above, or at least share instructions for creating a custom version of this dataset to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants