Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Added the Experimental ExternalTableDataset for Databricks #885

Merged

Conversation

MinuraPunchihewa
Copy link
Contributor

@MinuraPunchihewa MinuraPunchihewa commented Oct 11, 2024

Description

This PR adds the ExternalTableDataset to support interactions with external tables in Databricks (Unity Catalog).

Fixes #817

Development notes

The ExternalTableDataset has been implemented by extending the BaseTableDataset that was added here.

These changes have been tested,

  1. Manually, by running the code against a Unity Catalog enabled workspace.
  2. Via the existing and newly added unit tests.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@MinuraPunchihewa MinuraPunchihewa marked this pull request as ready for review October 11, 2024 12:03
@MinuraPunchihewa
Copy link
Contributor Author

Hey @noklam, @ankatiyar,
I've added the ExternalTableDataset as an experimental dataset here. Is there anything else I need to do here, like tests etc.?

@noklam
Copy link
Contributor

noklam commented Oct 11, 2024

@MinuraPunchihewa Tests are super welcomed, although we don't have a databricks cluster to run these tests unless there are ways to run this locally.

@MinuraPunchihewa
Copy link
Contributor Author

MinuraPunchihewa commented Oct 11, 2024

@MinuraPunchihewa Tests are super welcomed, although we don't have a databricks cluster to run these tests unless there are ways to run this locally.

@noklam I just added some tests to cover the logic specific to ExternalTableDataset (outside of BaseTableDataset). I also made a few other changes to the existing tstssuch as removing some duplicate code and moving all of the fixtures to conftest.py.

Signed-off-by: Minura Punchihewa <[email protected]>
@noklam
Copy link
Contributor

noklam commented Oct 14, 2024

FAILED tests/databricks/test_base_table_dataset.py::TestBaseTableDataset::test_save_overwrite

Is this an expected fail?

@MinuraPunchihewa
Copy link
Contributor Author

FAILED tests/databricks/test_base_table_dataset.py::TestBaseTableDataset::test_save_overwrite

Is this an expected fail?

@noklam It has been my experience that when overwriting data to an external table of a format other than Delta, the location has to be provided. Since I am an external location fixture here which requires an environment variable called DATABRICKS_EXTERNAL_LOCATION that does not exist in the test environment, yes, this is expected to fail.

I should have specified a different format though. I just made a commit with that change.

Can you give me an explanation of how these tests are running? There should be a Spark environment available for them to run at all, right?

@noklam
Copy link
Contributor

noklam commented Oct 14, 2024

Since this is an experimental dataset, test will go to here: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental/tests

This mean tests are not going to be run per every commit in CI. If you look at other Spark tests we run them in a local mode. With Databricks it's unclear whether it possible to run it in a local mode, especially with the platform UnityCatalog (which is different from the open source version).

@MinuraPunchihewa
Copy link
Contributor Author

Since this is an experimental dataset, test will go to here: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental/tests

This mean tests are not going to be run per every commit in CI. If you look at other Spark tests we run them in a local mode. With Databricks it's unclear whether it possible to run it in a local mode, especially with the platform UnityCatalog (which is different from the open source version).

Got it. I just moved the tests; I had to copy the conftest.py from as well.

@noklam
Copy link
Contributor

noklam commented Oct 18, 2024

tests are now passing and looks good to me.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried this out, but the code looks good to me!

Can you please add this addition to the release notes along with your name under contributors? Similar to how it looks here: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/RELEASE.md#major-features-and-improvements-1

Comment on lines 63 to 73
names_and_ages@spark:
type: databricks.ExternalTableDataset
format: parquet
table: names_and_ages
names_and_ages@pandas:
type: databricks.ExternalTableDataset
format: parquet
table: names_and_ages
dataframe_type: pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you add some spaces between the entries? It's a bit hard to read where one catalog entry starts and ends like this.

@MinuraPunchihewa
Copy link
Contributor Author

@merelcht I've made the requested changes. Do let me know if I've updated the RELEASE document correctly.

@merelcht
Copy link
Member

@merelcht I've made the requested changes. Do let me know if I've updated the RELEASE document correctly.

I've moved it to the section for upcoming release, because 5.1.0 is already out 🙂 Thank again! I'll make sure this gets merged.

@merelcht
Copy link
Member

Oh actually @MinuraPunchihewa I just realised this dataset wasn't added to https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L176 or https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L285. Can you update the pyproject.toml with the necessary dependencies if any?

@ankatiyar
Copy link
Contributor

Oh actually @MinuraPunchihewa I just realised this dataset wasn't added to https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L176 or https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L285. Can you update the pyproject.toml with the necessary dependencies if any?

Also here: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/docs/source/api/kedro_datasets_experimental.rst :)

@MinuraPunchihewa
Copy link
Contributor Author

Hey guys,
Apologies, I've missed this. I've made the changes now.
@merelcht @ankatiyar

Signed-off-by: Minura Punchihewa <[email protected]>
@merelcht merelcht enabled auto-merge (squash) October 21, 2024 15:08
@merelcht merelcht merged commit 5158b3e into kedro-org:main Oct 21, 2024
12 checks passed
tdhooghe pushed a commit to tdhooghe/kedro-plugins that referenced this pull request Oct 21, 2024
…ricks (kedro-org#885)

* added the experimental ExternalTableDataset
* fixed lint issues
* added the missing location attr to the docstring
* removed unused code from the tests for ManagedTableDataset
* added tests for ExternalTableDataset
* moved all fixtures to conftest.py
* updated the format for the test for save_overwrite to Parquet
* moved tests to the kedro_datasets_experimental pkg
* updated the release doc
* added the dataset to the documentation
* listed the dependencies for the dataset

---------

Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kedro-datasets: ExternalTableDataset for Databricks
4 participants