-
Notifications
You must be signed in to change notification settings - Fork 903
Contribute changes to Kedro that are tested on Databricks
Many Kedro users deploy their projects to Databricks, a cloud-based platform for data engineering and data science. We encourage contributions to extend and improve the experience for Kedro users on Databricks; this guide explains how to efficiently test your locally modified version of Kedro on Databricks as part of a build-and-test development cycle.
This page is for contributors developing changes to Kedro that need to test them on Databricks. If you are a Kedro user working on an individual or team project and need more information about workflows, consult the documentation pages for developing a Kedro project on Databricks. |
---|
You will need the following to follow this guide:
- Python version >=3.8.
- An activated Python virtual environment into which you have installed the Databricks CLI with authentication for your workspace.
- Access to a Databricks workspace with an existing cluster.
-
GNU
make
. -
git
. - A local clone of the Kedro git repository.
The development workflow for Kedro on Databricks is similar to the one for Kedro in general, when you develop and test your changes locally. The main difference comes when manually testing your changes on Databricks, since you will need to build and deploy the wheel file to Databricks to test it on a cluster.
To make developing Kedro for Databricks easier, Kedro comes with a Makefile
target named databricks-build
that automates the process of building a wheel file and installing this on your Databricks cluster to save development time.
Before you use make databricks-build
, you must set up the Databricks CLI.
Next, create and environment variable with the ID of the cluster you are using to test your Kedro build. You can find the ID by executing the Databricks CLI command databricks clusters list
and looking for the Cluster ID to the left of the name of your chosen cluster, for instance:
$ databricks clusters list
1234-567890-abcd1234 General Cluster TERMINATED
0987-654321-9876xywz Kedro Test Cluster TERMINATED
In this case, the cluster ID of Kedro Test Cluster
is 0987-654321-9876xywz
.
Once you have determined the cluster ID, you must export it to an environment variable named DATABRICKS_CLUSTER_ID
:
# Linux or macOS
export DATABRICKS_CLUSTER_ID=<your-cluster-id>
# Windows (PowerShell)
$Env:DATABRICKS_CLUSTER_ID = '<your-cluster-id>'
With the setup complete, you can use make databricks-build
. In your terminal, navigate to the parent directory of your Kedro development repository and run:
make databricks-build
You should see a stream of messages being written to your terminal. Behind the scenes, databricks-build
does the following:
- Builds a wheel file of your modified version of Kedro.
- Uninstalls any library on your Databricks cluster with the same wheel file name.
- Uploads your updated wheel file to DBFS (Databricks File System).
- Queues your updated wheel file for installation
- Restarts your cluster to apply the changes.
Note that your cluster will be unavailable while it restarts. You can poll the status of the cluster using the Databricks CLI:
# Linux or macOS
databricks clusters get --cluster-id $DATABRICKS_CLUSTER_ID | grep state
# Windows (PowerShell)
databricks clusters get --cluster-id $Env:DATABRICKS_CLUSTER_ID | Select-String state
Once the cluster has restarted, you should verify that your modified version of Kedro has been correctly installed. Run databricks libraries list --cluster-id <your-cluster-id>
. If installation was successful, you should see the following output:
{
"cluster_id": "<your-cluster-id>",
"library_statuses": [
{
"library": {
"whl": "dbfs:/tmp/kedro-builds/kedro-<version>-py3-none-any.whl"
},
"status": "INSTALLED",
"is_library_for_all_clusters": false
}
]
}
Any runs of a Kedro project on this cluster will now reflect your latest local changes to Kedro. You can now test your changes to Kedro by using your cluster to run a Kedro project.
- Contribute to Kedro
- Guidelines for contributing developers
- Contribute changes to Kedro that are tested on Databricks
- Backwards compatibility and breaking changes
- Contribute to the Kedro documentation
- Kedro documentation style guide
- Creating developer documentation
- Kedro new project creation - how it works
- The CI Setup: GitHub Actions
- The Performance Test Setup: Airspeed Velocity
- Kedro Framework team norms & ways of working ⭐️
- Kedro Framework Pull Request and Review team norms ✍️