-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Databricks integrations #2823
Databricks integrations #2823
Conversation
…o feature/databricks-orchestrator # Conflicts: # src/zenml/integrations/__init__.py
…/zenml into feature/databricks-orchestrator
The Databricks integration now requires the "databricks-sdk" package instead of "workflows_authoring_toolkit".
…zenml-io/zenml into feature/databricks-orchestrator
…d, and client_secret
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughWalkthroughThe latest updates introduce Databricks Orchestrator integration into the ZenML framework, enabling distributed computing capabilities. This includes comprehensive configurations, model deployment, data preprocessing, and inference features. New files and modifications cover orchestrator settings, deployment processes, data pipelines, and utility functions to support scalable and efficient ML workflows on Databricks. Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
LLM Finetuning template updates in |
src/zenml/steps/base_step.py
Outdated
obj = source_utils.load(source) | ||
logger.info("Loading step from source: %s", source) | ||
|
||
if prefix := os.environ.get("ZENML_DATABRICKS_SOURCE_PREFIX"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely not code that should be in our base step implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i guess this is fixed now by adding the installed wheel path to sys.path
) | ||
env_arg = ",".join(env_vars) | ||
|
||
arguments.extend(["--env", env_arg]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to pass environment variables to a databricks job? This seems very ugly, and potentially runs into issues with a too long argument string.
src/zenml/entrypoints/entrypoint.py
Outdated
@@ -39,12 +40,23 @@ def main() -> None: | |||
# is not wrapped in a function or an `if __name__== "__main__":` check) | |||
constants.SHOULD_PREVENT_PIPELINE_EXECUTION = True | |||
|
|||
source_utils.set_custom_source_root(source_root="custom_source_root") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we are running from notebook (databricks apparently runs the code from notebook env) so it detect that there is no zenml init
so setting a custom root is a solution that worked
…client_id, and client_secret
Images automagically compressed by Calibre's image-actions ✨ Compression reduced images by 49.2%, saving 1,009.30 KB.
246 images did not require optimisation. Update required: Update image-actions configuration to the latest version before 1/1/21. See README for instructions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So im just going to review the docs, and not the code.
I synced this to gitbook here. There's a few obvious mistakes:
- there is no mention in the toc so the pages don't appear
- there is an example committed for some reason that shouldn't be there?
APart from that, see comments. Great job overall this is very exciting!@
|
||
# Databricks Orchestrator | ||
|
||
[Databricks](https://www.databricks.com/) is a unified data analytics platform that combines the best of data warehouses and data lakes to offer an integrated solution for big data processing and machine learning. It provides a collaborative environment for data scientists, data engineers, and business analysts to work together on data projects. Databricks is built on top of Apache Spark, offering optimized performance and scalability for big data workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Spark is nowadays just one component of databricks and I think as the orchestrator has nothing to do with spark, there's no need to mention it
|
||
The Databricks orchestrator in ZenML leverages the concept of Wheel Packages. When you run a pipeline with the Databricks orchestrator, ZenML creates a Python wheel package from your project. This wheel package contains all the necessary code and dependencies for your pipeline. | ||
|
||
Once the wheel package is created, ZenML uploads it to Databricks. ZenML leverage Databricks SDK to create a job definition, This job definition includes information about the pipeline steps and ensures that each step is executed only after its upstream steps have successfully completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a diagram here would be nice, also for marketing . maybe the flow of what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point will create one
|
||
Once the wheel package is created, ZenML uploads it to Databricks. ZenML leverage Databricks SDK to create a job definition, This job definition includes information about the pipeline steps and ensures that each step is executed only after its upstream steps have successfully completed. | ||
|
||
The Databricks job is also configured with the necessary cluster settings to run. This includes specifying the version of Spark to use, the number of workers, the node type, and other configuration options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's how Databricks forces you to do it all their computing options are spark based and you can't really select otherwise
|
||
#### Enabling CUDA for GPU-backed hardware | ||
|
||
Note that if you wish to use this orchestrator to run steps on a GPU, you will need to follow [the instructions on this page](../../how-to/training-with-gpus/training-with-gpus.md) to ensure that it works. It requires adding some extra settings customization and is essential to enable CUDA for the GPU to give its full acceleration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id like a mention of how to specify a GPU via the settings
src/zenml/entrypoints/entrypoint.py
Outdated
if isinstance( | ||
args.entrypoint_config_source, str | ||
) and args.entrypoint_config_source.endswith( | ||
"DatabricksEntrypointConfiguration" | ||
): | ||
source_utils.set_custom_source_root(source_root=os.getcwd()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, this is something that should be in the DatabricksEntrypointConfiguration
Images automagically compressed by Calibre's image-actions ✨ Compression reduced images by 40%, saving 110.07 KB.
265 images did not require optimisation. Update required: Update image-actions configuration to the latest version before 1/1/21. See README for instructions. |
Images automagically compressed by Calibre's image-actions ✨ Compression reduced images by 40%, saving 87.55 KB.
266 images did not require optimisation. Update required: Update image-actions configuration to the latest version before 1/1/21. See README for instructions. |
Describe changes
I implemented/fixed _ to achieve _.
Pre-requisites
Please ensure you have done the following:
develop
and the open PR is targetingdevelop
. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes
Summary by CodeRabbit
New Features
Documentation