Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce PySpark Session support ( enables the adapter usage for job clusters) #862

Closed
wants to merge 12 commits into from

Conversation

dkruh1
Copy link

@dkruh1 dkruh1 commented Dec 3, 2024

Resolves #

Resolves: [dbt-spark Issue #272] ,[dbt-databricks Issue #575]

Description

Pull Request Description

Summary
This PR introduces support for defining a PySpark-based connection when using the adapter. This enhancement allows dbt to run as part of a running Databricks job cluster, expanding its usage beyond SQL warehouses or all-purpose clusters.

Background
The Spark session functionality referenced here was first discussed in [dbt-spark Issue #272]. Specifically for databricks , the issue was raised here :[dbt-databricks Issue #575]

Key Features

  1. PySpark-Based Connection:
    A new environment variable, DBT_DATABRICKS_SESSION_CONNECTION, has been introduced.

    • When this variable is set to True, a new DatabricksSessionConnectionManager is initialized.
    • This manager assumes that the dbt code is being executed in the context of an existing Spark session, making it possible to integrate with running Databricks job clusters.
  2. Testing:

    • A new pytest matrix feature called session_support was introduced in to the unit tests. When the session support is enabled , the DBT_DATABRICKS_SESSION_CONNECTION env var is set to true and the unit tests are being executed against the new DatabricksSessionConnectionManager
  3. Functional Testing:
    Functional tests were conducted using a Databricks notebook.

    • The notebook programmatically triggered dbt while ensuring the DBT_DATABRICKS_SESSION_CONNECTION variable was set to True.

    • These tests confirmed that dbt works seamlessly within a running Spark session.

    • example notebook code:
      `os.environ["DBT_DATABRICKS_SESSION_CONNECTION"] = "True"

      res = dbtRunner().invoke(["run","--profiles-dir","/Workspace/Users/[email protected]/dbt-dbx-session-test/","--project-dir","/Workspace/Users/[email protected]/dbt-models","--target","prod", "--select","model_to_execute"] )`

Why This Matters

  • Enables running dbt within existing Spark sessions, providing more flexibility for advanced Databricks workflows.
  • Expands the range of cluster types supported by dbt.
  • Supports integration with Databricks job clusters, ensuring compatibility with real-world use cases.

Next Steps

  • Document this feature for users who may need it.
  • Verify compatibility with additional Databricks environments as needed.

Checklist

  • [X ] I have run this code in development and it appears to resolve the stated issue
  • [X ] This PR includes tests, or tests are not required/relevant for this PR
  • [X ] I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@benc-db
Copy link
Collaborator

benc-db commented Dec 3, 2024

@dkruh1, we need to discuss internally if we want to take this feature. I appreciate the effort, and I understand why this feature would be valuable to users, but we need to decide whether we want to take on the maintenance burden of an additional connection mechanism. Will get back to you shortly.

@benc-db
Copy link
Collaborator

benc-db commented Dec 5, 2024

@dkruh1 after discussion, we will not be taking this feature at this time. We are focused on ensuring that dbt-databricks provides the best experience for interacting with SQL Warehouses and serverless compute. As this is OSS, you are free to fork our repo and use your implementation that way.

@benc-db benc-db closed this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants