Introduce PySpark Session support ( enables the adapter usage for job clusters) #862

dkruh1 · 2024-12-03T14:57:04Z

Resolves #

Resolves: [dbt-spark Issue #272] ,[dbt-databricks Issue #575]

Description

Pull Request Description

Summary
This PR introduces support for defining a PySpark-based connection when using the adapter. This enhancement allows dbt to run as part of a running Databricks job cluster, expanding its usage beyond SQL warehouses or all-purpose clusters.

Background
The Spark session functionality referenced here was first discussed in [dbt-spark Issue #272]. Specifically for databricks , the issue was raised here :[dbt-databricks Issue #575]

Key Features

PySpark-Based Connection:
A new environment variable, DBT_DATABRICKS_SESSION_CONNECTION, has been introduced.
- When this variable is set to True, a new DatabricksSessionConnectionManager is initialized.
- This manager assumes that the dbt code is being executed in the context of an existing Spark session, making it possible to integrate with running Databricks job clusters.
Testing:
- A new pytest matrix feature called session_support was introduced in to the unit tests. When the session support is enabled , the DBT_DATABRICKS_SESSION_CONNECTION env var is set to true and the unit tests are being executed against the new DatabricksSessionConnectionManager
Functional Testing:
Functional tests were conducted using a Databricks notebook.
- The notebook programmatically triggered dbt while ensuring the DBT_DATABRICKS_SESSION_CONNECTION variable was set to True.
- These tests confirmed that dbt works seamlessly within a running Spark session.
- example notebook code:
  `os.environ["DBT_DATABRICKS_SESSION_CONNECTION"] = "True"
  
  res = dbtRunner().invoke(["run","--profiles-dir","/Workspace/Users/[email protected]/dbt-dbx-session-test/","--project-dir","/Workspace/Users/[email protected]/dbt-models","--target","prod", "--select","model_to_execute"] )`

Why This Matters

Enables running dbt within existing Spark sessions, providing more flexibility for advanced Databricks workflows.
Expands the range of cluster types supported by dbt.
Supports integration with Databricks job clusters, ensuring compatibility with real-world use cases.

Next Steps

Document this feature for users who may need it.
Verify compatibility with additional Databricks environments as needed.

Checklist

[X ] I have run this code in development and it appears to resolve the stated issue
[X ] This PR includes tests, or tests are not required/relevant for this PR
[X ] I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

…TION Koala 1423 support session connection

Merge from fork

benc-db · 2024-12-03T17:23:40Z

@dkruh1, we need to discuss internally if we want to take this feature. I appreciate the effort, and I understand why this feature would be valuable to users, but we need to decide whether we want to take on the maintenance burden of an additional connection mechanism. Will get back to you shortly.

benc-db · 2024-12-05T18:30:46Z

@dkruh1 after discussion, we will not be taking this feature at this time. We are focused on ensuring that dbt-databricks provides the best experience for interacting with SQL Warehouses and serverless compute. As this is OSS, you are free to fork our repo and use your implementation that way.

dkruh36 and others added 11 commits November 27, 2024 12:20

support session method

947adf6

configure tests

9ff849a

reformat files

891e916

fix pr comments

91d0cfc

Merge pull request #1 from YotpoLtd/KOALA-1423-SUPPORT-SESSION-CONNEC…

d18a3d8

…TION Koala 1423 support session connection

Merge remote-tracking branch 'upstream/main' into merge-from-fork

e3081c8

merge from upstream

987016c

updste precommit

c2a3aac

revert wrong changes

f30c727

Merge pull request #3 from YotpoLtd/merge-from-fork

7dea025

Merge from fork

test

190364d

dkruh1 requested review from andrefurlan-db, benc-db and rcypher-databricks as code owners December 3, 2024 14:57

update changelog

38b7e0d

benc-db closed this Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce PySpark Session support ( enables the adapter usage for job clusters) #862

Introduce PySpark Session support ( enables the adapter usage for job clusters) #862

dkruh1 commented Dec 3, 2024 •

edited

Loading

benc-db commented Dec 3, 2024

benc-db commented Dec 5, 2024

Introduce PySpark Session support ( enables the adapter usage for job clusters) #862

Introduce PySpark Session support ( enables the adapter usage for job clusters) #862

Conversation

dkruh1 commented Dec 3, 2024 • edited Loading

Description

Pull Request Description

Checklist

benc-db commented Dec 3, 2024

benc-db commented Dec 5, 2024

dkruh1 commented Dec 3, 2024 •

edited

Loading