Introduce PySpark Session support ( enables the adapter usage for job clusters) #862
+172
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #
Resolves: [dbt-spark Issue #272] ,[dbt-databricks Issue #575]
Description
Pull Request Description
Summary
This PR introduces support for defining a PySpark-based connection when using the adapter. This enhancement allows dbt to run as part of a running Databricks job cluster, expanding its usage beyond SQL warehouses or all-purpose clusters.
Background
The Spark session functionality referenced here was first discussed in [dbt-spark Issue #272]. Specifically for databricks , the issue was raised here :[dbt-databricks Issue #575]
Key Features
PySpark-Based Connection:
A new environment variable,
DBT_DATABRICKS_SESSION_CONNECTION
, has been introduced.True
, a newDatabricksSessionConnectionManager
is initialized.Testing:
Functional Testing:
Functional tests were conducted using a Databricks notebook.
The notebook programmatically triggered dbt while ensuring the
DBT_DATABRICKS_SESSION_CONNECTION
variable was set toTrue
.These tests confirmed that dbt works seamlessly within a running Spark session.
example notebook code:
`os.environ["DBT_DATABRICKS_SESSION_CONNECTION"] = "True"
res = dbtRunner().invoke(["run","--profiles-dir","/Workspace/Users/[email protected]/dbt-dbx-session-test/","--project-dir","/Workspace/Users/[email protected]/dbt-models","--target","prod", "--select","model_to_execute"] )`
Why This Matters
Next Steps
Checklist
CHANGELOG.md
and added information about my change to the "dbt-databricks next" section.