From 4859b01be1d183edf17b78640e3a9ba0bc152fed Mon Sep 17 00:00:00 2001 From: MACHIN Date: Wed, 21 Aug 2024 11:49:51 +0100 Subject: [PATCH 01/10] updating Databricks/RStudio connection guidance based on new advice from Wojtek --- ADA/databricks_rstudio_sql_warehouse.qmd | 258 ++++++++++++++++++----- 1 file changed, 204 insertions(+), 54 deletions(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index 9ffc6e2..9ee2951 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -1,105 +1,255 @@ --- - title: "Setup Databricks SQL Warehouse with RStudio" + title: "Connecting to Databricks from RStudio" --- -

The following instructions set up an ODBC connection between your laptop and your DataBricks SQL warehouse, which can then be used in R/RStudio to query data using an ODBC based package. -SQL Warehouses are able to run SQL and can be used within the DataBricks environment, or through RStudio to run SQL code. The `sparklyr` package can also be used from within RStudio (only) as it converts the request for data to Spark SQL behind the scenes.

+The following instructions will help you to set up a connection between your laptop and your Databricks SQL warehouse, which can then be used in RStudio to query data. + +You can use data from the SQL warehouse in two different ways: + +- In the SQL editor or in notebooks via the Databricks environment +- In RStudio, similarly to the way that you might currently use SQL Server + +The `sparklyr` package can also be used from within RStudio as it converts the request for data to Spark SQL behind the scenes. + +--- + +## Pre-requisites + +Before you start, you must have: + +- Access to the Databricks platform + +- Access to data in a SQL Warehouse on Databricks + +- R and RStudio downloaded and installed + +- The `ODBC` package installed in RStudio + +If you do not have access to Databricks or a SQL Warehouse within Databricks, you can request this using [a service request form](https://dfe.service-now.com/serviceportal?id=sc_cat_item&sys_id=74bc3be81b212d504f999978b04bcb0b). + +If you do not have R or RStudio, you can find them both in the Software Centre. Note that you need both R **and** RStudio installed. + +--- + +## Process + +There are three steps to complete before your connection can be established. These are: + +- Setting up an ODBC driver on your laptop to enable a connection between your laptop and Databricks +- Modifying your .Renviron file to establish a connection between RStudio and Databricks +- Adding connection code to your existing scripts in RStudio + +Each of these steps is described in more detail in the sections below. + +--- + +### Setting up the ODBC driver + +--- + +#### Install the 'Simba Spark ODBC' driver from the Software Centre + +--- + +- Open the Software Centre via the start menu + +- In the 'Applications' tab, click 'Simba Spark ODBC Driver 64-bit' + +::: {align="center"} +![](../images/databricks-software-centre.png) +::: + +- Click install + +------------------------------------------------------------------------ + +#### Get connection details for the SQL Warehouse from Databricks + +------------------------------------------------------------------------ + +To set up the connection you will need a few details from the SQL Warehouse within Databricks that you wish to connect to. + +- Log in to Databricks + +- Click on the 'SQL Warehouses' tab in the sidebar + +::: {align="center"} +![](../images/databricks-SQL-warehouses.png) +::: + +- Click on the name of the warehouse you want to connect to, and click the 'Connection Details' tab + +- Make a note of the 'Server hostname', 'Port', and 'HTTP Path' ------------------------------------------------------------------------ -# Pre-requisites +#### Get a personal access token from Databricks for authentication ------------------------------------------------------------------------ -You must have: +A personal access token is is a security measure that acts as an identifier to let Databricks know who is accessing information from the SQL warehouse. Access tokens are usually set for a limited amount of time, so they will need renewing periodically. + +- In Databricks, click on your email address in the top right corner, then click 'User settings' + +- Go to the 'Developer' tab in the side bar. Next to 'Access tokens', click the 'Manage' button + +::: {align="center"} +![](../images/databricks-access-tokens.png) +::: + +- Click the 'Generate new token' button + +- Name the token, then click 'Generate' -- Access to Databricks -- Access to a SQL Warehouse on DataBricks -- R and RStudio downloaded +::: callout-note +Note that access tokens will only last as long as the value for the 'Lifetime (days)' field. After this period the token will expire, and you will need to create a new one to re-authenticate. +::: + +- Make a note of the 'Databricks access token' it has given you + +::: callout-warning +It is very important that you immediately copy the access token that you are given, as you will not be able to see it through Databricks again. If you lose this access token before pasting it into RStudio then you must generate a new access token to replace it. +::: ------------------------------------------------------------------------ -# Downloading an ODBC driver +#### Set up ODBC connection from your laptop ------------------------------------------------------------------------ -1. Install the 'Simba Spark ODBC' driver from the software centre. +We now have all the information we need to setup a connection between our laptop and Databricks: - i) Open the Software Centre via the start menu. +- in the start menu, search 'ODBC' and open 'ODBC Data Sources (64-bit)' - ii) In the 'Applications' tab, click 'Simba Spark ODBC Driver 64-bit'. ![](../images/databricks-software-centre.png) +- on the 'User DSN' tab click the 'Add...' button - iii) Click install. +- in the 'Create New Data Source' window, select 'Simba Spark ODBC Driver' and click 'Finish' -2. Get connection details for the SQL Warehouse from Databricks. To set up the connection you will need a few details from a SQL Warehouse within DataBricks. +- you will see the 'Simba Spark ODBC Driver DSN Setup' window shown below: - i) Login to [Databricks](https://adb-6882499576863257.17.azuredatabricks.net/?o=6882499576863257) +![](../images/odbc-driver-settings.png){fig-align="center"} - ii) Click on the 'SQL Warehouses' tab in the sidebar. ![](../images/databricks-SQL-warehouses.png) +- enter a 'Data Source Name' - iii) Click on the name of the warehouse you want to connect to, and click the 'connection details' tab. +::: {.callout-note} +Choose a short and sensible data source name and note it down as this is what you will use to connect to Databricks through RStudio. +::: - iv) Make a note of the 'Server hostname', 'Port', and 'HTTP Path'. +- enter a 'Description' -3. Get a personal access token from Databricks for authentication. +::: {.callout-note} +As you can set up more than one cluster on Databricks, use the description to make clear which cluster this connection is for. The description shown below describes that this connection is using an 8 core cluster on Databricks Runtime Environment 13. +::: - i) In Databricks, click on your email address in the top right corner, then click 'User settings'. +Set the remaining options to the settings below: - ii) Go to the 'Developer' tab in the side bar. Next to 'Access tokens', click the 'Manage' button. ![](../images/databricks-access-tokens.png) +- enter the 'Server Hostname' for your cluster in the 'Host(s)' field (you noted this down in step 2) +- in the Port section, remove the default number and use the Port number you noted in step 2 +- set the Authentication Mechanism to 'User Name and Password' +- enter the word 'token' into the 'User Name:' field, then enter your 'Databricks access token' in the 'Password' field +- change the Thrift Transport option to HTTP\ +- click the 'HTTP Options...' button and enter the 'HTTP Path' of your Databricks cluster, then click 'OK' +- click the 'SSL Options...' button and tick the 'Enable SSL' box, then click the 'OK' button +- click the 'Test' button to verify the connection has worked. You should see the following message: - iii) Click the 'Generate new token' button. - iv) Name the token, then click 'Generate'. *Note that access tokens will only last as long as the value for the 'Lifetime (days)' field. After this period the token will expire, and you will need to create a new one to re-authenticate.* +![](../images/databricks-test-connection.png){fig-align="center"} - v) Make a note of the 'Databricks access token' it has given you. **It is important to copy this somewhere as you will not be able to see it through Databricks again.** -4. Setup ODBC connection from your laptop. We now have all the information we need to setup a connection between our laptop and DataBricks. +::: {.callout-note} +If you get an error here, repeat steps 5.e.i -- 5.e.ix again and ensure all the values are correct. +::: - i) In the start menu, search 'ODBC' and open 'ODBC Data Sources (64-bit)'. +- click the 'OK' button to exit the 'Test Results' window, then the 'OK' button in the 'Simba Spark ODBC Driver DSN Setup' window - ii) On the 'User DSN' tab click the 'Add...' button. +------------------------------------------------------------------------ - iii) In the 'Create New Data Source' window, select 'Simba Spark ODBC Driver' and click 'Finish'. +### Establishing an RStudio connection using environment variables + +------------------------------------------------------------------------ - iv) In the 'Simba Spark ODBC Driver DSN Setup' window, +The `ODBC` package in RStudio allows you to connect to Databricks by creating and modifying four environment variables in your .Renviron file. - a. Enter a 'Data Source Name' and 'Description'. Choose a short and sensible data source name and note it down as this is what you will use to connect to Databricks through RStudio. *As you can set up more than one cluster on Databricks, use the description to make clear which cluster this connection is for. The description shown below describes that this connection is using an 8 core cluster on Databricks Runtime Environment 13.* - b. Set the remaning options to the settings below. - - Enter the 'Server Hostname' for your cluster in the 'Host(s):' field (you noted this down in step 2). - - In the Port section, remove the default number and use the Port number you noted in step 2. - - Set the Authentication Mechanism to 'User Name and Password'. - - Enter the word 'token' into the 'User Name:' field, then enter your 'Databricks access token' in the 'Password:' field. - - Change the Thrift Transport option to HTTP. - - Click the 'HTTP Options...' button and enter the 'HTTP Path' of your Databricks cluster, then click 'Okay'. - - Click the 'SSL Options...' button and tick the 'Enable SSL' box, then click the 'OK' button. - - ![](../images/odbc-driver-settings.png) - c. Click the 'Test' button to verify the connection has worked. You should see the following message. *If you get an error here, repeat steps 5.e.i -- 5.e.ix again and ensure all the values are correct.* +To set the environment variables, call `usethis::edit_r_environ()`. You will then need to enter the following information: - ![](../images/databricks-test-connection.png) +``` +DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net +DATABRICKS_SQL_WAREHOUSE_ID= +DATABRICKS_TOKEN= +DATABRICKS_CLUSTER_ID= +``` - d. Click the 'OK' button to exit the 'Test Results' window, then the 'OK' button in the 'Simba Spark ODBC Driver DSN Setup' window. +Once you have entered the details, save and close your .Renviron file. -5. Connect through RStudio. Watch the below video and view the [ADA_RStudio_connect GitHub repo](https://github.com/dfe-analytical-services/ADA_RStudio_connect) for methods on connecting to Databricks and querying data from RStudio. +::: callout-note +Everyone in your team that wishes to connect to the SQL Warehouse in Databricks and run your code must set up their .Renviron file individually, otherwise their connection will fail. +::: +The sections below describe where to find the information needed for each of the four environment variables. ------------------------------------------------------------------------ -# Pulling data into R studio from Databricks +#### Databricks host + +------------------------------------------------------------------------ + +The Databricks host is the instance of Databricks that you want to connect to. It's the URL that you see in your browser bar when you're on the Databricks site and should end in "azuredatabricks.net". + +------------------------------------------------------------------------ + +#### Databricks SQL Warehouse ID + +------------------------------------------------------------------------ + +The Databricks SQL warehouse ID is the warehouse containing data that you would like to use. To get the warehouse ID, follow these steps: + +- click 'SQL Warehouses' under the 'SQL' section of the left hand menu on Databricks +- click on the warehouse name that you'd like to get the ID for +- the warehouse ID is in brackets next to the warehouse name on the 'Overview' tab ------------------------------------------------------------------------ -Once you have set up an ODBC connection as detailed above, you can then use that connection to pull data directly from Databricks into R Studio. Charlotte recorded a video demonstrating two possible methods of how to do this. The recording is embedded below: +#### Databricks Token + +------------------------------------------------------------------------ + +The Databricks token is the personal access token we generated in the [Get a personal access token from Databricks for authentication](/ADA/databricks_rstudio_sql_warehouse.html#get-a-personal-access-token-from-databricks-for-authentication) section. + +------------------------------------------------------------------------ + +#### Cluster ID + +------------------------------------------------------------------------ + +The Cluster ID is the cluster that you'll be using to run your code. + +To obtain this, follow these steps: + +- click 'Compute' in the Databricks menu, then click on your cluster's name +- in the address bar, copy the string of characters between `clusters` and `configuration` in the URL. This is your cluster ID. + +------------------------------------------------------------------------ + +### Pulling data into RStudio from Databricks + +------------------------------------------------------------------------ + +Now that you have established a connection between Databricks and your laptop, and between Databricks and RStudio, you can add code to your existing scripts to pull data into RStudio for analysis. If you have connected to other SQL databases before, this code will look quite familiar to you. + +To access the data, we will make use of the `ODBC` package. You can find documentation about this package [on the Posit website](https://solutions.posit.co/connections/db/r-packages/odbc/). + +To pull in your data, include the following code in your R Script: + -
- -
+``` {r databricks_connect, eval=FALSE} +library(odbc) -A template of all of the code used in the above video can be found in the [ADA_RStudio_connect GitHub repo](https://github.com/dfe-analytical-services/ADA_RStudio_connect). +con <- DBI::dbConnect( + odbc::databricks(), + httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID") +) -Key takeaways from the video and example code: +odbcListObjects(con) +``` -* The main change here compared to connecting to SQL databases is the connection method. The installation and setup of the ODBC driver are all done pre-code, and the only part of the code that will need updating is your connection (usually your con variable). -* If your existing code was pulling in tables from SQL via the `RODBC` package or the `dbplyr` package, then this code should in theory run with minimal edits needed. -* If you were writing tables back into SQL from R, this is where your code may need the most edits. -* If your code is stored in a repo where multiple analysts contribute to and run the code, in order for the code to run for everyone you will all need to individually install the ODBC driver and **give it the same name** so that when the `con` variable is called, the name used in the code matches everyone's individual driver and runs for everyone. If this is the case, please **add a note about this to your repo's readme file** to help your future colleagues. From 3f8bcad0a9061101a768cb8272c81d45da21c5f0 Mon Sep 17 00:00:00 2001 From: MACHIN Date: Thu, 22 Aug 2024 12:21:29 +0100 Subject: [PATCH 02/10] removing "Set up ODBC connection from your laptop" and other formatting edits --- ADA/databricks_rstudio_sql_warehouse.qmd | 191 ++++++++++------------- 1 file changed, 81 insertions(+), 110 deletions(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index 9ee2951..d63a92c 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -35,7 +35,7 @@ If you do not have R or RStudio, you can find them both in the Software Centre. There are three steps to complete before your connection can be established. These are: -- Setting up an ODBC driver on your laptop to enable a connection between your laptop and Databricks +- Installing an ODBC driver on your laptop to enable a connection between your laptop and Databricks - Modifying your .Renviron file to establish a connection between RStudio and Databricks - Adding connection code to your existing scripts in RStudio @@ -45,15 +45,17 @@ Each of these steps is described in more detail in the sections below. ### Setting up the ODBC driver +An ODBC driver is required for the `ODBC` package in R to work - you must install it before attempting to use the package to connect to your data. + --- -#### Install the 'Simba Spark ODBC' driver from the Software Centre +#### Install the Simba Spark ODBC driver from the Software Centre --- - Open the Software Centre via the start menu -- In the 'Applications' tab, click 'Simba Spark ODBC Driver 64-bit' +- In the 'Applications' tab, click `Simba Spark ODBC Driver 64-bit` ::: {align="center"} ![](../images/databricks-software-centre.png) @@ -63,107 +65,6 @@ Each of these steps is described in more detail in the sections below. ------------------------------------------------------------------------ -#### Get connection details for the SQL Warehouse from Databricks - ------------------------------------------------------------------------- - -To set up the connection you will need a few details from the SQL Warehouse within Databricks that you wish to connect to. - -- Log in to Databricks - -- Click on the 'SQL Warehouses' tab in the sidebar - -::: {align="center"} -![](../images/databricks-SQL-warehouses.png) -::: - -- Click on the name of the warehouse you want to connect to, and click the 'Connection Details' tab - -- Make a note of the 'Server hostname', 'Port', and 'HTTP Path' - ------------------------------------------------------------------------- - -#### Get a personal access token from Databricks for authentication - ------------------------------------------------------------------------- - -A personal access token is is a security measure that acts as an identifier to let Databricks know who is accessing information from the SQL warehouse. Access tokens are usually set for a limited amount of time, so they will need renewing periodically. - -- In Databricks, click on your email address in the top right corner, then click 'User settings' - -- Go to the 'Developer' tab in the side bar. Next to 'Access tokens', click the 'Manage' button - -::: {align="center"} -![](../images/databricks-access-tokens.png) -::: - -- Click the 'Generate new token' button - -- Name the token, then click 'Generate' - -::: callout-note -Note that access tokens will only last as long as the value for the 'Lifetime (days)' field. After this period the token will expire, and you will need to create a new one to re-authenticate. -::: - -- Make a note of the 'Databricks access token' it has given you - -::: callout-warning -It is very important that you immediately copy the access token that you are given, as you will not be able to see it through Databricks again. If you lose this access token before pasting it into RStudio then you must generate a new access token to replace it. -::: - ------------------------------------------------------------------------- - -#### Set up ODBC connection from your laptop - ------------------------------------------------------------------------- - -We now have all the information we need to setup a connection between our laptop and Databricks: - -- in the start menu, search 'ODBC' and open 'ODBC Data Sources (64-bit)' - -- on the 'User DSN' tab click the 'Add...' button - -- in the 'Create New Data Source' window, select 'Simba Spark ODBC Driver' and click 'Finish' - -- you will see the 'Simba Spark ODBC Driver DSN Setup' window shown below: - -![](../images/odbc-driver-settings.png){fig-align="center"} - -- enter a 'Data Source Name' - -::: {.callout-note} -Choose a short and sensible data source name and note it down as this is what you will use to connect to Databricks through RStudio. -::: - -- enter a 'Description' - -::: {.callout-note} -As you can set up more than one cluster on Databricks, use the description to make clear which cluster this connection is for. The description shown below describes that this connection is using an 8 core cluster on Databricks Runtime Environment 13. -::: - -Set the remaining options to the settings below: - -- enter the 'Server Hostname' for your cluster in the 'Host(s)' field (you noted this down in step 2) -- in the Port section, remove the default number and use the Port number you noted in step 2 -- set the Authentication Mechanism to 'User Name and Password' -- enter the word 'token' into the 'User Name:' field, then enter your 'Databricks access token' in the 'Password' field -- change the Thrift Transport option to HTTP\ -- click the 'HTTP Options...' button and enter the 'HTTP Path' of your Databricks cluster, then click 'OK' -- click the 'SSL Options...' button and tick the 'Enable SSL' box, then click the 'OK' button -- click the 'Test' button to verify the connection has worked. You should see the following message: - - -![](../images/databricks-test-connection.png){fig-align="center"} - - -::: {.callout-note} -If you get an error here, repeat steps 5.e.i -- 5.e.ix again and ensure all the values are correct. -::: - -- click the 'OK' button to exit the 'Test Results' window, then the 'OK' button in the 'Simba Spark ODBC Driver DSN Setup' window - ------------------------------------------------------------------------- - ### Establishing an RStudio connection using environment variables ------------------------------------------------------------------------ @@ -173,7 +74,7 @@ The `ODBC` package in RStudio allows you to connect to Databricks by creating an To set the environment variables, call `usethis::edit_r_environ()`. You will then need to enter the following information: ``` -DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net +DATABRICKS_HOST= DATABRICKS_SQL_WAREHOUSE_ID= DATABRICKS_TOKEN= DATABRICKS_CLUSTER_ID= @@ -193,7 +94,7 @@ The sections below describe where to find the information needed for each of the ------------------------------------------------------------------------ -The Databricks host is the instance of Databricks that you want to connect to. It's the URL that you see in your browser bar when you're on the Databricks site and should end in "azuredatabricks.net". +The Databricks host is the instance of Databricks that you want to connect to. It's the URL that you see in your browser bar when you're on the Databricks site and should end in "azuredatabricks.net" (ignore anything after this section of the URL). ------------------------------------------------------------------------ @@ -209,11 +110,35 @@ The Databricks SQL warehouse ID is the warehouse containing data that you would ------------------------------------------------------------------------ -#### Databricks Token +#### Databricks token ------------------------------------------------------------------------ -The Databricks token is the personal access token we generated in the [Get a personal access token from Databricks for authentication](/ADA/databricks_rstudio_sql_warehouse.html#get-a-personal-access-token-from-databricks-for-authentication) section. +The Databricks token is a personal access token. + +A personal access token is is a security measure that acts as an identifier to let Databricks know who is accessing information from the SQL warehouse. Access tokens are usually set for a limited amount of time, so they will need renewing periodically. + +- In Databricks, click on your email address in the top right corner, then click 'User settings' + +- Go to the 'Developer' tab in the side bar. Next to 'Access tokens', click the 'Manage' button + +::: {align="center"} +![](../images/databricks-access-tokens.png) +::: + +- Click the 'Generate new token' button + +- Name the token, then click 'Generate' + +::: callout-note +Note that access tokens will only last as long as the value for the 'Lifetime (days)' field. After this period the token will expire, and you will need to create a new one to re-authenticate. +::: + +- Make a note of the 'Databricks access token' it has given you + +::: callout-warning +It is very important that you immediately copy the access token that you are given, as you will not be able to see it through Databricks again. If you lose this access token before pasting it into RStudio then you must generate a new access token to replace it. +::: ------------------------------------------------------------------------ @@ -238,10 +163,20 @@ Now that you have established a connection between Databricks and your laptop, a To access the data, we will make use of the `ODBC` package. You can find documentation about this package [on the Posit website](https://solutions.posit.co/connections/db/r-packages/odbc/). -To pull in your data, include the following code in your R Script: +You can connect to your data in two ways, both of which are described below. + +--- + +#### Connecting via the SQL Warehouse + +--- + +This uses the `ODBC` package and can interact with tables in the Delta Lake. +Include the following code in your R Script: -``` {r databricks_connect, eval=FALSE} + +``` {r databricks_connect_sql, eval=FALSE} library(odbc) con <- DBI::dbConnect( @@ -252,4 +187,40 @@ con <- DBI::dbConnect( odbcListObjects(con) ``` +--- + +#### Connecting via a compute cluster + +--- + +Compute clusters can interact with Volumes in Databricks, unlike the SQL Warehouse. For compute clusters, you need to use `sparklyr` rather than the `ODBC` package. + +Include the following code in your R Script: + +``` {r databricks_connect_cluster, eval=FALSE} +library(tidyverse) +library(sparklyr) +library(dbplyr) + +sc <- spark_connect( + cluster_id = Sys.getenv("DATABRICKS_CLUSTER_ID"), + method = "databricks_connect" +) + +tbl( + sc, + I("catalog_10_gold.information_schema.catalogs") +) %>% + head() +``` + +You can read CSV files from a Volume into a dataframe using the following code: + +``` {r databricks_connect_sparklyr, eval=FALSE} +df <- spark_read_csv( + sc, + "/Volumes/catalog_40_copper/folder/file.csv" +) + +``` From d872290cf6766a0168b1a6a22e9e756dd7077cb9 Mon Sep 17 00:00:00 2001 From: MACHIN Date: Thu, 22 Aug 2024 13:45:44 +0100 Subject: [PATCH 03/10] corrections in response to PR comments --- ADA/databricks_rstudio_sql_warehouse.qmd | 70 ++++++------------------ 1 file changed, 16 insertions(+), 54 deletions(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index d63a92c..278b15a 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -9,8 +9,6 @@ You can use data from the SQL warehouse in two different ways: - In the SQL editor or in notebooks via the Databricks environment - In RStudio, similarly to the way that you might currently use SQL Server -The `sparklyr` package can also be used from within RStudio as it converts the request for data to Spark SQL behind the scenes. - --- ## Pre-requisites @@ -80,6 +78,11 @@ DATABRICKS_TOKEN= DATABRICKS_CLUSTER_ID= ``` +::: callout-important +Each value should be entered inside quotation marks, like this: +DATABRICKS_SQL_WAREHOUSE_ID="/sql/1.0/warehouses/abcdefgh123" +::: + Once you have entered the details, save and close your .Renviron file. ::: callout-note @@ -106,7 +109,9 @@ The Databricks SQL warehouse ID is the warehouse containing data that you would - click 'SQL Warehouses' under the 'SQL' section of the left hand menu on Databricks - click on the warehouse name that you'd like to get the ID for -- the warehouse ID is in brackets next to the warehouse name on the 'Overview' tab +- the warehouse id is the 'HTTP Path' in the 'Connection details' tab +- the ID should start with something similar to "/sql/1.0/warehouses/" + ------------------------------------------------------------------------ @@ -150,8 +155,10 @@ The Cluster ID is the cluster that you'll be using to run your code. To obtain this, follow these steps: -- click 'Compute' in the Databricks menu, then click on your cluster's name -- in the address bar, copy the string of characters between `clusters` and `configuration` in the URL. This is your cluster ID. +- click Compute in the left and menu, and click on the name of your cluster +- under the configuration tab, click on "Advanced options" at the bottom of the page +- click on the JDBC/ODBC tab +- in HTTP path, copy everything after the last /. This is your Cluster ID ------------------------------------------------------------------------ @@ -163,16 +170,6 @@ Now that you have established a connection between Databricks and your laptop, a To access the data, we will make use of the `ODBC` package. You can find documentation about this package [on the Posit website](https://solutions.posit.co/connections/db/r-packages/odbc/). -You can connect to your data in two ways, both of which are described below. - ---- - -#### Connecting via the SQL Warehouse - ---- - -This uses the `ODBC` package and can interact with tables in the Delta Lake. - Include the following code in your R Script: @@ -180,47 +177,12 @@ Include the following code in your R Script: library(odbc) con <- DBI::dbConnect( - odbc::databricks(), - httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID") + odbc::databricks(), + httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID"), + workspace = Sys.getenv("DATABRICKS_HOST") ) odbcListObjects(con) ``` ---- - -#### Connecting via a compute cluster - ---- - -Compute clusters can interact with Volumes in Databricks, unlike the SQL Warehouse. For compute clusters, you need to use `sparklyr` rather than the `ODBC` package. - -Include the following code in your R Script: - -``` {r databricks_connect_cluster, eval=FALSE} -library(tidyverse) -library(sparklyr) -library(dbplyr) - -sc <- spark_connect( - cluster_id = Sys.getenv("DATABRICKS_CLUSTER_ID"), - method = "databricks_connect" -) - -tbl( - sc, - I("catalog_10_gold.information_schema.catalogs") -) %>% - head() -``` - -You can read CSV files from a Volume into a dataframe using the following code: - -``` {r databricks_connect_sparklyr, eval=FALSE} - -df <- spark_read_csv( - sc, - "/Volumes/catalog_40_copper/folder/file.csv" -) - -``` +--- \ No newline at end of file From 3885974d0e30cd8598f86cbc4ecd108a805a0755 Mon Sep 17 00:00:00 2001 From: MACHIN Date: Fri, 23 Aug 2024 08:39:27 +0100 Subject: [PATCH 04/10] adding more context from Nick's branch and removing personal clusters section --- ADA/databricks_rstudio_sql_warehouse.qmd | 46 +++++++++++++++++------- 1 file changed, 34 insertions(+), 12 deletions(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index 278b15a..f9be01c 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -2,12 +2,34 @@ title: "Connecting to Databricks from RStudio" --- -The following instructions will help you to set up a connection between your laptop and your Databricks SQL warehouse, which can then be used in RStudio to query data. +The following instructions will help you to set up a connection between your laptop and your Databricks SQL warehouse or personal cluster, which can then be used in RStudio to query data. -You can use data from the SQL warehouse in two different ways: +You can use data from Databricks in two different ways: - In the SQL editor or in notebooks via the Databricks environment -- In RStudio, similarly to the way that you might currently use SQL Server +- In RStudio via an ODBC connection, similarly to the way that you might currently use SQL Server + +------------------------------------------------------------------------ + +## Compute resources + +When your data is moved to Databricks, it will be stored in the Unity Catalog and you will need to use a compute resource to access it from other software such as RStudio. + +A compute resource allows you to run your code using cloud computing power instead of using your laptop's processing power. This means that using compute resources can allow your code to run faster than it would if you ran it locally, as it is like using the processing resources of multiple computers at once. On this page, we will be referring to the use of SQL Warehouses as the compute resource to run your code. + +------------------------------------------------------------------------ + +#### SQL Warehouse + +--- + +A SQL Warehouse is a SQL-only compute option which is quick to start and optimised for SQL querying. Although the name "warehouse" suggests storage, a SQL Warehouse in Databricks is actually a virtual computing resource that allows you to interact with Databricks by connecting to your data and running code. + +This option is recommended if you only require SQL functionality in Databricks and is ideal if you already have existing RAP pipelines set up using SQL scripts in a git repo. It works in a similar way to how many people currently use SQL Server Management Studio (SSMS). + +SQL Warehouses do not support R, Python or Scala code. Currently they also do not support widgets within Databricks notebooks. If you want to use compute resources to run R or Python code, then you will need to use a personal cluster. There is guidance on the use of personal clusters on the [Using personal clusters with Databricks](ADA/databricks_rstudio_personal_cluster.html) page. + +SQL Warehouses enable you to access tables in the Unity Catalog, but not volumes within the Unity Catalog. Volumes are storage areas for files (e.g. .txt files or .csv files) rather than tables. You can learn more about volumes on [the Databricks documentation site](https://docs.databricks.com/en/volumes/index.html). To access a volume, you will also need to use a personal cluster. --- @@ -21,7 +43,7 @@ Before you start, you must have: - R and RStudio downloaded and installed -- The `ODBC` package installed in RStudio +- The `ODBC` and `DBI` packages installed in RStudio If you do not have access to Databricks or a SQL Warehouse within Databricks, you can request this using [a service request form](https://dfe.service-now.com/serviceportal?id=sc_cat_item&sys_id=74bc3be81b212d504f999978b04bcb0b). @@ -73,7 +95,7 @@ To set the environment variables, call `usethis::edit_r_environ()`. You will the ``` DATABRICKS_HOST= -DATABRICKS_SQL_WAREHOUSE_ID= +DATABRICKS_SQL_PATH= DATABRICKS_TOKEN= DATABRICKS_CLUSTER_ID= ``` @@ -83,7 +105,7 @@ Each value should be entered inside quotation marks, like this: DATABRICKS_SQL_WAREHOUSE_ID="/sql/1.0/warehouses/abcdefgh123" ::: -Once you have entered the details, save and close your .Renviron file. +Once you have entered the details, save and close your .Renviron file and restart R (Session > Restart R). ::: callout-note Everyone in your team that wishes to connect to the SQL Warehouse in Databricks and run your code must set up their .Renviron file individually, otherwise their connection will fail. @@ -101,11 +123,11 @@ The Databricks host is the instance of Databricks that you want to connect to. I ------------------------------------------------------------------------ -#### Databricks SQL Warehouse ID +#### Databricks SQL Path ------------------------------------------------------------------------ -The Databricks SQL warehouse ID is the warehouse containing data that you would like to use. To get the warehouse ID, follow these steps: +The Databricks SQL Path is the location of the warehouse containing data that you would like to use. To get the warehouse ID, follow these steps: - click 'SQL Warehouses' under the 'SQL' section of the left hand menu on Databricks - click on the warehouse name that you'd like to get the ID for @@ -166,20 +188,20 @@ To obtain this, follow these steps: ------------------------------------------------------------------------ -Now that you have established a connection between Databricks and your laptop, and between Databricks and RStudio, you can add code to your existing scripts to pull data into RStudio for analysis. If you have connected to other SQL databases before, this code will look quite familiar to you. +Now that you have enabled ODBC connections on your laptop, and enabled a connection between Databricks and RStudio, you can add code to your existing scripts to pull data into RStudio for analysis. If you have connected to other SQL databases before, this code will look quite familiar to you. -To access the data, we will make use of the `ODBC` package. You can find documentation about this package [on the Posit website](https://solutions.posit.co/connections/db/r-packages/odbc/). +To access the data, we will make use of the `ODBC` package. You can find documentation about this package [on the Posit website](https://solutions.posit.co/connections/db/r-packages/odbc/). You will also need to have the `DBI` package installed. Include the following code in your R Script: ``` {r databricks_connect_sql, eval=FALSE} library(odbc) +library(DBI) con <- DBI::dbConnect( odbc::databricks(), - httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID"), - workspace = Sys.getenv("DATABRICKS_HOST") + httpPath = Sys.getenv("DATABRICKS_SQL_PATH") ) odbcListObjects(con) From 842d8db720cca02acfac2f77e42067b28773ec1e Mon Sep 17 00:00:00 2001 From: MACHIN Date: Fri, 23 Aug 2024 08:39:57 +0100 Subject: [PATCH 05/10] missed one tiny change! --- ADA/databricks_rstudio_sql_warehouse.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index f9be01c..8d6fd5c 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -102,7 +102,7 @@ DATABRICKS_CLUSTER_ID= ::: callout-important Each value should be entered inside quotation marks, like this: -DATABRICKS_SQL_WAREHOUSE_ID="/sql/1.0/warehouses/abcdefgh123" +DATABRICKS_SQL_PATH="/sql/1.0/warehouses/abcdefgh123" ::: Once you have entered the details, save and close your .Renviron file and restart R (Session > Restart R). From c4eba3eae25ce0b63cb26d2ae2232290bcd201da Mon Sep 17 00:00:00 2001 From: MACHIN Date: Fri, 30 Aug 2024 13:55:45 +0100 Subject: [PATCH 06/10] final comments addressed / personal cluster info removed --- ADA/databricks_rstudio_sql_warehouse.qmd | 29 ++++-------------------- 1 file changed, 5 insertions(+), 24 deletions(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index 8d6fd5c..aec9351 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -89,7 +89,7 @@ An ODBC driver is required for the `ODBC` package in R to work - you must instal ------------------------------------------------------------------------ -The `ODBC` package in RStudio allows you to connect to Databricks by creating and modifying four environment variables in your .Renviron file. +The `ODBC` package in RStudio allows you to connect to Databricks by creating and modifying three environment variables in your .Renviron file. To set the environment variables, call `usethis::edit_r_environ()`. You will then need to enter the following information: @@ -97,14 +97,8 @@ To set the environment variables, call `usethis::edit_r_environ()`. You will the DATABRICKS_HOST= DATABRICKS_SQL_PATH= DATABRICKS_TOKEN= -DATABRICKS_CLUSTER_ID= ``` -::: callout-important -Each value should be entered inside quotation marks, like this: -DATABRICKS_SQL_PATH="/sql/1.0/warehouses/abcdefgh123" -::: - Once you have entered the details, save and close your .Renviron file and restart R (Session > Restart R). ::: callout-note @@ -123,11 +117,13 @@ The Databricks host is the instance of Databricks that you want to connect to. I ------------------------------------------------------------------------ -#### Databricks SQL Path +#### Databricks SQL Warehouse Path ------------------------------------------------------------------------ -The Databricks SQL Path is the location of the warehouse containing data that you would like to use. To get the warehouse ID, follow these steps: +As described in the [SQL Warehouses section](/ADA/ada.html#sql-warehouse), in Databricks, SQL Warehouses are a way to gain access to your data in the Unity Catalog. They run queries and return the results either to the user or to a table. + +To get the Warehouse ID, follow these steps: - click 'SQL Warehouses' under the 'SQL' section of the left hand menu on Databricks - click on the warehouse name that you'd like to get the ID for @@ -169,21 +165,6 @@ It is very important that you immediately copy the access token that you are giv ------------------------------------------------------------------------ -#### Cluster ID - ------------------------------------------------------------------------- - -The Cluster ID is the cluster that you'll be using to run your code. - -To obtain this, follow these steps: - -- click Compute in the left and menu, and click on the name of your cluster -- under the configuration tab, click on "Advanced options" at the bottom of the page -- click on the JDBC/ODBC tab -- in HTTP path, copy everything after the last /. This is your Cluster ID - ------------------------------------------------------------------------- - ### Pulling data into RStudio from Databricks ------------------------------------------------------------------------ From 3e2061233e06fd373b9312653e63f936b5347258 Mon Sep 17 00:00:00 2001 From: MACHIN Date: Fri, 30 Aug 2024 13:58:07 +0100 Subject: [PATCH 07/10] changing page title for clarity --- ADA/databricks_rstudio_sql_warehouse.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index aec9351..e963d7d 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -1,5 +1,5 @@ --- - title: "Connecting to Databricks from RStudio" + title: "Using a SQL Warehouse to access Databricks data from RStudio" --- The following instructions will help you to set up a connection between your laptop and your Databricks SQL warehouse or personal cluster, which can then be used in RStudio to query data. From 72fc66757a7d62d46f77ef0b13166db9c293d436 Mon Sep 17 00:00:00 2001 From: MACHIN Date: Fri, 13 Sep 2024 11:50:54 +0100 Subject: [PATCH 08/10] removing missed reference to personal cluster --- ADA/databricks_rstudio_sql_warehouse.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index e963d7d..e2f2c3f 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -2,7 +2,7 @@ title: "Using a SQL Warehouse to access Databricks data from RStudio" --- -The following instructions will help you to set up a connection between your laptop and your Databricks SQL warehouse or personal cluster, which can then be used in RStudio to query data. +The following instructions will help you to set up a connection between your laptop and your Databricks SQL warehouse which can then be used in RStudio to query data. You can use data from Databricks in two different ways: From 7a9a0050b5e98d3b391a64b154e4a18f323788d1 Mon Sep 17 00:00:00 2001 From: MACHIN Date: Fri, 13 Sep 2024 13:45:58 +0100 Subject: [PATCH 09/10] addressing review comments from Charlotte --- ADA/databricks_rstudio_sql_warehouse.qmd | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index e2f2c3f..05c61ed 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -4,10 +4,14 @@ The following instructions will help you to set up a connection between your laptop and your Databricks SQL warehouse which can then be used in RStudio to query data. +::: callout-note +Please note: This guidance should be followed if you wish to run SQL scripts from RStudio against data held in tables in Databricks. If you wish to run R scripts or access data held in a volume instead, you will need a personal compute cluster. The guidance for setting up a personal cluster that works with RStudio can be found on our [set up Databricks personal compute cluster with RStudio](/ADA/databricks_rstudio_personal_cluster.html) page. You can learn more about compute resources on our [Databricks fundamentals](/ADA/databricks_fundamentals.html) page. +::: + You can use data from Databricks in two different ways: - In the SQL editor or in notebooks via the Databricks environment -- In RStudio via an ODBC connection, similarly to the way that you might currently use SQL Server +- In RStudio via an ODBC connection, similarly to the way that you might currently use data stored in a SQL Server ------------------------------------------------------------------------ @@ -25,11 +29,11 @@ A compute resource allows you to run your code using cloud computing power inste A SQL Warehouse is a SQL-only compute option which is quick to start and optimised for SQL querying. Although the name "warehouse" suggests storage, a SQL Warehouse in Databricks is actually a virtual computing resource that allows you to interact with Databricks by connecting to your data and running code. -This option is recommended if you only require SQL functionality in Databricks and is ideal if you already have existing RAP pipelines set up using SQL scripts in a git repo. It works in a similar way to how many people currently use SQL Server Management Studio (SSMS). +This option is recommended if you only require SQL functionality in Databricks and is ideal if you already have existing RAP pipelines set up using SQL scripts in a git repo. -SQL Warehouses do not support R, Python or Scala code. Currently they also do not support widgets within Databricks notebooks. If you want to use compute resources to run R or Python code, then you will need to use a personal cluster. There is guidance on the use of personal clusters on the [Using personal clusters with Databricks](ADA/databricks_rstudio_personal_cluster.html) page. +SQL Warehouses do not support R, Python or Scala code. Currently they also do not support widgets within Databricks notebooks. If you want to use compute resources to run widgets or R or Python code, then you will need to use a personal cluster. There is guidance on the use of personal clusters on the [Using personal clusters with Databricks](ADA/databricks_rstudio_personal_cluster.html) page. -SQL Warehouses enable you to access tables in the Unity Catalog, but not volumes within the Unity Catalog. Volumes are storage areas for files (e.g. .txt files or .csv files) rather than tables. You can learn more about volumes on [the Databricks documentation site](https://docs.databricks.com/en/volumes/index.html). To access a volume, you will also need to use a personal cluster. +SQL Warehouses enable you to access tables in the Unity Catalog, but not volumes within the Unity Catalog. Volumes are storage areas for files (e.g. .txt files or .csv files) rather than tables. You can learn more about volumes on [the Databricks documentation site](https://docs.databricks.com/en/volumes/index.html) or on our [Databricks fundamentals](ADA/databricks_fundamentals.html) page. To access a volume, you will also need to use a personal cluster. --- @@ -154,7 +158,7 @@ A personal access token is is a security measure that acts as an identifier to l - Name the token, then click 'Generate' ::: callout-note -Note that access tokens will only last as long as the value for the 'Lifetime (days)' field. After this period the token will expire, and you will need to create a new one to re-authenticate. +Note that access tokens will only last as long as the value for the 'Lifetime (days)' field. After this period the token will expire, and you will need to create a new one to re-authenticate. Access tokens also expire if they are unused after 90 days. For this reason, we recommend setting the Lifetime value to be 90 days or less. ::: - Make a note of the 'Databricks access token' it has given you From 440caeee2269dc738292706ecc398ee0786eb383 Mon Sep 17 00:00:00 2001 From: MACHIN Date: Fri, 13 Sep 2024 15:59:34 +0100 Subject: [PATCH 10/10] Git --- ADA/databricks_rstudio_sql_warehouse.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index 5722dc7..cf02173 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -31,7 +31,7 @@ A compute resource allows you to run your code using cloud computing power inste A SQL Warehouse is a SQL-only compute option which is quick to start and optimised for SQL querying. Although the name "warehouse" suggests storage, a SQL Warehouse in Databricks is actually a virtual computing resource that allows you to interact with Databricks by connecting to your data and running code. -This option is recommended if you only require SQL functionality in Databricks and is ideal if you already have existing RAP pipelines set up using SQL scripts in a git repo. +This option is recommended if you only require SQL functionality in Databricks and is ideal if you already have existing RAP pipelines set up using SQL scripts in a Git repo. SQL Warehouses do not support R, Python or Scala code. Currently they also do not support widgets within Databricks notebooks. If you want to use compute resources to run widgets or R or Python code, then you will need to use a personal cluster. There is guidance on the use of personal clusters on the [Using personal clusters with Databricks](ADA/databricks_rstudio_personal_cluster.html) page.