diff --git a/ADA/databricks_fundamentals.qmd b/ADA/databricks_fundamentals.qmd index 6e3e0ca..aa634f9 100644 --- a/ADA/databricks_fundamentals.qmd +++ b/ADA/databricks_fundamentals.qmd @@ -200,51 +200,4 @@ All compute options can be used both within the Databricks platform and be conne - [Setup Databricks SQL Warehouse with RStudio](/ADA/databricks_rstudio_sql_warehouse.html) - [Setup Databricks Personal Compute cluster with RStudio](/ADA/databricks_rstudio_personal_cluster.html) ---- - -### Creating a personal compute resource - ------------------------------------------------------------------------- - -1. To create your own personal compute resource click the 'Create with DfE Personal Compute' button on the compute page\ - -![](/images/ada-compute-personal.png) - -2. You'll then be presented with a screen to configure the cluster. There are 2 options here under the performance section which you will want to pay attention to; Databricks runtime version, and Node type\ - \ - **Databricks runtime version** - This is the version of the Databricks software that will be present on your compute resource. Generally it is recommended you go with the latest LTS (long term support) version. At the time of writing this is '15.4 LTS'\ - \ - **Node type** - This option determines how powerful your cluster is and there are 2 options available by default:\ - - - Standard 14GB 4-Core Nodes\ - - Large 28GB 8-Core Nodes\ - \ - If you require a larger personal cluster this can be requested by the ADA team.\ - \ - ![](/images/ada-compute-personal-create.png) - -3. Click the 'Create compute' button at the bottom of the page. This will create your personal cluster and begin starting it up. This usually takes around 5 minutes\ - \ - ![](/images/ada-compute-personal-create-button.png) - -4. Once the cluster is up and running the icon under the 'State' header on the 'Compute' page will appear as a green tick\ - \ - ![](/images/ada-compute-ready.png) - -::: callout-note -## Clusters will shut down after being idle for an hour - -Use of compute resources are charged by the hour, and so personal clusters have been set to shut down after being unused for an hour in order to prevent unnecessary cost to the Department. -::: - -::: callout-important -## Packages and libraries - -As mentioned above compute resources have no storage of their own. This means that if you install libraries or packages onto a cluster they will only remain installed until the cluster is stopped. Once re-started those libraries will need to be installed again. - -An alternative to this is to specify packages/libraries to be installed on the cluster at start up. To do this click the name of your cluster from the 'Compute' page, then go to the 'Libraries' tab and click the 'Install new' button. - -Certain packages are installed by default on personal cluster and do not need to be installed manually. The specific packages installed are based on the Databricks Runtime (DBR) version your cluster is set up with. A comprehensive list of packages included in each DBR is available in the [Databricks documentation](https://learn.microsoft.com/en-us/azure/databricks/release-notes/runtime/). -::: - Once you have a compute resource you can begin using Databricks. You can do this either through connecting to Databricks through RStudio, or you can begin coding in the Databricks platforms using scripts, or [Notebooks](/ADA/databricks_notebooks.html). diff --git a/ADA/databricks_rstudio_personal_cluster.qmd b/ADA/databricks_rstudio_personal_cluster.qmd index 64b5c14..faa7f8b 100644 --- a/ADA/databricks_rstudio_personal_cluster.qmd +++ b/ADA/databricks_rstudio_personal_cluster.qmd @@ -4,107 +4,234 @@

-The following instructions set up an ODBC connection between your laptop and your DataBricks cluster, which can then be used in R/RStudio to query data using an ODBC based package or `sparklyr`. Personal clusters are able to run SQL, R, python and scala. They can be used within the DataBricks environment, or through R studio and can be set up yourself if you don't have access to a SQL warehouse or shared cluster. +The following instructions set up an ODBC connection between your laptop and your Databricks cluster, which can then be used in RStudio to query data using an ODBC based package or `sparklyr`. Personal clusters are able to run SQL, R, Python and Scala. They can be used within the Databricks environment, or through RStudio and can be set up yourself if you don't have access to a SQL warehouse or shared cluster.

------------------------------------------------------------------------- +::: callout-note + +Please note: This guidance should be followed if you wish to run R scripts from RStudio against data held against tables in Databricks, or if you wish to work with a file held in a Databricks volume. You can read more about volumes on our [Databricks fundamentals page](ADA/databricks_fundamentals.html#volumes). If you only need to run SQL scripts against Databricks data then we would suggest setting up a [SQL warehouse compute connection](/ADA/databricks_rstudio_sql_warehouse.html) instead. +::: + + +You can use data from Databricks with R code in two different ways: -# Pre-requisites +- In scripts or notebooks via the Databricks environment +- In RStudio via an ODBC connection ------------------------------------------------------------------------ +## Pre-requisites + You must have: -- Access to Databricks -- Access to a personal cluster on DataBricks +- Access to Databricks and the data you'll be working with +- Access to a personal cluster on Databricks - R and RStudio downloaded ------------------------------------------------------------------------ -# Downloading an ODBC driver +## Compute resources + +When your data is moved to Databricks, it will be stored in the Unity Catalog and you will need to use a compute resource to access it from other software such as RStudio. + +A compute resource allows you to run your code using cloud computing power instead of using your laptop's processing power. This means that using compute resources can allow your code to run faster than it would if you ran it locally, as it is like using the processing resources of multiple computers at once. On this page, we will be referring to the use of personal clusters as the compute resource to run your code. + +------------------------------------------------------------------------ + +### Personal clusters + +--- + +A personal cluster is a compute resource that supports the use of multiple code languages (R, SQL, Scala and Python). You can create your own personal cluster within the Databricks interface. + +When you set up your personal cluster, you will be asked to select a runtime for that cluster. Different runtimes allow you to use different features and package versions. Certain packages are installed by default on personal cluster and do not need to be installed manually. The specific packages installed are based on the Databricks Runtime (DBR) version your cluster is set up with. A comprehensive list of packages included in each DBR is available in the [Databricks documentation](https://learn.microsoft.com/en-us/azure/databricks/release-notes/runtime/). + +Compute resources, including personal clusters, have no storage of their own. This means that if you install libraries or packages onto a cluster they will only remain installed until the cluster is stopped. Once re-started those libraries will need to be installed again. + +An alternative to this is to specify packages / libraries to be installed on the cluster at start up. To do this click the name of your cluster from the 'Compute' page, then go to the 'Libraries' tab and click the 'Install new' button. + +::: callout-note +## Clusters will shut down after being idle for an hour + +Use of compute resources are charged by the hour, and so personal clusters have been set to shut down after being unused for an hour in order to prevent unnecessary cost to the Department. +::: + +--- + +## Process + +There are four steps to complete before your connection can be established. These are: + +- Creating a personal compute resource (if you do not already have one) +- Installing an ODBC driver on your laptop to enable a connection between your laptop and Databricks +- Modifying your .Renviron file to establish a connection between RStudio and Databricks +- Adding connection code to your existing scripts in RStudio + +--- + +### Creating a personal compute resource ------------------------------------------------------------------------ -1. Install the 'Simba Spark ODBC' driver from the software centre. +1. To create your own personal compute resource click the 'Create with DfE Personal Compute' button on the compute page\ - i) Open the Software Centre via the start menu. +![](/images/ada-compute-personal.png) - ii) In the 'Applications' tab, click 'Simba Spark ODBC Driver 64-bit'. ![](../images/databricks-software-centre.png) +2. You'll then be presented with a screen to configure the cluster. There are 2 options here under the performance section which you will want to pay attention to; Databricks runtime version, and Node type\ + \ + **Databricks runtime version** - This is the version of the Databricks software that will be present on your compute resource. Generally it is recommended you go with the latest LTS (long term support) version. At the time of writing this is '15.4 LTS'\ + \ + **Node type** - This option determines how powerful your cluster is and there are 2 options available by default:\ - iii) Click install. + - Standard 14GB 4-Core Nodes\ + - Large 28GB 8-Core Nodes\ + \ + If you require a larger personal cluster this can be requested by the ADA team.\ + \ + ![](/images/ada-compute-personal-create.png) -2. Get connection details for the cluster from Databricks. To set up the connection you will need a few details from your cluster within DataBricks. +3. Click the 'Create compute' button at the bottom of the page. This will create your personal cluster and begin starting it up. This usually takes around 5 minutes\ + \ + ![](/images/ada-compute-personal-create-button.png) - i) Login to [Databricks](https://adb-6882499576863257.17.azuredatabricks.net/?o=6882499576863257) +4. Once the cluster is up and running the icon under the 'State' header on the 'Compute' page will appear as a green tick\ + \ + ![](/images/ada-compute-ready.png) - ii) Click on the 'Compute' tab in the sidebar. ![](../images/databricks-compute.png) - iii) Click on the name of the cluster you want to connect to, and click the 'Advanced options' at the bottom of the cluster page. +------------------------------------------------------------------------ - iv) Click the 'JDBC/ODBC' tab under 'Advanced options' +## Setting up the ODBC driver - v) Make a note of the 'Server hostname', 'Port', and 'HTTP Path'. +::: callout-important +If you have previously set up an ODBC connection, or followed the [set up Databricks SQL Warehouse with RStudio](/ADA/databricks_rstudio_sql_warehouse.html) guidance, then you can skip this step. +::: -3. Get a personal access token from Databricks for authentication. +- Open the Software Centre via the start menu - i) In Databricks, click on your email address in the top right corner, then click 'User settings'. +- In the 'Applications' tab, click `Simba Spark ODBC Driver 64-bit` - ii) Go to the 'Developer' tab in the side bar. Next to 'Access tokens', click the 'Manage' button. ![](../images/databricks-access-tokens.png) +::: {align="center"} +![](../images/databricks-software-centre.png) +::: - iii) Click the 'Generate new token' button. +- Click install - iv) Name the token, then click 'Generate'. *Note that access tokens will only last as long as the value for the 'Lifetime (days)' field. After this period the token will expire, and you will need to create a new one to re-authenticate.* - v) Make a note of the 'Databricks access token' it has given you. **It is important to copy this somewhere as you will not be able to see it through Databricks again.** +------------------------------------------------------------------------ -4. Setup ODBC connection from your laptop. We now have all the information we need to setup a connection between our laptop and DataBricks. +### Establishing an RStudio connection using environment variables - i) In the start menu, search 'ODBC' and open 'ODBC Data Sources (64-bit)'. +------------------------------------------------------------------------ - ii) On the 'User DSN' tab click the 'Add...' button. +The `ODBC` package in RStudio allows you to connect to Databricks by creating and modifying three environment variables in your .Renviron file. - iii) In the 'Create New Data Source' window, select 'Simba Spark ODBC Driver' and click 'Finish'. +::: callout-note +If you have previously established a connection between a SQL Warehouse and RStudio, then some of these variables will already be in your .Renviron file. +::: + +To set the environment variables, call `usethis::edit_r_environ()`. You will then need to enter the following information: - iv) In the 'Simba Spark ODBC Driver DSN Setup' window, +``` +DATABRICKS_HOST= +DATABRICKS_CLUSTER_PATH= +DATABRICKS_TOKEN= - a. Enter a 'Data Source Name' and 'Description'. Choose a short and sensible data source name and note it down as this is what you will use to connect to Databricks through RStudio. *As you can set up more than one cluster on Databricks, use the description to make clear which cluster this connection is for. The description shown below describes that this connection is using an 8 core cluster on Databricks Runtime Environment 13.* - b. Set the remaning options to the settings below. - - Enter the 'Server Hostname' for your cluster in the 'Host(s):' field (you noted this down in step 2). - - In the Port section, remove the default number and use the Port number you noted in step 2. - - Set the Authentication Mechanism to 'User Name and Password'. - - Enter the word 'token' into the 'User Name:' field, then enter your 'Databricks access token' in the 'Password:' field. - - Change the Thrift Transport option to HTTP. - - Click the 'HTTP Options...' button and enter the 'HTTP Path' of your Databricks cluster, then click 'Okay'. - - Click the 'SSL Options...' button and tick the 'Enable SSL' box, then click the 'OK' button. - - ![](../images/odbc-driver-settings.png) - c. Click the 'Test' button to verify the connection has worked. You should see the following message. *If you get an error here, repeat steps 5.e.i -- 5.e.ix again and ensure all the values are correct.* +``` - ![](../images/databricks-test-connection.png) +Once you have entered the details, save and close your .Renviron file and restart R (Session > Restart R). - d. Click the 'OK' button to exit the 'Test Results' window, then the 'OK' button in the 'Simba Spark ODBC Driver DSN Setup' window. +::: callout-note +Everyone in your team that wishes to connect to the data in Databricks and run your code must set up their .Renviron file individually, otherwise their connection will fail. +::: -5. Connect through RStudio. Watch the below video and view the [ADA_RStudio_connect GitHub repo](https://github.com/dfe-analytical-services/ADA_RStudio_connect) for methods on connecting to Databricks and querying data from RStudio. +The sections below describe where to find the information needed for each of the environment variables. ------------------------------------------------------------------------ -# Pulling data into R studio from Databricks +#### Databricks host ------------------------------------------------------------------------ -Once you have set up an ODBC connection as detailed above, you can then use that connection to pull data directly from Databricks into R Studio. Charlotte recorded a video demonstrating two possible methods of how to do this. The recording is embedded below: +The Databricks host is the instance of Databricks that you want to connect to. It's the URL that you see in your browser bar when you're on the Databricks site and should end in "azuredatabricks.net" (ignore anything after this section of the URL). + +--- + +#### Databricks cluster path + +--- + +In Databricks, go to Compute in the left hand menu, and click on the name of your personal cluster: ::: {align="center"} - + +------------------------------------------------------------------------ + +#### Databricks token + +------------------------------------------------------------------------ + +The Databricks token is a personal access token. + +A personal access token is is a security measure that acts as an identifier to let Databricks know who is accessing information from the SQL warehouse. Access tokens are usually set for a limited amount of time, so they will need renewing periodically. + +- In Databricks, click on your email address in the top right corner, then click 'User settings' + +- Go to the 'Developer' tab in the side bar. Next to 'Access tokens', click the 'Manage' button + +::: {align="center"} +![](../images/databricks-access-tokens.png) ::: -A template of all of the code used in the above video can be found in the [ADA_RStudio_connect GitHub repo](https://github.com/dfe-analytical-services/ADA_RStudio_connect). +- Click the 'Generate new token' button + +- Name the token, then click 'Generate' + +::: callout-note +Note that access tokens will only last as long as the value for the 'Lifetime (days)' field. After this period the token will expire, and you will need to create a new one to re-authenticate. Access tokens also expire if they are unused after 90 days. For this reason, we recommend setting the Lifetime value to be 90 days or less. +::: + +- Make a note of the 'Databricks access token' it has given you + +::: callout-warning +It is very important that you immediately copy the access token that you are given, as you will not be able to see it through Databricks again. If you lose this access token before pasting it into RStudio then you must generate a new access token to replace it. +::: + + +------------------------------------------------------------------------ + +### Pulling data into RStudio from Databricks + +------------------------------------------------------------------------ + +Now that you have enabled ODBC connections on your laptop, and enabled a connection between Databricks and RStudio, you can add code to your existing scripts to pull data into RStudio for analysis. If you have connected to databases before, this code will look quite familiar to you. + +To access the data, we will make use of the `sparklyr` package. You will also need to have the `dbplyr` package installed. + +Include the following code in your R Script: + +``` r + +library(tidyverse) +library(sparklyr) +library(dbplyr) + +sc <- spark_connect( + cluster_id = Sys.getenv("DATABRICKS_CLUSTER_PATH"), + method = "databricks_connect" +) -Key takeaways from the video and example code: +tbl( + sc, + I("catalog_10_gold.information_schema.catalogs") +) |> + head() -- The main change here compared to connecting to SQL databases is the connection method. The installation and setup of the ODBC driver are all done pre-code, and the only part of the code that will need updating is your connection (usually your con variable). -- If your existing code was pulling in tables from SQL via the `RODBC` package or the `dbplyr` package, then this code should in theory run with minimal edits needed. -- If you were writing tables back into SQL from R, this is where your code may need the most edits. -- If your code is stored in a repo where multiple analysts contribute to and run the code, in order for the code to run for everyone you will all need to individually install the ODBC driver and **give it the same name** so that when the `con` variable is called, the name used in the code matches everyone's individual driver and runs for everyone. If this is the case, please **add a note about this to your repo's readme file** to help your future colleagues. +``` \ No newline at end of file diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index b7da6d5..c329ee6 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -77,6 +77,10 @@ An ODBC driver is required for the `ODBC` package in R to work - you must instal --- +::: callout-important +If you have previously set up an ODBC connection, or followed the [set up Databricks personal compute cluster with RStudio](/ADA/databricks_rstudio_personal_cluster.html) guidance, then you can skip this step. +::: + - Open the Software Centre via the start menu - In the 'Applications' tab, click `Simba Spark ODBC Driver 64-bit` diff --git a/images/databricks-cluster-id.png b/images/databricks-cluster-id.png new file mode 100644 index 0000000..7b40ce7 Binary files /dev/null and b/images/databricks-cluster-id.png differ