Welcome to w261 - Machine Learning at Scale
. In this class, on top of learning about the Machine Learning models in the industry, you will be using production grade technology and infrastructure deployment best practices. For about two thirds of the class, you will be using an environment orchestration in Google Cloud. For the last third, you will get the opportunity to use Databricks on Azure.
A read-only GitHub repository will be used as a source of code for Homework and Live Session Labs.
While authenticated to GitHub, please navigate to github/personal_tokens to obtain one. You will need it for the automatization script mentioned below. Add a note such as w261 GCP
or similar to keep track of this token. Lastly, check the box on repo
which provides full control of private repositories, automatically all the underneath boxes are shown as checked. Please, be aware that you will only be able to see and copy this code once, so you may want to save a local copy of it on your local Windows/Mac machine.
Google Cloud is a top state of the art platform for Big Data Analytics and Orchestration. The service used in w261 is Dataproc which is the main Cloud API for orchestration of Big Data Hadoop and Spark clusters.
Dataproc offers a plug-and-play kind of cluster orchestration for Hadoop and Spark. Jupyter Lab comes out of the box using the GoogleUserContent front-end which is highly secure, and prevents us from exposing our VM with an external IP address.
Google offers $300 in credits for new accounts. Login to GCP Console to take advantage of this offer by clicking the top banner that shows the promotion. You can create a gmail account if you don't have an email account part of the Google Suite. You must have an account with a Billing Account set before running the automated orchestration.
Note: Acepting this offer involves providing a credit card that will not be charged immediately after credits deplete. You will have an opportunity to decide to continue before GCP charging your credit card.
For this Class, we will be using a single automation script that will help us navigate through some complexity in the Cloud and Compute World.
The first step is to open the GCP Console, and click the terminal icon >_
in the top blue bar.
This will open a panel box at the bottom of the screen, which is your CloudShell. This is serverless compute, you are allocated 5 GB of Storage, and is a great tool to act as a bridge for all the components we will be using in w261. From here, using the automation script, you will be able to deploy clusters, load data into Buckets and pull code from the Main Repo. The best part of CloudShell is that it's free.
Running the automated script on CloudShell guarantees having the appropriate dependencies, packages and the right environment.
The script you need to run is to prepare a Google Project with all the artifacts needed to work in a secure environment with Dataproc. Please take a look at the documentation in Create Dataproc Cluster to have a look inside of the orchestration under the covers.
Please follow the prompts:
gsutil cat gs://w261-hw-data/w261_env.sh | bash -euo pipefail
This script will take longer to run the first time before you have deployed any cluster. Once all the components are deployed, the subsequent runs will skip all orchestration and will create clusters on demand directly. Although, it will always check for all components to be installed. To run the script, follow the prompts; after you run the command line above, press Q to exit the Welcome screen and begin running the actual script. You will have the respond y
to the first question (Do you want to proceed?
) and then respond to some of the following questions. Please run the script again until you see in the prompts that a cluster was successfully created.
You can see your clusters in GCP Dataproc. If you don't see your cluster, switch to w261-student
project in the top blue GCP bar. Remember you will be consuming credits on a per second basis. The orchestration that got put together had this in mind, and following best practices, $300 should be more than enough.
It's up to you if you want to delete it directly or let the max-idle
feature hit in (you will select that every time you create a cluster: 1h, 2h, 3h, 6h, 12h, 24h).
-
Once you open JupyterLab, navigate to the root folder where you see to folders:
GCS
andLocal Disk
. We will work onLocal Disk
for HW1 and 2, and all first Labs before turning to Spark. The automation script make sure the files are properly loaded as long as you have run the script at least once. -
When working on a Notebook, get the full path where this notebook is located, and then add a new cell at the very top like this one:
%cd /full/path/to/the/notebook
- To get the data for the HWs, add a new cell and comment the previous command that pulled the data such as
!curl
,!wget
and similar, and obtain the data now from your GCS Data Bucket created in the first automation script:
!mkdir -p data/
!gsutil cp gs://<your-data-bucket>/main/Assignments/HW2/data/* data/
Feel free to explore where the data is for a specific HW with gsutil ls gs://<your-data-bucket>/main/Assignments/HW*
If you don't remember your GCS Data Bucket, run gsutil ls
to get a list of Buckets in your account.
- For Hadoop, the new location of the
JAR_FILE
is:
JAR_FILE = '/usr/lib/hadoop/hadoop-streaming-3.2.2.jar'
-
For debugging, go to Dataproc -> Clusters -> Web Interfaces and look for:
- MapReduce Job History for Hadoop job logs.
- Spark History Server for Spark job logs.
-
In Jupyter, when running
mkdir
use-p
to make sure you create the entire path, if inner folders doesn't exist.!hdfs dfs -mkdir -p {HDFS_DIR}
-
Spark UI for Current Notebook
- The Spark UI for current jobs and Notebook can be accessed via SSH directly into the Master Node.
- Open the Cloud Shell.
- Get the zone where your Master node is located. Adjust the name of your instance. You can also assign the direct value if already known.
ZONE=$(gcloud compute instances list --filter="name~w261" --format "value(zone)")
- SSH into the VM using your Cloud Shell. It can also be done from your local terminal or Google Cloud SDK if running windows. Adjust the name of your instance if different.
gcloud compute ssh w261-m --ssh-flag "-L 8080:localhost:42229" --zone $ZONE
- Click the
Web Preview
button at the top right in the Cloud Shell panel. We mapped this port to 8080, which is the default port number thatWeb Preview
uses. - By default, Dataproc runs the Spark UI on port
42229
. Adjust accordingly if using a different port. In order to get the port number, open a new cell and run the variablespark
(if SparkSession already established). You'll see the UI link. Hover over the link and get the port number. - Keep the Cloud Shell alive by running
sleep 1800
, or a number you feel comfortable to keep the tunnel open.