Standardize your data science development environment with this simple Docker image.
Feel free to use this repository as template to customize a stack for your own team.
This guide will help beginners set up a common data science tech stack on Anaconda3, Jupyter and Databricks from scratch.
I included Databricks, to demonstrate how to do integrate cloud-scale data science workflows which most organizations typically have.
Make sure to install a git client on your machine.
Chances are your editor or IDE of choice already has a git plugin. We recommend using that to streamline your workflow.
If you prefer to work on the command line or a standalone GUI here's the link to git clients for Windows and MacOS users.
If this is the first time you installed Git, make sure to set up your global user config. These settings will appear in all of your commits. Please replace the command below with your company email and name.
git config --global user.email "[email protected]"
git config --global user.name "James Faeldon"
Once your Git client is set up, clone this project.
git clone https:///github.com/faeldon/data-science-stack
Install Docker Desktop For Windows and MacOS uses. We need this to create the stack from scratch.
The installation would require a DockerHub account. Just signup for free, if you don't have an existing account.
If you have successfully installed Docker Desktop, then the docker
command should be available from your Terminal (MacOS) or Command
Prompt (Windows).
You can run the following command to verify your Docker installation.
docker info
If you're setting up Databricks you need to create an access token.
Make sure you have access to a Databricks workspace and a cluster.
Log in to databricks and follow these instructions.
Create a file called databricks.env
inside the root of the project
directory containing your databricks setup. An example is shown below.
Make sure to replace the token value with your access token from
step 3. Jupyter will use this config to connect to your Databricks
workspace.
DATABRICKS_HOST=southeastasia.azuredatabricks.net
DATABRICKS_PORT=8787
DATABRICKS_CLUSTER_ID=0604-041034-yip666
DATABRICKS_ORG_ID=12312312312312312
DATABRICKS_TOKEN=<put_your_token_here>
Run the docker-compose
command below.
docker-compose up
The build process would take several minutes to download and install packages the first time it runs. After the Docker image is created on your local machine, the next time you run the above command should be quick.
The image is persisted on your local machine and can be used across
different projects. You can run docker images
to check stored images
on your machine.
docker images
Most of the libraries we need are pre-installed and configured. If you need to install other 3rd-party packages feel free to edit the Dockerfile.
At the very end of the output log of step 5 shows a link to the Jupyter notebook -- similar to the example below.
[C 10:34:15.520 NotebookApp]
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-6-open.html
Or copy and paste one of these URLs:
http://(33f93e2264e5 or 127.0.0.1):8888/?token=71a0e2ea6efbdbbf3dca75e647a601ba93190c5be56fef5f
Open your browser on http://127.0.0.1:8888 and input the token string.
For example in the above log the token string is
71a0e2ea6efbdbbf3dca75e647a601ba93190c5be56fef5f
.
All the files inside this project will be available on the Jupyter workspace.
Hit CTRL+C on the running Docker container anytime you want to stop the server.
If you want tinker inside the docker container for debugging or customizing the image.
docker run -it faeldon/data-science-stack /bin/bash
Here is another useful command to test if your Databricks connection is set up properly.
docker run -it faeldon/data-science-stack /bin/bash -c "databricks-connect test"
Contributions are always welcome, no matter how large or small. Before contributing, please read the code of conduct.