-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Build a Pipeline
This page describes how to author pipelines and components, and submit them to the Pipeline system to run.
We suggest using JupyterHub that is installed in the same cluster as the pipeline system to author pipelines and components. Steps:
- Following instructions to deploy a Pipeline cluster and run kubectl proxy to connect to the cluster. You should see Pipelines UI.
- Click "Notebooks" in the left menu. If it is the first time you visit JupyterHub, you need to sign in with any user name (password can be blank). Then click "Spawn" button to create a new instance. After a few minutes, you will see Jupyter UI.
- Download the notebooks from https://github.com/kubeflow/pipelines/tree/master/samples/notebooks. Upload these notebooks from Jupyter UI (In Jupyter, go to tree view and find the "upload" button in the top right area).
Open the uploaded notebooks and make sure you are on Python3 (Python version at top right corner in the notebook view). You can run the notebooks now.
Note: The notebooks don't work on Jupyters outside the same cluster, because the Python library it uses communicates with Pipeline system through in-cluster service names.
Notebooks:
-
[KubeFlow Pipeline Using TFX OSS Components] (https://github.com/kubeflow/pipelines/blob/master/samples/notebooks/KubeFlow%20Pipeline%20Using%20TFX%20OSS%20Components.ipynb): it demonstrates building a machine learning pipeline based on TFX components. The pipeline includes a TFDV step to infer schema, a TFT preprocessor, a TensorFlow trainer, a TFMA analyzer, and a model deployer which deploys the trained model to tf-serving in the same cluster. It also demonstrates how to build a Python3 based component inside the notebook including building a docker container.
-
Lightweight python components: it demonstrates building simple Python3 based components and using them in a pipeline with fast iterations. Going this route, building a component does not require building a docker container so it is faster, but the container image may not be self contained because the source code is not built into the container.
If prefer the traditional command line experience, it is also possible to set up things yourself. But for now you won't be able to use Python SDK to submit pipelines to cluster. Nor can you build container images using the SDK. DSL Compiler works as usual.
Python 3.5 or above is required. If you don't have Python3 set up, we suggest the following steps to install Miniconda.
In a Debian/Ubuntu/Cloud shell environment:
apt-get update; apt-get install -y wget bzip2
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
In a Windows environment, download the installer and make sure you select "Add Miniconda to my PATH environment variable" option during the installation.
In a Mac environment, download the installer and run the following command:
bash Miniconda3-latest-MacOSX-x86_64.sh
Create a clean python3 environment:
conda create --name mlpipeline python=3.6
source activate mlpipeline
If the conda
command is not found, be sure to add the Miniconda path:
export PATH=MINICONDA_PATH/bin:$PATH
Run the following:
pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.1/kfp.tar.gz --upgrade
After successful installation the command "dsl-compile" should be added to your PATH.
The pipelines are written in Python, but they must be compiled to an intermediate representation before submitting to the Kubeflow pipelines service.
dsl-compile --py [path/to/py/file] --output [path/to/output/tar.gz]
For example:
dsl-compile --py [ML_REPO_DIRECTORY]/samples/basic/sequential.py --output [ML_REPO_DIRECTORY]/samples/basic/sequential.tar.gz
Upload the generated .tar.gz
file through the Kubeflow pipelines UI.
See how to build your own pipeline components.