Skip to content
This repository has been archived by the owner on Nov 5, 2022. It is now read-only.

ML Pipeline Generator is a tool for generating end-to-end pipelines composed of GCP components so that any customer can easily migrate their local ML models onto GCP and start realizing the benefits of the cloud quickly.

License

Notifications You must be signed in to change notification settings

GoogleCloudPlatform/ml-pipeline-generator-python

Repository files navigation

ML Pipeline Generator

PyPI - Python Version PyPI version Build Status

ML Pipeline Generator is a tool for generating end-to-end pipelines composed of GCP components so that users can easily migrate their local ML models onto GCP and start realizing the benefits of the Cloud quickly.

The following ML frameworks will be supported:

  1. TensorFlow (TF)
  2. Scikit-learn (SKL)
  3. XGBoost (XGB)

The following backends are currently supported for model training:

  1. Google Cloud AI Platform
  2. AI Platform Pipelines (managed Kubeflow Pipelines)

Setup

GCP credentials

gcloud auth login
gcloud auth application-default login
gcloud config set project [PROJECT_ID]

Enabling required APIs

The tool requires following Google Cloud APIs to be enabled:

  1. Compute Engine
  2. AI Platform Training and Prediction
  3. Cloud Storage

Enable the above APIs by following the links, or run the below command to enable the APIs for your project.

gcloud services enable ml.googleapis.com \
compute.googleapis.com \
storage-component.googleapis.com

Python environment

python3 -m venv venv
source ./venv/bin/activate
pip install ml-pipeline-gen

Kubeflow Pipelines

Create a Kubeflow Pipelines instance on AI Platform Pipelines. Once the instance is provisioned, note down the hostname (Dashboard URL).

End to end tutorial notebook

You can view the notebook here which can run on your local jupyter notebook, Cloud AI Platform and Colab. This takes you through how a typical user would leverage this solution.

Cloud AI Platform Demo

This demo uses the scikit-learn model in examples/sklearn/model/sklearn_model.py to create a training module to run on CAIP. First, make a copy of the sklearn example directory.

cp -r examples/sklearn sklearn-demo
cd sklearn-demo

Create a config.yaml by using the config.yaml.example template. See the docs for details on the config parameters. Once the config file is filled out, run the demo.

python demo.py

Running this demo uses the config file to generate a trainer/ module that is compatible with CAIP.

Kubeflow Pipelines Demo

This demo orechestrates training and prediction using a TensorFlow model in examples/kfp/model/tf_model.py over Kubeflow Pipelines (hosted on AI Platform Pipelines). First, make a copy of the kfp/ example directory.

cp -r examples/kfp kfp-demo
cd kfp-demo

Create a config.yaml by using the config.yaml.example template. See the docs for details on the config parameters. Once the config file is filled out, run the demo.

python demo.py

Running this demo uses the config file to generate a trainer/ module that is compatible with CAIP. It also generates orchestration/pipeline.py, which compiles a Kubeflow Pipelines pipeline.

Note: If you're using a GKE cluster without Workload Identity configured, the tool provisions Workload Identity for the GKE cluster which modifies the dashboard URL. If this occurs, you will need to update the your config.yaml with the new Kubeflow Pipelines URL and rerun the demo.

Tests

The tests use unittest, Python's built-in unit testing framework. By running python -m unittest, the framework performs test discovery to find all tests within this project. Tests can be run on a more granular level by feeding a directory to test discover. Read more about unittest here.

Unit tests:

python -m unittest discover -s tests/unit

Integration tests:

python -m unittest discover -s tests/integration

Input args

The following input args are included by default. Overwrite them by adding them as inputs in the config file.

Arg Description
train_path Dir or bucket containing train data.
eval_path Dir or bucket containing eval data.
model_dir Dir or bucket to save model files.
batch_size Number of rows of data to be fed into the model each iteration.
max_steps The maximum number of iterations to train the model for.
learning_rate Multiplier that controls how much the weights of our network are adjusted with respect to the loss gradient.
export_format File format expected by the exported model at inference time.
save_checkpoints_steps Number of steps to run before saving a model checkpoint.
keep_checkpoint_max Number of model checkpoints to keep.
log_step_count_steps Number of steps to run before logging training performance.
eval_steps Number of steps to use to evaluate the model.
early_stopping_steps Number of steps with no loss decrease before stopping early.

Contribute

To modify the behavior of the library, install ml-pipeline-gen using:

pip install -e ".[dev]"

About

ML Pipeline Generator is a tool for generating end-to-end pipelines composed of GCP components so that any customer can easily migrate their local ML models onto GCP and start realizing the benefits of the cloud quickly.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published