Skip to content
This repository has been archived by the owner on May 29, 2024. It is now read-only.
/ ml-ops-poc Public archive

Repository showcasing ML Ops practices with kubeflow and mlflow

License

Notifications You must be signed in to change notification settings

MGTheTrain/ml-ops-poc

Repository files navigation

ml-ops-poc

Table of Contents

Summary

Repository showcasing ML Ops practices with kubeflow and mlflow

References

Features

  • Deployment of Azure Kubernetes Service (AKS) clusters
  • kubeflow operator or mlflow helm chart installations in deployed AKS clusters
  • CD workflow for on-demand AKS deployments and kubeflow operator or mlflow helm chart installations
  • CD wofklow for on demand deployments of an Azure Storage Account Container (For storing terraform state files)
  • CD workflow for on-demand Azure Container Registry deployments in order to store internal Docker images.
  • CI workflow for building internal docker images and uploading those to an Azure Container Resgitry
  • CD workflows for internal helm chart installations in deployed AKS clusters
  • Added devcontainer.json with necessary tooling for local development
  • Python (pytorch or tensorflow) application for ML training and inference purposes and Jupyter notebooks
    • Simple feedforward neural network with MNIST dataset to map input images to their corresponding digit classes
    • CNN architecture training and inference considering COCO dataset for image classification AI applications (NOTE: Compute and storage intensive. Read Download the COCO dataset images comments on preferred hardware specs)
    • (OPTIONAL) Transformer architecture training considering pre-trained models for chatbot AI applications
  • Dockerizing Python (pytorch or tensorflow) applications for ML training and inference
  • Helm charts with K8s manifests for ML jobs considering the Training Operator for CRDs
  • Installation of the Training Operator for CRDs and applying sample TFJob and PyTorchJob k8s manifest
  • Demonstration of model training and model deployment trough automation workflows
  • (OPTIONAL) mlflow experiments for the machine learning lifecycle

NOTE: Steps 4 to 7 in the digits-recognizer-kubeflow GH repository are not showcased here. These sections focus on saving the ML model in MinIO once the model is successfully built and trained. Furthermore, the trained model is served through KServe's inference HTTP service. The relevant files are:

Getting started

Github workflows will be utilized in this Github repository. Once the workflows described in the Preconditions and Deploy an AKS cluster and install the kubeflow or mlflow components sections have been successfully executed, all resource groups listed should be visible in the Azure Portal UI:

Deployed resource groups

Preconditions

  1. Deploy an Azure Storage Account Service including container for terraform backends trough the terraform.yml workflow considering the INFRASTRUCTURE_OPERATIONS option storage-account-backend-deploy

Deploy an AKS cluster and install the kubeflow or mlflow components

  1. Deploy an AKS trough the terraform.yml workflow considering the INFRASTRUCTURE_OPERATIONS option k8s-service-deploy. An Azure Container Registry will be part of the deployment in order to store internal Docker images
  2. Optional: Install ml-ops tools to an existing kubernetes cluster trough terraform.yml workflow considering the INFRASTRUCTURE_OPERATIONS option ml-ops-tools-install

NOTE:

  • Set all the required Github secrets for aboves workflows
  • In order to locally access the deployed AKS cluster launch the devcontainer and retrieve the necessary kube config as displayed in the GitHub workflow step labeled with title Download the ~/.kube/config

kubeflow

To access the kubeflow dashboard following the installation of kustomize and kubeflow components, execute the following command:

kubectl get pods -A
kubectl port-forward -n <namespace>  <pod-name> <local-port>:<server-port>
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

and visit in a browser of choice localhost:8080.

Finally, open http://localhost:8080 and login with the default user’s credentials. The default email address is [email protected] and the default password is 12341234.

kubeflow-dashboard

Jupyter notebooks

When creating the Jupyter notebook instance consider the following data volume:

Jupyter instance data volume

The volumes that were created appear as follows:

Jupyter instance created volumes

The Jypter instace that was created appear as follows:

Created Jupyter instance

NOTE: You can check the status of the Jupyter instance pods:

Check jupyter instance pods

Once CONNECTED to a Jupyter instance ensure to clone this Git repository (HTTPS URL: https://github.com/MGTheTrain/ml-ops-ftw.git):

Clone git repository

You then should have the repository cloned in your workspace:

Cloned git repository in jupyter instance

Execute a Jupyter notebook to either train the model or perform inference (P.S. It's preferable to begin with the mnist-trainnig.ipynb. Others are either resource intensive or not yet implemented):

Run jupyter notebook example

Applying TFJob or PyTorchJob k8s manifests

After successful installation of the Kubeflow Training Operator, apply some sample k8s ML training jobs, e.g. for PyTorch and for Tensorflow.

# Pytorch 
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
# Tensorflow
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/tensorflow/simple.yaml

mlflow

To access the MLflow dashboard following the installation of the MLflow Helm chart, execute the following command:

kubectl port-forward -n ml-ops-ftw <mlflow pod name> 5000:5000

and visit in a browser of choice localhost:5000.

mlflow-dashboard

Destroy the AKS cluster or uninstall ml tools

  1. Optional: Uninstall only ml tools of an existing kubernetes cluster trough terraform.yml workflow considering the INFRASTRUCTURE_OPERATIONS option ml-ops-tools-uninstall
  2. Destroy an AKS trough the terraform.yml workflow considering the INFRASTRUCTURE_OPERATIONS option k8s-service-destroy