v1.1.0
We are excited to announce the release of AI on GKE v1.1! This release brings several new features, improvements, and bug fixes to enhance your experience with running AI workloads on Google Kubernetes Engine (GKE).
Highlights
AI on GKE Quick Starts
Get started with popular AI frameworks and tools using new quickstart guides for RAG, Ray and Jupyter notebooks on GKE.
RAG
Retrieval Augmented Generation (RAG) is a technique used to give Large Language Models (LLMs) additional context related to a prompt. RAG has many benefits including providing external information (e.g. from knowledge repositories) and introducing “grounding”, which helps the LLM generate an appropriate response.
The new quick start deploys a RAG stack on a new or existing GKE cluster using open source tools and frameworks such as Ray, LangChain, HuggingFace TGI, and Jupyter notebooks. The model used for inference is Mistral-7B. The solution uses GCS fuse driver to load the input dataset quickly and the Cloud SQL pgvector extension to store generating vector embeddings for RAG. It includes features like authenticated access for your application via Identity Aware Proxy, sensitive data protection & text moderation. See the README to get started.
Ray
Ray is an open-source framework to easily scale up Python applications across multiple nodes in a cluster. Ray provides a simple API for building distributed, parallelized applications, especially for machine learning.
KubeRay enables Ray to be deployed on Kubernetes. You get the wonderful Pythonic unified experience delivered by Ray, and the enterprise reliability and scale of GKE managed Kubernetes. Together, they offer scalability, fault tolerance, and ease of use for building, deploying, and managing distributed applications.
The new quick start deploys KubeRay on a new or existing GKE cluster along with a sample Ray cluster. See the README to get started.
Jupyter
JupyterHub is a powerful, multi-tenant server-based web application that allows users to interact with and collaborate on Jupyter notebooks. Users can create custom computing environments with custom images and computational resources in which to run their notebooks. “Zero to Jupyterhub for Kubernetes” (z2jh) is a Helm chart that you can use to install Jupyterhub on Kubernetes that provides numerous configurations for complex user scenarios.
The new quick start solution sets up Jupyterhub on GKE. Running your Jupyter notebooks and JupyterHub on Google Kubernetes Engine (GKE) provides a way to prototype your distributed, compute-intensive ML applications with security and scalability built-in as core elements of the platform. See the README to get started.
Ray on GKE guide
Dive deeper into running Ray workloads on GKE with comprehensive guides and tutorials covering various use cases and best practices. See the Ray on GKE README to get started. We’ve also included a new user guide specifically for leveraging TPU Multihost and Multislice Support with Ray.
Inference Benchmarks
Evaluate and compare the performance of different AI models and frameworks on GKE using newly added inference benchmarks. It supports benchmarking popular LLMs like Gemma, Llama 2, Falcon and other models available in Hugging Face. It supports different model servers like Text Generation Inference and Triton with TensorRT-LLM. You can measure the performance of these models and model servers on various GPU types in GKE. To get started, refer to the README.
Guides, Tutorials and Examples
LLM Guides
We’ve introduced the following guides for serving LLMs on GKE:
- Guide to Serving Mistral 7B-Instruct v0.1 on GKE Utilizing Nvidia L4-GPUs
- Guide to Serving Mixtral 8x7 Model on GKE Utilizing Nvidia L4-GPUs
- RAG with Weavite and Vertex AI
GKE ML Platform
Introducing the first MVP in the GKE ML Platform Solution, featuring:
- Opinionated GKE Platform for AI/ML workloads
- Comes with a sample deployment of Ray
- Infrastructure automated through Terraform and GitOps for cluster configuration management
- Parallel data processing using Ray, accelerating the notebook to cluster experience
- Includes a sample data processing script for a publicly available dataset using Ray.
- Resources:
- Automated Deployment via Terraform: github.com/GoogleCloudPlatform/ai-on-gke/tree/main/best-practices/ml-platform
TPU Provisioner
This release introduces the TPU Provisioner. A controller that automatically provisions new TPU node pools based on the requirements on pending pods, then deprovisions them when they are no longer in use. See the README for how to get started.
Bug fixes and improvements
- Reorganized folders in the ai-on-gke repo
- E2E tests for all quick start deployments are now running on Google Cloud Build
- Introduced the modules directory containing commonly used terraform modules used across our different deployments
- Renamed the gke-platform directory to infrastructure with additional features and capabilities