THIS IS NOT AN OFFICIAL GOOGLE PRODUCT, USE WITH CAUTION!
The repository contains code used for learning purposes, there might be breaking changes or unoptimized parts, current version is 0.2.0, alpha release
This repository covers all code required to deploy a Retail YOLO model using the ultralytics framework under Torchserve onto Vertex AI to an autoscalable endpoint
Deploying Torchserve on Vertex AI
Vertex AI - custom container requirements
YOLOv8 Retail - under GPL-3.0 License
There is also a possibility of TensorRT container served via FastAPI WebSocket
aiming to be deployed on an instance with a GPU (PoC-way) or NVIDIA Triton
Inference Server (production-ready, then the websocket server would be a
separate microservice), see ./ws
directory
- Python >3.11 with
google-cloud-aiplatform
installed gcloud
tool installed with project configured and the Vertex AI and Cloud Build APIs enabled
- The requirements listed in the
requirements.txt
- Stuff like
curl
,jq
,base64
for sending requests using Bash scripts
- Stuff listed in Deployment and Inference
docker
, ideally with NVIDIA capabilities (nvidia-docker2
)
Those are used in the local development phase, the final artifact is the custom Torchserve container that is ready to be deployed on Vertex AI
mar.sh
generates a model archive file for torchserverun-container.sh
runs a local deployment of the custom containertest-ts.sh
sends requests to local deployment of torchserveDockerfile
contains the manifest of the Torchserve container that runs the YOLOv8 model trained on the SKU110k datasethandler.py
is the custom handler that overrides the default TorchserveBaseHandler
cloud-build.sh
Bash script that submits a Cloud Build and returns the custom container image to be deployed onto Vertex AIdeploy.py
Python script that deploys the custom conatiner onto Vertex AI- It uploads the model to the Vertex AI Model Registry
- It deploys a new Vertex AI Endpoint
- It deploys the uploaded model to the deployed Endpoint with an accelerator (NVIDIA T4 was used in this example)
Running the scripts require a deployed endpoint with the model at serving
Before running the scripts, ammend the ENDPOINT_ID
and PROJECT_ID
configuration variables
infer.py
is a Python script for running inference on images or videos using the deployed Vertex AI Endpoint and plotting the results. It uses OpenCV for running the videos and annotating the images; It requires updating theimage_paths
and optionally theVIDEO_FOOTAGE
variableinfer.sh
is a Bash script for sending sample requests (single image, batch) to the deployed Vertex AI Endpoint
retail-yolo.pt
is the YoloV8 finetuned on the Retail SKKU datasetres.json
is a sample response from the deployed Vertex AI Endpointsample_request.json
is a sample request for running single image inferencesample_request_batch.json
is a sample request for running batch inference
Be sure to:
- Ensure Python 3.11 is installed, install the required packages
- Ensure the Google Cloud APIs are enabled
pytest
to check if the custom handler is all good, thetest_handler.py
script checks the intermediate output in betweenpreprocess
,inference
andpostprocess
Torchserve overriden methods of the custom handler. It has flags that can be ammended to save intermediate output into.pkl
format so that it can be inspected and debugged during development.run-container.sh
to build the custom container andtest-ts.sh
to check if requests go through; thetest-ts.sh
script checks for file upload requests, single image requests and batch image requests
Ensure the config params in cloud-build.sh
and deploy.py
scripts are
correct, then run
./cloud-build.sh && python deploy.py
This takes roughly 10-15 minutes, results in an endpoint ready to serve traffic
Deployment can then be tested using infer.sh
and infer.py
scripts, see the
Vertex AI console to grab the ENDPOINT_ID
and PROJECT_ID
that are required
for inference
The Python infer.py
script usage is
python infer.py --visualize --use_mask
for running on images in the image_paths
list and
python infer.py --visualize --track
for running on Video footage
The Vertex AI API has a gRPC API that runs over HTTP/2 and supports streaming binary objects
In order to achieve +99% detection rate, I would highly recommend fine-tuning the YOLO model on a specialized dataset
It also makes sense to introduce additional classes: empty spots and improperly placed product classes alongside the 'retail' product class
The model achieves good results but finetuning is not a complex task, with even ~100 data points in the fine-tuning dataset a major improvement for a specific store should be achievable
There are also some cool features like Benchmarking or Batch Prediction available through the Vertex AI console that can be useful
Inference takes about 30-50ms per frame on n2-standard-8 and NVIDIA TESLA T4, can be further improved by compressing the frames (sending through large frames adds a lot of latency depending on the network conditions) and using TensorRT for exporting an optimized model engine for accelerated inference leveraging the CUDA Tensor Cores