Skip to content

TorchServe v0.10.0 Release Notes

Compare
Choose a tag to compare
@lxning lxning released this 15 Mar 00:03
· 170 commits to master since this release

This is the release of TorchServe v0.10.0.

Highlights include

  • Extended support for PyTorch 2.x inference
  • C++ backend
  • GenAI fast series torch.compile showcase examples
  • Token authentication support for enhanced security.

C++ Backend

TorchServe presented the experimental C++ backend at the PyTorch Conference 2022. Similar to the Python backend, C++ backend also runs as a process and utilizes the BaseHandler to define APIs for customizing the handler. By providing a backend and handler written in pure C++ for TorchServe, it is now possible to deploy PyTorch models without any Python overhead. This release officially promoted the experimental branch to the master branch and included additional examples and Docker images for development.

torch.compile

With the launch of PT2 Inference at the PyTorch Conference 2023, we have added several key examples showcasing out-of-box speedups for torch.compile and AOT Compile. Since there is no new development being done in TorchScript, starting this release, TorchServe is preparing the migration path for customers to switch from TorchScript to torch.compile.

GenAI torch.compile series

The fast series GenAI models - GPTFast, SegmentAnythingFast, DiffusionFast with 3-10x speedups using torch.compile and native PyTorch optimizations:

Cold start problem solution

To address cold start problems, there is an example included to show how torch._export.aot_load (experimental API) can be used to load a pre-compiled model. TorchServe has also started benchmarking models with torch.compile and tracking their performance compared to TorchScript.

The new TorchServe C++ backend also includes torch.compile and AOTInductor related examples for ResNet50, BERT and Llama2.

  1. torch.compile
    a. Example torch.compile with image classifier model densenet161 #2915 @agunapal
    b. Example torch._export.aot_compile with image classification model ResNet-18 #2832 #2906 #2932 #2948 @agunapal
    c. Example torch inductor fx graph caching with image classification model densenet161 #2925 @agunapal

  2. C++ AOTInductor
    a. Example AOT Inductor with Llama2 #2913 @mreso
    b. Example AOT Inductor with ResNet-50 #2944 @lxning
    c. Example AOT Inductor with BERTSequenceClassification #2931 @lxning

Gen AI

  • Supported sequence batching for stateful inference in gRPC bi-directional streaming #2513 @lxning
  • The fast series Gen AI models using torch.compile and native PyTorch optimizations.
  • Example Mistral 7B with vLLM #2781 @agunapal
  • Example PyTorch native tensor parallel with Llama2 with continuous batching #2709 @mreso @HamidShojanazeri
  • Supported inf2 Neuronx transformer continuous batching for both no coding style and advanced customers with Llama2-70B example #2803 #3016 @lxning
  • Example deepspeed mii fastgen with Llama2-13B #2779 @lxning

Security

TorchServe has implemented token authentication for management and inference APIs. This is an optional config and can be enabled using torchserve-endpoint-plugin. This plugin can be downloaded from maven. This further strengthens TorchServe’s capability as a secure model serving solution. The security features of TorchServe are documented here

Apple Silicon Support

TorchServe is now supported on Apple Silicon mac. The current support is for CPU only. We have also posted an RFC for the deprecation of x86 mac support.

KServe Updates

While serving large models, model loading can take some time even though the pod is running. Even though TorchServe is up, the worker is not ready till the model is loaded. To address this, TorchServe now sets the model ready status in KServe after the model has been loaded on workers. TorchServe also includes native open inference protocol support in gRPC. This is an experiment feature.

  • Supported native KServe open inference protocol in gRPC #2609 @andyi2it
  • Refactored TorchServe configuration in KServe #2995 @sgaist
  • Improved KServe protocol version handling #2957 @sgaist
  • Updated KServe test script to return model version #2973 @agunapal
  • Set model status using TorchServe API in KServe #1878 @byeongjokim
  • Supported no-archive model archiver in KServe #2839 @agunapal
  • How to deploy MNIST using KServe with minikube #2718 @agunapal
  • Changes to support no-model archive mode with KServe #2839 @agunpal

Metrics Updates

In order to extend backwards compatibility support for metrics, auto-detection of backend metrics enables the flexibility to publish custom model metrics without having to explicitly specify them in the metrics configuration file. Furthermore, a customized script to collect system metrics is also now supported.

Improvements and Bug Fixing

Documentation

Platform Support

Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.

GPU Support Matrix

TorchServe version PyTorch version Python Stable CUDA Experimental CUDA
0.10.0 2.2.1 >=3.8, <=3.11 CUDA 11.8, CUDNN 8.7.0.84 CUDA 12.1, CUDNN 8.9.2.26
0.9.0 2.1 >=3.8, <=3.11 CUDA 11.8, CUDNN 8.7.0.84 CUDA 12.1, CUDNN 8.9.2.26
0.8.0 2.0 >=3.8, <=3.11 CUDA 11.7, CUDNN 8.5.0.96 CUDA 11.8, CUDNN 8.7.0.84
0.7.0 1.13 >=3.7, <=3.10 CUDA 11.6, CUDNN 8.3.2.44 CUDA 11.7, CUDNN 8.5.0.96

Inferentia2 Support Matrix

TorchServe version PyTorch version Python Neuron SDK
0.10.0 1.13 >=3.8, <=3.11 2.16+
0.9.0 1.13 >=3.8, <=3.11 2.13.2+