Release TorchServe v0.7.0 Release Notes · pytorch/serve

This is the release of TorchServe v0.7.0.

New Examples

HF + Better Transformer integration #2002 @HamidShojanazeri

Better Transformer / Flash Attention & Xformer Memory Efficient provides out of box performance with major speed ups for PyTorch Transformer encoders. This has been integrated into Torchserve HF Transformer example, please read more about this integration here.

Main speed ups in Better Transformers comes from exploiting sparsity on padded inputs and kernel fusions. As a result you would see the biggest gains when dealing with larger workloads, such sequences with longer paddings and larger batch sizes.

In our benchmarks on P3 instances with 4 V100 GPUs, using Torchserve benchmarking workloads, throughput has shown significant improvement with large batch sizes. 45.5% increase with batch size 8; 50.8% increase with batch size 16; 45.2% increase with batch size 32; 47.2% increase with batch size 64. and 17.2 increase with batch size 4. These number can vary based on your workload (batch size , padding percentage) and your hardware. Please look up some other benchmarks in the blog post.

torch.compile() support #1960 @msaroufim

We've added experimental support for PT 2.0 as in torch.compile() support within torchserve. To use it you need to supply a file compile.json when archiving your model to specify which backend you want. We've also enabled by default mode=reduce-overhead which is ideally suited for smaller batch sizes which are more common for inference. We recommend for now to leverage GPUs with tensor cores available like A10G or A100 since you're likely to see the greatest speedups there.

On training we've seen speedups ranging from 30% to 2x https://pytorch.org/get-started/pytorch-2.0/ but we haven't ran any performance benchmarks yet for inference. Until then we recommend you continue leveraging other runtimes like TensorRT or IPEX for accelerated inference which we highlight in our performance_guide.md. There are a few important caveats to consider when you're using torch.compile: changes in batch sizes will cause recompilations so make sure to leverage a small batch size, there will be additional overhead to start a model since you need to compile it first and you'll likely still see the largest speedups with TensorRT.

However, we hope that adding this support will make it easier for you to benchmark and try out PT 2.0. Learn more here https://github.com/pytorch/serve/tree/master/examples/pt2

Dependency Upgrades

Support Python 3.10 #2031 @agunapal
Support PyTorch 1.13 and Cuda 11.7 #1980 @agunapal
Update docker default from Ubuntu 18.04 to Ubuntu 20.04 (LTS) #1970 @LuigiCerone

Improvements

KFServe upgrade to 0.9 - #1860 @jagadeesh
Added pyyaml for python venv #2014 @lxning
Added HG BERT better transformer benchmark #2024 @lxning

Documentation

Fixed response time unit #2015 @lxning

Platform Support

Ubuntu 16.04, Ubuntu 18.04, MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.

GPU Support

Torch 1.13 + Cuda 11.7
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchServe v0.7.0 Release Notes