Skip to content

v2.0.0

Compare
Choose a tag to compare
@OlivierDehaene OlivierDehaene released this 12 Apr 16:44
· 392 commits to main since this release
c38a7d7

TGI is back to Apache 2.0!

Highlights

  • License was reverted to Apache 2.0
  • Cuda graphs are now used by default. They improve latency substancially on high end nodes.
  • Llava-next was added. It is the second multimodal model available on TGI after Idefics.
  • Cohere Command R+ support. TGI is the fastest open source backend for Command R+
  • FP8 support.
  • We now share the vocabulary for all medusa heads, greatly improving latency and memory use.

Try out Command R+ with Medusa heads on 4xA100s with:

model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4

What's Changed

New Contributors

Full Changelog: v1.4.5...v2.0.0