Skip to content

Releases: huggingface/text-generation-inference

v.1.2.0

30 Nov 14:19
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.1.1...v1.2.0

v1.1.1

16 Nov 17:37
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.1.0...v1.1.1

v1.1.0

28 Sep 08:34
Compare
Choose a tag to compare

Notable changes

  • Support for Mistral models (#1071)
  • AWQ quantization (#1019)
  • EETQ quantization (#1068)

What's Changed

New Contributors

Full Changelog: v1.0.3...v1.1.0

v1.0.3

29 Aug 12:29
5485c14
Compare
Choose a tag to compare

What's Changed

Codellama.

Full Changelog: v1.0.2...v1.0.3

v1.0.2

23 Aug 10:55
c4422e5
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.0.1...v1.0.2

v1.0.1

14 Aug 09:24
09eca64
Compare
Choose a tag to compare

Notable changes:

  • More GPTQ support
  • Rope scaling (linear + dynamic)
  • Bitsandbytes 4bits (both modes)
  • Added more documentation

What's Changed

New Contributors

Full Changelog: v1.0.0...v1.0.1

v1.0.0

28 Jul 15:47
3ef5ffb
Compare
Choose a tag to compare

License change

We are releasing TGI v1.0 under a new license: HFOIL 1.0.
All prior versions of TGI remain licensed under Apache 2.0, the last Apache 2.0 version being version 0.9.4.

HFOIL stands for Hugging Face Optimized Inference License, and it has been specifically designed for our optimized inference solutions. While the source code remains accessible, HFOIL is not a true open source license because we added a restriction: to sell a hosted or managed service built on top of TGI, we now require a separate agreement.
You can consult the new license here.

What does this mean for you?

This change in source code licensing has no impact on the overwhelming majority of our user community who use TGI for free. Additionally, both our Inference Endpoint customers and those of our commercial partners will also remain unaffected.

However, it will restrict non-partnered cloud service providers from offering TGI v1.0+ as a service without requesting a license.

To elaborate further:

  • If you are an existing user of TGI prior to v1.0, your current version is still Apache 2.0 and you can use it commercially without restrictions.

  • If you are using TGI for personal use or research purposes, the HFOIL 1.0 restrictions do not apply to you.

  • If you are using TGI for commercial purposes as part of an internal company project (that will not be sold to third parties as a hosted or managed service), the HFOIL 1.0 restrictions do not apply to you.

  • If you integrate TGI into a hosted or managed service that you sell to customers, then consider requesting a license to upgrade to v1.0 and later versions - you can email us at [email protected] with information about your service.

For more information, see: #726.

Full Changelog: v0.9.4...v1.0.0

v0.9.4

27 Jul 17:29
9f18f4c
Compare
Choose a tag to compare

Features

  • server: auto max_batch_total_tokens for flash att models #630
  • router: ngrok edge #642
  • server: Add trust_remote_code to quantize script by @ChristophRaab #647
  • server: Add exllama GPTQ CUDA kernel support #553 #666
  • server: Directly load GPTBigCode to specified device by @Atry in #618
  • server: add cuda memory fraction #659
  • server: Using quantize_config.json instead of GPTQ_BITS env variables #671
  • server: support new falcon config #712

Fix

  • server: llama v2 GPTQ #648
  • server: Fixing non parameters in quantize script bigcode/starcoder was an example #661
  • server: use mem_get_info to get kv cache size #664
  • server: fix exllama buffers #689
  • server: fix quantization python requirements #708

New Contributors

Full Changelog: v0.9.3...v0.9.4

v0.9.3

18 Jul 16:53
5e6ddfd
Compare
Choose a tag to compare

Highlights

  • server: add support for flash attention v2
  • server: add support for llamav2

Features

  • launcher: add debug logs
  • server: rework the quantization to support all models

Full Changelog: v0.9.2...v0.9.3

v0.9.2

14 Jul 14:36
c58a0c1
Compare
Choose a tag to compare

Features

  • server: harden a bit the weights choice to save on disk
  • server: better errors for warmup and TP
  • server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
  • server: Implements sharding for non divisible vocab_size
  • launcher: add arg validation and drop subprocess
  • router: explicit warning if revision is not set

Fix

  • server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
  • server: T5 weights names
  • server: Adding logger import to t5_modeling.py by @akowalsk
  • server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
  • server: GPTQ Env vars: catch correct type of error by @ssmi153
  • server: blacklist local files

New Contributors

Full Changelog: v0.9.1...v0.9.2