Releases: huggingface/text-generation-inference
v.1.2.0
What's Changed
- fix: do not leak inputs on error by @OlivierDehaene in #1228
- Fix missing
trust_remote_code
flag for AutoTokenizer in utils.peft by @creatorrr in #1270 - Load PEFT weights from local directory by @tleyden in #1260
- chore: update to torch 2.1.0 by @OlivierDehaene in #1182
- Fix IDEFICS dtype by @vakker in #1214
- Exllama v2 by @Narsil in #1211
- Add RoCm support by @fxmarty in #1243
- Let each model resolve their own default dtype. by @Narsil in #1287
- Make GPTQ test less flaky by @Narsil in #1295
New Contributors
- @creatorrr made their first contribution in #1270
- @tleyden made their first contribution in #1260
- @vakker made their first contribution in #1214
Full Changelog: v1.1.1...v1.2.0
v1.1.1
What's Changed
- Fix launcher.md by @mishig25 in #1075
- Update launcher.md to wrap code blocks by @mishig25 in #1076
- Fixing eetq dockerfile. by @Narsil in #1081
- Fix window_size_left for flash attention v1 by @peterlowrance in #1089
- raise exception on invalid images by @leot13 in #999
- [Doc page] Fix launcher page highlighting by @mishig25 in #1080
- Handling bloom prefix. by @Narsil in #1090
- Update idefics_image_processing.py by @Narsil in #1091
- fixed command line arguments in docs by @Fluder-Paradyne in #1092
- Adding titles to CLI doc. by @Narsil in #1094
- Receive base64 encoded images for idefics. by @Narsil in #1096
- Modify the default for
max_new_tokens
. by @Narsil in #1097 - fix: type hint typo in tokens.py by @vejvarm in #1102
- Fixing GPTQ exllama kernel usage. by @Narsil in #1101
- Adding yarn support. by @Narsil in #1099
- Hotfixing idefics base64 parsing. by @Narsil in #1103
- Prepare for v1.1.1 by @Narsil in #1100
- Remove some content from the README in favour of the documentation by @osanseviero in #958
- Fix link in preparing_model.md by @mishig25 in #1140
- Fix calling cuda() on load_in_8bit by @mmngays in #1153
- Fix: Replace view() with reshape() in neox_modeling.py to resolve RuntimeError by @Mario928 in #1155
- fix: EETQLinear with bias in layers.py by @SidaZh in #1176
- fix: remove useless token by @rtrompier in #1179
- #1049 CI by @OlivierDehaene in #1178
- Fix link to quantization page in preparing_model.md by @aasthavar in #1187
- feat: paged attention v2 by @OlivierDehaene in #1183
- feat: remove flume by @OlivierDehaene in #1184
- Adding the video -> moving the architecture picture lower by @Narsil in #1239
- Narsil patch 1 by @Narsil in #1241
- Update README.md by @Narsil in #1242
- Fix link in quantization guide by @osanseviero in #1246
New Contributors
- @peterlowrance made their first contribution in #1089
- @leot13 made their first contribution in #999
- @Fluder-Paradyne made their first contribution in #1092
- @vejvarm made their first contribution in #1102
- @mmngays made their first contribution in #1153
- @Mario928 made their first contribution in #1155
- @SidaZh made their first contribution in #1176
- @rtrompier made their first contribution in #1179
- @aasthavar made their first contribution in #1187
Full Changelog: v1.1.0...v1.1.1
v1.1.0
Notable changes
What's Changed
- Fix f180 by @Narsil in #951
- Fix Falcon weight mapping for H2O.ai checkpoints by @Vinno97 in #953
- Fixing top_k tokens when k ends up < 0 by @Narsil in #966
- small fix on idefics by @VictorSanh in #954
- chore(client): Support Pydantic 2 by @JelleZijlstra in #900
- docs: typo in streaming.js by @revolunet in #971
- Disabling exllama on old compute. by @Narsil in #986
- sync text-generation version from 0.3.0 to 0.6.0 with pyproject.toml by @yzbx in #950
- Fix exllama wronfully loading by @maximelaboisson in #990
- add transformers gptq support by @flozi00 in #963
- Fix call vs forward. by @Narsil in #993
- fit for baichuan models by @XiaoBin1992 in #981
- Fix missing arguments in Galactica's from_pb by @Vinno97 in #1022
- Fixing t5 loading. by @Narsil in #1042
- Add AWQ quantization inference support (#1019) by @Narsil in #1054
- Fix GQA llama + AWQ by @Narsil in #1061
- support local model config file by @zhangsibo1129 in #1058
- fix discard_names bug in safetensors convertion by @zhangsibo1129 in #1052
- Install curl to be able to perform more advanced healthchecks by @oOraph in #1033
- Fix position ids logic instantiation of idefics vision part by @VictorSanh in #1064
- Fix top_n_tokens returning non-log probs for some models by @Vinno97 in #1023
- Support eetq weight only quantization by @Narsil in #1068
- Remove the stripping of the prefix space (and any other mangling that tokenizers might do). by @Narsil in #1065
- Complete FastLinear.load parameters in OPTDecoder initialization by @zhangsibo1129 in #1060
- feat: add mistral model by @OlivierDehaene in #1071
New Contributors
- @VictorSanh made their first contribution in #954
- @JelleZijlstra made their first contribution in #900
- @revolunet made their first contribution in #971
- @yzbx made their first contribution in #950
- @maximelaboisson made their first contribution in #990
- @XiaoBin1992 made their first contribution in #981
- @sywangyi made their first contribution in #1034
- @zhangsibo1129 made their first contribution in #1058
Full Changelog: v1.0.3...v1.1.0
v1.0.3
What's Changed
Codellama.
- Upgrade version number in docs. by @Narsil in #910
- Added gradio example to docs by @merveenoyan in #867
- Supporting code llama. by @Narsil in #918
- Fixing the lora adaptation on docker. by @Narsil in #935
- Rebased #617 by @Narsil in #868
- New release. by @Narsil in #941
Full Changelog: v1.0.2...v1.0.3
v1.0.2
What's Changed
- Have snippets in Python/JavaScript in quicktour by @osanseviero in #809
- Added two more features in readme.md file by @sawanjr in #831
- Fix rope dynamic + factor by @Narsil in #822
- fix: LlamaTokenizerFast to AutoTokenizer at flash_llama.py by @dongs0104 in #619
- README edit -- running the service with no GPU or CUDA support by @pminervini in #773
- Fix
tokenizers==0.13.4
. by @Narsil in #838 - Update README.md by @adarshxs in #848
- Fixing watermark. by @Narsil in #851
- Misc minor improvements for InferenceClient docs by @osanseviero in #852
- "Fix" for rw-1b. by @Narsil in #860
- Upgrading versions of python client. by @Narsil in #862
- Adding Idefics multi modal model. by @Narsil in #842
- Add streaming guide by @osanseviero in #858
- Adding small benchmark script. by @Narsil in #881
New Contributors
- @sawanjr made their first contribution in #831
- @dongs0104 made their first contribution in #619
- @pminervini made their first contribution in #773
- @adarshxs made their first contribution in #848
Full Changelog: v1.0.1...v1.0.2
v1.0.1
Notable changes:
- More GPTQ support
- Rope scaling (linear + dynamic)
- Bitsandbytes 4bits (both modes)
- Added more documentation
What's Changed
- Local gptq support. by @Narsil in #738
- Fix typing in
Model.generate_token
by @jaywonchung in #733 - Adding Rope scaling. by @Narsil in #741
- chore: fix typo in mpt_modeling.py by @eltociear in #737
- fix(server): Failing quantize config after local read. by @Narsil in #743
- Typo fix. by @Narsil in #746
- fix typo for dynamic rotary by @flozi00 in #745
- add FastLinear import by @zspo in #750
- Automatically map deduplicated safetensors weights to their original values (#501) by @Narsil in #761
- feat(server): Add native support for PEFT Lora models by @Narsil in #762
- This should prevent the PyTorch overriding. by @Narsil in #767
- fix build tokenizer in quantize and remove duplicate import by @zspo in #768
- Merge BNB 4bit. by @Narsil in #770
- Fix dynamic rope. by @Narsil in #783
- Fixing non 4bits quantization. by @Narsil in #785
- Update init.py by @Narsil in #794
- Llama change. by @Narsil in #793
- Setup for doc-builder and docs for TGI by @merveenoyan in #740
- Use destructuring in router arguments to avoid '.0' by @ivarflakstad in #798
- Fix gated docs by @osanseviero in #805
- Minor docs style fixes by @osanseviero in #806
- Added CLI docs and rename docker launch by @merveenoyan in #799
- [docs] Build docs only when doc files change by @mishig25 in #812
- Added ChatUI Screenshot to Docs by @merveenoyan in #823
- Upgrade transformers (fix protobuf==3.20 issue) by @Narsil in #795
- Added streaming for InferenceClient by @merveenoyan in #821
- Version 1.0.1 by @Narsil in #836
New Contributors
- @jaywonchung made their first contribution in #733
- @eltociear made their first contribution in #737
- @flozi00 made their first contribution in #745
- @zspo made their first contribution in #750
- @ivarflakstad made their first contribution in #798
- @osanseviero made their first contribution in #805
- @mishig25 made their first contribution in #812
Full Changelog: v1.0.0...v1.0.1
v1.0.0
License change
We are releasing TGI v1.0 under a new license: HFOIL 1.0.
All prior versions of TGI remain licensed under Apache 2.0, the last Apache 2.0 version being version 0.9.4.
HFOIL stands for Hugging Face Optimized Inference License, and it has been specifically designed for our optimized inference solutions. While the source code remains accessible, HFOIL is not a true open source license because we added a restriction: to sell a hosted or managed service built on top of TGI, we now require a separate agreement.
You can consult the new license here.
What does this mean for you?
This change in source code licensing has no impact on the overwhelming majority of our user community who use TGI for free. Additionally, both our Inference Endpoint customers and those of our commercial partners will also remain unaffected.
However, it will restrict non-partnered cloud service providers from offering TGI v1.0+ as a service without requesting a license.
To elaborate further:
-
If you are an existing user of TGI prior to v1.0, your current version is still Apache 2.0 and you can use it commercially without restrictions.
-
If you are using TGI for personal use or research purposes, the HFOIL 1.0 restrictions do not apply to you.
-
If you are using TGI for commercial purposes as part of an internal company project (that will not be sold to third parties as a hosted or managed service), the HFOIL 1.0 restrictions do not apply to you.
-
If you integrate TGI into a hosted or managed service that you sell to customers, then consider requesting a license to upgrade to v1.0 and later versions - you can email us at [email protected] with information about your service.
For more information, see: #726.
Full Changelog: v0.9.4...v1.0.0
v0.9.4
Features
- server: auto max_batch_total_tokens for flash att models #630
- router: ngrok edge #642
- server: Add trust_remote_code to quantize script by @ChristophRaab #647
- server: Add exllama GPTQ CUDA kernel support #553 #666
- server: Directly load GPTBigCode to specified device by @Atry in #618
- server: add cuda memory fraction #659
- server: Using
quantize_config.json
instead of GPTQ_BITS env variables #671 - server: support new falcon config #712
Fix
- server: llama v2 GPTQ #648
- server: Fixing non parameters in quantize script
bigcode/starcoder
was an example #661 - server: use mem_get_info to get kv cache size #664
- server: fix exllama buffers #689
- server: fix quantization python requirements #708
New Contributors
- @ChristophRaab made their first contribution in #647
- @fxmarty made their first contribution in #648
- @Atry made their first contribution in #618
Full Changelog: v0.9.3...v0.9.4
v0.9.3
Highlights
- server: add support for flash attention v2
- server: add support for llamav2
Features
- launcher: add debug logs
- server: rework the quantization to support all models
Full Changelog: v0.9.2...v0.9.3
v0.9.2
Features
- server: harden a bit the weights choice to save on disk
- server: better errors for warmup and TP
- server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
- server: Implements sharding for non divisible
vocab_size
- launcher: add arg validation and drop subprocess
- router: explicit warning if revision is not set
Fix
- server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
- server: T5 weights names
- server: Adding logger import to t5_modeling.py by @akowalsk
- server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
- server: GPTQ Env vars: catch correct type of error by @ssmi153
- server: blacklist local files
New Contributors
- @akowalsk made their first contribution in #585
- @ssmi153 made their first contribution in #590
- @gary149 made their first contribution in #611
Full Changelog: v0.9.1...v0.9.2