30 Nov 14:19

OlivierDehaene

ccd5725

v.1.2.0

What's Changed

fix: do not leak inputs on error by @OlivierDehaene in #1228
Fix missing trust_remote_code flag for AutoTokenizer in utils.peft by @creatorrr in #1270
Load PEFT weights from local directory by @tleyden in #1260
chore: update to torch 2.1.0 by @OlivierDehaene in #1182
Fix IDEFICS dtype by @vakker in #1214
Exllama v2 by @Narsil in #1211
Add RoCm support by @fxmarty in #1243
Let each model resolve their own default dtype. by @Narsil in #1287
Make GPTQ test less flaky by @Narsil in #1295

New Contributors

@creatorrr made their first contribution in #1270
@tleyden made their first contribution in #1260
@vakker made their first contribution in #1214

Full Changelog: v1.1.1...v1.2.0

Contributors

Narsil, tleyden, and 4 other contributors

Assets 2

16 Nov 17:37

OlivierDehaene

v1.1.1

8acdc1f

v1.1.1

What's Changed

Fix launcher.md by @mishig25 in #1075
Update launcher.md to wrap code blocks by @mishig25 in #1076
Fixing eetq dockerfile. by @Narsil in #1081
Fix window_size_left for flash attention v1 by @peterlowrance in #1089
raise exception on invalid images by @leot13 in #999
[Doc page] Fix launcher page highlighting by @mishig25 in #1080
Handling bloom prefix. by @Narsil in #1090
Update idefics_image_processing.py by @Narsil in #1091
fixed command line arguments in docs by @Fluder-Paradyne in #1092
Adding titles to CLI doc. by @Narsil in #1094
Receive base64 encoded images for idefics. by @Narsil in #1096
Modify the default for max_new_tokens. by @Narsil in #1097
fix: type hint typo in tokens.py by @vejvarm in #1102
Fixing GPTQ exllama kernel usage. by @Narsil in #1101
Adding yarn support. by @Narsil in #1099
Hotfixing idefics base64 parsing. by @Narsil in #1103
Prepare for v1.1.1 by @Narsil in #1100
Remove some content from the README in favour of the documentation by @osanseviero in #958
Fix link in preparing_model.md by @mishig25 in #1140
Fix calling cuda() on load_in_8bit by @mmngays in #1153
Fix: Replace view() with reshape() in neox_modeling.py to resolve RuntimeError by @Mario928 in #1155
fix: EETQLinear with bias in layers.py by @SidaZh in #1176
fix: remove useless token by @rtrompier in #1179
#1049 CI by @OlivierDehaene in #1178
Fix link to quantization page in preparing_model.md by @aasthavar in #1187
feat: paged attention v2 by @OlivierDehaene in #1183
feat: remove flume by @OlivierDehaene in #1184
Adding the video -> moving the architecture picture lower by @Narsil in #1239
Narsil patch 1 by @Narsil in #1241
Update README.md by @Narsil in #1242
Fix link in quantization guide by @osanseviero in #1246

New Contributors

@peterlowrance made their first contribution in #1089
@leot13 made their first contribution in #999
@Fluder-Paradyne made their first contribution in #1092
@vejvarm made their first contribution in #1102
@mmngays made their first contribution in #1153
@Mario928 made their first contribution in #1155
@SidaZh made their first contribution in #1176
@rtrompier made their first contribution in #1179
@aasthavar made their first contribution in #1187

Full Changelog: v1.1.0...v1.1.1

Contributors

Narsil, rtrompier, and 11 other contributors

Assets 2

28 Sep 08:34

OlivierDehaene

v1.1.0

7a6fad6

v1.1.0

Notable changes

Support for Mistral models (#1071)
AWQ quantization (#1019)
EETQ quantization (#1068)

What's Changed

Fix f180 by @Narsil in #951
Fix Falcon weight mapping for H2O.ai checkpoints by @Vinno97 in #953
Fixing top_k tokens when k ends up < 0 by @Narsil in #966
small fix on idefics by @VictorSanh in #954
chore(client): Support Pydantic 2 by @JelleZijlstra in #900
docs: typo in streaming.js by @revolunet in #971
Disabling exllama on old compute. by @Narsil in #986
sync text-generation version from 0.3.0 to 0.6.0 with pyproject.toml by @yzbx in #950
Fix exllama wronfully loading by @maximelaboisson in #990
add transformers gptq support by @flozi00 in #963
Fix call vs forward. by @Narsil in #993
fit for baichuan models by @XiaoBin1992 in #981
Fix missing arguments in Galactica's from_pb by @Vinno97 in #1022
Fixing t5 loading. by @Narsil in #1042
Add AWQ quantization inference support (#1019) by @Narsil in #1054
Fix GQA llama + AWQ by @Narsil in #1061
support local model config file by @zhangsibo1129 in #1058
fix discard_names bug in safetensors convertion by @zhangsibo1129 in #1052
Install curl to be able to perform more advanced healthchecks by @oOraph in #1033
Fix position ids logic instantiation of idefics vision part by @VictorSanh in #1064
Fix top_n_tokens returning non-log probs for some models by @Vinno97 in #1023
Support eetq weight only quantization by @Narsil in #1068
Remove the stripping of the prefix space (and any other mangling that tokenizers might do). by @Narsil in #1065
Complete FastLinear.load parameters in OPTDecoder initialization by @zhangsibo1129 in #1060
feat: add mistral model by @OlivierDehaene in #1071

New Contributors

@VictorSanh made their first contribution in #954
@JelleZijlstra made their first contribution in #900
@revolunet made their first contribution in #971
@yzbx made their first contribution in #950
@maximelaboisson made their first contribution in #990
@XiaoBin1992 made their first contribution in #981
@sywangyi made their first contribution in #1034
@zhangsibo1129 made their first contribution in #1058

Full Changelog: v1.0.3...v1.1.0

Contributors

revolunet, Narsil, and 11 other contributors

Assets 2

29 Aug 12:29

Narsil

v1.0.3

5485c14

v1.0.3

What's Changed

Codellama.

Upgrade version number in docs. by @Narsil in #910
Added gradio example to docs by @merveenoyan in #867
Supporting code llama. by @Narsil in #918
Fixing the lora adaptation on docker. by @Narsil in #935
Rebased #617 by @Narsil in #868
New release. by @Narsil in #941

Full Changelog: v1.0.2...v1.0.3

Contributors

Narsil and merveenoyan

Assets 2

23 Aug 10:55

Narsil

v1.0.2

c4422e5

v1.0.2

What's Changed

Have snippets in Python/JavaScript in quicktour by @osanseviero in #809
Added two more features in readme.md file by @sawanjr in #831
Fix rope dynamic + factor by @Narsil in #822
fix: LlamaTokenizerFast to AutoTokenizer at flash_llama.py by @dongs0104 in #619
README edit -- running the service with no GPU or CUDA support by @pminervini in #773
Fix tokenizers==0.13.4 . by @Narsil in #838
Update README.md by @adarshxs in #848
Fixing watermark. by @Narsil in #851
Misc minor improvements for InferenceClient docs by @osanseviero in #852
"Fix" for rw-1b. by @Narsil in #860
Upgrading versions of python client. by @Narsil in #862
Adding Idefics multi modal model. by @Narsil in #842
Add streaming guide by @osanseviero in #858
Adding small benchmark script. by @Narsil in #881

New Contributors

@sawanjr made their first contribution in #831
@dongs0104 made their first contribution in #619
@pminervini made their first contribution in #773
@adarshxs made their first contribution in #848

Full Changelog: v1.0.1...v1.0.2

Contributors

Narsil, pminervini, and 4 other contributors

Assets 2

14 Aug 09:24

Narsil

v1.0.1

09eca64

v1.0.1

Notable changes:

More GPTQ support
Rope scaling (linear + dynamic)
Bitsandbytes 4bits (both modes)
Added more documentation

What's Changed

Local gptq support. by @Narsil in #738
Fix typing in Model.generate_token by @jaywonchung in #733
Adding Rope scaling. by @Narsil in #741
chore: fix typo in mpt_modeling.py by @eltociear in #737
fix(server): Failing quantize config after local read. by @Narsil in #743
Typo fix. by @Narsil in #746
fix typo for dynamic rotary by @flozi00 in #745
add FastLinear import by @zspo in #750
Automatically map deduplicated safetensors weights to their original values (#501) by @Narsil in #761
feat(server): Add native support for PEFT Lora models by @Narsil in #762
This should prevent the PyTorch overriding. by @Narsil in #767
fix build tokenizer in quantize and remove duplicate import by @zspo in #768
Merge BNB 4bit. by @Narsil in #770
Fix dynamic rope. by @Narsil in #783
Fixing non 4bits quantization. by @Narsil in #785
Update init.py by @Narsil in #794
Llama change. by @Narsil in #793
Setup for doc-builder and docs for TGI by @merveenoyan in #740
Use destructuring in router arguments to avoid '.0' by @ivarflakstad in #798
Fix gated docs by @osanseviero in #805
Minor docs style fixes by @osanseviero in #806
Added CLI docs and rename docker launch by @merveenoyan in #799
[docs] Build docs only when doc files change by @mishig25 in #812
Added ChatUI Screenshot to Docs by @merveenoyan in #823
Upgrade transformers (fix protobuf==3.20 issue) by @Narsil in #795
Added streaming for InferenceClient by @merveenoyan in #821
Version 1.0.1 by @Narsil in #836

New Contributors

@jaywonchung made their first contribution in #733
@eltociear made their first contribution in #737
@flozi00 made their first contribution in #745
@zspo made their first contribution in #750
@ivarflakstad made their first contribution in #798
@osanseviero made their first contribution in #805
@mishig25 made their first contribution in #812

Full Changelog: v1.0.0...v1.0.1

Contributors

Narsil, osanseviero, and 7 other contributors

Assets 2

28 Jul 15:47

OlivierDehaene

v1.0.0

3ef5ffb

v1.0.0

License change

We are releasing TGI v1.0 under a new license: HFOIL 1.0.
All prior versions of TGI remain licensed under Apache 2.0, the last Apache 2.0 version being version 0.9.4.

HFOIL stands for Hugging Face Optimized Inference License, and it has been specifically designed for our optimized inference solutions. While the source code remains accessible, HFOIL is not a true open source license because we added a restriction: to sell a hosted or managed service built on top of TGI, we now require a separate agreement.
You can consult the new license here.

What does this mean for you?

This change in source code licensing has no impact on the overwhelming majority of our user community who use TGI for free. Additionally, both our Inference Endpoint customers and those of our commercial partners will also remain unaffected.

However, it will restrict non-partnered cloud service providers from offering TGI v1.0+ as a service without requesting a license.

To elaborate further:

If you are an existing user of TGI prior to v1.0, your current version is still Apache 2.0 and you can use it commercially without restrictions.
If you are using TGI for personal use or research purposes, the HFOIL 1.0 restrictions do not apply to you.
If you are using TGI for commercial purposes as part of an internal company project (that will not be sold to third parties as a hosted or managed service), the HFOIL 1.0 restrictions do not apply to you.
If you integrate TGI into a hosted or managed service that you sell to customers, then consider requesting a license to upgrade to v1.0 and later versions - you can email us at [email protected] with information about your service.

For more information, see: #726.

Full Changelog: v0.9.4...v1.0.0

Assets 2

27 Jul 17:29

OlivierDehaene

v0.9.4

9f18f4c

v0.9.4

Features

server: auto max_batch_total_tokens for flash att models #630
router: ngrok edge #642
server: Add trust_remote_code to quantize script by @ChristophRaab #647
server: Add exllama GPTQ CUDA kernel support #553 #666
server: Directly load GPTBigCode to specified device by @Atry in #618
server: add cuda memory fraction #659
server: Using quantize_config.json instead of GPTQ_BITS env variables #671
server: support new falcon config #712

Fix

server: llama v2 GPTQ #648
server: Fixing non parameters in quantize script bigcode/starcoder was an example #661
server: use mem_get_info to get kv cache size #664
server: fix exllama buffers #689
server: fix quantization python requirements #708

New Contributors

@ChristophRaab made their first contribution in #647
@fxmarty made their first contribution in #648
@Atry made their first contribution in #618

Full Changelog: v0.9.3...v0.9.4

Contributors

Atry, ChristophRaab, and fxmarty

Assets 2

18 Jul 16:53

OlivierDehaene

v0.9.3

5e6ddfd

v0.9.3

Highlights

server: add support for flash attention v2
server: add support for llamav2

Features

launcher: add debug logs
server: rework the quantization to support all models

Full Changelog: v0.9.2...v0.9.3

Assets 2

14 Jul 14:36

OlivierDehaene

v0.9.2

c58a0c1

v0.9.2

Features

server: harden a bit the weights choice to save on disk
server: better errors for warmup and TP
server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
server: Implements sharding for non divisible vocab_size
launcher: add arg validation and drop subprocess
router: explicit warning if revision is not set

Fix

server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
server: T5 weights names
server: Adding logger import to t5_modeling.py by @akowalsk
server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
server: GPTQ Env vars: catch correct type of error by @ssmi153
server: blacklist local files

New Contributors

@akowalsk made their first contribution in #585
@ssmi153 made their first contribution in #590
@gary149 made their first contribution in #611

Full Changelog: v0.9.1...v0.9.2

Contributors

gary149, akowalsk, and ssmi153

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Notable changes

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Notable changes:

What's Changed

New Contributors

Contributors

License change

What does this mean for you?

Features

Fix

New Contributors

Contributors

Highlights

Features

Features

Fix

New Contributors

Contributors

Releases: huggingface/text-generation-inference

v.1.2.0

What's Changed

New Contributors

Contributors

v1.1.1

What's Changed

New Contributors

Contributors

v1.1.0

Notable changes

What's Changed

New Contributors

Contributors

v1.0.3

What's Changed

Contributors

v1.0.2

What's Changed

New Contributors

Contributors

v1.0.1

Notable changes:

What's Changed

New Contributors

Contributors

v1.0.0

License change

What does this mean for you?

v0.9.4

Features

Fix

New Contributors

Contributors

v0.9.3

Highlights

Features

v0.9.2

Features

Fix

New Contributors

Contributors