Releases · huggingface/text-generation-inference

25 Oct 21:14

OlivierDehaene

v2.4.0

0a655a0

v2.4.0 Latest

Latest

Notable changes

Experimental prefill chunking (PREFILL_CHUNKING=1)
Experimental FP8 KV cache support
Greatly decrease latency for large batches (> 128 requests)
Faster MoE kernels and support for GPTQ-quantized MoE
Faster implementation of MLLama

What's Changed

nix: remove unused _server.nix file by @danieldk in #2538
chore: Add old V2 backend by @OlivierDehaene in #2551
Remove duplicated RUN in Dockerfile by @alvarobartt in #2547
Micro cleanup. by @Narsil in #2555
Hotfixing main by @Narsil in #2556
Add support for scalar FP8 weight scales by @danieldk in #2550
Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 by @danieldk in #2537
Update the link to the Ratatui organization by @orhun in #2546
Simplify crossterm imports by @orhun in #2545
Adding note for private models in quick-tour document by @ariG23498 in #2548
Hotfixing main. by @Narsil in #2562
Cleanup Vertex + Chat by @Narsil in #2553
More tensor cores. by @Narsil in #2558
remove LORA_ADAPTERS_PATH by @nbroad1881 in #2563
Add LoRA adapters support for Gemma2 by @alvarobartt in #2567
Fix build with --features google by @alvarobartt in #2566
Improve support for GPUs with capability < 8 by @danieldk in #2575
flashinfer: pass window size and dtype by @danieldk in #2574
Remove compute capability lazy cell by @danieldk in #2580
Update architecture.md by @ulhaqi12 in #2577
Update ROCM libs and improvements by @mht-sharma in #2579
Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in #2557
feat: support phi3.5 moe by @drbh in #2479
Move flake back to tgi-nix main by @danieldk in #2586
MoE Marlin: support desc_act for groupsize != -1 by @danieldk in #2590
nix: experimental support for building a Docker container by @danieldk in #2470
Mllama flash version by @Narsil in #2585
Max token capacity metric by @Narsil in #2595
CI (2592): Allow LoRA adapter revision in server launcher by @drbh in #2602
Unroll notify error into generate response by @drbh in #2597
New release 2.3.1 by @Narsil in #2604
Revert "Unroll notify error into generate response" by @drbh in #2605
nix: example of local package overrides during development by @danieldk in #2607
Add basic FP8 KV cache support by @danieldk in #2603
Fp8 Cache condition by @flozi00 in #2611
enable mllama in intel platform by @sywangyi in #2610
Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in #2617
Add support for fused MoE Marlin for AWQ by @danieldk in #2616
nix: move back to the tgi-nix main branch by @danieldk in #2620
CI (2599): Update ToolType input schema by @drbh in #2601
nix: add black and isort to the closure by @danieldk in #2619
AMD CI by @Narsil in #2589
feat: allow tool calling to respond without a tool by @drbh in #2614
Update documentation to most recent stable version of TGI. by @Vaibhavs10 in #2625
Intel ci by @Narsil in #2630
Fixing intel Supports windowing. by @Narsil in #2637
Small fixes for supported models by @osanseviero in #2471
Cpu perf by @Narsil in #2596
Clarify gated description and quicktour by @osanseviero in #2631
update ipex to fix incorrect output of mllama in cpu by @sywangyi in #2640
feat: enable pytorch xpu support for non-attention models by @dvrogozh in #2561
Fixing linters. by @Narsil in #2650
Rollback to ChatRequest for Vertex AI Chat instead of VertexChat by @alvarobartt in #2651
Fp8 e4m3_fnuz support for rocm by @mht-sharma in #2588
feat: prefill chunking by @OlivierDehaene in #2600
Support e4m3fn KV cache by @danieldk in #2655
Simplify the attention function by @danieldk in #2609
fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in #2663
fix: prefer inplace softmax to avoid copy by @drbh in #2661
Break cycle between the attention implementations and KV cache by @danieldk in #2627
CI job. Gpt awq 4 by @Narsil in #2665
Make handling of FP8 scales more consisent by @danieldk in #2666
Test Marlin MoE with desc_act=true by @danieldk in #2622
break when there's nothing to read by @sywangyi in #2582
Add impureWithCuda dev shell by @danieldk in #2677
Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in #2632
feat: natively support Granite models by @OlivierDehaene in #2682
feat: allow any supported payload on /invocations by @OlivierDehaene in #2683
flashinfer: reminder to remove contiguous call in the future by @danieldk in #2685
Fix Phi 3.5 MoE tests by @danieldk in #2684
Add support for FP8 KV cache scales by @danieldk in #2628
Fixing "deadlock" when python prompts for trust_remote_code by always by @Narsil in #2664
[TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in #2357
Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in #2691
Fixing mt0 test. by @Narsil in #2692
Add support for stop words in TRTLLM by @mfuntowicz in #2678
Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in #2688

New Contributors

@alvarobartt made their first contribution in https://github.com/huggingface/...

Contributors

danieldk, Narsil, and 15 other contributors

Assets 2

03 Oct 13:01

Narsil

v2.3.1

a094729

v2.3.1

Important changes

Added support for Mllama (3.2, vision models). Flashed, unpadded.
FP8 performance improvements
Moe performance improvements
BREAKING CHANGE - When using tools, models could answer with a tool call notify_error with the content error, it will instead output regular generation.

What's Changed

nix: remove unused _server.nix file by @danieldk in #2538
chore: Add old V2 backend by @OlivierDehaene in #2551
Remove duplicated RUN in Dockerfile by @alvarobartt in #2547
Micro cleanup. by @Narsil in #2555
Hotfixing main by @Narsil in #2556
Add support for scalar FP8 weight scales by @danieldk in #2550
Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 by @danieldk in #2537
Update the link to the Ratatui organization by @orhun in #2546
Simplify crossterm imports by @orhun in #2545
Adding note for private models in quick-tour document by @ariG23498 in #2548
Hotfixing main. by @Narsil in #2562
Cleanup Vertex + Chat by @Narsil in #2553
More tensor cores. by @Narsil in #2558
remove LORA_ADAPTERS_PATH by @nbroad1881 in #2563
Add LoRA adapters support for Gemma2 by @alvarobartt in #2567
Fix build with --features google by @alvarobartt in #2566
Improve support for GPUs with capability < 8 by @danieldk in #2575
flashinfer: pass window size and dtype by @danieldk in #2574
Remove compute capability lazy cell by @danieldk in #2580
Update architecture.md by @ulhaqi12 in #2577
Update ROCM libs and improvements by @mht-sharma in #2579
Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in #2557
feat: support phi3.5 moe by @drbh in #2479
Move flake back to tgi-nix main by @danieldk in #2586
MoE Marlin: support desc_act for groupsize != -1 by @danieldk in #2590
nix: experimental support for building a Docker container by @danieldk in #2470
Mllama flash version by @Narsil in #2585
Max token capacity metric by @Narsil in #2595
CI (2592): Allow LoRA adapter revision in server launcher by @drbh in #2602
Unroll notify error into generate response by @drbh in #2597
New release 2.3.1 by @Narsil in #2604

New Contributors

@alvarobartt made their first contribution in #2547
@orhun made their first contribution in #2546
@ariG23498 made their first contribution in #2548
@ulhaqi12 made their first contribution in #2577
@mht-sharma made their first contribution in #2579

Full Changelog: v2.3.0...v2.3.1

Contributors

danieldk, Narsil, and 8 other contributors

Assets 2

20 Sep 16:20

Narsil

v2.3.0

169178b

v2.3.0

Important changes

Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem.
So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).
Lots of performance improvements with Marlin and quantization.

What's Changed

chore: update to torch 2.4 by @OlivierDehaene in #2259
fix crash in multi-modal by @sywangyi in #2245
fix of use of unquantized weights in cohere GQA loading, also enable … by @sywangyi in #2291
Split up layers.marlin into several files by @danieldk in #2292
fix: refactor adapter weight loading and mapping by @drbh in #2193
Using g6 instead of g5. by @Narsil in #2281
Some small fixes for the Torch 2.4.0 update by @danieldk in #2304
Fixing idefics on g6 tests. by @Narsil in #2306
Fix registry name by @XciD in #2307
Support tied embeddings in 0.5B and 1.5B Qwen2 models by @danieldk in #2313
feat: add ruff and resolve issue by @drbh in #2262
Run ci api key by @ErikKaum in #2315
Install Marlin from standalone package by @danieldk in #2320
fix: reject grammars without properties by @drbh in #2309
patch-error-on-invalid-grammar by @ErikKaum in #2282
fix: adjust test snapshots and small refactors by @drbh in #2323
server quantize: store quantizer config in standard format by @danieldk in #2299
Rebase TRT-llm by @Narsil in #2331
Handle GPTQ-Marlin loading in GPTQMarlinWeightLoader by @danieldk in #2300
Pr 2290 ci run by @drbh in #2329
refactor usage stats by @ErikKaum in #2339
enable HuggingFaceM4/idefics-9b in intel gpu by @sywangyi in #2338
Fix cache block size for flash decoding by @danieldk in #2351
Unify attention output handling by @danieldk in #2343
fix: attempt forward on flash attn2 to check hardware support by @drbh in #2335
feat: include local lora adapter loading docs by @drbh in #2359
fix: return the out tensor rather then the functions return value by @drbh in #2361
feat: implement a templated endpoint for visibility into chat requests by @drbh in #2333
feat: prefer stop over eos_token to align with openai finish_reason by @drbh in #2344
feat: return the generated text when parsing fails by @drbh in #2353
fix: default num_ln_in_parallel_attn to one if not supplied by @drbh in #2364
fix: prefer original layernorm names for 180B by @drbh in #2365
fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig by @almersawi in #2350
add gptj modeling in TGI #2366 (CI RUN) by @drbh in #2372
Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) by @drbh in #2371
Pr 2374 ci branch by @drbh in #2378
fix EleutherAI/gpt-neox-20b does not work in tgi by @sywangyi in #2346
Pr 2337 ci branch by @drbh in #2379
fix: prefer hidden_activation over hidden_act in gemma2 by @drbh in #2381
Update Quantization docs and minor doc fix. by @Vaibhavs10 in #2368
Pr 2352 ci branch by @drbh in #2382
Add FlashInfer support by @danieldk in #2354
Add experimental flake by @danieldk in #2384
Using HF_HOME instead of CACHE to get token read in addition to models. by @Narsil in #2288
flake: add fmt and clippy by @danieldk in #2389
Update documentation for Supported models by @Vaibhavs10 in #2386
flake: use rust-overlay by @danieldk in #2390
Using an enum for flash backens (paged/flashdecoding/flashinfer) by @Narsil in #2385
feat: add guideline to chat request and template by @drbh in #2391
Update flake for 9.0a capability in Torch by @danieldk in #2394
nix: add router to the devshell by @danieldk in #2396
Upgrade fbgemm by @Narsil in #2398
Adding launcher to build. by @Narsil in #2397
Fixing import exl2 by @Narsil in #2399
Cpu dockerimage by @sywangyi in #2367
Add support for prefix caching to the v3 router by @danieldk in #2392
Keeping the benchmark somewhere by @Narsil in #2401
feat: validate template variables before apply and improve sliding wi… by @drbh in #2403
fix: allocate tmp based on sgmv kernel if available by @drbh in #2345
fix: improve completions to send a final chunk with usage details by @drbh in #2336
Updating the flake. by @Narsil in #2404
Pr 2395 ci run by @drbh in #2406
fix: include create_exllama_buffers and set_device for exllama by @drbh in #2407
nix: incremental build of the launcher by @danieldk in #2410
Adding more kernels to flake. by @Narsil in #2411
add numa to improve cpu inference perf by @sywangyi in #2330
fix: adds causal to attention params by @drbh in #2408
nix: partial incremental build of the router by @danieldk in #2416
Upgrading exl2. by @Narsil in #2415
More fixes trtllm by @mfuntowicz in #2342
nix: build router incrementally by @danieldk in #2422
Fixing exl2 and other quanize tests again. by @Narsil in #2419
Upgrading the tests to match the current workings. by @Narsil in #2423
nix: try to reduce the number of Rust rebuilds by @danieldk in https://github.com/huggingface/text-generation-inference/pull/...

Contributors

danieldk, Narsil, and 12 other contributors

Assets 2

23 Jul 16:30

Narsil

v2.2.0

db7e043

v2.2.0

Notable changes

Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
Gemma2 softcap support
Deepseek v2 support.
Lots of internal reworks/cleanup (allowing for cool features)
Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)

What's Changed

Preparing patch release. by @Narsil in #2186
Adding "longrope" for Phi-3 (#2172) by @amihalik in #2179
Refactor dead code - Removing all flash_xxx.py files. by @Narsil in #2166
Fix Starcoder2 after refactor by @danieldk in #2189
GPTQ CI improvements by @danieldk in #2151
Consistently take prefix in model constructors by @danieldk in #2191
fix dbrx & opt model prefix bug by @icyxp in #2201
hotfix: Fix number of KV heads by @danieldk in #2202
Fix incorrect cache allocation with multi-query by @danieldk in #2203
Falcon/DBRX: get correct number of key-value heads by @danieldk in #2205
add doc for intel gpus by @sywangyi in #2181
fix: python deserialization by @jaluma in #2178
update to metrics 0.23.0 or could work with metrics-exporter-promethe… by @sywangyi in #2190
feat: use model name as adapter id in chat endpoints by @drbh in #2128
Fix nccl regression on PyTorch 2.3 upgrade by @fxmarty in #2099
Fix buildx cache + change runner type by @glegendre01 in #2176
Fixed README ToC by @vinkamath in #2196
Updating the self check by @Narsil in #2209
Move quantized weight handling out of the Weights class by @danieldk in #2194
Add support for FP8 on compute capability >=8.0, <8.9 by @danieldk in #2213
fix: append DONE message to chat stream by @drbh in #2221
[fix] Modifying base in yarn embedding by @SeongBeomLEE in #2212
Use symmetric quantization in the quantize subcommand by @danieldk in #2120
feat: simple mistral lora integration tests by @drbh in #2180
fix custom cache dir by @ErikKaum in #2226
fix: Remove bitsandbytes installation when running cpu-only install by @Hugoch in #2216
Add support for AWQ-quantized Idefics2 by @danieldk in #2233
server quantize: expose groupsize option by @danieldk in #2225
Remove stray quantize argument in get_weights_col_packed_qkv by @danieldk in #2237
fix(server): fix cohere by @OlivierDehaene in #2249
Improve the handling of quantized weights by @danieldk in #2250
Hotfix: fix of use of unquantized weights in Gemma GQA loading by @danieldk in #2255
Hotfix: various GPT-based model fixes by @danieldk in #2256
Hotfix: fix MPT after recent refactor by @danieldk in #2257
Hotfix: pass through model revision in VlmCausalLM by @danieldk in #2258
usage stats and crash reports by @ErikKaum in #2220
add usage stats to toctree by @ErikKaum in #2260
fix: adjust default tool choice by @drbh in #2244
Add support for Deepseek V2 by @danieldk in #2224
re-push to internal registry by @XciD in #2242
Add FP8 release test by @danieldk in #2261
feat(fp8): use fbgemm kernels and load fp8 weights directly by @OlivierDehaene in #2248
fix(server): fix deepseekv2 loading by @OlivierDehaene in #2266
Hotfix: fix of use of unquantized weights in Mixtral GQA loading by @icyxp in #2269
legacy warning on text_generation client by @ErikKaum in #2271
fix(ci): test new instances by @XciD in #2272
fix(server): fix fp8 weight loading by @OlivierDehaene in #2268
Softcapping for gemma2. by @Narsil in #2273
use proper name for ci by @XciD in #2274
Fixing mistral nemo. by @Narsil in #2276
fix(l4): fix fp8 logic on l4 by @OlivierDehaene in #2277
Add support for repacking AWQ weights for GPTQ-Marlin by @danieldk in #2278
[WIP] Add support for Mistral-Nemo by supporting head_dim through config by @shaltielshmid in #2254
Preparing for release. by @Narsil in #2285
Add support for Llama 3 rotary embeddings by @danieldk in #2286
hotfix: pin numpy by @danieldk in #2289

New Contributors

@jaluma made their first contribution in #2178
@vinkamath made their first contribution in #2196
@ErikKaum made their first contribution in #2226
@Hugoch made their first contribution in #2216
@XciD made their first contribution in #2242
@shaltielshmid made their first contribution in #2254

Full Changelog: v2.1.1...v2.2.0

Contributors

danieldk, Narsil, and 14 other contributors

Assets 2

04 Jul 10:43

Narsil

v2.1.1

4dfdb48

v2.1.1

Main changes

Bugfixes
Added FlashDecoding support (Beta) use FLASH_DECODING=1 to use TGI with flash decoding (large speedups on long queries). #1940
Use Marlin over GPTQ kernels for faster GPTQ inference #2111

What's Changed

Fixing the CI to also run in release when it's a tag ? by @Narsil in #2138
fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… by @sywangyi in https://github.com//pull/2148
Fixing clippy. by @Narsil in #2149
fix: use weights from base_layer by @drbh in #2141
feat: download lora adapter weights from launcher by @drbh in #2140
Use GPTQ-Marlin for supported GPTQ configurations by @danieldk in #2111
fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' by @icyxp in #2123
refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform by @sywangyi in #2132
fix: prefer serde structs over custom functions by @drbh in #2127
Fixing test. by @Narsil in #2152
GH router. by @Narsil in #2153
Fixing baichuan override. by @Narsil in #2158
[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. by @Narsil in #1940
Fixing graph capture for flash decoding. by @Narsil in #2163
fix FlashDecoding change's regression in intel platform by @sywangyi in #2161
fix: use the base layers weight in mistral rocm by @drbh in #2155
Fixing rocm. by @Narsil in #2164
Ci test by @glegendre01 in #2124
Hotfixing qwen2 and starcoder2 (which also get clamping). by @Narsil in #2167
feat: improve update_docs for openapi schema by @drbh in #2169
Fixing the dockerfile warnings. by @Narsil in #2173
Fixing missing object field for regular completions. by @Narsil in #2175

New Contributors

@icyxp made their first contribution in #2123

Full Changelog: v2.1.0...v2.1.1

Contributors

danieldk, Narsil, and 4 other contributors

Assets 2

28 Jun 06:26

Narsil

v2.1.0

192d49a

v2.1.0

Notable changes

New models : gemma2
Multi lora adapters. You can now run multiple loras on the same TGI deployment #2010
Faster GPTQ inference and Marlin support (up to 2x speedup).
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)
Lots of Rocm support and bugfixes,
Lots of new contributors ! Thanks a lot for these contributions

What's Changed

OpenAI function calling compatible support by @phangiabao98 in #1888
Fixing types. by @Narsil in #1906
Types. by @Narsil in #1909
Fixing signals. by @Narsil in #1910
Removing some unused code. by @Narsil in #1915
MI300 compatibility by @fxmarty in #1764
Add TGI monitoring guide through Grafana and Prometheus by @fxmarty in #1908
Update grafana template by @fxmarty in #1918
Fix TunableOp bug by @fxmarty in #1920
Fix TGI issues with ROCm by @fxmarty in #1921
Fixing the download strategy for ibm-fms by @Narsil in #1917
ROCm: make CK FA2 default instead of Triton by @fxmarty in #1924
docs: Fix grafana dashboard url by @edwardzjl in #1925
feat: include token in client test like server tests by @drbh in #1932
Creating doc automatically for supported models. by @Narsil in #1929
fix: use path inside of speculator config by @drbh in #1935
feat: add train medusa head tutorial by @drbh in #1934
reenable xpu for tgi by @sywangyi in #1939
Fixing some legacy behavior (big swapout of serverless on legacy stuff). by @Narsil in #1937
Add completion route to client and add stop parameter where it's missing by @thomas-schillaci in #1869
Improving the logging system. by @Narsil in #1938
Fixing codellama loads by using purely AutoTokenizer. by @Narsil in #1947
Fix seeded output. by @Narsil in #1949
Fix (flash) Gemma prefix and enable tests by @danieldk in #1950
Fix GPTQ for models which do not have float16 at the default dtype (simpler) by @danieldk in #1953
Processor config chat template by @drbh in #1954
fix small typo and broken link by @MoritzLaurer in #1958
Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). by @Narsil in #1959
Fix (non-container) pytest stdout buffering-related lock-up by @danieldk in #1963
Fixing the text part from tokenizer endpoint. by @Narsil in #1967
feat: adjust attn weight loading logic by @drbh in #1975
Add support for exl2-quantized models by @danieldk in #1965
Update documentation version to 2.0.4 by @fxmarty in #1980
Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. by @Narsil in #1986
Fixing exl2 scratch buffer. by @Narsil in #1990
single char ` addition for docs by @nbroad1881 in #1989
Fixing GPTQ imports. by @Narsil in #1994
reable xpu, broken by gptq and setuptool upgrade by @sywangyi in #1988
router: send the input as chunks to the backend by @danieldk in #1981
Fix Phi-2 with tp>1 by @danieldk in #2003
fix: update triton implementation reference by @emmanuel-ferdman in #2002
feat: add SchedulerV3 by @OlivierDehaene in #1996
Support GPTQ models with column-packed up/gate tensor by @danieldk in #2006
Making make install work better by default. by @Narsil in #2004
Hotfixing make install. by @Narsil in #2008
Do not initialize scratch space when there are no ExLlamaV2 layers by @danieldk in #2015
feat: move allocation logic to rust by @OlivierDehaene in #1835
Fixing rocm. by @Narsil in #2021
Fix GPTQWeight import by @danieldk in #2020
Update version on init.py to 0.7.0 by @andimarafioti in #2017
Add support for Marlin-quantized models by @danieldk in #2014
marlin: support tp>1 when group_size==-1 by @danieldk in #2032
marlin: improve build by @danieldk in #2031
Internal runner ? by @Narsil in #2023
Xpu gqa by @sywangyi in #2013
server: use chunked inputs by @danieldk in #1985
ROCm and sliding windows fixes by @fxmarty in #2033
Add Phi-3 medium support by @danieldk in #2039
feat(ci): add trufflehog secrets detection by @McPatate in #2038
fix(ci): remove unnecessary permissions by @McPatate in #2045
Update LLMM1 bound by @fxmarty in #2050
Support chat response format by @drbh in #2046
fix(server): fix OPT implementation by @OlivierDehaene in #2061
fix(layers): fix SuRotaryEmbedding by @OlivierDehaene in #2060
PR #2049 CI run by @drbh in #2054
implement Open Inference Protocol endpoints by @drbh in #1942
Add support for GPTQ Marlin by @danieldk in #2052
Update the link for qwen2 by @xianbaoqian in #2068
Adding architecture document by @tengomucho in #2044
Support different image sizes in prefill in VLMs by @danieldk in #2065
Contributing guide & Code of Conduct by @LysandreJik in #2074
fix build.rs watch files by @zirconium-n in #2072
Set maximum grpc message receive size to 2GiB by @danieldk in #2075
CI: Tailscale improvements by @glegendre01 in #2079
CI: pass pre-commit hooks again by @danieldk in #2084
feat: rotate tests ci token by @drbh in #2091
Support exl2-quantized Qwen2 models by @danieldk in #2085
Factor out sharding of packed tensors by @...

Contributors

danieldk, Narsil, and 22 other contributors

Assets 2

24 May 10:55

Narsil

v2.0.4

8f22cb9

v2.0.4

Main changes

AMD MI300 compatibility by @fxmarty in #1764
Many bugfixes.

What's Changed

OpenAI function calling compatible support by @phangiabao98 in #1888
Fixing types. by @Narsil in #1906
Types. by @Narsil in #1909
Fixing signals. by @Narsil in #1910
Removing some unused code. by @Narsil in #1915
MI300 compatibility by @fxmarty in #1764
Add TGI monitoring guide through Grafana and Prometheus by @fxmarty in #1908
Update grafana template by @fxmarty in #1918
Fix TunableOp bug by @fxmarty in #1920
Fix TGI issues with ROCm by @fxmarty in #1921
Fixing the download strategy for ibm-fms by @Narsil in #1917
ROCm: make CK FA2 default instead of Triton by @fxmarty in #1924
docs: Fix grafana dashboard url by @edwardzjl in #1925
feat: include token in client test like server tests by @drbh in #1932
Creating doc automatically for supported models. by @Narsil in #1929
fix: use path inside of speculator config by @drbh in #1935
feat: add train medusa head tutorial by @drbh in #1934
reenable xpu for tgi by @sywangyi in #1939
Fixing some legacy behavior (big swapout of serverless on legacy stuff). by @Narsil in #1937
Add completion route to client and add stop parameter where it's missing by @thomas-schillaci in #1869
Improving the logging system. by @Narsil in #1938
Fixing codellama loads by using purely AutoTokenizer. by @Narsil in #1947

New Contributors

@phangiabao98 made their first contribution in #1888
@edwardzjl made their first contribution in #1925
@thomas-schillaci made their first contribution in #1869

Full Changelog: v2.0.3...v2.0.4

Contributors

Narsil, thomas-schillaci, and 5 other contributors

Assets 2

16 May 05:05

Narsil

v2.0.3

40213c9

v2.0.3

Important changes

Add: Support for the Falcon2 by @Nilabhra in #1886
New speculation method MLPSpeculator. by @JRosenkranz in #1865
Pali gemma modeling by @drbh in #1895

What's Changed

Fix: "Fixing" double BOS for mistral too. by @Narsil in #1843
Adding scripts to prepare load data. by @Narsil in #1841
Remove misleading warning (not that important nowadays anyway). by @Narsil in #1848
feat: prefer huggingface_hub in docs and show image api by @drbh in #1844
Updating Phi3 (long context). by @Narsil in #1849
Add router name to /info endpoint by @Wauplin in #1854
Upgrading to rust 1.78. by @Narsil in #1851
update xpu docker image and use public ipex whel by @sywangyi in #1860
Refactor layers. by @Narsil in #1866
Granite support? by @Narsil in #1882
Add: Support for the Falcon2 11B architecture by @Nilabhra in #1886
MLPSpeculator. by @JRosenkranz in #1865
Fixing truncation. by @Narsil in #1890
Correct 'using guidance' link by @brandon-lockaby in #1892
Add GPT-2 with flash attention by @danieldk in #1889
Removing accepted ids in the regular info logs, downgrade to debug. by @Narsil in #1898
feat: add deprecation warning to clients by @drbh in #1855
[Bug Fix] Update torch import reference in bnb quantization by @DhruvSrikanth in #1902
Pali gemma modeling by @drbh in #1895

New Contributors

@Nilabhra made their first contribution in #1886
@brandon-lockaby made their first contribution in #1892
@danieldk made their first contribution in #1889
@DhruvSrikanth made their first contribution in #1902

Full Changelog: v2.0.2...v2.0.3

Contributors

danieldk, Narsil, and 7 other contributors

Assets 2

01 May 07:22

Narsil

v2.0.2

6073ece

v2.0.2

Tl;dr

New models (idefics2, phi3)
Cleaner VLM support in the openai layer
Upgraded to pytorch 2.3.0

What's Changed

Make --cuda-graphs 0 work as expected (bis) by @fxmarty in #1768
fix typos in docs and add small clarifications by @MoritzLaurer in #1790
Add attribute descriptions for GenerateParameters by @Wauplin in #1798
feat: allow null eos and bos tokens in config by @drbh in #1791
Phi3 support by @Narsil in #1797
Idefics2. by @Narsil in #1756
fix: avoid frequency and repetition penalty on padding tokens by @drbh in #1765
Adding support for HF_HUB_OFFLINE support in the router. by @Narsil in #1789
feat: improve temperature logic in chat by @drbh in #1749
Updating the benchmarks so everyone uses openai compat layer. by @Narsil in #1800
Update guidance docs to reflect grammar support in API by @dr3s in #1775
Use the generation config. by @Narsil in #1808
2nd round of benchmark modifications (tiny adjustements to avoid overloading the host). by @Narsil in #1816
Adding new env variables for TPU backends. by @Narsil in #1755
add intel xpu support for TGI by @sywangyi in #1475
Blunder by @Narsil in #1815
Fixing qwen2. by @Narsil in #1818
Dummy CI run. by @Narsil in #1817
Changing the waiting_served_ratio default (stack more aggressively by default). by @Narsil in #1820
Better graceful shutdown. by @Narsil in #1827
Add the missing tool_prompt parameter to Python client by @maziyarpanahi in #1825
Small CI cleanup. by @Narsil in #1801
Add reference to TPU support by @brandonroyal in #1760
fix: use get_speculate to the number of layers by @OlivierDehaene in #1737
feat: add how it works section by @drbh in #1773
Fixing frequency penalty by @martinigoyanes in #1811
feat: add vlm docs and simple examples by @drbh in #1812
Handle images in chat api by @drbh in #1828
chore: update torch by @OlivierDehaene in #1730
(chore): torch 2.3.0 by @Narsil in #1833

New Contributors

@MoritzLaurer made their first contribution in #1790
@dr3s made their first contribution in #1775
@maziyarpanahi made their first contribution in #1825
@brandonroyal made their first contribution in #1760
@martinigoyanes made their first contribution in #1811

Full Changelog: v2.0.1...v2.0.2

Contributors

dr3s, Narsil, and 9 other contributors

Assets 2

18 Apr 15:22

OlivierDehaene

v2.0.1

2d0a717

v2.0.1

What's Changed

feat: improve tools to include name and add tests by @drbh in #1693
Update response type for /v1/chat/completions and /v1/completions by @Wauplin in #1747
accept list as prompt for OpenAI API by @drbh in #1702
fix ROCm docker image

Full Changelog: v2.0.0...v2.0.1

Contributors

drbh and Wauplin

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notable changes

What's Changed

New Contributors

Contributors

Important changes

What's Changed

New Contributors

Contributors

Important changes

What's Changed

Contributors

Notable changes

What's Changed

New Contributors

Contributors

Main changes

What's Changed

New Contributors

Contributors

Notable changes

What's Changed

Contributors

Main changes

What's Changed

New Contributors

Contributors

Important changes

What's Changed

New Contributors

Contributors

Tl;dr

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Releases: huggingface/text-generation-inference

v2.4.0

Notable changes

What's Changed

New Contributors

Contributors

v2.3.1

Important changes

What's Changed

New Contributors

Contributors

v2.3.0

Important changes

What's Changed

Contributors

v2.2.0

Notable changes

What's Changed

New Contributors

Contributors

v2.1.1

Main changes

What's Changed

New Contributors

Contributors

v2.1.0

Notable changes

What's Changed

Contributors

v2.0.4

Main changes

What's Changed

New Contributors

Contributors

v2.0.3

Important changes

What's Changed

New Contributors

Contributors

v2.0.2

Tl;dr

What's Changed

New Contributors

Contributors

v2.0.1

What's Changed

Contributors