Notable changes
- Experimental prefill chunking (
PREFILL_CHUNKING=1
) - Experimental FP8 KV cache support
- Greatly decrease latency for large batches (> 128 requests)
- Faster MoE kernels and support for GPTQ-quantized MoE
- Faster implementation of MLLama
What's Changed
- nix: remove unused
_server.nix
file by @danieldk in #2538 - chore: Add old V2 backend by @OlivierDehaene in #2551
- Remove duplicated
RUN
inDockerfile
by @alvarobartt in #2547 - Micro cleanup. by @Narsil in #2555
- Hotfixing main by @Narsil in #2556
- Add support for scalar FP8 weight scales by @danieldk in #2550
- Add
DenseMoELayer
and wire it up in Mixtral/Deepseek V2 by @danieldk in #2537 - Update the link to the Ratatui organization by @orhun in #2546
- Simplify crossterm imports by @orhun in #2545
- Adding note for private models in quick-tour document by @ariG23498 in #2548
- Hotfixing main. by @Narsil in #2562
- Cleanup Vertex + Chat by @Narsil in #2553
- More tensor cores. by @Narsil in #2558
- remove LORA_ADAPTERS_PATH by @nbroad1881 in #2563
- Add LoRA adapters support for Gemma2 by @alvarobartt in #2567
- Fix build with
--features google
by @alvarobartt in #2566 - Improve support for GPUs with capability < 8 by @danieldk in #2575
- flashinfer: pass window size and dtype by @danieldk in #2574
- Remove compute capability lazy cell by @danieldk in #2580
- Update architecture.md by @ulhaqi12 in #2577
- Update ROCM libs and improvements by @mht-sharma in #2579
- Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in #2557
- feat: support phi3.5 moe by @drbh in #2479
- Move flake back to tgi-nix
main
by @danieldk in #2586 - MoE Marlin: support
desc_act
forgroupsize != -1
by @danieldk in #2590 - nix: experimental support for building a Docker container by @danieldk in #2470
- Mllama flash version by @Narsil in #2585
- Max token capacity metric by @Narsil in #2595
- CI (2592): Allow LoRA adapter revision in server launcher by @drbh in #2602
- Unroll notify error into generate response by @drbh in #2597
- New release 2.3.1 by @Narsil in #2604
- Revert "Unroll notify error into generate response" by @drbh in #2605
- nix: example of local package overrides during development by @danieldk in #2607
- Add basic FP8 KV cache support by @danieldk in #2603
- Fp8 Cache condition by @flozi00 in #2611
- enable mllama in intel platform by @sywangyi in #2610
- Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in #2617
- Add support for fused MoE Marlin for AWQ by @danieldk in #2616
- nix: move back to the tgi-nix main branch by @danieldk in #2620
- CI (2599): Update ToolType input schema by @drbh in #2601
- nix: add black and isort to the closure by @danieldk in #2619
- AMD CI by @Narsil in #2589
- feat: allow tool calling to respond without a tool by @drbh in #2614
- Update documentation to most recent stable version of TGI. by @Vaibhavs10 in #2625
- Intel ci by @Narsil in #2630
- Fixing intel Supports windowing. by @Narsil in #2637
- Small fixes for supported models by @osanseviero in #2471
- Cpu perf by @Narsil in #2596
- Clarify gated description and quicktour by @osanseviero in #2631
- update ipex to fix incorrect output of mllama in cpu by @sywangyi in #2640
- feat: enable pytorch xpu support for non-attention models by @dvrogozh in #2561
- Fixing linters. by @Narsil in #2650
- Rollback to
ChatRequest
for Vertex AI Chat instead ofVertexChat
by @alvarobartt in #2651 - Fp8 e4m3_fnuz support for rocm by @mht-sharma in #2588
- feat: prefill chunking by @OlivierDehaene in #2600
- Support
e4m3fn
KV cache by @danieldk in #2655 - Simplify the
attention
function by @danieldk in #2609 - fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in #2663
- fix: prefer inplace softmax to avoid copy by @drbh in #2661
- Break cycle between the attention implementations and KV cache by @danieldk in #2627
- CI job. Gpt awq 4 by @Narsil in #2665
- Make handling of FP8 scales more consisent by @danieldk in #2666
- Test Marlin MoE with
desc_act=true
by @danieldk in #2622 - break when there's nothing to read by @sywangyi in #2582
- Add
impureWithCuda
dev shell by @danieldk in #2677 - Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in #2632
- feat: natively support Granite models by @OlivierDehaene in #2682
- feat: allow any supported payload on /invocations by @OlivierDehaene in #2683
- flashinfer: reminder to remove contiguous call in the future by @danieldk in #2685
- Fix Phi 3.5 MoE tests by @danieldk in #2684
- Add support for FP8 KV cache scales by @danieldk in #2628
- Fixing "deadlock" when python prompts for trust_remote_code by always by @Narsil in #2664
- [TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in #2357
- Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in #2691
- Fixing mt0 test. by @Narsil in #2692
- Add support for stop words in TRTLLM by @mfuntowicz in #2678
- Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in #2688
New Contributors
- @alvarobartt made their first contribution in #2547
- @orhun made their first contribution in #2546
- @ariG23498 made their first contribution in #2548
- @ulhaqi12 made their first contribution in #2577
- @mht-sharma made their first contribution in #2579
- @dvrogozh made their first contribution in #2561
Full Changelog: v2.3.0...v2.4