-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TGI does not support FP8 quantized models on ROCm #2654
Comments
Thanks for reporting! I updated the title to reflect that this issue only occurs on ROCm. It looks like we have to expand the shapes when dispatching to Torch scaled mm (for CUDA we don't use the Torch implementation but Marlin/fbgemm depending on the compute capability). |
Any chance you could try |
@danieldk Deployed TGI with |
@danieldk I have deployed
TGI launch script:
|
System Info
System Info
TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm
MODEL: meta-llama/Llama-3.1-405B-Instruct-FP8
Hardware used:
Intel® Xeon® Platinum 8470 2G, 52C/104T, 16GT/s, 105M Cache, Turbo, HT (350W) [x2]
AMD MI300X GPU OAM 192GB 750W GPUs [x8]
64GB RDIMM, 4800MT/s Dual Rank [x32]
Hardware provided by: hotaisle
Deployed using: dstack
Information
Tasks
Reproduction
Steps to reproduce
Output
error_with_quantized_model.txt
Expected behavior
Should run
Llama-3.1-405B-Instruct-FP8
The text was updated successfully, but these errors were encountered: