Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGI does not support FP8 quantized models on ROCm #2654

Open
1 of 4 tasks
Bihan opened this issue Oct 16, 2024 · 5 comments
Open
1 of 4 tasks

TGI does not support FP8 quantized models on ROCm #2654

Bihan opened this issue Oct 16, 2024 · 5 comments

Comments

@Bihan
Copy link

Bihan commented Oct 16, 2024

System Info

System Info
TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm
MODEL: meta-llama/Llama-3.1-405B-Instruct-FP8

Hardware used:
Intel® Xeon® Platinum 8470 2G, 52C/104T, 16GT/s, 105M Cache, Turbo, HT (350W) [x2]
AMD MI300X GPU OAM 192GB 750W GPUs [x8]
64GB RDIMM, 4800MT/s Dual Rank [x32]

Hardware provided by: hotaisle

Deployed using: dstack

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps to reproduce

  1. Using the above mentioned docker image provision the machine.
  2. RUN text-generation-launcher --port 8000 --num-shard 8 --sharded true

Output
error_with_quantized_model.txt

Expected behavior

Should run Llama-3.1-405B-Instruct-FP8

@danieldk danieldk changed the title TGI does not support FP8 quantized models TGI does not support FP8 quantized models on ROCm Oct 17, 2024
@danieldk
Copy link
Member

Thanks for reporting! I updated the title to reflect that this issue only occurs on ROCm. It looks like we have to expand the shapes when dispatching to Torch scaled mm (for CUDA we don't use the Torch implementation but Marlin/fbgemm depending on the compute capability).

@danieldk
Copy link
Member

Any chance you could try docker pull ghcr.io/huggingface/text-generation-inference:latest-rocm? ROCm FP8 support was improved yesterday:

#2588

@Bihan
Copy link
Author

Bihan commented Oct 17, 2024

Any chance you could try docker pull ghcr.io/huggingface/text-generation-inference:latest-rocm? ROCm FP8 support was improved yesterday:

#2588

@danieldk Yes sure.

@Bihan
Copy link
Author

Bihan commented Oct 17, 2024

@danieldk Deployed TGI with neuralmagic/Meta-Llama-3-70B-Instruct-FP8 and it worked.

@tjtanaa
Copy link

tjtanaa commented Oct 22, 2024

@danieldk I have deployed meta-llama/Llama-3.1-405B-Instruct-FP8 however, when I am trying to send a lot of loads, I am getting the following error

 ERROR text_generation_router::server: router/src/server.rs:638: Incomplete generation stream

TGI launch script:

ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --port 8000 --num-shard 8 --sharded true --max-concurrent-requests 1024 --max-total-tokens 131072 --max-input-tokens 131000 --model-id /app/model/models--meta-llama--Llama-3.1-405B-Instruct-FP8/snapshots/64a54b704768dfd589a3e4ac05d546052f67f4fd/ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants