TGI does not support FP8 quantized models on ROCm #2654

Bihan · 2024-10-16T05:27:18Z

System Info

System Info
TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm
MODEL: meta-llama/Llama-3.1-405B-Instruct-FP8

Hardware used:
Intel® Xeon® Platinum 8470 2G, 52C/104T, 16GT/s, 105M Cache, Turbo, HT (350W) [x2]
AMD MI300X GPU OAM 192GB 750W GPUs [x8]
64GB RDIMM, 4800MT/s Dual Rank [x32]

Hardware provided by: hotaisle

Deployed using: dstack

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Steps to reproduce

Using the above mentioned docker image provision the machine.
RUN text-generation-launcher --port 8000 --num-shard 8 --sharded true

Output
error_with_quantized_model.txt

Expected behavior

Should run Llama-3.1-405B-Instruct-FP8

The text was updated successfully, but these errors were encountered:

danieldk · 2024-10-17T11:16:37Z

Thanks for reporting! I updated the title to reflect that this issue only occurs on ROCm. It looks like we have to expand the shapes when dispatching to Torch scaled mm (for CUDA we don't use the Torch implementation but Marlin/fbgemm depending on the compute capability).

danieldk · 2024-10-17T11:20:30Z

Any chance you could try docker pull ghcr.io/huggingface/text-generation-inference:latest-rocm? ROCm FP8 support was improved yesterday:

#2588

Bihan · 2024-10-17T11:22:38Z

Any chance you could try docker pull ghcr.io/huggingface/text-generation-inference:latest-rocm? ROCm FP8 support was improved yesterday:

#2588

@danieldk Yes sure.

Bihan · 2024-10-17T15:48:26Z

@danieldk Deployed TGI with neuralmagic/Meta-Llama-3-70B-Instruct-FP8 and it worked.

tjtanaa · 2024-10-22T10:14:32Z

@danieldk I have deployed meta-llama/Llama-3.1-405B-Instruct-FP8 however, when I am trying to send a lot of loads, I am getting the following error

 ERROR text_generation_router::server: router/src/server.rs:638: Incomplete generation stream

TGI launch script:

ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --port 8000 --num-shard 8 --sharded true --max-concurrent-requests 1024 --max-total-tokens 131072 --max-input-tokens 131000 --model-id /app/model/models--meta-llama--Llama-3.1-405B-Instruct-FP8/snapshots/64a54b704768dfd589a3e4ac05d546052f67f4fd/

Bihan mentioned this issue Oct 16, 2024

Add Llama 3.1 405B FP8 Inference Benchmark dstackai/benchmarks#1

Open

danieldk changed the title ~~TGI does not support FP8 quantized models~~ TGI does not support FP8 quantized models on ROCm Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TGI does not support FP8 quantized models on ROCm #2654

TGI does not support FP8 quantized models on ROCm #2654

Bihan commented Oct 16, 2024 •

edited

Loading

danieldk commented Oct 17, 2024

danieldk commented Oct 17, 2024

Bihan commented Oct 17, 2024

Bihan commented Oct 17, 2024 •

edited

Loading

tjtanaa commented Oct 22, 2024

TGI does not support FP8 quantized models on ROCm #2654

TGI does not support FP8 quantized models on ROCm #2654

Comments

Bihan commented Oct 16, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

danieldk commented Oct 17, 2024

danieldk commented Oct 17, 2024

Bihan commented Oct 17, 2024

Bihan commented Oct 17, 2024 • edited Loading

tjtanaa commented Oct 22, 2024

Bihan commented Oct 16, 2024 •

edited

Loading

Bihan commented Oct 17, 2024 •

edited

Loading