Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redrafter fp8 support #2607

Closed
darraghdog opened this issue Dec 22, 2024 · 1 comment
Closed

Redrafter fp8 support #2607

darraghdog opened this issue Dec 22, 2024 · 1 comment

Comments

@darraghdog
Copy link

darraghdog commented Dec 22, 2024

I am deploying Qwen/QwQ-32B-Preview on the a 4 X L4 (24GB per card) environment. I have an fp8 quantised model (I used the llmapi to quantise it) which fits in memory using 32GB, and the remaining memory for the kv cache. I see redrafter supports fp8 in the support matrix.

I have a fp32 redrafter which was trained on the bf16 version of the base model. I would like to convert a quantised fp8 base model (modelopt format) and the fp32 redrafter together. However I see that the convert script only accepts a base model with fp16/fp32/bf16 (link). A bf16 model would allocate too much memory and leave little remaining for kv cache.
I am wondering should it be possible to make it work with fp8 base model (already quantised to fp8 as below); I am happy to modify the conversion script as needed.

Quantisation params of base model

# modified from https://nvidia.github.io/TensorRT-LLM/llm-api-examples/llm_quantization.html 
python3 scripts/quant_llm_api_dist_01.py \
    --model_in_path "/workspace/trtllm/Qwen-QwQ-32B-Preview/" \
    --model_out_path "/workspace/trtllm/Qwen-QwQ-32B-Preview_FP8_KVFP8_tp4/" \
    --quant_algo "FP8" --tp_size 4 --calib_dataset "demo/qwq_cot" \
    --calib_batches 512 --calib_seq_length 1024 --max_batch_size 8 --fp8_kv_cache 

fp8 base model config

{
    "producer": {
        "name": "modelopt",
        "version": "0.19.0"
    },
    "architecture": "QWenForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float16",
    "num_hidden_layers": 64,
    "num_attention_heads": 40,
    "num_key_value_heads": 8,
    "hidden_size": 5120,
    "norm_epsilon": 1e-05,
    "vocab_size": 152064,
    "max_position_embeddings": 32768,
    "hidden_act": "silu",
    "use_parallel_embedding": true,
    "embedding_sharding_dim": 0,
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": "FP8",
        "exclude_modules": [ ....
@darraghdog
Copy link
Author

Closing this, as its working ok now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant