Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qserve is slower then awq int4 for llama2-7b on H100 #2509

Open
anaivebird opened this issue Nov 28, 2024 · 4 comments
Open

qserve is slower then awq int4 for llama2-7b on H100 #2509

anaivebird opened this issue Nov 28, 2024 · 4 comments
Assignees
Labels
Performance Issue about performance number triaged Issue has been triaged by maintainers

Comments

@anaivebird
Copy link

anaivebird commented Nov 28, 2024

System Info

  • GPU: NVIDIA H100 80G
  • TensorRT-LLM branch main
  • TensorRT-LLM commit: 535c9cc

performance results

qserve result:

Successful Request 359
Request_Gen_Token_Len 1024
Batch Size 64
Avg_Input_Token_Len 1737.53
Avg_Gen_Token_Len 1000.3
Elapse_Time (s) 226.188
Time_to_First_Token_AVG (s) 9.957
Time_to_First_Token_P99 (s) 30.965
Time_per_Output_Token_AVG (s) 0.029
Time_per_Output_Token_P99 (s) 0.03
Latency_P90 (s) 57.549
Latency_P95 (s) 58.187
Latency_P99 (s) 61.007
Latency_AVG (s) 34.043
Token QPS (token/s) 1587.65
Service QPS (req/s) 1.59

Successful Request 208
Request_Gen_Token_Len 1024
Batch Size 128
Avg_Input_Token_Len 1802.95
Avg_Gen_Token_Len 994.21
Elapse_Time (s) 135.085
Time_to_First_Token_AVG (s) 36.664
Time_to_First_Token_P99 (s) 62.527
Time_per_Output_Token_AVG (s) 0.028
Time_per_Output_Token_P99 (s) 0.045
Latency_P90 (s) 88.988
Latency_P95 (s) 90.888
Latency_P99 (s) 92.339
Latency_AVG (s) 33.051
Token QPS (token/s) 1530.85
Service QPS (req/s) 1.54

awq result:

Successful Request 369
Request_Gen_Token_Len 1024
Batch Size 64
Avg_Input_Token_Len 1726.56
Avg_Gen_Token_Len 952.3
Elapse_Time (s) 212.125
Time_to_First_Token_AVG (s) 8.244
Time_to_First_Token_P99 (s) 29.357
Time_per_Output_Token_AVG (s) 0.029
Time_per_Output_Token_P99 (s) 0.062
Latency_P90 (s) 53.352
Latency_P95 (s) 55.721
Latency_P99 (s) 58.419
Latency_AVG (s) 31.806
Token QPS (token/s) 1656.56
Service QPS (req/s) 1.74

Successful Request 177
Request_Gen_Token_Len 1024
Batch Size 128
Avg_Input_Token_Len 1804.7
Avg_Gen_Token_Len 931.08
Elapse_Time (s) 105.276
Time_to_First_Token_AVG (s) 30.793
Time_to_First_Token_P99 (s) 59.689
Time_per_Output_Token_AVG (s) 0.028
Time_per_Output_Token_P99 (s) 0.072
Latency_P90 (s) 72.126
Latency_P95 (s) 86.212
Latency_P99 (s) 88.854
Latency_AVG (s) 24.425
Token QPS (token/s) 1565.43
Service QPS (req/s) 1.68

build commands:

#qserve engine build

git clone https://github.com/mit-han-lab/deepcompressor
cd deepcompressor
git checkout lmquant-v0.0.0-deprecated
export PATH="/root/miniconda3/bin:$PATH"
source activate base
conda env create -f environment.yml -n lmquant
conda activate lmquant
poetry install
cd /root/deepcompressor/projects/llm
nohup python -m lmquant.llm.run \
    configs/llm.yaml configs/qoq/g128.yaml \
    --model-name llama2-7b --model-path /root/llama2-7b \
    --smooth-xw-alpha 0 --smooth-xw-beta 1 \
    --smooth-yx-alpha 0.5 --smooth-yx-beta 0 \
    --save-model &


cd /app/tensorrt_llm/examples/llama
export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quant-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

trtllm-build --checkpoint_dir /root/trtllm-llama2-7b \
            --output_dir /root/engine-llama2-7b \
            --gemm_plugin auto


#awq int4 engine build

convert_script=../llama/convert_checkpoint.py
quantize_script=../quantization/quantize.py
model_dir=/root/llama2-7b
output_dir=/root/awq-llama2-7b
tp=1
python3 ../quantization/quantize.py --model_dir ${model_dir} \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --output_dir $output_dir/llama-checkpoint-awq-int4-${tp}gpu/ \
                                   --calib_size 128 \
                                   --batch_size 1 \
                                   --calib_max_seq_length 2048

trtllm-build --checkpoint_dir $output_dir/llama-checkpoint-awq-int4-${tp}gpu/ \
             --output_dir $output_dir/llama-trt-engine-awq-int4-${tp}gpu/ \
                         --gemm_plugin float16 \
                         --use_paged_context_fmha enable \
                         --max_num_tokens 13120 \
                         --max_seq_len 4096 \
                         --max_batch_size 128

@anaivebird anaivebird changed the title qserve with tensorrt-llm is slower and awq int4 for llama2-7b qserve group 128 with tensorrt-llm is slower and awq int4 for llama2-7b Nov 28, 2024
@anaivebird anaivebird changed the title qserve group 128 with tensorrt-llm is slower and awq int4 for llama2-7b qserve is slower then awq int4 for llama2-7b on H100 Nov 29, 2024
@anaivebird
Copy link
Author

anaivebird commented Nov 29, 2024

both per channel and per group qserve is slower than awq

batch size qserve per group qserve per channel awq
4 no test 514.54 602.91
64 1587.65 1675.41 1656.56
128 1530.85 1660.44 1565.43

@bobboli
Copy link
Collaborator

bobboli commented Dec 2, 2024

Hi,
Currently QServe kernels are not fully utilizing the hardware features of Hopper architecture. You could try on Ampere or Ada cards if available.

@hello-11 hello-11 added the Performance Issue about performance number label Dec 2, 2024
@hello-11 hello-11 added the triaged Issue has been triaged by maintainers label Dec 10, 2024
@KKwanhee
Copy link

KKwanhee commented Jan 4, 2025

I compared qserve with AWQ on small batch sizes, and qserve is still slower even on an A100 GPU. Is this because it hasn’t been fully optimized yet?

@bobboli
Copy link
Collaborator

bobboli commented Jan 6, 2025

I compared qserve with AWQ on small batch sizes, and qserve is still slower even on an A100 GPU. Is this because it hasn’t been fully optimized yet?

The throughput of QServe should be close to AWQ at small batch sizes, which is mainly determined by the bit-width of the weights. QServe (w4a8) demonstrates advantages over AWQ (w4a16) in Time-To-First-Token, as well as throughput at large batch sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Issue about performance number triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants