Is it possible for streaming llm to support the QWEN2 model? #1872

lai-serena · 2024-07-01T01:43:07Z

lai-serena
Jul 1, 2024

My machine is Ubuntu, with 2 Nvidia Tesla A100 40G GPUs. I use vLLM to run the QWEN2-7B-INSTRUCT model, and it achieves an inference speed of about 34.74 tokens/s. I read the paper that streaming LLMs could improve inference performance by 46%, so I hope to run a streaming LLM with QWEN2. After I set this command trtllm-build --checkpoint_dir /Qwen2-7B-Instruct-checkpoint --output_dir /Qwen2-7B-Instruct-checkpoint-1gpu --gemm_plugin float16 --streamingllm enable, I got the above result. Another question is, if your have ever compared the inference performance of TensorRT-LLM with vLLM?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible for streaming llm to support the QWEN2 model? #1872

{{title}}

Replies: 0 comments

Select a reply

Is it possible for streaming llm to support the QWEN2 model? #1872

lai-serena Jul 1, 2024

Replies: 0 comments

lai-serena
Jul 1, 2024