Is it possible for streaming llm to support the QWEN2 model? #1872
lai-serena
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My machine is Ubuntu, with 2 Nvidia Tesla A100 40G GPUs. I use vLLM to run the QWEN2-7B-INSTRUCT model, and it achieves an inference speed of about 34.74 tokens/s. I read the paper that streaming LLMs could improve inference performance by 46%, so I hope to run a streaming LLM with QWEN2. After I set this command
trtllm-build --checkpoint_dir /Qwen2-7B-Instruct-checkpoint --output_dir /Qwen2-7B-Instruct-checkpoint-1gpu --gemm_plugin float16 --streamingllm enable
, I got the above result. Another question is, if your have ever compared the inference performance of TensorRT-LLM with vLLM?Beta Was this translation helpful? Give feedback.
All reactions