-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support Qwen2-VL #2183
Comments
Hi, |
where to find PR files for qwen2-vl used by tensorrt-llm |
Any updates? |
Hi, the work is in progress, I'll update it ASAP. |
Any updates? |
2 similar comments
Any updates? |
Any updates? |
@sunnyqgg Is there a clear timeline to complete the model? thanks. |
Hi all, |
Hi, the code is under review and almost done, it'll be public soon. |
Hi, is there any update yet? |
any updates? |
How is the progress? |
It's supported, pls see examples/multimodal for more info. |
Hi, Qwen2-VL can run successfully, but compared to directly import transformers, there is no significant improvement in time consumption and GPU memory usage. Is this within expectations? |
I've encountered the same situation. For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm. |
Hi @LugerW-A
|
@sunnyqgg thanks for your contribution! @kaiyux Hi could you help to public this changes? thanks a lot @sunnyqgg @kaiyux hi currently tensorrtllm_backend does not support the qwen2-vl model. Is there a solution for this? Or can you tell us how to add support to tensorrtllm_backend? Thanks ! |
Any update in this week @sunnyqgg |
Hi, Thanks. |
@sunnyqgg you mean currently tensorrtllm_backend already supported the qwen2-vl model ? Is it tensorrtllm_backend version v0.15.0? |
Hi @fan-niu , Thanks. |
@sunnyqgg |
I found reduced accuracy and output error when two pics with qwen2-vl-7B, but one pic is ok. message:
hf output:
trtllm output:
|
VisionAttentionOpt code can change as following. Then multi pics request will be ok. But at now, i found the performance is lower than vllm when multi concurrent request. That 1~2 concurency is better than vllm. |
Hi, please use the latest code which is public today, and for multi-batch accuracy please change the the attention_mask_vit in tensorrt_llm/runtime/multimodal_model_runner.py,
please let me know if there have any other issues. Thanks. |
I think this is indeed the cause of the bug. By debuging, i found qwen2vl use VisionSdpaAttention default. I use it to instead of VisionAttention, and found can also fix it. |
System Info
qwen2-vl added new features of M-ROPE, please support it
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
qwen2-vl open source model
Expected behavior
tensorrt-llm support
actual behavior
tensorrt-llm not support
additional notes
no
The text was updated successfully, but these errors were encountered: