-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Remove intermediate states copying in Mllama #11295
base: main
Are you sure you want to change the base?
[Model] Remove intermediate states copying in Mllama #11295
Conversation
Signed-off-by: Jan Kaniecki <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Do you have benchmark result on this optimization? I think we still need a copy operation here:
|
@heheda12345 I observe stable improvement in prompt time when using Gaudi devices. Depends on hidden states size, but always getting a few ms boost with this change. |
Can you show how did you do the benchmarking and your exact number here? |
@heheda12345 Sorry for long response time, I was offline for last 2,5 weeks. I checked the change impact using benchmark_throughput.py from vllm/benchmarks. Using such a command:
I was able to see ~1,5% gain in all metrics reported by the test.
|
I think that is not a very serious problem on Gaudi, and not sure whether it can have similar performance gain on other platforms. |
Mllama model - This PR changes the way intermediate hidden states are kept inside encoder part of the model. Appending a tuple with tensors and stacking them at the end can cause memcopying on different devices. This sollution avoids using tuple and uses pre-located tensor instead.
That can help to improve encoder speed by avoiding tensors stacking (which triggers memcopy)