[Model] Remove intermediate states copying in Mllama #11295

jkaniecki · 2024-12-18T10:06:56Z

Mllama model - This PR changes the way intermediate hidden states are kept inside encoder part of the model. Appending a tuple with tensors and stacking them at the end can cause memcopying on different devices. This sollution avoids using tuple and uses pre-located tensor instead.
That can help to improve encoder speed by avoiding tensors stacking (which triggers memcopy)

Signed-off-by: Jan Kaniecki <[email protected]>

github-actions · 2024-12-18T10:07:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

heheda12345 · 2024-12-18T19:30:59Z

Do you have benchmark result on this optimization? I think we still need a copy operation here:

encoder_states.index_copy_(0, hidden_states_idx, hidden_states.unsqueeze(0))

jkaniecki · 2024-12-20T12:36:51Z

Do you have benchmark result on this optimization? I think we still need a copy operation here:
encoder_states.index_copy_(0, hidden_states_idx, hidden_states.unsqueeze(0))

@heheda12345 I observe stable improvement in prompt time when using Gaudi devices. Depends on hidden states size, but always getting a few ms boost with this change.

heheda12345 · 2024-12-20T18:48:51Z

Can you show how did you do the benchmarking and your exact number here?

jkaniecki · 2025-01-09T14:59:33Z

@heheda12345 Sorry for long response time, I was offline for last 2,5 weeks. I checked the change impact using benchmark_throughput.py from vllm/benchmarks. Using such a command:

python benchmark_throughput.py --model Meta-Llama-3.2-11B-Vision-Instruct --max-model-len 2048 --dataset datasets/sharegpt4v_instruct_gpt4-vision_cap100k.json --num-prompts 1000 --max-num-seqs 128 --output-len 4

I was able to see ~1,5% gain in all metrics reported by the test.
I made following assumptions:

I reduced output len to 4 to expose change impact on prompt time in the final outcome (encoder is only used during prompt phase)
I used sharegpt4v_instruct_gpt4-vision_cap100k dataset available here
To make benchmark_throughput.py work with mllama I needed to adjust test script by adding proper prompt pattern:

heheda12345 · 2025-01-10T05:39:27Z

I think that is not a very serious problem on Gaudi, and not sure whether it can have similar performance gain on other platforms.

Remove intermediate states copying in Mllama

86015f3

Signed-off-by: Jan Kaniecki <[email protected]>

jkaniecki mentioned this pull request Dec 18, 2024

Remove intermediate states copying in Mllama HabanaAI/vllm-fork#617

Closed

DarkLight1337 requested a review from heheda12345 December 18, 2024 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Remove intermediate states copying in Mllama #11295

[Model] Remove intermediate states copying in Mllama #11295

jkaniecki commented Dec 18, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 18, 2024

heheda12345 commented Dec 18, 2024

jkaniecki commented Dec 20, 2024 •

edited

Loading

heheda12345 commented Dec 20, 2024

jkaniecki commented Jan 9, 2025

heheda12345 commented Jan 10, 2025

[Model] Remove intermediate states copying in Mllama #11295

Are you sure you want to change the base?

[Model] Remove intermediate states copying in Mllama #11295

Conversation

jkaniecki commented Dec 18, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 18, 2024

heheda12345 commented Dec 18, 2024

jkaniecki commented Dec 20, 2024 • edited Loading

heheda12345 commented Dec 20, 2024

jkaniecki commented Jan 9, 2025

heheda12345 commented Jan 10, 2025

jkaniecki commented Dec 18, 2024 •

edited by github-actions bot

Loading

jkaniecki commented Dec 20, 2024 •

edited

Loading