Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Remove intermediate states copying in Mllama #11295

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jkaniecki
Copy link
Contributor

@jkaniecki jkaniecki commented Dec 18, 2024

Mllama model - This PR changes the way intermediate hidden states are kept inside encoder part of the model. Appending a tuple with tensors and stacking them at the end can cause memcopying on different devices. This sollution avoids using tuple and uses pre-located tensor instead.
That can help to improve encoder speed by avoiding tensors stacking (which triggers memcopy)

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@heheda12345
Copy link
Collaborator

Do you have benchmark result on this optimization? I think we still need a copy operation here:

encoder_states.index_copy_(0, hidden_states_idx, hidden_states.unsqueeze(0))

@jkaniecki
Copy link
Contributor Author

jkaniecki commented Dec 20, 2024

Do you have benchmark result on this optimization? I think we still need a copy operation here:

encoder_states.index_copy_(0, hidden_states_idx, hidden_states.unsqueeze(0))

@heheda12345 I observe stable improvement in prompt time when using Gaudi devices. Depends on hidden states size, but always getting a few ms boost with this change.

@heheda12345
Copy link
Collaborator

Can you show how did you do the benchmarking and your exact number here?

@jkaniecki
Copy link
Contributor Author

@heheda12345 Sorry for long response time, I was offline for last 2,5 weeks. I checked the change impact using benchmark_throughput.py from vllm/benchmarks. Using such a command:

python benchmark_throughput.py --model Meta-Llama-3.2-11B-Vision-Instruct --max-model-len 2048 --dataset datasets/sharegpt4v_instruct_gpt4-vision_cap100k.json --num-prompts 1000 --max-num-seqs 128 --output-len 4

I was able to see ~1,5% gain in all metrics reported by the test.
I made following assumptions:

  • I reduced output len to 4 to expose change impact on prompt time in the final outcome (encoder is only used during prompt phase)
  • I used sharegpt4v_instruct_gpt4-vision_cap100k dataset available here
  • To make benchmark_throughput.py work with mllama I needed to adjust test script by adding proper prompt pattern:
    image

@heheda12345
Copy link
Collaborator

I think that is not a very serious problem on Gaudi, and not sure whether it can have similar performance gain on other platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants