Infrence Batching in VLLM #5247
HimanshuJanbandhu
started this conversation in
General
Replies: 1 comment
-
Current LLM inference speed bottle neck is memory I/O, which is bring the model weight and data from vram to computation core (eg. CUDA) through HBM. Running in batch means if you have two requests in the same time, you only have to load the weight from VRAM once and deal with the two requests. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
So I used vLLM with
I saw that the Batch inferences took on average way more time than individual prompts took, per prompt.
This medium post claims, batch inferencing should take less time, but doesn't provide any proofs,
https://medium.com/@wearegap/a-brief-introduction-to-optimized-batched-inference-with-vllm-deddf5423d0c
Is there any reason why vLLM should be faster while batch infrencing, if yes, why so?
Beta Was this translation helpful? Give feedback.
All reactions