Infrence Batching in VLLM #5247

HimanshuJanbandhu · 2024-06-04T11:38:45Z

HimanshuJanbandhu
Jun 4, 2024

So I used vLLM with

Individual prompt inferencing
Batching prompts for inferencing

I saw that the Batch inferences took on average way more time than individual prompts took, per prompt.

This medium post claims, batch inferencing should take less time, but doesn't provide any proofs,
https://medium.com/@wearegap/a-brief-introduction-to-optimized-batched-inference-with-vllm-deddf5423d0c

Is there any reason why vLLM should be faster while batch infrencing, if yes, why so?

Chocobi-1129 · 2024-12-13T05:37:33Z

Chocobi-1129
Dec 13, 2024

Current LLM inference speed bottle neck is memory I/O, which is bring the model weight and data from vram to computation core (eg. CUDA) through HBM. Running in batch means if you have two requests in the same time, you only have to load the weight from VRAM once and deal with the two requests.
The average time per batch will be lower but the latency of the first request will be higher.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrence Batching in VLLM #5247

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Infrence Batching in VLLM #5247

HimanshuJanbandhu Jun 4, 2024

Replies: 1 comment

Chocobi-1129 Dec 13, 2024

HimanshuJanbandhu
Jun 4, 2024

Chocobi-1129
Dec 13, 2024