Replies: 3 comments 3 replies
-
Hi @imoneoi, thanks for the question. Yes, vLLM has some CPU-side overheads that can potentially decrease its GPU utilization. For example, the tokenizer can be a performance bottleneck especially when the slow tokenizer is used. For another example, the sampler part can be slow, especially when the requests have different sampling parameters (e.g., some requests use nucleus sampling and others use beam search). Also, FastAPI may incur some overhead when the request rate is high. For now, it's difficult to tell which one is causing the slowdown. Thanks again for reporting it. We will continue to identify and optimize the performance issue. |
Beta Was this translation helpful? Give feedback.
-
I have a similar issue. I'm using VLLM with Qwen2.5 32b GGUF q4 via the OpenAI endpoints to do chat completion. I know this is experimental, but am VRAM poor and need this kind of model. (OK, I'm testing in the cloud with short-lived machines, but am looking for something that can be used beyond testing without ridiculous cost). I am playing with increasing the number of GPU expecting to halve the token generation time each time I double the GPUs, approximately. I'm finding that the GPUs hit around 80-95% usage (which is good) but after a certain number of GPUs (depending on the GPU model), it hits a wall in terms of performance gain.
And the worst thing about it is that it doesn't matter if I have more than 2 vCPUs, it's always hitting the 1 (real) core 100% usage bottleneck. I've tried with 24 (12 real) cores, same. This is a single chat request, not parallel calls. I've played with all the options I can find related to concurrency, parallel, threads. I guess it's a slow tokeniser, but also one that isn't able to use multithreading? |
Beta Was this translation helpful? Give feedback.
-
I too have a similar issue with OpenAI API server. Whatever is eating up the CPU, I would hope it can spread across available CPU threads. |
Beta Was this translation helpful? Give feedback.
-
I run an openai API server with a LLaMA-based model and 128 parallel requests, but only about 50% GPU utilization (nvidia-smi). Is it normal? Or because of some overhead, such as tokenizer?
Beta Was this translation helpful? Give feedback.
All reactions