Response streaming takes a while to start but then goes fast #9093
-
Hey, I'm playing around with llama.cpp because I want to inference a small llm (fine tuning of Google's gemma-2b Q4_K_M) on a CPU. I'm trying to optimize latency time and don't quite get why it takes a while to start streaming the response, and than goes blazing fast afterwards. When I send a new prompt it takes around 15 secs just for the first output token to be produced, but then subsequent output tokens take less than 1 sec each. Shouldn't the first decoding step require the same amount of computing as the following ones? What is the initial startup time due to? I've put together a quick notebook here to show the issue I'm talking about, the last two cells are the ones: https://colab.research.google.com/drive/11BTDHA3Qs1SoRKf06pguugdu-33lQZrB?usp=sharing |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
I'm taking the bet it's the prompt processing time that takes 15 secs, not the first token generation. |
Beta Was this translation helpful? Give feedback.
Yes, the TTFT depends on the time it takes to calculate the KV cache for input tokens. This can be done very efficiently if the hardware can do a lot of matrix multiplication in parallel. Also, the
--batch-size
param allow you to control how many tokens can be processed in parallel in this stage.But since you're using CPU, there are not many parallel operations can be done, so you will observe that speed token/sec for prompt processing is not significantly higher than generation speed.
In reality, big server that runs ChatGPT / Claude.ai / etc can do a massive amount of…