Response streaming takes a while to start but then goes fast #9093

GoldenGooose · 2024-08-19T10:35:16Z

GoldenGooose
Aug 19, 2024

Hey, I'm playing around with llama.cpp because I want to inference a small llm (fine tuning of Google's gemma-2b Q4_K_M) on a CPU. I'm trying to optimize latency time and don't quite get why it takes a while to start streaming the response, and than goes blazing fast afterwards.

When I send a new prompt it takes around 15 secs just for the first output token to be produced, but then subsequent output tokens take less than 1 sec each. Shouldn't the first decoding step require the same amount of computing as the following ones? What is the initial startup time due to?

I've put together a quick notebook here to show the issue I'm talking about, the last two cells are the ones:

https://colab.research.google.com/drive/11BTDHA3Qs1SoRKf06pguugdu-33lQZrB?usp=sharing

Answered by ngxson

Aug 20, 2024

Well I guess most of the delay for the TTFT is kv cache at this point, I didn't know that was so computing intensive.

Yes, the TTFT depends on the time it takes to calculate the KV cache for input tokens. This can be done very efficiently if the hardware can do a lot of matrix multiplication in parallel. Also, the --batch-size param allow you to control how many tokens can be processed in parallel in this stage.

But since you're using CPU, there are not many parallel operations can be done, so you will observe that speed token/sec for prompt processing is not significantly higher than generation speed.

In reality, big server that runs ChatGPT / Claude.ai / etc can do a massive amount of…

View full answer

ExtReMLapin · 2024-08-20T05:17:55Z

ExtReMLapin
Aug 20, 2024

I'm taking the bet it's the prompt processing time that takes 15 secs, not the first token generation.

4 replies

GoldenGooose Aug 20, 2024
Author

I'm taking the bet it's the prompt processing time that takes 15 secs, not the first token generation.

yes that makes sense. However, what kind of processing is it doing exactly? I don't see anything else that needs to be done other than tokenizing the input and that should be much faster than the rest of inference. I'm not intimately familiar with llama.cpp though, perhaps there is some further stuff going on

ngxson Aug 20, 2024
Collaborator

The 15s delay you're referring to is Time To First Token (TTFT)

I don't see anything else that needs to be done other than tokenizing the input

No, not just tokenizing. I recommend watching some videos that explain basic concepts in LLM (I'm talking about LLM in general, not only applied to llama.cpp): https://www.youtube.com/watch?v=eMlx5fFNoYc

GoldenGooose Aug 20, 2024
Author

Thanks ngxson, yes I'm familiar with the video (which is great btw). Well I guess most of the delay for the TTFT is kv cache at this point, I didn't know that was so computing intensive.

llama_print_timings: load time = 11118.42 ms
llama_print_timings: sample time = 3.75 ms / 17 runs ( 0.22 ms per token, 4532.12 tokens per second)
llama_print_timings: prompt eval time = 11117.87 ms / 389 tokens ( 28.58 ms per token, 34.99 tokens per second)
llama_print_timings: eval time = 1260.96 ms / 16 runs ( 78.81 ms per token, 12.69 tokens per second)
llama_print_timings: total time = 12421.31 ms / 405 tokens

'prompt_tokens': 389, 'completion_tokens': 16

I just want to confirm that the above numbers are on par with your experience, with 11 sec TTFT and then about 13 tokens per sec after that, thank you

ngxson Aug 20, 2024
Collaborator

Well I guess most of the delay for the TTFT is kv cache at this point, I didn't know that was so computing intensive.

Yes, the TTFT depends on the time it takes to calculate the KV cache for input tokens. This can be done very efficiently if the hardware can do a lot of matrix multiplication in parallel. Also, the --batch-size param allow you to control how many tokens can be processed in parallel in this stage.

But since you're using CPU, there are not many parallel operations can be done, so you will observe that speed token/sec for prompt processing is not significantly higher than generation speed.

In reality, big server that runs ChatGPT / Claude.ai / etc can do a massive amount of tensor operations in parallel (plus, they use server GPU, not consumer one). That's why TTFT is low on these platforms.

Answer selected by GoldenGooose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response streaming takes a while to start but then goes fast #9093

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Response streaming takes a while to start but then goes fast #9093

GoldenGooose Aug 19, 2024

Replies: 1 comment · 4 replies

ExtReMLapin Aug 20, 2024

GoldenGooose Aug 20, 2024 Author

ngxson Aug 20, 2024 Collaborator

GoldenGooose Aug 20, 2024 Author

ngxson Aug 20, 2024 Collaborator

GoldenGooose
Aug 19, 2024

Replies: 1 comment 4 replies

ExtReMLapin
Aug 20, 2024

GoldenGooose Aug 20, 2024
Author

ngxson Aug 20, 2024
Collaborator

GoldenGooose Aug 20, 2024
Author

ngxson Aug 20, 2024
Collaborator