Skip to content

Response streaming takes a while to start but then goes fast #9093

Closed Answered by ngxson
GoldenGooose asked this question in Q&A
Discussion options

You must be logged in to vote

Well I guess most of the delay for the TTFT is kv cache at this point, I didn't know that was so computing intensive.

Yes, the TTFT depends on the time it takes to calculate the KV cache for input tokens. This can be done very efficiently if the hardware can do a lot of matrix multiplication in parallel. Also, the --batch-size param allow you to control how many tokens can be processed in parallel in this stage.

But since you're using CPU, there are not many parallel operations can be done, so you will observe that speed token/sec for prompt processing is not significantly higher than generation speed.

In reality, big server that runs ChatGPT / Claude.ai / etc can do a massive amount of…

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@GoldenGooose
Comment options

@ngxson
Comment options

@GoldenGooose
Comment options

@ngxson
Comment options

Answer selected by GoldenGooose
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants