Skip to content

Does the KV cache persist across multiple requests sharing a prefix? #8860

Answered by ggerganov
andysalerno asked this question in Q&A
Discussion options

You must be logged in to vote

If I have a large prompt that is always the same for the first 4k tokens, if I make multiple requests in a row using that same prefix, will llama.cpp (specifically the server binary) reuse the kv cache for the prefix?

Yes, make sure to use set cache_prompt = true in the requests to enable this feature

alternatively, if the answer is yes, what if I have two prompts that alternate - i.e. two different prefixes A and B, and the pattern if requests is always (A + dynamic>) then (B + dynamic) then (A + dynamic) etc etc, will the KV cache for both A and B prefixes remain in memory (if vram allows) so the input processing stage can be mostly skipped?

You can achieve this by using 2 parallel …

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
1 reply
@andysalerno
Comment options

Answer selected by andysalerno
Comment options

You must be logged in to vote
1 reply
@ggerganov
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants