Does the KV cache persist across multiple requests sharing a prefix? #8860
-
If I have a large prompt that is always the same for the first 4k tokens, if I make multiple requests in a row using that same prefix, will llama.cpp (specifically the alternatively, if the answer is yes, what if I have two prompts that alternate - i.e. two different prefixes A and B, and the pattern if requests is always (A + dynamic>) then (B + dynamic) then (A + dynamic) etc etc, will the KV cache for both A and B prefixes remain in memory (if vram allows) so the input processing stage can be mostly skipped? I'm aware there's an issue open to implement PagedAttention which I believe would achieve the above, though my understanding is PagedAttention is particularly useful for concurrent batched requests, and I'm mostly interested in sequential requests (that just happen to have large prefixes in common) |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Yes, make sure to use set
You can achieve this by using 2 parallel slots (
PagedAttention is unrelated to this functionality |
Beta Was this translation helpful? Give feedback.
-
Hi @ggerganov, As the author described about two long document A, B. Can you explain how to use slot storing and restoring? |
Beta Was this translation helpful? Give feedback.
Yes, make sure to use set
cache_prompt = true
in the requests to enable this featureYou can achieve this by using 2 parallel …