Does the KV cache persist across multiple requests sharing a prefix? #8860

andysalerno · 2024-08-04T23:09:26Z

andysalerno
Aug 4, 2024

If I have a large prompt that is always the same for the first 4k tokens, if I make multiple requests in a row using that same prefix, will llama.cpp (specifically the server binary) reuse the kv cache for the prefix?

alternatively, if the answer is yes, what if I have two prompts that alternate - i.e. two different prefixes A and B, and the pattern if requests is always (A + dynamic>) then (B + dynamic) then (A + dynamic) etc etc, will the KV cache for both A and B prefixes remain in memory (if vram allows) so the input processing stage can be mostly skipped?

I'm aware there's an issue open to implement PagedAttention which I believe would achieve the above, though my understanding is PagedAttention is particularly useful for concurrent batched requests, and I'm mostly interested in sequential requests (that just happen to have large prefixes in common)

Answered by ggerganov

Aug 5, 2024

If I have a large prompt that is always the same for the first 4k tokens, if I make multiple requests in a row using that same prefix, will llama.cpp (specifically the server binary) reuse the kv cache for the prefix?

Yes, make sure to use set cache_prompt = true in the requests to enable this feature

alternatively, if the answer is yes, what if I have two prompts that alternate - i.e. two different prefixes A and B, and the pattern if requests is always (A + dynamic>) then (B + dynamic) then (A + dynamic) etc etc, will the KV cache for both A and B prefixes remain in memory (if vram allows) so the input processing stage can be mostly skipped?

You can achieve this by using 2 parallel …

View full answer

ggerganov · 2024-08-05T06:11:51Z

ggerganov
Aug 5, 2024
Maintainer

If I have a large prompt that is always the same for the first 4k tokens, if I make multiple requests in a row using that same prefix, will llama.cpp (specifically the server binary) reuse the kv cache for the prefix?

Yes, make sure to use set cache_prompt = true in the requests to enable this feature

alternatively, if the answer is yes, what if I have two prompts that alternate - i.e. two different prefixes A and B, and the pattern if requests is always (A + dynamic>) then (B + dynamic) then (A + dynamic) etc etc, will the KV cache for both A and B prefixes remain in memory (if vram allows) so the input processing stage can be mostly skipped?

You can achieve this by using 2 parallel slots (-np 2 when starting server) and sending the A requests to slot 0 and the B requests to slot 1, again with cache_prompt = true for each request

I'm aware there's an issue open to implement PagedAttention which I believe would achieve the above, though my understanding is PagedAttention is particularly useful for concurrent batched requests, and I'm mostly interested in sequential requests (that just happen to have large prefixes in common)

PagedAttention is unrelated to this functionality

1 reply

andysalerno Aug 5, 2024
Author

Thanks! And now I finally understand what slots are, so I learned two things today :D

dhandhalyabhavik · 2024-10-01T05:03:38Z

dhandhalyabhavik
Oct 1, 2024

Hi @ggerganov,

As the author described about two long document A, B. Can you explain how to use slot storing and restoring?
is it possible that we store A.bin file for A document, B.bin file for B document .. etc for any N document. When user wants to communicate with any document, they just restore them using /slots/?action=restore and start communicating but at the same time saving on kv computation costs?

1 reply

ggerganov Oct 1, 2024
Maintainer

Yes, you should be able to do that. Let us know if you find any issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the KV cache persist across multiple requests sharing a prefix? #8860

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Does the KV cache persist across multiple requests sharing a prefix? #8860

andysalerno Aug 4, 2024

Replies: 2 comments · 2 replies

ggerganov Aug 5, 2024 Maintainer

andysalerno Aug 5, 2024 Author

dhandhalyabhavik Oct 1, 2024

ggerganov Oct 1, 2024 Maintainer

andysalerno
Aug 4, 2024

Replies: 2 comments 2 replies

ggerganov
Aug 5, 2024
Maintainer

andysalerno Aug 5, 2024
Author

dhandhalyabhavik
Oct 1, 2024

ggerganov Oct 1, 2024
Maintainer