[RFC][Core] Cache policy framework #11928

ShawnD200 · 2025-01-10T11:03:04Z

[Background]
I started to work on KV cache management when I read Simon posted Roadmap Q4 2024 (#9006), which listed Sparse KV cache framework. After digging it for some time, I realized that Sliding Window is already a sparse KV cache, although it is implemented (incompletely) inside the default cache management. So I think it would be better to generalize the idea of cache policy and design an interface for any cache management scheme.

[Motivation]
CachePolicy is a general interface that manages KV cache allocations. It is motivated by the need to have more sophisticated cache management. Implementing CachePolicy, one can make policy to dictate how cache space is used should there be new tokens, such as sliding window, sparse cache, etc.

Actually, its idea is general enough that it is already reality, which may or may not be noted. The default cache behavior just practices a simple cache policy that always allocates new slots/blocks to append new tokens. To this end, CachePolicy should natually fit in the current system. As an interface, it is meant to be implemented to be any cache management idea, primarily aimed to save scarce memory space. And with it, they are easy to be done.

It is important to note that CachePolicy is not BlockAllocator, but asks an allocator for new blocks (and returns blocks back). Its role is clearly defined and confined that it should not be confused and is decoupled with any other component. Nonetheless, specific policy implementation may not be able to work with another component, e.g., it is known that CachePolicySlidingWindow and prefix caching can not be together, which I think is technically unnecessary (IIUC) but it is checked and avoided in configuring stage so that it is not any different and will not break anything.

[Design]
In terms of its design, CachePolicy has two straightforward methods basically: add_tokens_prefill and add_tokens_decode (which can be unified as suggested by Cody). Their functions are indeed unique, to give slots/blocks to new tokens. The given slots/blocks may be new, and if so they will be requested from allocator. Or they can be old ones, which are entirely from the previous blocks at its disposal. It is the policy that selects victims, evicts them, and replaces them with the new ones in their positions.

CachePolicy is lightweight, it employs two data structures to keep track of blocks and tokens: a PhysicalBlockTable and a VirtualBlockTable, and build these two along the process and use them to manage the cache space. The PhysicalBlockTable (previously as BlockList) holds the physical blocks, while the VirtualBlockTable maps tokens to blocks and backward with slot mappings and token mappings respectively.

[Status]
Using CachePolicy, I implemented the first two concrete policies which are used today. Apart from CachePolicyBase that is the same as the default, the second one, CachePolicySlidingWindow maintains a context of a sliding window, instead of mixing these two policies together as they are, using the CachePolicy interface helps achieve it cleanly and completely. By complete, I mean sliding window now takes effect in prefill and is adhered to in decode, in which never a new block will be allocated should a window of blocks have been had.

Lightly tested it on Qwen2.5 with basic usages, I post it early for comments and suggestions.

Future work:

Unit test cases (integration and for itself)
A sparse cache policy that selects token which to keep and which to throw, typically from existing research.
It seems per layer cache is in interest, will see if and how to cover this.

Thanks.

@simon-mo @comaniac @WoosukKwon @cadedaniel @alexm-neuralmagic

Signed-off-by: Shawn Du <[email protected]>

Add CachePolicySlidingWindow Adapt seq_len Signed-off-by: Shawn Du <[email protected]>

Signed-off-by: Shawn Du <[email protected]>

github-actions · 2025-01-10T11:03:19Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac · 2025-01-10T20:06:19Z

This is a big change so I'd expect an RFC to propose the motivation, overall design and impact (how does it work with the current features such as chunked prefill, prefix caching, etc).

Meanwhile, some comments based on the current PR description:

We want to eliminate the concept of prefill and decode in the engine as possible, mainly because they can be unified by a single pattern: (query_length=X, kv_cache=Y). Prefill is (query_length=N, kv_cache=None), and decode is (query_length=1, kv_cache=N). For prefix caching and chunked prefill, it would be (query_length=M, kv_cache=K). Thus, add_tokens_prefill is not a desired method.
Better to think about how this proposal works with vLLM v1, given that we are going to make vLLM v1 as default (hopefully) soon. In the vLLM v1 architecture, there's no concept of prefill decode anymore, and chunked prefill / prefix caching are first class citizens.

Signed-off-by: Shawn Du <[email protected]>

ShawnD200 · 2025-01-11T09:57:05Z

Thanks for the review.

This is a big change so I'd expect an RFC to propose the motivation, overall design and impact (how does it work with the current features such as chunked prefill, prefix caching, etc).

Sure. I briefly introduced it in the PR text.

Meanwhile, some comments based on the current PR description:

We want to eliminate the concept of prefill and decode in the engine as possible, mainly because they can be unified by a single pattern: (query_length=X, kv_cache=Y). Prefill is (query_length=N, kv_cache=None), and decode is (query_length=1, kv_cache=N). For prefix caching and chunked prefill, it would be (query_length=M, kv_cache=K). Thus, add_tokens_prefill is not a desired method.

Absolutely, they can be unified.

Better to think about how this proposal works with vLLM v1, given that we are going to make vLLM v1 as default (hopefully) soon. In the vLLM v1 architecture, there's no concept of prefill decode anymore, and chunked prefill / prefix caching are first class citizens.

I am not so familiar with v1. But I think in principle CachePolicy is not so big a change, it is basically the previous BlockTable, and PhysicalBlockTable is previously BlockList. What I did is jut abstracted and formalized the idea cache management can have different policies, which is anyway there, and based on the framework "reimplemented" SlidingWindow as a cache policy. Except for these, there should be no difference compared to the current system.

With respect to Prefix Caching, IIUC it is behind allocator for block sharing, which therefore is utterly opaque to CachePolicy, so is Chunked Prefill. I will learn about v1 and see how to fit in.

…ers) Signed-off-by: Shawn Du <[email protected]>

Signed-off-by: Shawn Du <[email protected]>

ShawnD200 added 6 commits January 10, 2025 15:00

Add slot mapping in BlockList

2f49b49

Signed-off-by: Shawn Du <[email protected]>

Add CachePolicy and PhysicalBlockTable, VirtualBlockTable

0ee6ad7

Add CachePolicySlidingWindow Adapt seq_len Signed-off-by: Shawn Du <[email protected]>

Add CachePolicyFactory

bffc695

Signed-off-by: Shawn Du <[email protected]>

Add comments

b3bb962

Signed-off-by: Shawn Du <[email protected]>

Rebase with extra_hash

749d492

Signed-off-by: Shawn Du <[email protected]>

Delete block_table

de7c117

Signed-off-by: Shawn Du <[email protected]>

ShawnD200 requested review from zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners January 10, 2025 11:03

ShawnD200 marked this pull request as draft January 10, 2025 11:06

ShawnD200 added 2 commits January 11, 2025 13:40

Fix rebase bug and static check error

1ded7d1

Signed-off-by: Shawn Du <[email protected]>

Fix static check error

2789791

Signed-off-by: Shawn Du <[email protected]>

ShawnD200 changed the title ~~[Core] Cache policy framework~~ [RFC][Core] Cache policy framework Jan 11, 2025

ShawnD200 added 2 commits January 14, 2025 12:54

Adapt to flash-attention and fashinfer backend (previously only xform…

4b1e9a5

…ers) Signed-off-by: Shawn Du <[email protected]>

Fix prefix caching abstract type

d9b3979

Signed-off-by: Shawn Du <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Core] Cache policy framework #11928

[RFC][Core] Cache policy framework #11928

ShawnD200 commented Jan 10, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 10, 2025

comaniac commented Jan 10, 2025

ShawnD200 commented Jan 11, 2025

[RFC][Core] Cache policy framework #11928

Are you sure you want to change the base?

[RFC][Core] Cache policy framework #11928

Conversation

ShawnD200 commented Jan 10, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 10, 2025

comaniac commented Jan 10, 2025

ShawnD200 commented Jan 11, 2025

ShawnD200 commented Jan 10, 2025 •

edited by github-actions bot

Loading