Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][Core] Cache policy framework #11928

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

ShawnD200
Copy link
Contributor

@ShawnD200 ShawnD200 commented Jan 10, 2025

[Background]
I started to work on KV cache management when I read Simon posted Roadmap Q4 2024 (#9006), which listed Sparse KV cache framework. After digging it for some time, I realized that Sliding Window is already a sparse KV cache, although it is implemented (incompletely) inside the default cache management. So I think it would be better to generalize the idea of cache policy and design an interface for any cache management scheme.

[Motivation]
CachePolicy is a general interface that manages KV cache allocations. It is motivated by the need to have more sophisticated cache management. Implementing CachePolicy, one can make policy to dictate how cache space is used should there be new tokens, such as sliding window, sparse cache, etc.

Actually, its idea is general enough that it is already reality, which may or may not be noted. The default cache behavior just practices a simple cache policy that always allocates new slots/blocks to append new tokens. To this end, CachePolicy should natually fit in the current system. As an interface, it is meant to be implemented to be any cache management idea, primarily aimed to save scarce memory space. And with it, they are easy to be done.

It is important to note that CachePolicy is not BlockAllocator, but asks an allocator for new blocks (and returns blocks back). Its role is clearly defined and confined that it should not be confused and is decoupled with any other component. Nonetheless, specific policy implementation may not be able to work with another component, e.g., it is known that CachePolicySlidingWindow and prefix caching can not be together, which I think is technically unnecessary (IIUC) but it is checked and avoided in configuring stage so that it is not any different and will not break anything.

[Design]
In terms of its design, CachePolicy has two straightforward methods basically: add_tokens_prefill and add_tokens_decode (which can be unified as suggested by Cody). Their functions are indeed unique, to give slots/blocks to new tokens. The given slots/blocks may be new, and if so they will be requested from allocator. Or they can be old ones, which are entirely from the previous blocks at its disposal. It is the policy that selects victims, evicts them, and replaces them with the new ones in their positions.

CachePolicy is lightweight, it employs two data structures to keep track of blocks and tokens: a PhysicalBlockTable and a VirtualBlockTable, and build these two along the process and use them to manage the cache space. The PhysicalBlockTable (previously as BlockList) holds the physical blocks, while the VirtualBlockTable maps tokens to blocks and backward with slot mappings and token mappings respectively.

[Status]
Using CachePolicy, I implemented the first two concrete policies which are used today. Apart from CachePolicyBase that is the same as the default, the second one, CachePolicySlidingWindow maintains a context of a sliding window, instead of mixing these two policies together as they are, using the CachePolicy interface helps achieve it cleanly and completely. By complete, I mean sliding window now takes effect in prefill and is adhered to in decode, in which never a new block will be allocated should a window of blocks have been had.

Lightly tested it on Qwen2.5 with basic usages, I post it early for comments and suggestions.

Future work:

  1. Unit test cases (integration and for itself)
  2. A sparse cache policy that selects token which to keep and which to throw, typically from existing research.
  3. It seems per layer cache is in interest, will see if and how to cover this.

Thanks.

@simon-mo @comaniac @WoosukKwon @cadedaniel @alexm-neuralmagic

Add CachePolicySlidingWindow

Adapt seq_len

Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@ShawnD200 ShawnD200 marked this pull request as draft January 10, 2025 11:06
@comaniac
Copy link
Collaborator

This is a big change so I'd expect an RFC to propose the motivation, overall design and impact (how does it work with the current features such as chunked prefill, prefix caching, etc).

Meanwhile, some comments based on the current PR description:

  • We want to eliminate the concept of prefill and decode in the engine as possible, mainly because they can be unified by a single pattern: (query_length=X, kv_cache=Y). Prefill is (query_length=N, kv_cache=None), and decode is (query_length=1, kv_cache=N). For prefix caching and chunked prefill, it would be (query_length=M, kv_cache=K). Thus, add_tokens_prefill is not a desired method.
  • Better to think about how this proposal works with vLLM v1, given that we are going to make vLLM v1 as default (hopefully) soon. In the vLLM v1 architecture, there's no concept of prefill decode anymore, and chunked prefill / prefix caching are first class citizens.

@ShawnD200 ShawnD200 changed the title [Core] Cache policy framework [RFC][Core] Cache policy framework Jan 11, 2025
@ShawnD200
Copy link
Contributor Author

Thanks for the review.

This is a big change so I'd expect an RFC to propose the motivation, overall design and impact (how does it work with the current features such as chunked prefill, prefix caching, etc).

Sure. I briefly introduced it in the PR text.

Meanwhile, some comments based on the current PR description:

  • We want to eliminate the concept of prefill and decode in the engine as possible, mainly because they can be unified by a single pattern: (query_length=X, kv_cache=Y). Prefill is (query_length=N, kv_cache=None), and decode is (query_length=1, kv_cache=N). For prefix caching and chunked prefill, it would be (query_length=M, kv_cache=K). Thus, add_tokens_prefill is not a desired method.

Absolutely, they can be unified.

  • Better to think about how this proposal works with vLLM v1, given that we are going to make vLLM v1 as default (hopefully) soon. In the vLLM v1 architecture, there's no concept of prefill decode anymore, and chunked prefill / prefix caching are first class citizens.

I am not so familiar with v1. But I think in principle CachePolicy is not so big a change, it is basically the previous BlockTable, and PhysicalBlockTable is previously BlockList. What I did is jut abstracted and formalized the idea cache management can have different policies, which is anyway there, and based on the framework "reimplemented" SlidingWindow as a cache policy. Except for these, there should be no difference compared to the current system.

With respect to Prefix Caching, IIUC it is behind allocator for block sharing, which therefore is utterly opaque to CachePolicy, so is Chunked Prefill. I will learn about v1 and see how to fit in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants