-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][Core] Cache policy framework #11928
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shawn Du <[email protected]>
Add CachePolicySlidingWindow Adapt seq_len Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This is a big change so I'd expect an RFC to propose the motivation, overall design and impact (how does it work with the current features such as chunked prefill, prefix caching, etc). Meanwhile, some comments based on the current PR description:
|
Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
Thanks for the review.
Sure. I briefly introduced it in the PR text.
Absolutely, they can be unified.
I am not so familiar with v1. But I think in principle CachePolicy is not so big a change, it is basically the previous BlockTable, and PhysicalBlockTable is previously BlockList. What I did is jut abstracted and formalized the idea cache management can have different policies, which is anyway there, and based on the framework "reimplemented" SlidingWindow as a cache policy. Except for these, there should be no difference compared to the current system. With respect to Prefix Caching, IIUC it is behind allocator for block sharing, which therefore is utterly opaque to CachePolicy, so is Chunked Prefill. I will learn about v1 and see how to fit in. |
…ers) Signed-off-by: Shawn Du <[email protected]>
Signed-off-by: Shawn Du <[email protected]>
[Background]
I started to work on KV cache management when I read Simon posted Roadmap Q4 2024 (#9006), which listed Sparse KV cache framework. After digging it for some time, I realized that Sliding Window is already a sparse KV cache, although it is implemented (incompletely) inside the default cache management. So I think it would be better to generalize the idea of cache policy and design an interface for any cache management scheme.
[Motivation]
CachePolicy is a general interface that manages KV cache allocations. It is motivated by the need to have more sophisticated cache management. Implementing CachePolicy, one can make policy to dictate how cache space is used should there be new tokens, such as sliding window, sparse cache, etc.
Actually, its idea is general enough that it is already reality, which may or may not be noted. The default cache behavior just practices a simple cache policy that always allocates new slots/blocks to append new tokens. To this end, CachePolicy should natually fit in the current system. As an interface, it is meant to be implemented to be any cache management idea, primarily aimed to save scarce memory space. And with it, they are easy to be done.
It is important to note that CachePolicy is not BlockAllocator, but asks an allocator for new blocks (and returns blocks back). Its role is clearly defined and confined that it should not be confused and is decoupled with any other component. Nonetheless, specific policy implementation may not be able to work with another component, e.g., it is known that CachePolicySlidingWindow and prefix caching can not be together, which I think is technically unnecessary (IIUC) but it is checked and avoided in configuring stage so that it is not any different and will not break anything.
[Design]
In terms of its design, CachePolicy has two straightforward methods basically: add_tokens_prefill and add_tokens_decode (which can be unified as suggested by Cody). Their functions are indeed unique, to give slots/blocks to new tokens. The given slots/blocks may be new, and if so they will be requested from allocator. Or they can be old ones, which are entirely from the previous blocks at its disposal. It is the policy that selects victims, evicts them, and replaces them with the new ones in their positions.
CachePolicy is lightweight, it employs two data structures to keep track of blocks and tokens: a PhysicalBlockTable and a VirtualBlockTable, and build these two along the process and use them to manage the cache space. The PhysicalBlockTable (previously as BlockList) holds the physical blocks, while the VirtualBlockTable maps tokens to blocks and backward with slot mappings and token mappings respectively.
[Status]
Using CachePolicy, I implemented the first two concrete policies which are used today. Apart from CachePolicyBase that is the same as the default, the second one, CachePolicySlidingWindow maintains a context of a sliding window, instead of mixing these two policies together as they are, using the CachePolicy interface helps achieve it cleanly and completely. By complete, I mean sliding window now takes effect in prefill and is adhered to in decode, in which never a new block will be allocated should a window of blocks have been had.
Lightly tested it on Qwen2.5 with basic usages, I post it early for comments and suggestions.
Future work:
Thanks.
@simon-mo @comaniac @WoosukKwon @cadedaniel @alexm-neuralmagic