ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

This repository provides the implementation for our paper "ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification". Our approach introduces an adaptive KV cache mixed-precision quantization method for LLMs.

arXiv | BibTeX

Getting Started

Follow the step-by-step tutorial to set up ZipCache.

Step 1: Setup

Create a virtual environment and install dependencies as specified by requirements.txt. Then install flash_attn and zipcache as follows:

pip install packaging ninja
pip install flash-attn --no-build-isolation
pip install -e .

Step 2: Download Pretrained Models

Download the pretrained LLaMA model from huggingface and modify the MODEL_PATH in zipcache_generation_demo.py.

Step 3: Inference with ZipCache

python3 zipcache_generation_demo.py

BibTeX

If you find this work useful for your research, please consider citing:

@article{he2024zipcache,
  title={ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification},
  author={He, Yefei and Zhang, Luoming and Wu, Weijia and Liu, Jing and Zhou, Hong and Zhuang, Bohan},
  journal={arXiv preprint arXiv:2405.14256},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

Getting Started

Step 1: Setup

Step 2: Download Pretrained Models

Step 3: Inference with ZipCache

BibTeX

Files

README.md

Latest commit

History

README.md

File metadata and controls

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

Getting Started

Step 1: Setup

Step 2: Download Pretrained Models

Step 3: Inference with ZipCache

BibTeX