[KV-Cache] Make k_scale, v_scale as attributes of self_attn using HFCache #148

horheynm · 2024-08-31T03:32:41Z

FIX #132
Automatically set k/v_scales a part of self_attn module's attribute.

Before:
Compute input, weight, output quantization params, create it as an attribute of the leaf module (ex. self_attn.k_proj.output_scale)
Then, copy the attribute to the self_attn layer (ex. self_attn.k_proj.output_scale -> self_attn.k_scale) and delete the attr from the child.

After
Use HF's cache object, when self_attn forward is called, use the cache to compute the k_scale and v_scale of the kv cache, wrap the forward call to automatically populate the scales to self_attn.

Note:
Previously, the scales were computed from wrap_module_forward_quantized, which does out = output_fq(weight_fq(input_fq(x))where .*_fq is fake quantize

Now the cache computes the k/v_scales of the kv cache, not the output activations

src/compressed_tensors/quantization/lifecycle/initialize.py

src/compressed_tensors/quantization/quant_scheme.py

src/compressed_tensors/quantization/lifecycle/forward.py

src/compressed_tensors/quantization/cache.py

src/compressed_tensors/quantization/lifecycle/apply.py

src/compressed_tensors/quantization/lifecycle/forward.py

src/compressed_tensors/quantization/lifecycle/initialize.py

tests/test_quantization/test_cache.py

…into kv-cache

src/compressed_tensors/quantization/cache.py

src/compressed_tensors/quantization/lifecycle/forward.py

src/compressed_tensors/quantization/lifecycle/initialize.py

src/compressed_tensors/quantization/utils/helpers.py

… kv-cache

src/compressed_tensors/quantization/cache.py

mgoin

I tested on a few different models and it works well for producing checkpoints that work with vLLM, thanks George!

horheynm added 5 commits August 28, 2024 13:31

init

bd63b68

init

fbaadec

delete unnces file

01acd6c

pass if no seed

cffc1a0

pre polish

79a7194

horheynm marked this pull request as draft August 31, 2024 03:32

horheynm commented Aug 31, 2024

View reviewed changes

src/compressed_tensors/quantization/lifecycle/initialize.py Outdated Show resolved Hide resolved

horheynm commented Aug 31, 2024

View reviewed changes

src/compressed_tensors/quantization/quant_scheme.py Outdated Show resolved Hide resolved

clean up

d228522

horheynm force-pushed the kv-cache branch from 60a83f4 to d228522 Compare August 31, 2024 17:08

horheynm added 2 commits August 31, 2024 13:24

Merge branch 'main' into kv-cache

40f12f0

post clean up, merge main

e5ea57c

horheynm requested review from Satrat and mgoin August 31, 2024 17:33

horheynm self-assigned this Aug 31, 2024

horheynm marked this pull request as ready for review August 31, 2024 17:33

horheynm commented Aug 31, 2024

View reviewed changes

src/compressed_tensors/quantization/lifecycle/forward.py Outdated Show resolved Hide resolved

tests and get rid of iter_named_leaf_modules, use iter_named_modules

23fe917

horheynm force-pushed the kv-cache branch from 59e5c40 to 23fe917 Compare September 1, 2024 01:09

Satrat suggested changes Sep 3, 2024

View reviewed changes

comments

eba35d9

horheynm force-pushed the kv-cache branch from db70842 to eba35d9 Compare September 3, 2024 18:28

horheynm added 3 commits September 3, 2024 14:29

Merge branch 'main' into kv-cache

511d04d

pass tests

b92b870

Merge branch 'kv-cache' of github.com:neuralmagic/compressed-tensors …

ab91291

…into kv-cache

mgoin reviewed Sep 3, 2024

View reviewed changes

horheynm mentioned this pull request Sep 4, 2024

[KV Cache] kv-cache end to end tests vllm-project/llm-compressor#141

Open

horheynm added 3 commits September 4, 2024 11:17

mgoin comments

86239c4

Merge branch 'main' of github.com:neuralmagic/compressed-tensors into…

69ced96

… kv-cache

frozen state for inference

bea90f3

horheynm added 4 commits September 13, 2024 17:18

only compute scale, zp, do not keep quantized_key|value_states

49ff395

do calibration if there is kv_cache_scheme

b06cb5c

pass test

89d0c58

Merge branch 'main' of github.com:neuralmagic/compressed-tensors into…

938243c

… kv-cache

mgoin reviewed Sep 24, 2024

View reviewed changes

comments

cf442f4

mgoin approved these changes Sep 25, 2024

View reviewed changes

mgoin merged commit 74f1aa6 into main Sep 25, 2024
1 check passed

mgoin deleted the kv-cache branch September 25, 2024 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KV-Cache] Make k_scale, v_scale as attributes of self_attn using HFCache #148

[KV-Cache] Make k_scale, v_scale as attributes of self_attn using HFCache #148

horheynm commented Aug 31, 2024 •

edited by mgoin

Loading

mgoin left a comment

[KV-Cache] Make k_scale, v_scale as attributes of self_attn using HFCache #148

[KV-Cache] Make k_scale, v_scale as attributes of self_attn using HFCache #148

Conversation

horheynm commented Aug 31, 2024 • edited by mgoin Loading

mgoin left a comment

Choose a reason for hiding this comment

horheynm commented Aug 31, 2024 •

edited by mgoin

Loading