-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KV-Cache] Make k_scale, v_scale as attributes of self_attn using HFCache #148
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
horheynm
commented
Aug 31, 2024
horheynm
commented
Aug 31, 2024
horheynm
commented
Aug 31, 2024
Satrat
suggested changes
Sep 3, 2024
mgoin
reviewed
Sep 3, 2024
mgoin
reviewed
Sep 24, 2024
mgoin
approved these changes
Sep 25, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested on a few different models and it works well for producing checkpoints that work with vLLM, thanks George!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
FIX #132
Automatically set k/v_scales a part of self_attn module's attribute.
Before:
Compute input, weight, output quantization params, create it as an attribute of the leaf module (ex.
self_attn.k_proj.output_scale
)Then, copy the attribute to the self_attn layer (ex.
self_attn.k_proj.output_scale
->self_attn.k_scale
) and delete the attr from the child.After
Use HF's cache object, when
self_attn
forward is called, use the cache to compute thek_scale
andv_scale
of the kv cache, wrap the forward call to automatically populate the scales toself_attn
.Note:
Previously, the scales were computed from
wrap_module_forward_quantized
, which doesout = output_fq(weight_fq(input_fq(x))
where.*_fq
is fake quantizeNow the cache computes the k/v_scales of the kv cache, not the output activations