-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linear:int4 quantization regression testing #1362
Comments
Nice catch, now that we have the 1B models we might be able to switch over to that instead of stories for a more representative model. (or change the groupsize) As for padding vs not padding, I'll need to think about that before a make a call.
|
We should update the test so that this doesn't happen https://github.com/pytorch/torchchat/blob/main/.github/workflows/pull.yml#L335 |
We currently apply a uniform transformation to all layers, so we are limited to the GCD of all layers. For language-llama (can't believe it's almost a year ago we brought up LLM support with the translation model), that GCD was small. So, we need a solution. We sorta have several options (sorta because we may need multiple for a full solution): Because of how much I wanted to put on the engineer doing the quantization, this is why I just did the bi-partition into output/!output node_type, also makes padding attractive because you can get something running without getting a PhD in quantization! |
Let's make sure it does not explode the runtime for tests to not become onerous (when developers will start ignoring or argue for less tests etc) Also, to get the 1B model, we need the HF tokens. which leads to periodic tests with llama models to fail today as per #1351 |
🚀 The feature, motivation and pitch
In the past, we padded int4 quantization with non-multiple group size to make things work. Since we have decided to remove the padding, int4 quantization is now simply skipped for non-multiple groups. This means, among other things, that int4 quantization is no longer tested because the stories model uses non-multiple-of-256.
Some options:
Alternatives
Put padding back into int4 quantization.
Yes, it's not ideal, then again, suppressing quantization is not either. In my own experience, just making things work increases utility for end users, if there's real concern about performance (int4 quantization with padding may still beat non-quantization!), pad and issue a warning to users.
Additional context
No response
RFC (Optional)
No response
The text was updated successfully, but these errors were encountered: