Replies: 12 comments 28 replies
-
The Non-linear quantization one is kind of interesting. |
Beta Was this translation helpful? Give feedback.
-
It would be great to have more choice, of course. But basically, I'd go for the obvious here : the best compromises for each range of situation. In this case, the improved Q2_K (pre-SOTA) and the Q3_K_S are competing with each other. About the code, you guys know best, but I guess you apply a principle of "lowest maintenance" factorizing what can be factorized, which means that any code which can be common between the quants should be part of a common code leaving the smallest amount of separated code for each quant specifics. (don't flame me, what's obvious for you guys is sophisticated for me!) About the IQ Quants, I'll pass on commenting because it's beyond my paygrade lol, but the IQ2_XS is great, and same general principles ofc apply! |
Beta Was this translation helpful? Give feedback.
-
I think the 2) is what I was thinking about for some time but never got into trying it in practice:
So what if we had entirely different datatype, something exponential for example, which still allows to encode outliers somehow, but it's only one bit of the precision. in 3bit, that could be Another idea would be to use some kind of entropy coding instead (or even on top of that), because we know that not all weights are likely to be outliers, so if we've already seen N_MAX outliers, we can use those bits to encode "ordinary" weights (-127 being -3 for example). But I'm not sure how practical it would be to do this in shader, so this is just very abstract sketch. Maybe we could also save offset of the first outlier, and use that together with the counter? Also, I'm not sure what is the actual scale of different outliers, but from my intuitive understanding, it should be more or less the same "power" when compared to ordinary weights. Forgive me, if I'm mumbling nonsense, I am not an expert. |
Beta Was this translation helpful? Give feedback.
-
This is not quite true for the quants in Outliers are lost in, e.g., quantization schemes based on clustering. |
Beta Was this translation helpful? Give feedback.
-
@TheBloke Any reason why there's no imatrix and the new 2 bit quants yet? It seems you're still uploading the old ones. |
Beta Was this translation helpful? Give feedback.
-
In another discussion around quantization somebody brought up the recent AQLM Paper. I spent some time making a more detailed comparison between the results of this paper and the quantization types available in Now back to the actual comparison.
The graph shows the quantization error as a function of bits-per-weight (bpw) used. Please note the logarithmic scale of the y-axis chosen to be able to represent the large variation in quantization error as one goes from 2- to 4-bit quants. I have taken the GPTQ, QuIP#, and SpQR results from the AQLM Paper (so don't blame me if they are wrong). GPTQ is the orange circles. It is clearly far behind the competition, so I'll not discuss it further (although, even today, one comes across claims around the Internet of GPTQ being the SOTA when it comes to quantization). SpQR (blue triangle down, available at 3 and 4 bpw) and QuIP# (magenta triangles up, available at 2 and 4 bpw) results are respectable, quite similar to each other, but clearly not SOTA by being above the black line with squares representing published k- and i-quants. The AQLM results shown with red circles/dashed line are clearly the new SOTA at 2 and 3 bpw, so congratulations to the AQLM authors! At 4 bpw it is not quite as clear. There is no 4.0-bit k- or i-quantization, and |
Beta Was this translation helpful? Give feedback.
-
I am very confused by the recent changes in quants. There are not enough tests and no clear guidelines on how to use imatrixes. Combining quants times imatrixes times tests leads to combinatorial explosion, I can't possibly test everything, but I've done what I can. It runs mmlu, winogrande, arc and perplexity against almost all quants, pretty prints the results and compares them against Q8_0 and TheBloke quants computed a ~month ago. The 'X' indicates if the result is statistically significant, if I didn't screw up the math. It seems that multiple choice tests are not very helpful because the results simply do not reach statistical significance. Perplexity is better, but it's unclear how that translates into real-world performance. I wonder what could be done. If you are looking for something specific, I have all the output from llama.cpp for tests, quants, and imatrix preparations. Hope this helps. |
Beta Was this translation helpful? Give feedback.
-
https://en.wikipedia.org/wiki/Chebyshev_polynomials Might be of interest, especially if you ever need to optimize the parameters (ie: the standard polynomial basis coefficients aren't independent and altering say Another interesting thing to look at are Padé approximants: https://en.wikipedia.org/wiki/Pad%C3%A9_approximant The wiki page doesn't really do a good job of explaining their purpose, but this does: https://www.youtube.com/watch?v=szMaPkJEMrw Finally if you've never head of it, the book The End of Error: Unum Computing is a fascinating read. Likely not that helpful directly, but makes a good case for variable length floating point types, and might be food for thought! :) |
Beta Was this translation helpful? Give feedback.
-
Hi all, Thank you for your previous discussions and insights on vector quantization and the support for VQ-based weights in llama.cpp. We've recently developed a method called VPTQ (Vector Post-Training Quantization), which you can explore here: VPTQ on GitHub https://github.com/microsoft/VPTQ . This method quantizes weights into index and vector components that form lookup tables. Here’s a brief overview of VPTQ:
Questions:
Looking forward to your insights and suggestions. Thanks! |
Beta Was this translation helpful? Give feedback.
-
I normally don't hang around here anymore, but given that I started this thread, I decided to chime in. Thank you, @matt-c1, for sorting out that the token embedding and output tensors are left as
It is OK, but is it SOTA as claimed? Here is a graph that shows quantization error ( So, is VPTQ quantization SOTA?
Is it OK to leave token embedding and output tensors as |
Beta Was this translation helpful? Give feedback.
-
Hi @ikawrakow, We understand that adding a new quantization data type is a very difficult decision, depending on factors such as continued support for the quantization method, the maintainability of the software system, and so on. Currently, VPTQ only uses indices packed into int32 and a lookup table in fp16/bf16 (in the embedding operator). I would like to ask, if I want to support this kind of quantization method for VPTQ in llama.cpp, even on my own fork, which approach should I take:
Which approach would you prefer, and which one is more likely to be merged into the main branch? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Hello @ikawrakow and the rest of the readers here. I am glad to see that you now have your own fork. You have meant so much for the llama.cpp project! Your ingenious contributions were very exciting to try. What do you think about QTIP? Sources: https://www.reddit.com/r/LocalLLaMA/comments/1ggwrx6/new_quantization_method_qtip_quantization_with/ Do you think this could be useful or it is another project claiming SOTA results? Curious to hear your thoughts. Might also be interesting for your fork. I assume that people can still borrow code from your fork for llama.cpp? I am not sure why you felt the need to make your own fork in the first place, but I presume it has to do with having more control over what get's implemented? Once again, I sincerely appreciate all of the great and innovative work you've done so far! |
Beta Was this translation helpful? Give feedback.
-
In addition to the
IQ2_XXS
,IQ2_XS
,Q2_K_S
(and nowQ3_K_S
via PR #5060) that were recently added tollama.cpp
, I have experimented with a number of other quantization types in a private development repository. Before embarking on a journey to add some of those tollama.cpp
, I think it is useful to discuss if this will be considered a welcome addition:To get the discussion going, in what follows I give a brief summary of what these additional quants bring to the table:
1. Row-wise quantization
All existing
llama.cpp
quantization types utilize a block-wise structure - either blocks of 32 quants (Q4_0
,Q4_1
,Q5_0
,Q5_1
,Q8_0
), or blocks of 16 or 32 quants in super-blocks of 256 for the k-quants. Each super-block of 256 quants has 1 or 2 floating points scales that convert the quants to actual model weights. My experiments show that the increase in quantization error by going from super-block scales to row-wise scales is very minor. Hence, one can go to scales per tensor row. There are two main benefits:llama.cpp
solution to a situation where k-quants cannot be used is to replace the quant type with one ofQ4_0
,Q4_1
,Q5_0
,Q5_1
,Q8_0
, resulting in a larger model or a lower quality quantization.Q4_K
, so abut 1.5%)2. Non-linear quantization
All existing
llama.cpp
quantization types use a linear mapping between quants and de-quantized weights (i.e.,x = a * q
orx = a * q + b
, wherex
are the de-quantized model weights,q
are the quants, anda, b
are block-wise quantization constants. The non-linear quants that I have experimented with use a 3'rd order relation, i.e.,x = a q^3 + b q^2 + c q + d
. The key benefit from doing so is that one can achieve a very similar quantization quality to the k-quants with larger blocks, thus saving precious bits and reducing quantized model size. As an example, a 3rd order non-linear quantization with4.125
bpw is comparable toQ4_K
, which uses4.5
bpw, for almost a 10% reduction in quantized model size. This comes at the expense of slightly lower performance (typically a few percent). But when the model does not fit into the available GPU, one can squeeze a few more layers onto the GPU that way, which can more than offset the slightly lower kernel performance. Oh, why 3rd order polynomial? I can give a more detailed explanation in the PR if it comes to that.3. k-means clustering quantization
k-means clustering is what is used in, e.g., SqueezeLLM. Basically, instead of scaling quants block-wise into the the range provided by a given number of bits, one performs (weighted) k-means clustering on all weights in a tensor row, thus mapping weights to clusters (with number of clusters defined by the bpw one wants to spend). In this way one can have a "true"
N
-bit quantization (only using2^N
bytes per tensor row for the cluster means in addition to theN
bpw for the quants). k-means clustering is a tricky business and the final outcome strongly depends on the details of the clustering algorithm and model weights used. My implementation is different from SqueezeLLM and does slightly worse on LLaMA-v1-7B (PPL = 6.04
vs theirs6.03
) but much better for LLaMA-v2-7B withPPL = 5.91
vs theirs5.96
(and I don't have their PPL values for other models. PR #3093, which would add SqueezeLLM support tollama.cpp
if accepted, isARM_NEON
only, so it takes a long time to run perplexities, so I only did it for 7B LLaMA's.) This type of quantization is never as good as k-quants or non-linear quants, but it does squeeze out a few more bits from a quantized model.Conclusion
Just for fun, below is a copy-paste of the
ggml_type
enum from my development repo. Obviously I would never add all of these tollama.cpp/ggml
but only pick a few select types that offer the best model-size-vs-quality tradeoff, if the consensus is that this would be valuable.Beta Was this translation helpful? Give feedback.
All reactions