Replies: 22 comments 15 replies
-
I can also confirm that calibrating using 8,000 tokens from the clean calibration dataset instead of 90,000 tokens is still worse than using 8k tokens from the random dataset. The random data had less deviation & lower ppl, & is closer to the base model for both the pretrain data perplexity & the lyrical perplexity. (I used 128 context for both. Lower context seems better for calibration.) |
Beta Was this translation helpful? Give feedback.
-
Fascinating. Would a big list of "relevant" tokens be even more ideal than a random selection? Like, say, the 10K or 100K most common English words? |
Beta Was this translation helpful? Give feedback.
-
IMatrix conditions: Mixtral 8x7b instruct v0.1 - Q8_0 base - wiki.train.raw - 512 ctx - Partial GPU offload Quantized from f16 GGUF with above imatrix at various chunk lengths Final estimate: PPL = 4.6288 +/- 0.02509 - mod3ks Note: Mod 3ks has q6_K attn_Output and attn_Q weights, Q8_0 attn_V. ffn_ are all q3_K EDIT: I did a 40 chunk run of the PennTreeBank(PTB) dataset for imatrix and subsequent quant and perplexity on wiki.test.raw looks like margin of error for that. Might have to try a longer run. |
Beta Was this translation helpful? Give feedback.
-
After experimenting a bit, I got the lowest ppl for both test cases (the pretraining-esque data and the lyrics) by using about ~20k worth of near random tokens at 256 context length when calculating the imatrix. Here is what I settled on: |
Beta Was this translation helpful? Give feedback.
-
A lot more experimentation is needed, this looks interesting. Will we soon find a collection of matrix downloads on HF ?:) |
Beta Was this translation helpful? Give feedback.
-
This phenomenon is reproducible in exllamav2 as well! I quantized Mistral 7B to 3bpw in exllamav2, and tested perplexity on a small dataset from kalomaze and a much bigger dataset of novel-style fiction and vicuna formatted chats. With the default exllama quantization on diverse data:
test_ppl perplexity: 10.0170 stories + chats perplexity: 11.4470 With kalomaze's "20K of random words" file used to quantize mistral 7b:
test_ppl perplexity: 9.9101 stories + chats perplexity: 11.3409 @turboderp @lonestriker you may be interested in this as well. |
Beta Was this translation helpful? Give feedback.
-
Did some more testing, but with Mistral 7b instruct v0.2 All IMatrix calculations were done on the f16 gguf for Mistral 7b instruct v0.2 PTB was sourced from here I did notice a discrepancy in the Hellaswag scores depending on if it was run on my P40 or my 3060ti (8GB). I compiled with force_MMQ. The PPL scores were effectively identical regardless of GPU. PPL on wiki.test.raw
Hellaswag
|
Beta Was this translation helpful? Give feedback.
-
@TheBloke Maybe you'd be interested by that. |
Beta Was this translation helpful? Give feedback.
-
@kalomaze What is the outcome of using your random tokens for quantization types that need more guidance from the importance matrix? Such as |
Beta Was this translation helpful? Give feedback.
-
Kind of related: for multiple consecutive matrix multiplications you can rearrange rows/columns without affecting the results. But this changes which values end up in a block together so after quantization the results do change. When I tested this idea it turned out that a random order is essentially best and that sorting the data in such a way that large values end up in the same block makes PPL worse. You could potentially try to optimize the order but I discarded the idea because I expected this to overfit. But if random tokens work well as input then maybe this is viable after all? |
Beta Was this translation helpful? Give feedback.
-
I have seen quite a few comments here and elsewhere talking about importance matrix over-fitting, so I feel I need to add some context. I don't think the approach I have implemented for Long story short: before jumping to conclusions that one imatrix approach is better than another one, you need a much more extensive evaluation than one quantization type and perplexity of two quite small test datasets. |
Beta Was this translation helpful? Give feedback.
-
I have answered my own question. For Mistral-7B and
So, Winogrande is the same, PPL and HellaSwag are better with imatrix from |
Beta Was this translation helpful? Give feedback.
-
Someone did a pretty advanced analysis on this topic, you can find them there: |
Beta Was this translation helpful? Give feedback.
-
what's blocking for using all the model known tokens in a few combinations as importance matrix? (except cost since it's gonna be 32k * combinations and change) |
Beta Was this translation helpful? Give feedback.
-
Here are some KL-divergence data for various quants of Mistral-7B-Instruct-v0.1. I used @Ttl's llama_kl.py script with the full 330K tokens from wiki.test.raw. For what it's worth, I think perplexity is the wrong metric to optimise for when you're quantizing models, since it doesn't measure deviation from the unquantized model. No data for IQ2_XS/IQ2_XXS, because they don't work on ROCm and are way too slow on the CPU. |
Beta Was this translation helpful? Give feedback.
-
my intuition tells me that we should use the dataset the model was trained on, which is impractical, but maybe the calculation is doable with the fine tuning dataset..... brb im gonna go quantize some dolphins. |
Beta Was this translation helpful? Give feedback.
-
@kalomaze @ikawrakow
But I can certainly say 100 chunks aren't enough. huge PPL different between 100 chunks and 10000 chunks |
Beta Was this translation helpful? Give feedback.
-
As for myself, to speed-up testing, I made an experience on Sheared Llama 2 1.3b (which apparently shares consistently and properly the features of the Llama 2 architecture despite being a shrink of Llama 2 7b, including in terms of rope sweet spots) with small matrix. At the end I picked -c 25 with -chunks 32 (the matrix of the poor, any lower value is bad, any higher value up to -c 768 -chunks 100 ain't decisively better) base on wiki.train.raw, because I needed to quantize a 70b model. The experience with various ctx and numbers of chunks thus gave me this : princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c768_ch100.gguf | - | wikitext | 11.5608 | 512 And on 70b, this small matrix c032-ch025 lowers the perplexity by : More than 3% in Rope 8 on Q2_K WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-Q2_K.gguf,-,wikitext,6.2489,512, More than 2% in Rope 4 on Q2_K WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-Q2_K.gguf,-,wikitext,4.8859,512, More than 1.5% in Rope 2 on Q2_K WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-Q2_K.gguf,-,wikitext,4.5030,512, More than 1% with Rope 8 on Q3_K_S WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-Q3_K_S.gguf,-,wikitext,5.6127,512 |
Beta Was this translation helpful? Give feedback.
-
There is an indication that there is some randomness involved in the performance of the quants, no matter if measured via PPL or KL. An imatrix dataset that is the least similar to the test set can still lead to the generation of a quant that scores the best in tests, yet only with a low probability. Similarly, the most suitable imatrix dataset can generate the worst quant results, also only with a low probability. Finding a way to reduce those random outliers would make the testing process simpler, and thereby help to determine if something like a "best on average" imatrix dataset can be achieved. |
Beta Was this translation helpful? Give feedback.
-
I don't think that an "average" approach is the right solution, especially given the "average" likely will come up as great and as bad solution depending on the model it's used on. Given the unreliable way to create a well working i-matrix, I believe we would need a tool that automates it for each model. Those imatrix results then should be filenamed in a standardized short way that shows what perplexity gain it includes and how many runs it included. So we can have gguf files and imatrix files that directly indicate how good they are. Example. |
Beta Was this translation helpful? Give feedback.
-
I sometimes encounter something like these. |
Beta Was this translation helpful? Give feedback.
-
Still getting this warning |
Beta Was this translation helpful? Give feedback.
-
So, I mentioned before that I was concerned that wikitext-style calibration data / data that lacked diversity could potentially be worse for importance matrix calculations in comparison to more "random" data. My reasoning for this was, it maybe would "overfit" to a particular style of data otherwise, and outlier model activations would be more prone to being quantized away.
If we are judging based off perplexity, then it seems I was correct.
But, it doesn't stop there. Out-of-domain data actually gets worse in terms of perplexity as a result of a large volume of "clean" data being used for calibration over the "random" data.
I also evaluated the perplexity of some song lyrics (about 2,500 tokens worth, in smaller batches) which were not in the calibration datasets.
Very interesting stuff. @ikawrakow
Also, here's a sample of what the incoherent calibration data looks like
And what the pretrain-style data looks like for comparison
Here is the data I used to calibrate:
8k_random_data.txt
Beta Was this translation helpful? Give feedback.
All reactions