Importance matrix calculations work best on near-random data #5006

kalomaze · 2024-01-17T20:47:04Z

kalomaze
Jan 17, 2024

So, I mentioned before that I was concerned that wikitext-style calibration data / data that lacked diversity could potentially be worse for importance matrix calculations in comparison to more "random" data. My reasoning for this was, it maybe would "overfit" to a particular style of data otherwise, and outlier model activations would be more prone to being quantized away.

If we are judging based off perplexity, then it seems I was correct.

Quantization	Final PPL Estimate
q8_0, Mistral 7b, evaluating PPL of set aside pretrain-style data (~15,000 tokens)	8.1901 ± 0.27188
q3_K_L (importance matrix calibrated via incoherent, near-random selection of 8,000 tokens)	8.3157 ± 0.27557
q3_K_L (importance matrix calibrated via 90,000 tokens worth of pretrain-style data)	8.3577 ± 0.27849
q3_K_L (no importance matrix)	8.4068 ± 0.27938

But, it doesn't stop there. Out-of-domain data actually gets worse in terms of perplexity as a result of a large volume of "clean" data being used for calibration over the "random" data.

I also evaluated the perplexity of some song lyrics (about 2,500 tokens worth, in smaller batches) which were not in the calibration datasets.

Quantization	Final PPL Estimate
q8_0	14.8318 ± 1.35053
q3_K_L (importance matrix calibrated via incoherent, near-random selection of 8,000 tokens)	15.4084 ± 1.40201
q3_K_L (no importance matrix)	15.5547 ± 1.41542
q3_K_L (90,000 tokens importance matrix, on pretrain-style data)	15.6798 ± 1.43779

Very interesting stuff. @ikawrakow

Also, here's a sample of what the incoherent calibration data looks like

And what the pretrain-style data looks like for comparison

Here is the data I used to calibrate:
8k_random_data.txt

kalomaze · 2024-01-17T21:51:51Z

kalomaze
Jan 17, 2024
Author

I can also confirm that calibrating using 8,000 tokens from the clean calibration dataset instead of 90,000 tokens is still worse than using 8k tokens from the random dataset. The random data had less deviation & lower ppl, & is closer to the base model for both the pretrain data perplexity & the lyrical perplexity.

(I used 128 context for both. Lower context seems better for calibration.)

0 replies

brucethemoose · 2024-01-17T22:35:29Z

brucethemoose
Jan 17, 2024

Fascinating.

Would a big list of "relevant" tokens be even more ideal than a random selection? Like, say, the 10K or 100K most common English words?

https://gist.github.com/h3xx/1976236

0 replies

askmyteapot · 2024-01-17T23:38:01Z

askmyteapot
Jan 17, 2024

IMatrix conditions: Mixtral 8x7b instruct v0.1 - Q8_0 base - wiki.train.raw - 512 ctx - Partial GPU offload

Quantized from f16 GGUF with above imatrix at various chunk lengths

Final estimate: PPL = 4.6288 +/- 0.02509 - mod3ks
Final estimate: PPL = 4.6112 +/- 0.02512 - mod3ks with imatrix at 40 chunks
Final estimate: PPL = 4.6058 +/- 0.02510 - mod3ks with imatrix at 642 chunks
Final estimate: PPL = 4.6041 +/- 0.02506 - mod3ks with imatrix at 1024 chunks

Note: Mod 3ks has q6_K attn_Output and attn_Q weights, Q8_0 attn_V. ffn_ are all q3_K

EDIT: I did a 40 chunk run of the PennTreeBank(PTB) dataset for imatrix and subsequent quant and perplexity on wiki.test.raw
Final estimate: PPL = 4.6161 +/- 0.02515 - mod3ks with PTB imatrix at 40 chunks

looks like margin of error for that. Might have to try a longer run.

3 replies

kalomaze Jan 17, 2024
Author

Is this analyzing ppl similar to the original dataset / the test set? I'm using distinct data in comparison to wikitext because I want to make sure that things that aren't similar to wikitext aren't disproportionately affected.

askmyteapot Jan 17, 2024

I'm calculating the PPL on wiki.test.raw
The imatrix is on wiki.train.raw

askmyteapot Jan 17, 2024

I have thought about trying PTB as a dataset to do the imatrix on. But haven't gotten around to it.

Also, this is the final output from my 1024 chunk imatrix calculation run.

Final estimate: PPL = 4.7554 +/- 0.02083

save_imatrix: stored collected data after 2048 chunks in D:\text-generation-webui\models\Mixtral-8x7B-Instruct-v0.1\ggml-model-q8.gguf.imat.new(batch512).data

llama_print_timings:        load time =   37103.85 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 13850826.16 ms / 524288 tokens (   26.42 ms per token,    37.85 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 13884171.61 ms / 524289 tokens

kalomaze · 2024-01-18T00:08:41Z

kalomaze
Jan 18, 2024
Author

After experimenting a bit, I got the lowest ppl for both test cases (the pretraining-esque data and the lyrics) by using about ~20k worth of near random tokens at 256 context length when calculating the imatrix.

Here is what I settled on:
20k_random_data.txt

0 replies

cmp-nct · 2024-01-18T01:01:27Z

cmp-nct
Jan 18, 2024

A lot more experimentation is needed, this looks interesting.
Also interesting would be the compatibility of those calculated matrices within different fine-tunes of the same foundation model.

Will we soon find a collection of matrix downloads on HF ?:)

0 replies

brucethemoose · 2024-01-18T01:23:51Z

brucethemoose
Jan 18, 2024

This phenomenon is reproducible in exllamav2 as well!

I quantized Mistral 7B to 3bpw in exllamav2, and tested perplexity on a small dataset from kalomaze and a much bigger dataset of novel-style fiction and vicuna formatted chats.

With the default exllama quantization on diverse data:

python convert.py --in_dir /home/alpha/Models/Raw/mistralai_Mistral-7B-v0.1 -o /home/alpha/FastModels/scratch -ss 4096 -b 3.0 -hb 6 -cf /home/alpha/FastModels/Mistral-3bpw-default -nr

test_ppl perplexity: 10.0170

stories + chats perplexity: 11.4470

With kalomaze's "20K of random words" file used to quantize mistral 7b:

python convert.py --in_dir /home/alpha/Models/Raw/mistralai_Mistral-7B-v0.1 -o /home/alpha/FastModels/scratch --cal_dataset /home/alpha/Documents/20krand.parquet -l 2048
 -r 9 -ml 2048 -mr 9 -ss 4096 -b 3.0 -hb 6 -cf /home/alpha/FastModels/Mistral-3bpw-test -nr

test_ppl perplexity: 9.9101

stories + chats perplexity: 11.3409

@turboderp @lonestriker you may be interested in this as well.

1 reply

Nero10578 Jan 18, 2024

How did you convert the 20krand.txt into parquet for exl2?

askmyteapot · 2024-01-18T03:01:09Z

askmyteapot
Jan 18, 2024

Did some more testing, but with Mistral 7b instruct v0.2

All IMatrix calculations were done on the f16 gguf for Mistral 7b instruct v0.2

PTB was sourced from here
wiki was sourced from here
All quants were standard q4km

I did notice a discrepancy in the Hellaswag scores depending on if it was run on my P40 or my 3060ti (8GB). I compiled with force_MMQ.

The PPL scores were effectively identical regardless of GPU.

PPL on wiki.test.raw

Final estimate: PPL = 6.7081 +/- 0.04371 - f16
Final estimate: PPL = 6.7591 +/- 0.04407 - ptb 256 chunk (Final estimate: PPL = 6.7609 +/- 0.04409 [3060ti])
Final estimate: PPL = 6.7343 +/- 0.04380 - wiki.train 256 chunk
Final estimate: PPL = 6.7684 +/- 0.04406 - nil imat (Final estimate: PPL = 6.7684 +/- 0.04406[P40])

Hellaswag

400     85.75000000 - f16 [P40]
400     85.00000000 - nil
400     86.00000000 - ptb 256 chunk - (400     85.75000000 [P40])
400     86.00000000 - wiki.train 256 chunk

800     83.12500000 - f16 [P40]
800     82.75000000 - nil
800     83.25000000 - wiki.train 256 chunk
800     83.37500000 - ptb 256 chunk

0 replies

BadisG · 2024-01-18T04:56:36Z

BadisG
Jan 18, 2024

@TheBloke Maybe you'd be interested by that.
@kalomaze I knew that using a calibration dataset would overfit the model and make it worse overall than a "normal" gguf quant, but I didn't expect that the solution to this would be to use a dataset based on random tokens, that's a nice find!

0 replies

ikawrakow · 2024-01-18T08:49:36Z

ikawrakow
Jan 18, 2024

@kalomaze What is the outcome of using your random tokens for quantization types that need more guidance from the importance matrix? Such as IQ2_XXS, IQ2_XS, or Q2_K_S?

0 replies

JohannesGaessler · 2024-01-18T09:11:07Z

JohannesGaessler
Jan 18, 2024
Collaborator

Kind of related: for multiple consecutive matrix multiplications you can rearrange rows/columns without affecting the results. But this changes which values end up in a block together so after quantization the results do change. When I tested this idea it turned out that a random order is essentially best and that sorting the data in such a way that large values end up in the same block makes PPL worse. You could potentially try to optimize the order but I discarded the idea because I expected this to overfit. But if random tokens work well as input then maybe this is viable after all?

1 reply

StefanGliga Jan 18, 2024

That sounds like the idea of RPTQ paper, although in that paper I think they tried many permutations and keep only the best one.

ikawrakow · 2024-01-18T09:17:54Z

ikawrakow
Jan 18, 2024

I have seen quite a few comments here and elsewhere talking about importance matrix over-fitting, so I feel I need to add some context.

I don't think the approach I have implemented for llama.cpp is very prone to over-fitting. The importance matrix (imatrix in what follows) only uses the diagonal elements of the activation expectation value (<a a^T>). I.e., for a tensor with N x M weights it only has N entries, and each entry is involved as a weight in a weighted RMSE minimization of M model weights. E.g., for Mistral-7B, each imatrix entry acts on 14336 (ffn_down) or 4096 (all other tensors) weights. imatrix entries never act alone, but always in tandem with the other entries in a quantization block (16 or 32, depending on quantization type). The M RMSE minimization problems where a given imatrix entry is involved are independent of each other. Given this, what are the odds for over-fitting? Not very high, to not say negligible. When you get to low-bit quantization (2- or 3-bit), quantization results become very sensitive to details of the RMSE minimization. The RMSE minimization is a mixed-integer problem, and we never attempt to find the exact solution (hard, but feasible for block-wise quantization with blocks of 16 or 32). Why? Because from experience we know that the exact solution may or may not be the best solution for observable model performance. So that, all the k-quants (and now Q4_0 - Q5_1 quantization functions use heuristics that have proven more robust in terms of observable results when used for a wide range of models and LLM tasks. The heuristics are indeed influenced by the supplied imatrix, so yes, one will see variation in the results depending on how one computed the imatrix. But this variation is more due to heuristics sensitivity rather than imatrix over-fitting.

Long story short: before jumping to conclusions that one imatrix approach is better than another one, you need a much more extensive evaluation than one quantization type and perplexity of two quite small test datasets.

1 reply

ResearchTLDR Feb 26, 2024

I am gearing up to try to do more testing with imatrix datasets, and I have written up a post on the /r/LocalLlama subreddit: https://www.reddit.com/r/LocalLLaMA/comments/1azvjcx/investigating_the_impact_of_dataset_used_for/

@ikawrakow I would love your feedback and even collaboration, if you are interested. That invitation is also open to anyone else who is this far into a Github discussion about imatrix calculations.

ikawrakow · 2024-01-18T10:31:34Z

ikawrakow
Jan 18, 2024

@kalomaze What is the outcome of using your random tokens for quantization types that need more guidance from the importance matrix? Such as IQ2_XXS, IQ2_XS, or Q2_K_S?

I have answered my own question. For Mistral-7B and IQ2_XXS quantization:

imatrix from wiki.train.raw
- Perplexity on wiki.test.raw: 7.7322 +/- 0.04447
- HellaSwag (2000 tasks, 10-shot): 72.9
- Winogrande (1267 tasks): 67.88 +/- 1.31
imatrix from @kalomaze's random token dataset and chunks of 256:
- Perplexity on wiki.test.raw: 8.2490 +/- 0.04811
- HellaSwag (2000 tasks, 10-shot): 70.65
- Winogrande (1267 tasks): 67.88 +/- 1.31

So, Winogrande is the same, PPL and HellaSwag are better with imatrix from wiki.train.raw. The HellaSwag difference is statistically significant: HellaSwag(wiki.train.raw imatrix) - HellaSwag(random imatrix) = 2.25 +/- 0.53.

0 replies

BadisG · 2024-01-18T15:50:25Z

BadisG
Jan 18, 2024

Someone did a pretty advanced analysis on this topic, you can find them there:
https://www.reddit.com/r/LocalLLaMA/comments/1993iro/comment/kifils5/?utm_source=share&utm_medium=web2x&context=3

1 reply

BrickBee Jan 20, 2024

Some of the results there indicate that we could be seeing patterns in random noise here. There are more results and details hidden in the comments of that posting.

LorenzoBoccaccia · 2024-01-18T17:25:25Z

LorenzoBoccaccia
Jan 18, 2024

what's blocking for using all the model known tokens in a few combinations as importance matrix? (except cost since it's gonna be 32k * combinations and change)

0 replies

Artefact2 · 2024-01-18T18:08:41Z

Artefact2
Jan 18, 2024
Collaborator

Here are some KL-divergence data for various quants of Mistral-7B-Instruct-v0.1. I used @Ttl's llama_kl.py script with the full 330K tokens from wiki.test.raw. For what it's worth, I think perplexity is the wrong metric to optimise for when you're quantizing models, since it doesn't measure deviation from the unquantized model.

No data for IQ2_XS/IQ2_XXS, because they don't work on ROCm and are way too slow on the CPU.

6 replies

brucethemoose Jan 18, 2024

wikitext is often chosen calibration dataset because its "generic" and should reasonably generalize to other input data.

Hence I think its less useful to test wikitext on a model quantized with wikitext. One would expect the wikitext profiling to perform extremely well on another wikitext subset, but the more important question is if it generalizes to more dissimilar datasets (like chat or code) better than random tokens.

Artefact2 Jan 18, 2024
Collaborator

I get your concern. But all the perplexity calculations shared in this thread use wikitext too.

KL divergence measures differences in token probabilities between quantized and unquantized model, given the same prompt/context. I don't think the results would change much given a different data set especially over such a large sample size. If anyone wants to try, grab the script above and have a go.

askmyteapot Jan 19, 2024

I also tried some extreme tests on 4.2M Tokens from mathstack100k with both 512ctx at 8192 chunks, and 8192ctx ans 512 chunks. Resulted in the same wiki.test.raw ppl for a q4km quant of mistral instruct. Im not at home right now, but the ppl was slightly worse than the wiki.train.raw imatrix, but better than the penntreebank dataset at 300K tokens.

kalomaze Jan 19, 2024
Author

The formatting and writing of wikitext is extremely consistent throughout the data (use of spaces after punctuation, non-markdown formatting, encyclopedia style, etc...). It makes sense, of course, that if you test the perplexity of wikitext on a model quantized with wikitext calibration data, it'll do a bit better compared to the pseudo-random data; I don't see how this proves anything when the issue I'm discussing is inherently related to the narrow scope of only testing on the wikitext dataset.

cmp-nct Jan 19, 2024

"top-token-differ" appears to be a great metric to me. While differences and perplexity are interesting to evaluate the actual model similarity the top token is what defines the output.

cameronbergh · 2024-01-18T23:06:46Z

cameronbergh
Jan 18, 2024

my intuition tells me that we should use the dataset the model was trained on, which is impractical, but maybe the calculation is doable with the fine tuning dataset..... brb im gonna go quantize some dolphins.

0 replies

JianbangZ · 2024-01-19T14:53:33Z

JianbangZ
Jan 19, 2024

@kalomaze @ikawrakow
Share some piece of infomation from my experiment on Qwen-14B-chat (bilingual model)
I used https://huggingface.co/datasets/wikimedia/wikipedia, which has significantly larger dataset for calibration.
I tested using english dataset with 1000 chunks (512k tokens), 10k chunks (5M tokens), and then mix Chinese and English both language together to a single calibration dataset with but with differect order , and with as many as 28k chunks (15M tokens).
It took me almost 2 days to do these tests and the observations are:

Larger calibration dataset helps, but the benefit diminishes beyond certain point.
Larger calibration dataset really helps PPL, but limited benefit found for other datasets, this is to be determined.
Using a mixed language calibration dataset improves the performance. I had a C-eval Chinese dataset. If using pure EN calibration dataset, the C-EVAL results isn't that good compared to using pure Chinese calibraiton dataset or, in my case a mix of English and Chinese calibraiton dataset. So I believe for multi-lingual model, it's best to use a multi-lingual calibration dataset

But I can certainly say 100 chunks aren't enough. huge PPL different between 100 chunks and 10000 chunks

0 replies

Nexesenex · 2024-01-21T22:33:24Z

Nexesenex
Jan 21, 2024

As for myself, to speed-up testing, I made an experience on Sheared Llama 2 1.3b (which apparently shares consistently and properly the features of the Llama 2 architecture despite being a shrink of Llama 2 7b, including in terms of rope sweet spots) with small matrix. At the end I picked -c 25 with -chunks 32 (the matrix of the poor, any lower value is bad, any higher value up to -c 768 -chunks 100 ain't decisively better) base on wiki.train.raw, because I needed to quantize a 70b model.

The experience with various ctx and numbers of chunks thus gave me this :

princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c768_ch100.gguf | - | wikitext | 11.5608 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c512_ch400.gguf | - | wikitext | 11.5429 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c512_ch025.gguf | - | wikitext | 11.528 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c512_ch100.gguf | - | wikitext | 11.5198 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c512_ch200.gguf | - | wikitext | 11.4947 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c512_ch050.gguf | - | wikitext | 11.483 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch010.gguf | - | wikitext | 11.4728 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c256_ch100.gguf | - | wikitext | 11.4674 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c128_ch100.gguf | - | wikitext | 11.4561 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c064_ch100.gguf | - | wikitext | 11.4456 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c016_ch100.gguf | - | wikitext | 11.4375 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c016_ch025.gguf | - | wikitext | 11.4329 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c016_ch010.gguf | - | wikitext | 11.4268 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c016_ch050.gguf | - | wikitext | 11.4053 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch025.gguf | - | wikitext | 11.388 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch050.gguf | - | wikitext | 11.3862 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch100.gguf | - | wikitext | 11.3742 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c512_ch100.gguf | - | wikitext | 9.8568 | 2048
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c016_ch100.gguf | - | wikitext | 9.7635 | 2048
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch025.gguf | - | wikitext | 9.7566 | 2048
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c016_ch050.gguf | - | wikitext | 9.7388 | 2048
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch100.gguf | - | wikitext | 9.7386 | 2048
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch050.gguf | - | wikitext | 9.732 | 2048
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c512_ch100.gguf | - | wikitext | 9.6553 | 4096
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch100.gguf | - | wikitext | 9.5493 | 4096
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch025.gguf | - | wikitext | 9.5455 | 4096
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c016_ch100.gguf | - | wikitext | 9.54 | 4096
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c032_ch050.gguf | - | wikitext | 9.5352 | 4096
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c016_ch050.gguf | - | wikitext | 9.5197 | 4096
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c016_ch010.gguf | - | wikitext | 9.1624 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c032_ch050.gguf | - | wikitext | 9.147 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c256_ch100.gguf | - | wikitext | 9.1451 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c512_ch025.gguf | - | wikitext | 9.1445 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c512_ch400.gguf | - | wikitext | 9.1442 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c064_ch100.gguf | - | wikitext | 9.1434 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c016_ch025.gguf | - | wikitext | 9.1432 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c032_ch010.gguf | - | wikitext | 9.1431 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c016_ch050.gguf | - | wikitext | 9.1429 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c032_ch100.gguf | - | wikitext | 9.1424 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c032_ch025.gguf | - | wikitext | 9.1411 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c128_ch100.gguf | - | wikitext | 9.1411 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c512_ch200.gguf | - | wikitext | 9.1411 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c512_ch100.gguf | - | wikitext | 9.1398 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c512_ch050.gguf | - | wikitext | 9.1386 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c768_ch100.gguf | - | wikitext | 9.1382 | 512
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c032_ch050.gguf | - | wikitext | 7.8034 | 2048
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c016_ch050.gguf | - | wikitext | 7.8031 | 2048
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c032_ch025.gguf | - | wikitext | 7.8014 | 2048
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c016_ch050.gguf | - | wikitext | 7.5402 | 4096
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c032_ch025.gguf | - | wikitext | 7.5376 | 4096
princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-Q3_K_M.iMatrix_Wiki_c032_ch050.gguf | - | wikitext | 7.5369 | 4096

And on 70b, this small matrix c032-ch025 lowers the perplexity by :

More than 3% in Rope 8 on Q2_K

WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-Q2_K.gguf,-,wikitext,6.2489,512,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-iMat-c32_ch25-Q2_K.gguf,-,wikitext,6.0482,512

More than 2% in Rope 4 on Q2_K

WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-Q2_K.gguf,-,wikitext,4.8859,512,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-iMat-c32_ch25-Q2_K.gguf,-,wikitext,4.7739,512

More than 1.5% in Rope 2 on Q2_K

WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-Q2_K.gguf,-,wikitext,4.5030,512,
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-iMat-c32_ch25-Q2_K.gguf,-,wikitext,4.42,512

More than 1% with Rope 8 on Q3_K_S

WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-Q3_K_S.gguf,-,wikitext,5.6127,512
WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant-AR-b1924-iMat-c32_ch25-Q3_K_S.gguf,-,wikitext,5.5461,512

1 reply

sorasoras Feb 5, 2024

princeton-nlp_Sheared-LLaMA-1.3B-AR-b1924-IQ2_XS.iMatrix_Wiki_c016_ch050.gguf | - | wikitext | 9.5197 | 4096
so this is the best case at least for small model
smallest chunks and context give you the best result
if you have the time I guess

BrickBee · 2024-02-04T11:38:55Z

BrickBee
Feb 4, 2024

There is an indication that there is some randomness involved in the performance of the quants, no matter if measured via PPL or KL. An imatrix dataset that is the least similar to the test set can still lead to the generation of a quant that scores the best in tests, yet only with a low probability. Similarly, the most suitable imatrix dataset can generate the worst quant results, also only with a low probability.

Finding a way to reduce those random outliers would make the testing process simpler, and thereby help to determine if something like a "best on average" imatrix dataset can be achieved.

0 replies

cmp-nct · 2024-02-04T16:47:27Z

cmp-nct
Feb 4, 2024

I don't think that an "average" approach is the right solution, especially given the "average" likely will come up as great and as bad solution depending on the model it's used on.

Given the unreliable way to create a well working i-matrix, I believe we would need a tool that automates it for each model.
As in iterating different approaches for a set amount of time/trials and then choosing the best imatrix out of the mix.
Possibly even merging the best imatrix results, trying to combine the benefits of two good approaches.

Those imatrix results then should be filenamed in a standardized short way that shows what perplexity gain it includes and how many runs it included. So we can have gguf files and imatrix files that directly indicate how good they are.

Example. vicuna-7b-pp_5_22-ppr_021-runs_25.imatrix
5.21 perplexity, 2.1% pp gain compared to vanilla at Q4K, 25 runs for "vicuna-7b"

0 replies

sorasoras · 2024-02-05T09:07:48Z

sorasoras
Feb 5, 2024

====== llama_model_quantize_internal: did not find weights for token_embd.weight
quantizing to q4_K .. size =  1187.00 MiB ->   333.84 MiB
[ 176/ 259]                        output.weight - [ 4096, 151936,     1,     1], type =    f16,
====== llama_model_quantize_internal: did not find weights for output.weight

I sometimes encounter something like these.
Does it mean I need more imatrix calculation ?
@ikawrakow

1 reply

ikawrakow Feb 5, 2024

No you don't. token_embd never has an entry in the imatrix. Same for output.weight. I guess, it would have been better to not print the warning for these two tensors.

oldgithubman · 2024-05-14T04:10:53Z

oldgithubman
May 14, 2024

No you don't. token_embd never has an entry in the imatrix. Same for output.weight. I guess, it would have been better to not print the warning for these two tensors.

Still getting this warning

0 replies

Importance matrix calculations work best on near-random data #5006

Replies: 22 comments · 15 replies

kalomaze Jan 17, 2024 Author

kalomaze Jan 17, 2024 Author

kalomaze Jan 18, 2024 Author

PPL on wiki.test.raw

Hellaswag

JohannesGaessler Jan 18, 2024 Collaborator

Artefact2 Jan 18, 2024 Collaborator

Artefact2 Jan 18, 2024 Collaborator

kalomaze Jan 19, 2024 Author

Replies: 22 comments 15 replies

kalomaze
Jan 17, 2024
Author

kalomaze Jan 17, 2024
Author

kalomaze
Jan 18, 2024
Author

JohannesGaessler
Jan 18, 2024
Collaborator

Artefact2
Jan 18, 2024
Collaborator

Artefact2 Jan 18, 2024
Collaborator

kalomaze Jan 19, 2024
Author