Is llama 3 more prone to damage from quantization? #6901

Dampfinchen · 2024-04-25T10:23:34Z

Dampfinchen
Apr 25, 2024

So I've made a small test. In a full 4096 context, I've written an elaborate instruct in my system prompt how my character should act in certain situations which require quite a bit of logic.

With Llama 3 8B Instruct at FP16, the model was successfully able to connect the dots. Then I did the same test using the same sampler settings with a quantized IQ4_XS model of Llama 3 8B Instruct and it failed all the time.

That also applied to 70B. A quantized 70B was unable to perform this test correctly most of the time, while the FP16 model of 8B's success-rate was much higher. I feel like quantization significantly reduces the attention to early parts of the context which includes the system prompt.

With other small models like Mistral 7B and Solar, I didn't notice this severe damage at all. A Solar model at IQ4_XS does a better job than LLama 3 8B Instruct IQ4_XS in this test.

Has anyone noticed the same? I came across this reddit thread https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/
which seems to confirm my suspicion.

Has anyone else noticed this or made similar tests? I'm also wondering if the bf16->quant conversion is partly to blame for this and if LLama 3 perhaps suffers from this more than other models.

BarfingLemurs · 2024-04-25T22:58:19Z

BarfingLemurs
Apr 25, 2024

Llama3 would need the fixed quantizer to be calibrated, we don't know if the GQA or new tokenizer causes these results.

The fixed quantizations gives these results on wiki-test-raw:

llama3 8b (10.7%) (+0.4)

F16
[1]4.1427,[2]4.9660,[3]3.8588,[4]4.4846,[5]4.9643,[6]5.3515,[7]5.7894,[8]6.2894,[9]6.8240,[10]7.0278,[11]7.2138,[12]7.4660,[13]7.7964,[14]7.5785,[15]7.5177,[16]7.3343,[17]7.3194,[18]7.4368,[19]7.2497,[20]7.1620

Q4_0
[1]4.5887,[2]5.3167,[3]5.1779,[4]5.7172,[5]6.1072,[6]6.4485,[7]6.8884,[8]7.3824,[9]7.9026,[10]8.0692,[11]8.2617,[12]8.5400,[13]8.8602,[14]8.5645,[15]8.4517,[16]8.2268,[17]8.1868,[18]8.3152,[19]8.0835,[20]7.9436,[21]7.9241,[22]7.6656,[23]7.3926,[24]7.2334,[25]6.9982,[26]6.9006,[27]6.8043,[28]6.7403,[29]6.8021,[30]6.7896,[31]6.7857,[32]6.7692,[33]6.8168,[34]6.8600,[35]6.8882,[36]6.9880

Q8_0
[1]4.1745,[2]4.9984,[3]3.9018,[4]4.5213[5]4.9965,[6]5.3802,[7]5.8168,[8]6.3150,[9]6.8482,[10]7.0501,[11]7.2323,[12]7.4822,[13]7.8128,[14]7.5929,[15]7.5287,[16]7.3454,[17]7.3282,[18]7.4439,[19]7.2574,[20]7.1702,[21]7.1699,[22]6.9373,[23]6.6829,[24]6.5370

Mistral 7b (2.6% difference) (+0.2)

F16
[1]3.9972,[2]4.4934,[3]5.2952,[4]5.9949,[5]6.0085,[6]6.0593,[7]6.2645,[8]6.3922,[9]6.5404,[10]6.8471,[11]7.1098,[12]7.1013,[13]7.1197,[14]7.1498,[15]7.0138,[16]6.8477

Q4_0
[1]4.1030,[2]4.6001,[3]5.4445,[4]6.1796,[5]6.1823,[6]6.2288,[7]6.4268,[8]6.5691,[9]6.7295,[10]7.0287,[11]7.2858,[12]7.2603,[13]7.2982,[14]7.3270,[15]7.2002,[16]7.0286,[17]7.0455,[18]6.7896,[19]6.6409,[20]6.7295,[21]6.6026,[22]6.5765,[23]6.3963,[24]6.4238,[25]6.2447,[26]6.0469,[27]5.9205,[28]5.7913,[29]5.6467,[30]5.6203,[31]5.5114,[32]5.5333,[33]5.5015,[34]5.4662,[35]5.4576,[36]5.4097

Q8_0
[1]3.9901,[2]4.4936,[3]5.2967,[4]5.9963,[5]6.0094,[6]6.0603,[7]6.2655,[8]6.3923,[9]6.5396,[10]6.8468,[11]7.1079,[12]7.0994,[13]7.1189,[14]7.1477,[15]7.0114,[16]6.8453,

2 replies

ggerganov Apr 26, 2024
Maintainer

What is "fixed quantizer"?

BarfingLemurs Apr 26, 2024

fixed quantizer

Quantization route with no importance matrix calibration data.

I assume downstream projects and users will quantize and use the Q4_0 as default, without realizing this PPL degradation compared to mistral or older llama models.

Llama 3 PPL:
F16: Final estimate: PPL = 6.7647
Q8_0: Final estimate: PPL = 6.7646
Q4_0: Final estimate: PPL = 7.2904

7.7% Delta to f16

Mistral PPL:
F16: Final estimate: PPL = 5.6925
Q8_0: Final estimate: PPL = 5.6918
Q4_0: Final estimate: PPL = 5.8192

2.2% Delta to f16

But I am not sure, can the quantizer do better? Or is the important data all in the higher precision datatypes?

Dampfinchen · 2024-04-26T23:07:02Z

Dampfinchen
Apr 26, 2024
Author

#6936

Very relevant here.

0 replies

dranger003 · 2024-05-04T10:50:28Z

dranger003
May 4, 2024

For those interested, I did the PPL on all quants using the base model.

Quantization	Size (GiB)	Perplexity (wiki.test)	Delta (FP16)
IQ1_S	14.29	9.8655 +/- 0.0625	248.51%
IQ1_M	15.60	8.5193 +/- 0.0530	200.95%
IQ2_XXS	17.79	6.6705 +/- 0.0405	135.64%
IQ2_XS	19.69	5.7486 +/- 0.0334	103.07%
IQ2_S	20.71	5.5215 +/- 0.0318	95.05%
Q2_K_S	22.79	5.4334 +/- 0.0325	91.94%
IQ2_M	22.46	4.8959 +/- 0.0276	72.95%
Q2_K	24.56	4.7763 +/- 0.0274	68.73%
IQ3_XXS	25.58	3.9671 +/- 0.0211	40.14%
IQ3_XS	27.29	3.7210 +/- 0.0191	31.45%
Q3_K_S	28.79	3.6502 +/- 0.0192	28.95%
IQ3_S	28.79	3.4698 +/- 0.0174	22.57%
IQ3_M	29.74	3.4402 +/- 0.0171	21.53%
Q3_K_M	31.91	3.3617 +/- 0.0172	18.75%
Q3_K_L	34.59	3.3016 +/- 0.0168	16.63%
IQ4_XS	35.30	3.0310 +/- 0.0149	7.07%
IQ4_NL	37.30	3.0261 +/- 0.0149	6.90%
Q4_K_S	37.58	3.0050 +/- 0.0148	6.15%
Q4_K_M	39.60	2.9674 +/- 0.0146	4.83%
Q5_K_S	45.32	2.8843 +/- 0.0141	1.89%
Q5_K_M	46.52	2.8656 +/- 0.0139	1.23%
Q6_K	53.91	2.8441 +/- 0.0138	0.47%
Q8_0	69.83	2.8316 +/- 0.0138	0.03%
F16	131.43	2.8308 +/- 0.0138	0.00%

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is llama 3 more prone to damage from quantization? #6901

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is llama 3 more prone to damage from quantization? #6901

Dampfinchen Apr 25, 2024

Replies: 3 comments · 2 replies

BarfingLemurs Apr 25, 2024

llama3 8b (10.7%) (+0.4)

Mistral 7b (2.6% difference) (+0.2)

ggerganov Apr 26, 2024 Maintainer

BarfingLemurs Apr 26, 2024

Dampfinchen Apr 26, 2024 Author

dranger003 May 4, 2024

Dampfinchen
Apr 25, 2024

Replies: 3 comments 2 replies

BarfingLemurs
Apr 25, 2024

ggerganov Apr 26, 2024
Maintainer

Dampfinchen
Apr 26, 2024
Author

dranger003
May 4, 2024