Is llama 3 more prone to damage from quantization? #6901
Replies: 3 comments 2 replies
-
Llama3 would need the fixed quantizer to be calibrated, we don't know if the GQA or new tokenizer causes these results. The fixed quantizations gives these results on wiki-test-raw: llama3 8b (10.7%) (+0.4)F16 Q4_0 Q8_0 Mistral 7b (2.6% difference) (+0.2)F16 Q4_0 Q8_0 |
Beta Was this translation helpful? Give feedback.
-
Very relevant here. |
Beta Was this translation helpful? Give feedback.
-
For those interested, I did the PPL on all quants using the base model.
|
Beta Was this translation helpful? Give feedback.
-
So I've made a small test. In a full 4096 context, I've written an elaborate instruct in my system prompt how my character should act in certain situations which require quite a bit of logic.
With Llama 3 8B Instruct at FP16, the model was successfully able to connect the dots. Then I did the same test using the same sampler settings with a quantized IQ4_XS model of Llama 3 8B Instruct and it failed all the time.
That also applied to 70B. A quantized 70B was unable to perform this test correctly most of the time, while the FP16 model of 8B's success-rate was much higher. I feel like quantization significantly reduces the attention to early parts of the context which includes the system prompt.
With other small models like Mistral 7B and Solar, I didn't notice this severe damage at all. A Solar model at IQ4_XS does a better job than LLama 3 8B Instruct IQ4_XS in this test.
Has anyone noticed the same? I came across this reddit thread https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/
which seems to confirm my suspicion.
Has anyone else noticed this or made similar tests? I'm also wondering if the bf16->quant conversion is partly to blame for this and if LLama 3 perhaps suffers from this more than other models.
Beta Was this translation helpful? Give feedback.
All reactions