You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I‘ve been wondering if the bias in quantized blocks is maybe more important than the scale, since we have all these rms norms in the model anyway. Anecdotal evidence points to Q6_K (which only has a scale) sometimes being worse than Q5_K and Q4_K (which have bias too). Has anyone given this some thought?
Or, a more dynamic approach: what if during quantization we analyze the block values and then store just a single value per block, and use a bit to decide whether to apply it as a scale or bias depending on the distribution of values in the block?
ChatGPT helped me writing down the algorithm:
Proposal: Adaptive Scale/Bias Quantization Scheme
The goal is to analyze each block of values during quantization and determine whether applying a scale or a bias is more appropriate, adapting dynamically to the data distribution for better accuracy.
Block Analysis:
For each block of values x, calculate:
x_min: the minimum value in the block.
x_max: the maximum value in the block.
mean: the average of the values in the block.
Compute the range of the block:
range = x_max - x_min
Calculate how centered the mean is relative to the block:
Compare centeredness against a chosen threshold T to decide whether to use a bias or a scale:
If centeredness is greater than T, the block values are not centered around zero, so a bias is more appropriate.
Otherwise, use a scale.
Storing the Scale/Bias:
If using a bias, store the mean (mean) as a 16-bit float, and set the least significant bit (LSB) to 1 to indicate it's a bias.
If using a scale, store the range (range) as a 16-bit float, and set the LSB to 0 to indicate it's a scale.
Quantization Process:
When transforming each value x in the block:
If the LSB is 1 (indicating bias mode), quantize using:
x_quantized = (x - mean) / range
If the LSB is 0 (indicating scale mode), quantize using:
x_quantized = x / range
This method enables the quantization process to adaptively choose between scaling and biasing for each block, ensuring better handling of blocks where values aren't centered around zero. This adaptive approach can lead to improved quantization accuracy without a significant increase in storage overhead.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I‘ve been wondering if the bias in quantized blocks is maybe more important than the scale, since we have all these rms norms in the model anyway. Anecdotal evidence points to Q6_K (which only has a scale) sometimes being worse than Q5_K and Q4_K (which have bias too). Has anyone given this some thought?
Or, a more dynamic approach: what if during quantization we analyze the block values and then store just a single value per block, and use a bit to decide whether to apply it as a scale or bias depending on the distribution of values in the block?
ChatGPT helped me writing down the algorithm:
Proposal: Adaptive Scale/Bias Quantization Scheme
The goal is to analyze each block of values during quantization and determine whether applying a scale or a bias is more appropriate, adapting dynamically to the data distribution for better accuracy.
Block Analysis:
x
, calculate:x_min
: the minimum value in the block.x_max
: the maximum value in the block.mean
: the average of the values in the block.range = x_max - x_min
centeredness = absolute(mean - (x_min + x_max) / 2)
Decision Making:
centeredness
against a chosen thresholdT
to decide whether to use a bias or a scale:centeredness
is greater thanT
, the block values are not centered around zero, so a bias is more appropriate.Storing the Scale/Bias:
mean
) as a 16-bit float, and set the least significant bit (LSB) to1
to indicate it's a bias.range
) as a 16-bit float, and set the LSB to0
to indicate it's a scale.Quantization Process:
x
in the block:1
(indicating bias mode), quantize using:x_quantized = (x - mean) / range
0
(indicating scale mode), quantize using:x_quantized = x / range
This method enables the quantization process to adaptively choose between scaling and biasing for each block, ensuring better handling of blocks where values aren't centered around zero. This adaptive approach can lead to improved quantization accuracy without a significant increase in storage overhead.
Beta Was this translation helpful? Give feedback.
All reactions