You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scale step is cast to fp32 acc_i32.to(f32) x scale.to(f32)
Different autotune configs
Different autotune configs
Ideally we should only keep 1. The tedious part is to validate there is no accuracy+speed regression, regardless of which final implementation we will adopt.
main/torchao/quantization/utils.py contains a lot of util q/dq ops that's call the more versatile quant primitive ops (quantize_affine/dequantize_affine/choose_qparams_affine) btw, so many of these are convenience functions to hold the configurations for these quant primitive ops (e.g. dtype, block_size, symmetric/asymmetric, quant_min/quant_max, eps etc.).
With the new addition of INT8 mixed-precision training, there are now 2 implementations of scaled INT8 matmul (INT8 matmul + dequant)
I have identified the key differences
intmm_triton.py
int8_mm.py
acc_i32 x scale
acc_i32.to(f32) x scale.to(f32)
Ideally we should only keep 1. The tedious part is to validate there is no accuracy+speed regression, regardless of which final implementation we will adopt.
Here are the places that use
intmm_triton.py
-> Basically ensure INT8 dynamic quantization for Llama and SAM benchmarks don't regress
Here are the places that use
int8_mm.py
intmm_triton.py
above-> Ensure INT8 mixed-precision training doesn't regress
Another question. Is it ok to change
int_scaled_matmul()
signature to accept scales for both A and B instead of only for A?The text was updated successfully, but these errors were encountered: