Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[AWQ] Cast
fns.quantile()
result to float32 (#3044)
### Changes Cast `fns.quantile()` result to float32 inside AWQ algorithm. ### Reason for changes `fns.quantile()` for numpy backend returns `np.float64` value. In AWQ it is used as a clip lower bound, resulting in float64 result. Then via chain reaction it leads to weights and activations being converted to float64. As I understand, processing in float64 is not necessary. At the same time it leads to increased running time. Below are measurements for compression time with AWQ enabled before and after the changes. | Model | develop (sec.) | branch (sec.) | |-----------------|----------------|---------------| | tiny-llama-1.1b | 123 | 109 (-11%) | | phi3_mini-3.7b | 487 | 419 (-14%) | | llama3-8b | 1091 | 912 (-16%) |
- Loading branch information