[Website][Paper][Nunchaku Inference System]
Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive to quantization, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first shift the outliers from activations into the weights, then employ a high-precision low-rank branch to take in the outliers in the weights with SVD. This process eases the quantization on both sides. However, naively running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels in the low-rank branch into thosein the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the requantization. Extensive experiments on SDXL, PixArt-Sigma, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.6×, achieving 3.5× speedup over the 4-bit weight-only quantized baseline on a 16GB RTX-4090 GPU, paving the way for more interactive applications on PCs.
We use Flux.1-Schnell as an example.
In order to evaluate the similarity metrics, we have to first prepare the reference images generated by unquantized models by running the following command:
python -m deepcompressor.app.diffusion.ptq configs/model/flux.1-schnell.yaml --output-dirname reference
In this command,
configs/model/flux.1-schnell.yaml
specifies the model configurations including evaluation setups.- By setting flag
--output-dirname
toreference
, the output directory will be automatically redirect to theref_root
in the evaluation configuration.
Before quantizing diffusion models, we randomly sample 128 prompts in COCO Captions 2024 to generate calibration dataset by running the following command:
python -m deepcompressor.app.diffusion.dataset.collect.calib \
configs/model/flux.1-schnell.yaml configs/collect/qdiff.yaml
In this command,
configs/collect/qdiff.yaml
specifies the calibration dataset configurations, including the path to the prompt yaml (i.e.,--collect-prompt-path prompts/qdiff.yaml
), the number of prompts to be sampled (i.e.,--collect-num-samples 128
), and the root directory of the calibration datasets (which should be in line with the quantization configuration).
The following command will perform INT4 SVDQuant and evaluate the quantized model on 1024 samples from MJHQ-30K:
python -m deepcompressor.app.diffusion.ptq \
configs/model/flux.1-schnell.yaml configs/svdquant/int4.yaml \
--eval-benchmarks MJHQ --eval-num-samples 1024
In this command,
- The positional arguments are configuration files which are loaded in order.
configs/svdquant/int4
contains the quantization configurations specialized in INT4 SVDQuant. Please make sure all configuration files are under a subfolder of the working directory where you run the command. - All configurations can be directly set in either YAML file or command line. Please refer to
configs/__default__.yaml
andpython -m deepcompressor.app.diffusion.ptq -h
. - The default evaluation datasets are 1024 samples from MJHQ and DCI.
- If you would like to save quantized model checkpoint, please add
--save-model true
or--save-model /PATH/TO/CHECKPOINT/DIR
in the command.
We provide SVDQuant quantized model checkpoints in Nunchaku
for your reference. Please refer to Nunchaku
for further deployment on GPU system.
Below is the quality and similarity evaluated with 5000 samples from MJHQ-30K dataset. IR means ImageReward. Our 4-bit results outperform other 4-bit baselines, effectively preserving the visual quality of 16-bit models.
Model | Precision | Method | FID ( |
IR ( |
LPIPS ( |
PSNR( |
---|---|---|---|---|---|---|
FLUX.1-dev (50 Steps) | BF16 | -- | 20.3 | 0.953 | -- | -- |
INT W8A8 | Ours | 20.4 | 0.948 | 0.089 | 27.0 | |
W4A16 | NF4 | 20.6 | 0.910 | 0.272 | 19.5 | |
INT W4A4 | Ours | 19.86 | 0.932 | 0.254 | 20.1 | |
FP W4A4 | Ours | 21.0 | 0.933 | 0.247 | 20.2 | |
FLUX.1-schnell (4 Steps) | BF16 | -- | 19.2 | 0.938 | -- | -- |
INT W8A8 | Ours | 19.2 | 0.966 | 0.120 | 22.9 | |
W4A16 | NF4 | 18.9 | 0.943 | 0.257 | 18.2 | |
INT W4A4 | Ours | 18.4 | 0.969 | 0.292 | 17.5 | |
FP W4A4 | Ours | 19.9 | 0.956 | 0.279 | 17.5 | |
FP16 | -- | 16.6 | 0.944 | -- | -- | |
PixArt-Sigma (20 Steps) | INT W8A8 | ViDiT-Q | 15.7 | 0.944 | 0.137 | 22.5 |
INT W8A8 | Ours | 16.3 | 0.955 | 0.109 | 23.7 | |
INT W4A8 | ViDiT-Q | 37.3 | 0.573 | 0.611 | 12.0 | |
INT W4A4 | Ours | 20.1 | 0.898 | 0.394 | 16.2 | |
FP W4A4 | Ours | 18.3 | 0.946 | 0.326 | 17.4 |
If you find deepcompressor
useful or relevant to your research, please kindly cite our paper:
@article{li2024svdquant,
title={SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models},
author={Li*, Muyang, and Lin*, Yujun and Zhang, Zhekai and Cai, Tianle and Li, Xiuyu and Guo, Junxian and Xie, Enze and Meng, Chenlin and Zhu, Jun-Yan and Han, Song},
journal={arXiv preprint arXiv:2411.05007},
year={2024}
}