Quantized weights filesize #1803

lucala · 2023-05-11T22:26:08Z

lucala
May 11, 2023

I'm trying to quantize my model such that the model weights file becomes smaller. Using PTQ of pytorch the model weights get stored as INT8 and therefore the file takes up around 4x less space.

Quantizing with NNCF seems to be working fine but when I inspect the resulting model weights they still seem to be FP. And the weight filesize is also still just as large as the original.

Any idea how to handle this?

vshampor · 2023-05-12T14:21:36Z

vshampor
May 12, 2023
Maintainer

Greetings, @lucala!

That all depends on what you want to achieve here exactly.

Check the FAQ answer out, as well. NNCF won't be making the .pth weight representations of the quantized model smaller, since this would be losing information from the checkpoint that the user might potentially want to train later.

As for .onnx - recently we allow to get a "stripped" version of the model, where our own NNCF quantizers are replaced by torch-native versions, and even these are exported to .onnx as primitive QuantizeLinear/DequantizeLinear pairs, without the non-local graph optimizations to actually quantize the impacted weights. The .onnx versions do contain the weights that had been quantized to the respective quantized levels, only in the FP32 domain. If you use OpenVINO's Model Optimizer to get the IR representations of the model object, you will see the footprint reduction.

1 reply

lucala May 12, 2023
Author

I understand. I need to measure the accuracy drop of my quantized models compared to the reduction in file size.

So I gather the only way to do this while using NNCF is by converting the quantized model using OpenVINO MO? I'm familiar with onnx but if I understood correctly, I will not see any footprint reduction since the quantized values are still stored using 32 bits?

lucala · 2023-05-12T17:44:54Z

lucala
May 12, 2023
Author

@vshampor I followed this tutorial, using NNCF to quantize and fine-tune my model, converting it to ONNX and then to OpenVINO. The resulting .xml file is indeed quite small, comparable to dynamic quantized pytorch weights. Running inference using OpenVINO gives the expected results using the weights stored inside .xml file.

Is this correct the correct way to go about it?

1 reply

vshampor May 12, 2023
Maintainer

The .xml file doesn't really store the weights, it's the .bin file that does, and it's the .bin file size you should be looking at, but this is the standard approach, yes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized weights filesize #1803

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Quantized weights filesize #1803

lucala May 11, 2023

Replies: 2 comments · 2 replies

vshampor May 12, 2023 Maintainer

lucala May 12, 2023 Author

lucala May 12, 2023 Author

vshampor May 12, 2023 Maintainer

lucala
May 11, 2023

Replies: 2 comments 2 replies

vshampor
May 12, 2023
Maintainer

lucala May 12, 2023
Author

lucala
May 12, 2023
Author

vshampor May 12, 2023
Maintainer