Replies: 2 comments 2 replies
-
Greetings, @lucala! That all depends on what you want to achieve here exactly. Check the FAQ answer out, as well. NNCF won't be making the .pth weight representations of the quantized model smaller, since this would be losing information from the checkpoint that the user might potentially want to train later. As for .onnx - recently we allow to get a "stripped" version of the model, where our own NNCF quantizers are replaced by torch-native versions, and even these are exported to .onnx as primitive QuantizeLinear/DequantizeLinear pairs, without the non-local graph optimizations to actually quantize the impacted weights. The .onnx versions do contain the weights that had been quantized to the respective quantized levels, only in the FP32 domain. If you use OpenVINO's Model Optimizer to get the IR representations of the model object, you will see the footprint reduction. |
Beta Was this translation helpful? Give feedback.
-
@vshampor I followed this tutorial, using NNCF to quantize and fine-tune my model, converting it to ONNX and then to OpenVINO. The resulting .xml file is indeed quite small, comparable to dynamic quantized pytorch weights. Running inference using OpenVINO gives the expected results using the weights stored inside .xml file. Is this correct the correct way to go about it? |
Beta Was this translation helpful? Give feedback.
-
I'm trying to quantize my model such that the model weights file becomes smaller. Using PTQ of pytorch the model weights get stored as INT8 and therefore the file takes up around 4x less space.
Quantizing with NNCF seems to be working fine but when I inspect the resulting model weights they still seem to be FP. And the weight filesize is also still just as large as the original.
Any idea how to handle this?
Beta Was this translation helpful? Give feedback.
All reactions