Skip to content

v2.8.0

Compare
Choose a tag to compare
@KodiaqQ KodiaqQ released this 24 Jan 13:06
· 2334 commits to develop since this release

Post-training Quantization:

Breaking changes:

  • nncf.quantize signature has been changed to add mode: Optional[nncf.QuantizationMode] = None as its 3-rd argument, between the original calibration_dataset and preset arguments.
  • (Common) nncf.common.quantization.structs.QuantizationMode has been renamed to nncf.common.quantization.structs.QuantizationScheme

General:

  • (OpenVINO) Changed default OpenVINO opset from 9 to 13.

Features:

  • (OpenVINO) Added 4-bit data-aware weights compression. For that dataset optional parameter has been added to nncf.compress_weights() and can be used to minimize accuracy degradation of compressed models (note that this option increases the compression time).
  • (PyTorch) Added support for PyTorch models with shared weights and custom PyTorch modules in nncf.compress_weights(). The weights compression algorithm for PyTorch models is now based on tracing the model graph. The dataset parameter is now required in nncf.compress_weights() for the compression of PyTorch models.
  • (Common) Renamed the nncf.CompressWeightsMode.INT8 to nncf.CompressWeightsMode.INT8_ASYM and introduce nncf.CompressWeightsMode.INT8_SYM that can be efficiently used with dynamic 8-bit quantization of activations.
    The original nncf.CompressWeightsMode.INT8 enum value is now deprecated.
  • (OpenVINO) Added support for quantizing the ScaledDotProductAttention operation from OpenVINO opset 13.
  • (OpenVINO) Added FP8 quantization support via nncf.QuantizationMode.FP8_E4M3 and nncf.QuantizationMode.FP8_E5M2 enum values, invoked via passing one of these values as an optional mode argument to nncf.quantize. Currently, OpenVINO supports inference of FP8-quantized models in reference mode with no performance benefits and can be used for accuracy projections.
  • (Common) Post-training Quantization with Accuracy Control - nncf.quantize_with_accuracy_control() has been extended by restore_mode optional parameter to revert weights to int8 instead of the original precision.
    This parameter helps to reduce the size of the quantized model and improves its performance.
    By default, it's disabled and model weights are reverted to the original precision in nncf.quantize_with_accuracy_control().
  • (Common) Added an all_layers: Optional[bool] = None argument to nncf.compress_weights to indicate whether embeddings and last layers of the model should be compressed to a primary precision. This is relevant to 4-bit quantization only.
  • (Common) Added a sensitivity_metric: Optional[nncf.parameters.SensitivityMetric] = None argument to nncf.compress_weights for finer control over the sensitivity metric for assigning quantization precision to layers.
    Defaults to weight quantization error if a dataset is not provided for weight compression and to maximum variance of the layers' inputs multiplied by inverted 8-bit quantization noise if a dataset is provided.
    By default, the backup precision is assigned for the embeddings and last layers.

Fixes:

  • (OpenVINO) Models with embeddings (e.g. gpt-2, stable-diffusion-v1-5, stable-diffusion-v2-1, opt-6.7b, falcon-7b, bloomz-7b1) are now more accurately quantized.
  • (PyTorch) nncf.strip(..., do_copy=True) now actually returns a deepcopy (stripped) of the model object.
  • (PyTorch) Post-hooks can now be set up on operations that return torch.return_type (such as torch.max).
  • (PyTorch) Improved dynamic graph tracing for various tensor operations from torch namespace.
  • (PyTorch) More robust handling of models with disjoint traced graphs when applying PTQ.

Improvements:

  • Reformatted the tutorials section in the top-level README.md for better readability.

Deprecations/Removals:

  • (Common) The original nncf.CompressWeightsMode.INT8 enum value is now deprecated.
  • (PyTorch) The Git patch for integration with HuggingFace transformers repository is marked as deprecated and will be removed in a future release.
    Developers are advised to use optimum-intel instead.
  • Dockerfiles in the NNCF Git repository are deprecated and will be removed in a future release.