This is a step-by-step tutorial on how to integrate the NNCF package into the existing project. The use case implies that the user already has a training pipeline that reproduces training of the model in the floating point precision and pretrained model. The task is to prepare this model for accelerated inference by simulating the compression at train time. The instructions below use certain "helper" functions of the NNCF which abstract away most of the framework specifics and make the integration easier in most cases. As an alternative, you can always use the NNCF internal objects and methods as described in the architectural overview.
A JSON configuration file is used for easier setup of the parameters of compression to be applied to your model. See configuration file description or the sample configuration files packaged with the example scripts for reference.
NNCF enables compression-aware training by being integrated into the regular training pipelines. The framework is designed so that the modifications to your original training code are minor.
-
Add the imports required for NNCF:
import torch import nncf.torch # Important - must be imported before any other external package that depends on torch from nncf import NNCFConfig, create_compressed_model, load_state
NOTE (PyTorch): Due to the way NNCF works within the PyTorch backend,
import nncf
must be done before any other import oftorch
in your package or in third-party packages that your code utilizes, otherwise the compression may be applied incompletely. -
Load the NNCF JSON configuration file that you prepared during Step 1:
nncf_config = NNCFConfig.from_json("nncf_config.json") # Specify a path to your own NNCF configuration file in place of "nncf_config.json"
-
(Optional) For certain algorithms such as quantization it is highly recommended to initialize the algorithm by passing training data via
nncf_config
prior to starting the compression fine-tuning properly:from nncf import register_default_init_args nncf_config = register_default_init_args(nncf_config, train_loader, criterion=criterion)
Training data loaders should be attached to the NNCFConfig object as part of a library-defined structure.
register_default_init_args
is a helper method that registers the necessary structures for all available initializations (currently quantizer range and precision initialization) by taking data loader, criterion and criterion function (for sophisticated calculation of loss different from direct call of the criterion with 2 arguments: model outputs and targets).The initialization expects that the model is called with its first argument equal to the dataloader output. If your model has more complex input arguments you can create your own data loader implementing
nncf.common.initialization.dataloader.NNCFDataLoader
interface to return a tuple of (single model input , the rest of the model inputs as a kwargs dict). -
Right after you create an instance of the original model and load its weights, wrap the model by making the following call
compression_ctrl, compressed_model = create_compressed_model(model, nncf_config)
The
create_compressed_model
function parses the loaded configuration file and returns two objects.compression_ctrl
is a "controller" object that can be used during compressed model training to adjust certain parameters of the compression algorithm (according to a scheduler, for instance), or to gather statistics related to your compression algorithm (such as the current level of sparsity in your model). -
(Optional) Wrap your model with
DataParallel
orDistributedDataParallel
classes for multi-GPU training. If you useDistributedDataParallel
, add the following call afterwards:compression_ctrl.distributed()
in case the compression algorithms that you use need special adjustments to function in the distributed mode.
-
In the training loop, make the following changes:
-
After inferring the model, take a compression loss and add it (using the
+
operator) to the common loss, for example cross-entropy loss:compression_loss = compression_ctrl.loss() loss = cross_entropy_loss + compression_loss
-
Call the scheduler
step()
before each training iteration:compression_ctrl.scheduler.step()
-
Call the scheduler
epoch_step()
before each training epoch:compression_ctrl.scheduler.epoch_step()
-
NOTE: For a real-world example of how these changes should be introduced, take a look at the examples published in the NNCF repository.
At this point, the NNCF is fully integrated into your training pipeline. You can run it as usual and monitor your original model's metrics and/or compression algorithm metrics and balance model metrics quality vs. level of compression.
Important points you should consider when training your networks with compression algorithms:
- Turn off the
Dropout
layers (and similar ones likeDropConnect
) when training a network with quantization or sparsity - It is better to turn off additional regularization in the loss function (for example, L2 regularization via
weight_decay
) when training the network with RB sparsity, since it already imposes an L0 regularization term.
After the compressed model has been fine-tuned to acceptable accuracy and compression stages, you can export it. There are two ways to export a model:
-
Call the compression controller's
export_model
method to properly export the model with compression specifics into ONNX.compression_ctrl.export_model("./compressed_model.onnx")
The exported ONNX file may contain special, non-ONNX-standard operations and layers to leverage full compressed/low-precision potential of the OpenVINO toolkit. In some cases it is possible to export a compressed model with ONNX standard operations only (so that it can be run using
onnxruntime
, for example) - this is the case for the 8-bit symmetric quantization and sparsity/filter pruning algorithms. Refer to compression algorithm documentation for details. Also, this method is limited to the supported formats for export. -
Call the compression controller's
strip
method, to properly get the model without NNCF specific nodes for training compressed model, after that you can trace the model via inference in framework operations. It gives more flexibility to deploy model after optimization. As well as this method also allows you to connect third-party inference solutions, like OpenVINO.inference_model = compression_ctrl.strip() # To ONNX format import torch torch.onnx.export(inference_model, dummy_input, './compressed_model.onnx') # To OpenVINO format from openvino.tools import mo ov_model = mo.convert_model(inference_model, example_input=example_input)
The complete information about compression is defined by a compressed model and a compression state.
The model characterizes the weights and topology of the network. The compression state - how to restore the setting of
compression layers in the model and how to restore the compression schedule and the compression loss.
The latter can be obtained by compression_ctrl.get_compression_state()
on saving and passed to the
create_compressed_model
helper function by the optional compression_state
argument on loading.
The compressed model should be loaded once it's created.
Saving and loading of the compressed model and compression state is framework-specific and can be done in an arbitrary way. NNCF provides one possible way of doing it with helper functions in samples.
To save the best compressed checkpoint use compression_ctrl.compression_stage()
to distinguish between 3 possible
levels of compression: UNCOMPRESSED
, PARTIALLY_COMPRESSED
and FULLY_COMPRESSED
. It is useful in case of staged
compression. Model may achieve the best accuracy on earlier stages of compression - tuning without compression or with
intermediate compression rate, but still fully compressed model with lower accuracy should be considered as the best
compressed one. UNCOMPRESSED
means that no compression is applied for the model, for instance, in case of stage
quantization - when all quantization are disabled, or in case of sparsity - when current sparsity rate is zero.
PARTIALLY_COMPRESSED
stands for the compressed model which haven't reached final compression ratio yet, e.g. magnitude
sparsity algorithm has learnt masking of 30% weights out of 51% of target rate. The controller returns
FULLY_COMPRESSED
compression stage when it finished scheduling and tuning hyper parameters of the compression
algorithm, for example when rb-sparsity method sets final target sparsity rate for the loss.
# save part
compression_ctrl, compress_model = create_compressed_model(model, nncf_config)
checkpoint = tf.train.Checkpoint(model=compress_model,
compression_state=TFCompressionState(compression_ctrl),
...)
# save checkpoint in a preferable way
# using checkpoint manager
checkpoint_manager = tf.train.CheckpointManager(checkpoint, path_to_checkpoint)
checkpoint_manager.save()
# or via the corresponding callback
callbacks = []
callbacks.append(CheckpointManagerCallback(checkpoint, ckpt_dir))
...
compress_model.fit(..., callbacks=callbacks)
# load part
checkpoint = tf.train.Checkpoint(compression_state=TFCompressionStateLoader())
checkpoint.restore(path_to_checkpoint)
compression_state = checkpoint.compression_state.state
compression_ctrl, compress_model = create_compressed_model(model, nncf_config, compression_state)
checkpoint = tf.train.Checkpoint(model=compress_model,
...)
checkpoint.restore(path_to_checkpoint)
Since the compression state is a dictionary of Python JSON-serializable objects, we convert it to JSON
string within tf.train.Checkpoint
. There are 2 helper classes: TFCompressionState
- for saving compression state and
TFCompressionStateLoader
- for loading.
Deprecated API
# save part
compression_ctrl, compressed_model = create_compressed_model(model, nncf_config)
checkpoint = {
'state_dict': compressed_model.state_dict(),
'scheduler_state': compression_ctrl.scheduler.get_state(),
'compression_stage': compression_ctrl.compression_stage(),
...
}
torch.save(checkpoint, path)
# load part
resuming_checkpoint = torch.load(path)
state_dict = resuming_checkpoint['state_dict']
compression_ctrl, compressed_model = create_compressed_model(model, nncf_config, resuming_state_dict=state_dict)
compression_ctrl.scheduler.load_state(resuming_checkpoint['scheduler_state'])
New API
# save part
compression_ctrl, compressed_model = create_compressed_model(model, nncf_config)
checkpoint = {
'state_dict': compressed_model.state_dict(),
'compression_state': compression_ctrl.get_compression_state(),
...
}
torch.save(checkpoint, path)
# load part
resuming_checkpoint = torch.load(path)
compression_state = resuming_checkpoint['compression_state']
compression_ctrl, compressed_model = create_compressed_model(model, nncf_config, compression_state=compression_state)
state_dict = resuming_checkpoint['state_dict']
# load model in a preferable way
load_state(compressed_model, state_dict, is_resume=True)
# or when execution mode on loading is the same as on saving:
# save and load in a single GPU mode or save and load in the (Distributed)DataParallel one, not in a mixed way
compressed_model.load_state_dict(state_dict)
You can save the compressed_model
object torch.save
as usual: via state_dict
and load_state_dict
methods.
Alternatively, you can use the nncf.load_state
function on loading. It will attempt to load a PyTorch state dict into
a model by first stripping the irrelevant prefixes, such as module.
or nncf_module.
, from both the checkpoint and
the model layer identifiers, and then do the matching between the layers.
Depending on the value of the is_resume
argument, it will then fail if an exact match could not be made
(when is_resume == True
), or load the matching layer parameters and print a warning listing the mismatches
(when is_resume == False
). is_resume=False
is most commonly used if you want to load the starting weights from an
uncompressed model into a compressed model and is_resume=True
is used when you want to evaluate a compressed
checkpoint or resume compressed checkpoint training without changing the compression algorithm parameters.
The compression state can be directly pickled by torch.save
as well, since it is a dictionary of Python objects.
In the previous releases of the NNCF, model can be loaded without compression state information
by saving the model state dictionary compressed_model.state_dict
and loading it via nncf.load_state
and
compressed_model.load_state_dict
methods or using optional resuming_state_dict
argument of the
create_compressed_model
.
This way of loading is deprecated, and we highly recommend to not use this way as it does not guarantee the exact loading
of compression model state for algorithms with sophisticated initialization - e.g. HAWQ and AutoQ.
Also in this case, keep in mind that in order to load the resulting checkpoint file the compressed_model
object should
have the same structure with regard to PyTorch module and parameters as it was when the checkpoint was saved.
In practice this means that you should use the same compression algorithms (i.e. the same NNCF configuration file) when
loading a compressed model checkpoint.
After a create_compressed_model
call, the NNCF log directory will contain visualizations of internal representations for the original, uncompressed model (original_graph.dot
) and for the model with the compression algorithms applied (compressed_graph.dot
).
These graphs form the basis for NNCF analyses of your model.
Below is the example of a LeNet network's original_graph.dot
visualization:
Same model's compressed_graph.dot
visualization for symmetric INT8 quantization:
Visualize these .dot files using Graphviz and browse through the visualization to validate that this representation correctly reflects your model structure.
Each node represents a single PyTorch function call - see NNCFArchitecture.md section on graph tracing for details.
In case you need to exclude some parts of the model from being considered in one algorithm or another, you can use the labels of the compressed_graph.dot
nodes (excluding the numerical ID in the beginning) and specify these (globally or per-algorithm) within the corresponding specific sections in configuration file
Regular expression matching is also possible for easier exclusion of certain node groups.
For instance, below is the same LeNet INT8 model as above, but with "ignored_scopes": ["{re}.*RELU.*", "LeNet/NNCFConv2d[conv2]"]
:
Notice that all RELU operation outputs and the second convolution's weights are no longer quantized.
With no target model code modifications, NNCF only supports native PyTorch modules with respect to trainable parameter (weight) compressed, such as torch.nn.Conv2d
If your model contains a custom, non-PyTorch standard module with trainable weights that should be compressed, you can register it using the @nncf.register_module
decorator:
import nncf
@nncf.register_module(ignored_algorithms=[...])
class MyModule(torch.nn.Module):
def __init__(self, ...):
self.weight = torch.nn.Parameter(...)
# ...
If registered module should be ignored by specific algorithms use ignored_algorithms
parameter of decorator.
In the example above, the NNCF-compressed models that contain instances of MyModule
will have the corresponding modules extended with functionality that will allow NNCF to quantize, sparsify or prune the weight
parameter of MyModule
before it takes part in MyModule
's forward
calculation.
NNCF has the capability to apply the model compression algorithms while satisfying the user-defined accuracy constraints. This is done by executing an internal custom accuracy-aware training loop, which also helps to automate away some of the manual hyperparameter search related to model training such as setting the total number of epochs, the target compression rate for the model, etc. There are two supported training loops. The first one is called Early Exit Training, which aims to finish fine-tuning when the accuracy drop criterion is reached. The second one is more sophisticated. It is targeted for the automated discovery of the compression rate for the model given that it satisfies the user-specified maximal tolerable accuracy drop due to compression. Its name is Adaptive Compression Level Training. Both training loops could be run with either PyTorch or TensorFlow backend with the same user interface(except for the TF case where the Keras API is used for training).
The following function is required to create the accuracy-aware training loop. One has to pass the NNCFConfig
object and the compression controller (that is returned upon compressed model creation, see above).
from nncf.common.accuracy_aware_training import create_accuracy_aware_training_loop
training_loop = create_accuracy_aware_training_loop(nncf_config, compression_ctrl, uncompressed_model_accuracy)
In order to properly instantiate the accuracy-aware training loop, the user has to specify the 'accuracy_aware_training' section. This section fully depends on what Accuracy-Aware Training loop is being used. For more details about config of Adaptive Compression Level Training refer to Adaptive Compression Level Training documentation and Early Exit Training refer to Early Exit Training documentation.
The training loop is launched by calling its run
method. Before the start of the training loop, the user is expected to define several functions related to the training of the model and pass them as arguments to the run
method of the training loop instance:
def train_epoch_fn(compression_ctrl, model, **kwargs):
'''
A function that takes the compression controller and the
compressed model instance and trains it for a single epoch.
Optional arguments, that are packed into the kwargs dictionary
unless explicitly specified as arguments, are the following:
`optimizer` - the optimizer instance, if it is defined in the training
pipeline; `lr_scheduler` - the learning rate scheduler instance, in case
it is defined in the training pipeline and used during single-epoch training,
e.g. to adjust the learning rate on every iteration; `epoch` - current epoch
count (in case it is needed for logging). Other entities used in this function
are expected to be supplied as nonlocal variables of the outer function scope
(i.e. `train_epoch_fn` to be used as a closure, see training samples for examples).
Note that all of the NNCF-related integration code should be in place inside of this
function to properly execute compression-aware training.
'''
def validate_fn(model, **kwargs):
'''
A function that takes the model, runs its evaluation on the validation set
and returns the float value of the target accuracy metric. Defined similarly
to the `train_epoch_fn` above. The optional argument is `epoch` - current epoch
number, if required for logging.
'''
def configure_optimizers_fn():
'''
An (optional) function that instantiates an optimizer and a learning rate
scheduler before fine-tuning. Should be registered in all of the cases when
an explicit optimizer/LR scheduler object is used for training. The `optimizer`
and `lr_scheduler` arguments should be defined in the `train_epoch_fn` function
in this case as well. The `configure_optimizers_fn` should return a tuple consisting
of an optimizer instance and an LR scheduler instance (replace with None if the latter
is not applicable).
'''
def dump_checkpoint_fn(model, compression_controller, accuracy_aware_runner, save_dir):
'''
An (optional) function that allows a user to define how to save the model's checkpoint.
Training loop will call this function instead own dump_checkpoint function and pass
`model`, `compression_controller`, `accuracy_aware_runner` and `save_dir` to it as arguments.
The user can save the states of the objects according to their own needs.
`save_dir` is a directory that Accuracy-Aware pipeline created to store log information.
'''
Once the above functions are defined, you could pass them to the run
method of the earlier created training loop :
model = training_loop.run(
model,
train_epoch_fn=train_epoch_fn,
validate_fn=validate_fn,
configure_optimizers_fn=configure_optimizers_fn,
dump_checkpoint_fn=dump_checkpoint_fn)
The above call executes the accuracy-aware training loop and return the compressed model. For more details on how to use the accuracy-aware training loop functionality of NNCF, please refer to its documentation.
See a PyTorch example for Quantization + Filter Pruning Adaptive Compression scenario on CIFAR10 and ResNet18 config.