From 232b43543a7c7098798a9791282cdc2f9677614a Mon Sep 17 00:00:00 2001
From: Nikita Savelyev <nikita.savelyev@intel.com>
Date: Fri, 12 Jul 2024 15:04:43 +0200
Subject: [PATCH] Lower weight compression memory footprint by sorting weights
 according to their size (#2803)

### Changes

Sort weights for compression:
```
all_weight_params = sorted(all_weight_params, key=lambda wp: wp.num_weights, reverse=True)
```

### Reason for changes

During weights compression, memory footprint gradually increases when
new low-bit constants are created. At the same time there are temporary
spikes in memory footprint which happen during compressed weight
computation. For example, here:
```
if invert_scale:
    scale = fns.power(scale, -1)
    compressed_weights = weight * scale
else:
    compressed_weights = weight / scale
if zero_point is not None:
    compressed_weights += zero_point.astype(weight.dtype)
compressed_weights = fns.round(compressed_weights)
compressed_weights = fns.clip(compressed_weights, level_low, level_high).astype(dtype)
```
Multiple temporary full precision arrays are needed to be created here.
After that they get garbage-collected. However, as it was said, this
creates temporary spikes in memory footprint. Taking this into account,
it makes sense to compress large constants first so that there are not
many low-bit constants taking up memory yet. This mostly is affected by
embedding matrices.

Please see memory figures below. They were obtain during 8-bit weights
compression. Figures were gathered with the
[memory_logger.py](https://github.com/openvinotoolkit/nncf/pull/2801),
memory_type=SYSTEM_NORMALIZED.

| Backend | Model | Before | After |
|---------|-------|--------|-------|
| OV | qwen2-7b |
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/0fd9a423-71e0-474b-9e44-8eb3accd464c)
|
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/ae53b89a-a5a6-4cf4-bad8-91b853602227)
|
| PT | qwen2-7b |
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/2fe7e377-36ee-4d67-9d57-3b5563a8e349)
|
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/1f89e657-cd1a-4af1-ae5a-8275fdd72679)
|

For example, for qwen2-7b OV model there is a reduction from ~12GB peak
footprint to ~7GB.

Much lower values for OV backend compared to PT are because OV models
are read using mmap which allows to avoid allocating memory for the
whole full-precision model.

### Related tickets

144501
---
 nncf/quantization/algorithms/weight_compression/algorithm.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/nncf/quantization/algorithms/weight_compression/algorithm.py b/nncf/quantization/algorithms/weight_compression/algorithm.py
index ec3bfdce893..a448d185f45 100644
--- a/nncf/quantization/algorithms/weight_compression/algorithm.py
+++ b/nncf/quantization/algorithms/weight_compression/algorithm.py
@@ -403,6 +403,9 @@ def apply(
                 backend_entity=self._backend_entity,
             )
 
+        # Sort weight params to start compression with the bigger constants. This lowers peak memory footprint.
+        all_weight_params = sorted(all_weight_params, key=lambda wp: wp.num_weights, reverse=True)
+
         # Compress model using weight compression parameters
         transformed_model = self._backend_entity.transform_model(
             model,