Lower weight compression memory footprint by sorting weights accordin… · openvinotoolkit/nncf@232b435

Commit

Lower weight compression memory footprint by sorting weights accordin…

…g to their size (#2803)

### Changes

Sort weights for compression:
```
all_weight_params = sorted(all_weight_params, key=lambda wp: wp.num_weights, reverse=True)
```

### Reason for changes

During weights compression, memory footprint gradually increases when
new low-bit constants are created. At the same time there are temporary
spikes in memory footprint which happen during compressed weight
computation. For example, here:
```
if invert_scale:
    scale = fns.power(scale, -1)
    compressed_weights = weight * scale
else:
    compressed_weights = weight / scale
if zero_point is not None:
    compressed_weights += zero_point.astype(weight.dtype)
compressed_weights = fns.round(compressed_weights)
compressed_weights = fns.clip(compressed_weights, level_low, level_high).astype(dtype)
```
Multiple temporary full precision arrays are needed to be created here.
After that they get garbage-collected. However, as it was said, this
creates temporary spikes in memory footprint. Taking this into account,
it makes sense to compress large constants first so that there are not
many low-bit constants taking up memory yet. This mostly is affected by
embedding matrices.

Please see memory figures below. They were obtain during 8-bit weights
compression. Figures were gathered with the
[memory_logger.py](#2801),
memory_type=SYSTEM_NORMALIZED.

| Backend | Model | Before | After |
|---------|-------|--------|-------|
| OV | qwen2-7b |
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/0fd9a423-71e0-474b-9e44-8eb3accd464c)
|
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/ae53b89a-a5a6-4cf4-bad8-91b853602227)
|
| PT | qwen2-7b |
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/2fe7e377-36ee-4d67-9d57-3b5563a8e349)
|
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/1f89e657-cd1a-4af1-ae5a-8275fdd72679)
|

For example, for qwen2-7b OV model there is a reduction from ~12GB peak
footprint to ~7GB.

Much lower values for OV backend compared to PT are because OV models
are read using mmap which allows to avoid allocating memory for the
whole full-precision model.

### Related tickets

144501

Loading branch information

nikita-savelyevv authored Jul 12, 2024

1 parent 6926cf1 commit 232b435

nncf/quantization/algorithms/weight_compression/algorithm.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -403,6 +403,9 @@ def apply( @@
                     backend_entity=self._backend_entity,
                 )
+            # Sort weight params to start compression with the bigger constants. This lowers peak memory footprint.
+            all_weight_params = sorted(all_weight_params, key=lambda wp: wp.num_weights, reverse=True)
             # Compress model using weight compression parameters
             transformed_model = self._backend_entity.transform_model(
                 model,
@@ Expand Down @@

0 comments on commit `232b435`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `232b435`

Commit

There are no files selected for viewing

0 comments on commit 232b435

0 comments on commit `232b435`