Skip to content

Commit

Permalink
Lower weight compression memory footprint by sorting weights accordin…
Browse files Browse the repository at this point in the history
…g to their size (#2803)

### Changes

Sort weights for compression:
```
all_weight_params = sorted(all_weight_params, key=lambda wp: wp.num_weights, reverse=True)
```

### Reason for changes

During weights compression, memory footprint gradually increases when
new low-bit constants are created. At the same time there are temporary
spikes in memory footprint which happen during compressed weight
computation. For example, here:
```
if invert_scale:
    scale = fns.power(scale, -1)
    compressed_weights = weight * scale
else:
    compressed_weights = weight / scale
if zero_point is not None:
    compressed_weights += zero_point.astype(weight.dtype)
compressed_weights = fns.round(compressed_weights)
compressed_weights = fns.clip(compressed_weights, level_low, level_high).astype(dtype)
```
Multiple temporary full precision arrays are needed to be created here.
After that they get garbage-collected. However, as it was said, this
creates temporary spikes in memory footprint. Taking this into account,
it makes sense to compress large constants first so that there are not
many low-bit constants taking up memory yet. This mostly is affected by
embedding matrices.

Please see memory figures below. They were obtain during 8-bit weights
compression. Figures were gathered with the
[memory_logger.py](#2801),
memory_type=SYSTEM_NORMALIZED.

| Backend | Model | Before | After |
|---------|-------|--------|-------|
| OV | qwen2-7b |
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/0fd9a423-71e0-474b-9e44-8eb3accd464c)
|
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/ae53b89a-a5a6-4cf4-bad8-91b853602227)
|
| PT | qwen2-7b |
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/2fe7e377-36ee-4d67-9d57-3b5563a8e349)
|
![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/1f89e657-cd1a-4af1-ae5a-8275fdd72679)
|

For example, for qwen2-7b OV model there is a reduction from ~12GB peak
footprint to ~7GB.

Much lower values for OV backend compared to PT are because OV models
are read using mmap which allows to avoid allocating memory for the
whole full-precision model.

### Related tickets

144501
  • Loading branch information
nikita-savelyevv authored Jul 12, 2024
1 parent 6926cf1 commit 232b435
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions nncf/quantization/algorithms/weight_compression/algorithm.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,9 @@ def apply(
backend_entity=self._backend_entity,
)

# Sort weight params to start compression with the bigger constants. This lowers peak memory footprint.
all_weight_params = sorted(all_weight_params, key=lambda wp: wp.num_weights, reverse=True)

# Compress model using weight compression parameters
transformed_model = self._backend_entity.transform_model(
model,
Expand Down

0 comments on commit 232b435

Please sign in to comment.