From 232b43543a7c7098798a9791282cdc2f9677614a Mon Sep 17 00:00:00 2001 From: Nikita Savelyev Date: Fri, 12 Jul 2024 15:04:43 +0200 Subject: [PATCH] Lower weight compression memory footprint by sorting weights according to their size (#2803) ### Changes Sort weights for compression: ``` all_weight_params = sorted(all_weight_params, key=lambda wp: wp.num_weights, reverse=True) ``` ### Reason for changes During weights compression, memory footprint gradually increases when new low-bit constants are created. At the same time there are temporary spikes in memory footprint which happen during compressed weight computation. For example, here: ``` if invert_scale: scale = fns.power(scale, -1) compressed_weights = weight * scale else: compressed_weights = weight / scale if zero_point is not None: compressed_weights += zero_point.astype(weight.dtype) compressed_weights = fns.round(compressed_weights) compressed_weights = fns.clip(compressed_weights, level_low, level_high).astype(dtype) ``` Multiple temporary full precision arrays are needed to be created here. After that they get garbage-collected. However, as it was said, this creates temporary spikes in memory footprint. Taking this into account, it makes sense to compress large constants first so that there are not many low-bit constants taking up memory yet. This mostly is affected by embedding matrices. Please see memory figures below. They were obtain during 8-bit weights compression. Figures were gathered with the [memory_logger.py](https://github.com/openvinotoolkit/nncf/pull/2801), memory_type=SYSTEM_NORMALIZED. | Backend | Model | Before | After | |---------|-------|--------|-------| | OV | qwen2-7b | ![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/0fd9a423-71e0-474b-9e44-8eb3accd464c) | ![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/ae53b89a-a5a6-4cf4-bad8-91b853602227) | | PT | qwen2-7b | ![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/2fe7e377-36ee-4d67-9d57-3b5563a8e349) | ![system-normalized_memory_usage](https://github.com/openvinotoolkit/nncf/assets/23343961/1f89e657-cd1a-4af1-ae5a-8275fdd72679) | For example, for qwen2-7b OV model there is a reduction from ~12GB peak footprint to ~7GB. Much lower values for OV backend compared to PT are because OV models are read using mmap which allows to avoid allocating memory for the whole full-precision model. ### Related tickets 144501 --- nncf/quantization/algorithms/weight_compression/algorithm.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/nncf/quantization/algorithms/weight_compression/algorithm.py b/nncf/quantization/algorithms/weight_compression/algorithm.py index ec3bfdce893..a448d185f45 100644 --- a/nncf/quantization/algorithms/weight_compression/algorithm.py +++ b/nncf/quantization/algorithms/weight_compression/algorithm.py @@ -403,6 +403,9 @@ def apply( backend_entity=self._backend_entity, ) + # Sort weight params to start compression with the bigger constants. This lowers peak memory footprint. + all_weight_params = sorted(all_weight_params, key=lambda wp: wp.num_weights, reverse=True) + # Compress model using weight compression parameters transformed_model = self._backend_entity.transform_model( model,