why training ...? #63

dreamyou070 · 2024-07-13T21:57:42Z

in the paper, you found the unimportant SD block/layer.
In that case, you may not have to retrain the model
(because if you erase unimportant block/layer, the performance is almost preserved)

Can you explain why you train again after erasing unimportant block ?

Thanks!

bokyeong1015 · 2024-07-14T02:55:36Z

Hi,

For low pruning ratios (which remove a small number of blocks), retraining may not be necessary, or light retraining (such as LoRA) would be enough. Refer to the example of mid-block removal without retraining in our paper.

For high pruning ratios (which remove a large number of blocks, including outer blocks), retraining is essential to compensate for the loss of information and to achieve satisfactory results.

Specifically, for structured pruning, we think that severe compression to achieve significant efficiency gains often necessitates heavy retraining.

Nevertheless, retraining over a pruned network yields faster and better convergence compared to training the same size network from scratch using random weights.

These observations are further supported in our subsequent work, Shortened LLaMA:

Low pruning ratio: Light LoRA retraining would be enough.
High pruning ratio: Full-parameter finetuning on the pretraining corpus is necessary for good results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why training ...? #63

why training ...? #63

dreamyou070 commented Jul 13, 2024

bokyeong1015 commented Jul 14, 2024 •

edited

Loading

why training ...? #63

why training ...? #63

Comments

dreamyou070 commented Jul 13, 2024

bokyeong1015 commented Jul 14, 2024 • edited Loading

bokyeong1015 commented Jul 14, 2024 •

edited

Loading