You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
in the paper, you found the unimportant SD block/layer.
In that case, you may not have to retrain the model
(because if you erase unimportant block/layer, the performance is almost preserved)
Can you explain why you train again after erasing unimportant block ?
Thanks!
The text was updated successfully, but these errors were encountered:
For low pruning ratios (which remove a small number of blocks), retraining may not be necessary, or light retraining (such as LoRA) would be enough. Refer to the example of mid-block removal without retraining in our paper.
For high pruning ratios (which remove a large number of blocks, including outer blocks), retraining is essential to compensate for the loss of information and to achieve satisfactory results.
Specifically, for structured pruning, we think that severe compression to achieve significant efficiency gains often necessitates heavy retraining.
Nevertheless, retraining over a pruned network yields faster and better convergence compared to training the same size network from scratch using random weights.
These observations are further supported in our subsequent work, Shortened LLaMA:
Low pruning ratio: Light LoRA retraining would be enough.
High pruning ratio: Full-parameter finetuning on the pretraining corpus is necessary for good results.
in the paper, you found the unimportant SD block/layer.
In that case, you may not have to retrain the model
(because if you erase unimportant block/layer, the performance is almost preserved)
Can you explain why you train again after erasing unimportant block ?
Thanks!
The text was updated successfully, but these errors were encountered: