Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why training ...? #63

Open
dreamyou070 opened this issue Jul 13, 2024 · 1 comment
Open

why training ...? #63

dreamyou070 opened this issue Jul 13, 2024 · 1 comment

Comments

@dreamyou070
Copy link

in the paper, you found the unimportant SD block/layer.
In that case, you may not have to retrain the model
(because if you erase unimportant block/layer, the performance is almost preserved)

Can you explain why you train again after erasing unimportant block ?

Thanks!

@bokyeong1015
Copy link
Member

bokyeong1015 commented Jul 14, 2024

Hi,

For low pruning ratios (which remove a small number of blocks), retraining may not be necessary, or light retraining (such as LoRA) would be enough. Refer to the example of mid-block removal without retraining in our paper.

For high pruning ratios (which remove a large number of blocks, including outer blocks), retraining is essential to compensate for the loss of information and to achieve satisfactory results.

Specifically, for structured pruning, we think that severe compression to achieve significant efficiency gains often necessitates heavy retraining.

  • Nevertheless, retraining over a pruned network yields faster and better convergence compared to training the same size network from scratch using random weights.

These observations are further supported in our subsequent work, Shortened LLaMA:

  • Low pruning ratio: Light LoRA retraining would be enough.
  • High pruning ratio: Full-parameter finetuning on the pretraining corpus is necessary for good results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants