You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
It's currently not possible to scale the learning rate for a specific layer (except head) without adding a new argument such as head-lr-mult.
Describe the solution you'd like
This PR enables scaling the learning rate of a layer during pretraining by giving its name in scale-lr-layer and the multiplier in lr-multiplier by using the existing internal logic of scale_l_cond and lr_mult.
Describe alternatives you've considered
This implementation generalizes/makes more flexible the existing use of this feature for lm head during finetuning by making it possible to specify the name of the target layer as well as the LR multiplier.
Extends its use for pretraining as well. When no layer is specified, the scale_lr_cond argument is None and no lr-scaling is applied.
Additional context MuP and several interesting papers that followed (ex. Depth-MuP) suggest, among other technics s.a layers' output scaling and initializations, to use different LRs depending on width in order to enhance feature learning and avoid that output layers dominate the learning process. When combined with proper initializations and layers' output scaling, it consists of a stable setting especially for sweeping and scaling hyperparameters for pretraining.
A GPT like model typically has an ffn-factor > 1. It's 3.5 for Llama3.1 70B. Which suggests that down-projection (linear_fc2 in Megatron) requires a lower LR. Theoretically LR x 1/ffn_factor.
This way, we don't have to add a new argument (ex. downproj-lr-mult) each time we want to test scaling of a certain layer (ex. linear_fc2).
P.S:
Layers' output scaling (before residual-connections) as introduced in Depth-MuP to account for depth-scaling will be suggested in a separate PR. Same for init.
The text was updated successfully, but these errors were encountered:
dhia680
changed the title
[ENHANCEMENT] Enabling LR scaling for a certain layer (ex. down-projection...) during pretraining
[ENHANCEMENT] Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining
Oct 28, 2024
Is your feature request related to a problem? Please describe.
It's currently not possible to scale the learning rate for a specific layer (except
head
) without adding a new argument such ashead-lr-mult
.Describe the solution you'd like
This PR enables scaling the learning rate of a layer during pretraining by giving its name in
scale-lr-layer
and the multiplier inlr-multiplier
by using the existing internal logic ofscale_l_cond
andlr_mult
.Describe alternatives you've considered
This implementation generalizes/makes more flexible the existing use of this feature for lm
head
during finetuning by making it possible to specify the name of the target layer as well as the LR multiplier.Extends its use for pretraining as well. When no layer is specified, the
scale_lr_cond
argument isNone
and no lr-scaling is applied.Proposed implementation
Here is the PR.
Additional context
MuP and several interesting papers that followed (ex. Depth-MuP) suggest, among other technics s.a layers' output scaling and initializations, to use different LRs depending on width in order to enhance feature learning and avoid that output layers dominate the learning process. When combined with proper initializations and layers' output scaling, it consists of a stable setting especially for sweeping and scaling hyperparameters for pretraining.
A GPT like model typically has an ffn-factor > 1. It's 3.5 for Llama3.1 70B. Which suggests that down-projection (
linear_fc2
in Megatron) requires a lower LR. TheoreticallyLR x 1/ffn_factor
.This way, we don't have to add a new argument (ex.
downproj-lr-mult
) each time we want to test scaling of a certain layer (ex.linear_fc2
).P.S:
Layers' output scaling (before residual-connections) as introduced in Depth-MuP to account for depth-scaling will be suggested in a separate PR. Same for init.
The text was updated successfully, but these errors were encountered: