You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let weight decay be $\lambda$ and learning rate be $\mu_t$. If I understand it correctly, this line of code update weight decay with $$\theta_t \leftarrow \tilde{\theta}_t - \lambda \mu_t$$
where (follow the notation in the paper)
Hi,
I was using the SGDW implementation in this repo, and I wonder if anything is wrong with this line:
pytorch-optimizer/torch_optimizer/sgdw.py
Line 121 in 910b414
Let weight decay be$\lambda$ and learning rate be $\mu_t$ . If I understand it correctly, this line of code update weight decay with
$$\theta_t \leftarrow \tilde{\theta}_t - \lambda \mu_t$$
where (follow the notation in the paper)
But it should be
as in the paper:
This result in poor performance of training compared to SGD with the same set of optimization hyper-parameter.
Thanks!
Regards, Liu
The text was updated successfully, but these errors were encountered: