-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdamG: Towards Stability of Parameter-free Optimization #264
Comments
@Vectorrent thanks for the suggestion! I just implemented the AdamG optimizer based on the pseudo-code in the paper, here #265.
|
That was quick! Thanks a lot, I'll be testing ASAP. I love this library 🙂 |
Not trying to nitpick... but in the research, the authors set From the "setup" section: Unless otherwise specified, all Adam and Adam-type parameter-free optimizers are paired with a cosine learning rate scheduler. I.e., the default value of ηk in AdamG, D-Adapt Adam and Prodigy Adam is set to 1 with extra cosine annealing decay strategy... |
yeap, afaik, they used the default learning rate of 1.0. umm... actually, I have no intuition about the learning rate of this optimizer now, however, I guess the main reason they used in short, the absolute value of maybe we could find some intuitions from other optimizers like prodigy, d-adaptation repos |
I don't have much intuition here, either. Given the fact that Prodigy and DAdapt methods also use a LR of
Prodigy recommends NEVER changing the learning rate: We recommend using lr=1. (default) for all networks. If you want to force the method to estimate a smaller or larger learning rate, it is better to change the value of d_coef (1.0 by default). Values of d_coef above 1, such as 2 or 10, will force a larger estimate of the learning rate; set it to 0.5 or even 0.1 if you want a smaller learning rate. I suppose the golden step in AdamG acts like the |
I agree with you |
https://arxiv.org/abs/2405.04376
I've been experimenting with parameter-free optimizers lately (like Prodigy), and came upon AdamG:
Hyperparameter tuning, particularly the selection of an appropriate learning rate in adaptive gradient training methods, remains a challenge. To tackle this challenge, in this paper, we propose a novel parameter-free optimizer, \textsc{AdamG} (Adam with the golden step size), designed to automatically adapt to diverse optimization problems without manual tuning. The core technique underlying \textsc{AdamG} is our golden step size derived for the AdaGrad-Norm algorithm, which is expected to help AdaGrad-Norm preserve the tuning-free convergence and approximate the optimal step size in expectation w.r.t. various optimization scenarios. To better evaluate tuning-free performance, we propose a novel evaluation criterion, \textit{reliability}, to comprehensively assess the efficacy of parameter-free optimizers in addition to classical performance criteria. Empirical results demonstrate that compared with other parameter-free baselines, \textsc{AdamG} achieves superior performance, which is consistently on par with Adam using a manually tuned learning rate across various optimization tasks.
I was able to hack together a version of AdamG in TFJS, and it performs fairly well! But I am not at all sure if my version is mathematically sound.
Would love to see an implementation of AdamG in Pytorch! So far as I'm aware, this code does not exist anywhere else. I'm opening a feature request here for posterity, though I might get around to implementing this PR myself, some day.
The text was updated successfully, but these errors were encountered: