Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdamG: Towards Stability of Parameter-free Optimization #264

Closed
Vectorrent opened this issue Aug 13, 2024 · 6 comments · Fixed by #265
Closed

AdamG: Towards Stability of Parameter-free Optimization #264

Vectorrent opened this issue Aug 13, 2024 · 6 comments · Fixed by #265
Assignees
Labels
feature request Request features

Comments

@Vectorrent
Copy link
Contributor

https://arxiv.org/abs/2405.04376

I've been experimenting with parameter-free optimizers lately (like Prodigy), and came upon AdamG:

Hyperparameter tuning, particularly the selection of an appropriate learning rate in adaptive gradient training methods, remains a challenge. To tackle this challenge, in this paper, we propose a novel parameter-free optimizer, \textsc{AdamG} (Adam with the golden step size), designed to automatically adapt to diverse optimization problems without manual tuning. The core technique underlying \textsc{AdamG} is our golden step size derived for the AdaGrad-Norm algorithm, which is expected to help AdaGrad-Norm preserve the tuning-free convergence and approximate the optimal step size in expectation w.r.t. various optimization scenarios. To better evaluate tuning-free performance, we propose a novel evaluation criterion, \textit{reliability}, to comprehensively assess the efficacy of parameter-free optimizers in addition to classical performance criteria. Empirical results demonstrate that compared with other parameter-free baselines, \textsc{AdamG} achieves superior performance, which is consistently on par with Adam using a manually tuned learning rate across various optimization tasks.

I was able to hack together a version of AdamG in TFJS, and it performs fairly well! But I am not at all sure if my version is mathematically sound.

Would love to see an implementation of AdamG in Pytorch! So far as I'm aware, this code does not exist anywhere else. I'm opening a feature request here for posterity, though I might get around to implementing this PR myself, some day.

@kozistr
Copy link
Owner

kozistr commented Aug 13, 2024

@Vectorrent thanks for the suggestion!

I just implemented the AdamG optimizer based on the pseudo-code in the paper, here #265.
if you have any suggestions or reviews, feel free to check and leave a comment :)

image

  • your implementation looks good to me except the paper used beta1 0.95 and q 0.25!

  • I missed this line Note that we use the numerator function s(x) = 0.2x 0.24 for all optimization tasks, and the final formula slightly differs from our theoretical derivation, p → 1/2, q → 1/4 −

@Vectorrent
Copy link
Contributor Author

That was quick! Thanks a lot, I'll be testing ASAP. I love this library 🙂

@Vectorrent
Copy link
Contributor Author

Not trying to nitpick... but in the research, the authors set ηk = 1. That's the learning rate/step size, right? Do you think it would be better to set the default LR to 1.0 here as well, @kozistr?

From the "setup" section:

Unless otherwise specified, all Adam and Adam-type parameter-free optimizers are paired with a cosine learning rate scheduler. I.e., the default value of ηk in AdamG, D-Adapt Adam and Prodigy Adam is set to 1 with extra cosine annealing decay strategy...

@kozistr
Copy link
Owner

kozistr commented Aug 13, 2024

Not trying to nitpick... but in the research, the authors set ηk = 1. That's the learning rate/step size, right? Do you think it would be better to set the default LR to 1.0 here as well, @kozistr?

From the "setup" section:

Unless otherwise specified, all Adam and Adam-type parameter-free optimizers are paired with a cosine learning rate scheduler. I.e., the default value of ηk in AdamG, D-Adapt Adam and Prodigy Adam is set to 1 with extra cosine annealing decay strategy...

yeap, afaik, they used the default learning rate of 1.0.

umm... actually, I have no intuition about the learning rate of this optimizer now, however, I guess the main reason they used 1.0 is for the fair comparison with the previous works, and AdamG is a parameter-free, scale-free optimizer, assume that don't need to tune the parameters (e.g. lr, ...) empirically.

in short, the absolute value of 1.0 looks too high to train, however, it could be a proper step size for the update. of course, need more observations though.

maybe we could find some intuitions from other optimizers like prodigy, d-adaptation repos

@Vectorrent
Copy link
Contributor Author

I don't have much intuition here, either. Given the fact that Prodigy and DAdapt methods also use a LR of 1.0, I'd dare say that these would be more appropriate defaults for AdamG:

lr = 1.0
p = 0.2
q = 0.24

Prodigy recommends NEVER changing the learning rate:

We recommend using lr=1. (default) for all networks. If you want to force the method to estimate a smaller or larger learning rate, it is better to change the value of d_coef (1.0 by default). Values of d_coef above 1, such as 2 or 10, will force a larger estimate of the learning rate; set it to 0.5 or even 0.1 if you want a smaller learning rate.

I suppose the golden step in AdamG acts like the d_coef in Prodigy; it is what scales the learning rate, and makes the optimizer adaptive.

@kozistr
Copy link
Owner

kozistr commented Aug 13, 2024

I don't have much intuition here, either. Given the fact that Prodigy and DAdapt methods also use a LR of 1.0, I'd dare say that these would be more appropriate defaults for AdamG:

lr = 1.0
p = 0.2
q = 0.24

Prodigy recommends NEVER changing the learning rate:

We recommend using lr=1. (default) for all networks. If you want to force the method to estimate a smaller or larger learning rate, it is better to change the value of d_coef (1.0 by default). Values of d_coef above 1, such as 2 or 10, will force a larger estimate of the learning rate; set it to 0.5 or even 0.1 if you want a smaller learning rate.

I suppose the golden step in AdamG acts like the d_coef in Prodigy; it is what scales the learning rate, and makes the optimizer adaptive.

I agree with you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants