AdamG: Towards Stability of Parameter-free Optimization #264

Vectorrent · 2024-08-13T02:22:32Z

I've been experimenting with parameter-free optimizers lately (like Prodigy), and came upon AdamG:

Hyperparameter tuning, particularly the selection of an appropriate learning rate in adaptive gradient training methods, remains a challenge. To tackle this challenge, in this paper, we propose a novel parameter-free optimizer, \textsc{AdamG} (Adam with the golden step size), designed to automatically adapt to diverse optimization problems without manual tuning. The core technique underlying \textsc{AdamG} is our golden step size derived for the AdaGrad-Norm algorithm, which is expected to help AdaGrad-Norm preserve the tuning-free convergence and approximate the optimal step size in expectation w.r.t. various optimization scenarios. To better evaluate tuning-free performance, we propose a novel evaluation criterion, \textit{reliability}, to comprehensively assess the efficacy of parameter-free optimizers in addition to classical performance criteria. Empirical results demonstrate that compared with other parameter-free baselines, \textsc{AdamG} achieves superior performance, which is consistently on par with Adam using a manually tuned learning rate across various optimization tasks.

I was able to hack together a version of AdamG in TFJS, and it performs fairly well! But I am not at all sure if my version is mathematically sound.

Would love to see an implementation of AdamG in Pytorch! So far as I'm aware, this code does not exist anywhere else. I'm opening a feature request here for posterity, though I might get around to implementing this PR myself, some day.

kozistr · 2024-08-13T03:23:22Z

@Vectorrent thanks for the suggestion!

I just implemented the AdamG optimizer based on the pseudo-code in the paper, here #265.
if you have any suggestions or reviews, feel free to check and leave a comment :)

your implementation looks good to me except the paper used beta1 0.95 and ~~q 0.25~~!
I missed this line Note that we use the numerator function s(x) = 0.2x 0.24 for all optimization tasks, and the final formula slightly differs from our theoretical derivation, p → 1/2, q → 1/4 −

Vectorrent · 2024-08-13T03:39:19Z

That was quick! Thanks a lot, I'll be testing ASAP. I love this library 🙂

Vectorrent · 2024-08-13T04:26:18Z

Not trying to nitpick... but in the research, the authors set ηk = 1. That's the learning rate/step size, right? Do you think it would be better to set the default LR to 1.0 here as well, @kozistr?

From the "setup" section:

Unless otherwise specified, all Adam and Adam-type parameter-free optimizers are paired with a cosine learning rate scheduler. I.e., the default value of ηk in AdamG, D-Adapt Adam and Prodigy Adam is set to 1 with extra cosine annealing decay strategy...

kozistr · 2024-08-13T04:59:36Z

Not trying to nitpick... but in the research, the authors set ηk = 1. That's the learning rate/step size, right? Do you think it would be better to set the default LR to 1.0 here as well, @kozistr?

From the "setup" section:

Unless otherwise specified, all Adam and Adam-type parameter-free optimizers are paired with a cosine learning rate scheduler. I.e., the default value of ηk in AdamG, D-Adapt Adam and Prodigy Adam is set to 1 with extra cosine annealing decay strategy...

yeap, afaik, they used the default learning rate of 1.0.

umm... actually, I have no intuition about the learning rate of this optimizer now, however, I guess the main reason they used 1.0 is for the fair comparison with the previous works, and AdamG is a parameter-free, scale-free optimizer, assume that don't need to tune the parameters (e.g. lr, ...) empirically.

in short, the absolute value of 1.0 looks too high to train, however, it could be a proper step size for the update. of course, need more observations though.

maybe we could find some intuitions from other optimizers like prodigy, d-adaptation repos

Vectorrent · 2024-08-13T05:31:54Z

I don't have much intuition here, either. Given the fact that Prodigy and DAdapt methods also use a LR of 1.0, I'd dare say that these would be more appropriate defaults for AdamG:

lr = 1.0
p = 0.2
q = 0.24

Prodigy recommends NEVER changing the learning rate:

We recommend using lr=1. (default) for all networks. If you want to force the method to estimate a smaller or larger learning rate, it is better to change the value of d_coef (1.0 by default). Values of d_coef above 1, such as 2 or 10, will force a larger estimate of the learning rate; set it to 0.5 or even 0.1 if you want a smaller learning rate.

I suppose the golden step in AdamG acts like the d_coef in Prodigy; it is what scales the learning rate, and makes the optimizer adaptive.

kozistr · 2024-08-13T09:51:25Z

I don't have much intuition here, either. Given the fact that Prodigy and DAdapt methods also use a LR of 1.0, I'd dare say that these would be more appropriate defaults for AdamG:
lr = 1.0
p = 0.2
q = 0.24
Prodigy recommends NEVER changing the learning rate:

We recommend using lr=1. (default) for all networks. If you want to force the method to estimate a smaller or larger learning rate, it is better to change the value of d_coef (1.0 by default). Values of d_coef above 1, such as 2 or 10, will force a larger estimate of the learning rate; set it to 0.5 or even 0.1 if you want a smaller learning rate.

I suppose the golden step in AdamG acts like the d_coef in Prodigy; it is what scales the learning rate, and makes the optimizer adaptive.

I agree with you

Vectorrent added the feature request Request features label Aug 13, 2024

Vectorrent assigned kozistr Aug 13, 2024

kozistr mentioned this issue Aug 13, 2024

[Feature] Implement AdamG optimizer #265

Merged

2 tasks

kozistr closed this as completed in #265 Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdamG: Towards Stability of Parameter-free Optimization #264

AdamG: Towards Stability of Parameter-free Optimization #264

Vectorrent commented Aug 13, 2024

kozistr commented Aug 13, 2024 •

edited

Loading

Vectorrent commented Aug 13, 2024

Vectorrent commented Aug 13, 2024

kozistr commented Aug 13, 2024

Vectorrent commented Aug 13, 2024

kozistr commented Aug 13, 2024

AdamG: Towards Stability of Parameter-free Optimization #264

AdamG: Towards Stability of Parameter-free Optimization #264

Comments

Vectorrent commented Aug 13, 2024

kozistr commented Aug 13, 2024 • edited Loading

Vectorrent commented Aug 13, 2024

Vectorrent commented Aug 13, 2024

kozistr commented Aug 13, 2024

Vectorrent commented Aug 13, 2024

kozistr commented Aug 13, 2024

kozistr commented Aug 13, 2024 •

edited

Loading