Configurable step size instead of hard-coded default values for adafactor #535

lamthuy · 2023-09-23T23:16:19Z

The current implementation of the adafactor is consistent with the paper's default hyperparameters choice. In particular, in the get_lr function at

pytorch-optimizer/torch_optimizer/adafactor.py

Line 85 in 19c3e41

def _get_lr(self, param_group: ParamGroup, param_state: State) -> float:

We can see that if relative_step is True, the input learning rate by users is ignored and instead the learning rate is time-dependent defined as:

if param_group["relative_step"]:
            min_step = (
                1e-6 * param_state["step"]
                if param_group["warmup_init"]
                else 1e-2
            )
            rel_step_sz = min(min_step, 1.0 / math.sqrt(param_state["step"]))

That means the learning rate is defined as min(1e-6*t, 1/sqrt(t)) if warmup_init is set to True and min(1e-2, 1/sqrt(t)) otherwise. This hard-coded values 1e-6 and 1e-2 is not an optimal choice and the best values are data-dependent. I would suggest to change those lines to:

if param_group["relative_step"]:
            min_step = (
                param_group["lr"] * param_state["step"]
                if param_group["warmup_init"]
                else param_group["lr"]
            )
            rel_step_sz = min(min_step, 1.0 / math.sqrt(param_state["step"]))

That enables the users to configure those hyper-parameters via the input learning rate.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable step size instead of hard-coded default values for adafactor #535

Configurable step size instead of hard-coded default values for adafactor #535

lamthuy commented Sep 23, 2023

Configurable step size instead of hard-coded default values for adafactor #535

Configurable step size instead of hard-coded default values for adafactor #535

Comments

lamthuy commented Sep 23, 2023