Apply scale loss when performing accum_freq #957

AshStuff · 2024-10-14T16:01:54Z

In this line

Line 162 in fc5a37b

backward(total_loss, scaler)

We are trying to accumulate the gradients and perform optimizer step only after we accumulate gradients for accum_freq steps.

I am wondering whether do we need to divide the total_loss by accum_freq to scale the loss properly.

The text was updated successfully, but these errors were encountered:

rwightman · 2024-10-15T05:01:15Z

@AshStuff dupe of #761 I have not used the grad accum myself, it would be worth someone with the resources doing a small scale experiment to run w/ and w/o scaling and see if there is any difference in behviour... though I would have thought that would have been tested against normal training before being added in first place. Perhaps not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply scale loss when performing accum_freq #957

Apply scale loss when performing accum_freq #957

AshStuff commented Oct 14, 2024

rwightman commented Oct 15, 2024

Apply scale loss when performing accum_freq #957

Apply scale loss when performing accum_freq #957

Comments

AshStuff commented Oct 14, 2024

rwightman commented Oct 15, 2024