OOM eventually when using create_graph=True with BatchL2Grad #238

crowsonkb · 2022-01-24T12:33:30Z

I was trying to use my second-order optimizer ESGD-M with BatchL2Grad in order to collect information on within-batch gradient variance to estimate stochastic noise (think OpenAI's gradient noise scale paper), and I kept OOMing after maybe six epochs of MNIST training. ESGD-M does a Hessian-vector product internally (not using Backpack stuff, just autograd) so it needs the user to specify create_graph=True. I assume when I use it with Backpack, something is leaking references to past computational graphs, normally these graphs are garbage collected without issue.

Thank you,
Katherine Crowson

crowsonkb · 2022-01-24T12:38:09Z

Apparently if I tell ESGD-M to do a Hessian-vector product every step instead of every ten for compute efficiency, I don't OOM anymore. Normally the graphs made with create_graph=True are freed on their own if ESGD-M doesn't do an HVP that step but Backpack is hanging onto them somewhere?

f-dangel · 2022-01-26T22:10:16Z

Hi,

thanks for your report. From your description I think BackPACK's memory cleanup should be triggered during the backward pass.

Maybe you can try to explicitly disable BackPACK's hooks during the HVP using

from backpack import disable

with disable():
   # HVP

Otherwise, it would be great to reproduce this issue in an MWE and track down the memory leak.

Best,
Felix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM eventually when using create_graph=True with BatchL2Grad #238

OOM eventually when using create_graph=True with BatchL2Grad #238

crowsonkb commented Jan 24, 2022

crowsonkb commented Jan 24, 2022

f-dangel commented Jan 26, 2022

OOM eventually when using create_graph=True with BatchL2Grad #238

OOM eventually when using create_graph=True with BatchL2Grad #238

Comments

crowsonkb commented Jan 24, 2022

crowsonkb commented Jan 24, 2022

f-dangel commented Jan 26, 2022