Using MetaAdam results in 'RuntimeError: Trying to backward through the graph a second time' #191

LabChameleon · 2023-09-06T13:40:48Z

LabChameleon
Sep 6, 2023

Hi!

I would like to build on top of your Meta-Gradient RL example with the MetaAdam optimizer. However, if I simply replace the MetaSGD optimizer in your code with inner_optimizer = torchopt.MetaAdam(net, lr=5e-1, moment_requires_grad=False) I get the following error in the second iteration:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Adding torchopt.stop_gradient(net) does not fix the problem as well.

Here is the full code to reproduce the problem:

import torch
import torch.nn as nn
import torch.nn.functional as F

import torchopt


def test_gamma():
    class Rollout:
        @staticmethod
        def get():
            out = torch.empty(5, 2)
            out[:, 0] = torch.randn(5)
            out[:, 1] = 0.1 * torch.ones(5)
            label = torch.arange(0, 10)
            return out.view(10, 1), F.one_hot(label, 10)

        @staticmethod
        def rollout(trajectory, gamma):
            out = [trajectory[-1]]
            for i in reversed(range(9)):
                out.append(trajectory[i] + gamma[i] * out[-1].clone().detach_())
            out.reverse()
            return torch.hstack(out).view(10, 1)

    class ValueNetwork(nn.Module):
        def __init__(self):
            super().__init__()
            self.fc = nn.Linear(10, 1)

        def forward(self, x):
            return self.fc(x)

    torch.manual_seed(0)
    inner_iters = 1
    outer_iters = 10000
    net = ValueNetwork()
    inner_optimizer = torchopt.MetaAdam(net, lr=5e-1, moment_requires_grad=False)
    gamma = torch.zeros(9, requires_grad=True)
    meta_optimizer = torchopt.SGD([gamma], lr=5e-1)
    net_state = torchopt.extract_state_dict(net)
    for i in range(outer_iters):
        for _ in range(inner_iters):
            trajectory, state = Rollout.get()
            backup = Rollout.rollout(trajectory, torch.sigmoid(gamma))
            pred_value = net(state.float())

            loss = F.mse_loss(pred_value, backup)
            inner_optimizer.step(loss)

        trajectory, state = Rollout.get()
        pred_value = net(state.float())
        backup = Rollout.rollout(trajectory, torch.ones_like(gamma))

        loss = F.mse_loss(pred_value, backup)
        meta_optimizer.zero_grad()
        loss.backward()
        meta_optimizer.step()
        torchopt.recover_state_dict(net, net_state)
        torchopt.stop_gradient(net)
        if i % 100 == 0:
            with torch.no_grad():
                print(f'epoch {i} | gamma: {torch.sigmoid(gamma)}')


if __name__ == '__main__':
    test_gamma()

Is this a bug or am I missing something?

Thanks for your help!

Answered by XuehaiPan

Sep 7, 2023

Doesn't moving the line torchopt.recover_state_dict(net, net_state) into the inner loop prevent the gradients from flowing from one inner loop iteration to the next one? Instead, now they get detached after every inner loop iteration, don't they?

Yes, you are correct.

The problem is that you should create an inner optimizer at the beginning of each outer loop. In your code snippet, the inner optimizer is shared across multiple outer loop optimization. You should either extract/recover and detach the state of the inner optimizer like you do for the network parameter, or recreate a new inner optimizer.

import torch
import torch.nn as nn
import torch.nn.functional as F

import torchopt


d…

View full answer

XuehaiPan · 2023-09-06T15:10:01Z

XuehaiPan
Sep 6, 2023
Maintainer

Hi @dierkes-j, thanks for raising this. You need to extract your model state at the beginning of each outer loop, and also recover the model state at the end of each inner loop.

Here is the suggestion:

import torch
import torch.nn as nn
import torch.nn.functional as F

import torchopt


def test_gamma():
    class Rollout:
        @staticmethod
        def get():
            out = torch.empty(5, 2)
            out[:, 0] = torch.randn(5)
            out[:, 1] = 0.1 * torch.ones(5)
            label = torch.arange(0, 10)
            return out.view(10, 1), F.one_hot(label, 10)

        @staticmethod
        def rollout(trajectory, gamma):
            out = [trajectory[-1]]
            for i in reversed(range(9)):
                out.append(trajectory[i] + gamma[i] * out[-1].clone().detach_())
            out.reverse()
            return torch.hstack(out).view(10, 1)

    class ValueNetwork(nn.Module):
        def __init__(self):
            super().__init__()
            self.fc = nn.Linear(10, 1)

        def forward(self, x):
            return self.fc(x)

    torch.manual_seed(0)
    inner_iters = 1
    outer_iters = 10000
    net = ValueNetwork()
    inner_optimizer = torchopt.MetaAdam(net, lr=5e-1, moment_requires_grad=False)
    gamma = torch.zeros(9, requires_grad=True)
    meta_optimizer = torchopt.SGD([gamma], lr=5e-1)
-   net_state = torchopt.extract_state_dict(net)
    for i in range(outer_iters):
+       net_state = torchopt.extract_state_dict(net)
        for _ in range(inner_iters):
            trajectory, state = Rollout.get()
            backup = Rollout.rollout(trajectory, torch.sigmoid(gamma))
            pred_value = net(state.float())

            loss = F.mse_loss(pred_value, backup)
            inner_optimizer.step(loss)
+           torchopt.recover_state_dict(net, net_state)

        trajectory, state = Rollout.get()
        pred_value = net(state.float())
        backup = Rollout.rollout(trajectory, torch.ones_like(gamma))

        loss = F.mse_loss(pred_value, backup)
        meta_optimizer.zero_grad()
        loss.backward()
        meta_optimizer.step()
-       torchopt.recover_state_dict(net, net_state)
        torchopt.stop_gradient(net)
        if i % 100 == 0:
            with torch.no_grad():
                print(f'epoch {i} | gamma: {torch.sigmoid(gamma)}')


if __name__ == '__main__':
    test_gamma()

Here is the results:

$ python3 test_gamma.py
epoch 0 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 100 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 200 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 300 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 400 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 500 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 600 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 700 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 800 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 900 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 1000 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
...

0 replies

LabChameleon · 2023-09-07T08:40:46Z

LabChameleon
Sep 7, 2023
Author

Hi @XuehaiPan,

thanks for your fast reply and help!

I am a little bit confused about your reply. Doesn't moving the line torchopt.recover_state_dict(net, net_state) into the inner loop prevent the gradients from flowing from one inner loop iteration to the next one? Instead, now they get detached after every inner loop iteration, don't they? Correct me here if I got something wrong! This would also explain why the gamma Tensor is not adjusted anymore (it just stays 0.5 everywhere in your output). This is not the intended output.

From my understanding, I would simply omit the extraction and recovering of net_state and instead fully rely on torchopt.stop_gradient(net) to allow the gradients to flow through all the inner loop iterations. But then I get the aforementioned error again backwarding through the graph a second time.

1 reply

XuehaiPan Sep 7, 2023
Maintainer

Doesn't moving the line torchopt.recover_state_dict(net, net_state) into the inner loop prevent the gradients from flowing from one inner loop iteration to the next one? Instead, now they get detached after every inner loop iteration, don't they?

Yes, you are correct.

The problem is that you should create an inner optimizer at the beginning of each outer loop. In your code snippet, the inner optimizer is shared across multiple outer loop optimization. You should either extract/recover and detach the state of the inner optimizer like you do for the network parameter, or recreate a new inner optimizer.

import torch
import torch.nn as nn
import torch.nn.functional as F

import torchopt


def test_gamma():
    class Rollout:
        @staticmethod
        def get():
            out = torch.empty(5, 2)
            out[:, 0] = torch.randn(5)
            out[:, 1] = 0.1 * torch.ones(5)
            label = torch.arange(0, 10)
            return out.view(10, 1), F.one_hot(label, 10)

        @staticmethod
        def rollout(trajectory, gamma):
            out = [trajectory[-1]]
            for i in reversed(range(9)):
                out.append(trajectory[i] + gamma[i] * out[-1].clone().detach_())
            out.reverse()
            return torch.hstack(out).view(10, 1)

    class ValueNetwork(nn.Module):
        def __init__(self):
            super().__init__()
            self.fc = nn.Linear(10, 1)

        def forward(self, x):
            return self.fc(x)

    torch.manual_seed(0)
    inner_iters = 1
    outer_iters = 10000
    net = ValueNetwork()
-   inner_optimizer = torchopt.MetaAdam(net, lr=5e-1, moment_requires_grad=False)
    gamma = torch.zeros(9, requires_grad=True)
    meta_optimizer = torchopt.SGD([gamma], lr=5e-1)
-   net_state = torchopt.extract_state_dict(net)
    for i in range(outer_iters):
+       inner_optimizer = torchopt.MetaAdam(net, lr=5e-1, moment_requires_grad=False)
        for _ in range(inner_iters):
            trajectory, state = Rollout.get()
            backup = Rollout.rollout(trajectory, torch.sigmoid(gamma))
            pred_value = net(state.float())

            loss = F.mse_loss(pred_value, backup)
            inner_optimizer.step(loss)

        trajectory, state = Rollout.get()
        pred_value = net(state.float())
        backup = Rollout.rollout(trajectory, torch.ones_like(gamma))

        loss = F.mse_loss(pred_value, backup)
        meta_optimizer.zero_grad()
        loss.backward()
        meta_optimizer.step()
-       torchopt.recover_state_dict(net, net_state)
        torchopt.stop_gradient(net)
        if i % 100 == 0:
            with torch.no_grad():
                print(f'epoch {i} | gamma: {torch.sigmoid(gamma)}')


if __name__ == '__main__':
    test_gamma()

And here is the output:

$ python3 test_gamma.py
epoch 0 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 100 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 200 | gamma: tensor([0.5000, 0.5000, 0.5000, 0.4995, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 300 | gamma: tensor([0.5000, 0.5000, 0.4999, 0.4993, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 400 | gamma: tensor([0.5000, 0.5000, 0.4999, 0.4993, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 500 | gamma: tensor([0.5000, 0.5000, 0.4999, 0.4985, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
epoch 600 | gamma: tensor([0.5296, 0.5691, 0.5284, 0.5668, 0.4965, 0.5033, 0.4402, 0.3919, 0.4948])
epoch 700 | gamma: tensor([0.5293, 0.5684, 0.5287, 0.5670, 0.4961, 0.5024, 0.4400, 0.3918, 0.4949])
epoch 800 | gamma: tensor([0.5294, 0.5684, 0.5287, 0.5669, 0.4961, 0.5024, 0.4400, 0.3918, 0.4949])
epoch 900 | gamma: tensor([0.5294, 0.5684, 0.5287, 0.5669, 0.4961, 0.5024, 0.4400, 0.3918, 0.4949])
epoch 1000 | gamma: tensor([0.5294, 0.5684, 0.5287, 0.5669, 0.4961, 0.5024, 0.4400, 0.3918, 0.4949])
epoch 1100 | gamma: tensor([0.5294, 0.5684, 0.5291, 0.5669, 0.4961, 0.5024, 0.4400, 0.3918, 0.4949])
epoch 1200 | gamma: tensor([0.5294, 0.5684, 0.5291, 0.5669, 0.4961, 0.5024, 0.4400, 0.3918, 0.4949])
epoch 1300 | gamma: tensor([0.5295, 0.5684, 0.5291, 0.5669, 0.4961, 0.5024, 0.4400, 0.3918, 0.4949])
epoch 1400 | gamma: tensor([0.5294, 0.5684, 0.5291, 0.5669, 0.4961, 0.5024, 0.4400, 0.3918, 0.4949])
epoch 1500 | gamma: tensor([0.5294, 0.5684, 0.5291, 0.5668, 0.4961, 0.5020, 0.4400, 0.3918, 0.4949])
epoch 1600 | gamma: tensor([0.5295, 0.5703, 0.5291, 0.5670, 0.4961, 0.5020, 0.4221, 0.3918, 0.4949])
epoch 1700 | gamma: tensor([0.5295, 0.5703, 0.5293, 0.5669, 0.4961, 0.5019, 0.4238, 0.3918, 0.4949])
epoch 1800 | gamma: tensor([0.5295, 0.5702, 0.5293, 0.5669, 0.4961, 0.5019, 0.4238, 0.3918, 0.4943])
epoch 1900 | gamma: tensor([0.5295, 0.5702, 0.5294, 0.5669, 0.4961, 0.5019, 0.4238, 0.3918, 0.4943])
epoch 2000 | gamma: tensor([0.5295, 0.5702, 0.5294, 0.5669, 0.4961, 0.5020, 0.4238, 0.3918, 0.4943])
...

Answer selected by LabChameleon

LabChameleon · 2023-09-08T12:59:14Z

LabChameleon
Sep 8, 2023
Author

I see, that makes sense to me! Thanks again for your help :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using MetaAdam results in 'RuntimeError: Trying to backward through the graph a second time' #191

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using MetaAdam results in 'RuntimeError: Trying to backward through the graph a second time' #191

LabChameleon Sep 6, 2023

Replies: 3 comments · 1 reply

XuehaiPan Sep 6, 2023 Maintainer

LabChameleon Sep 7, 2023 Author

XuehaiPan Sep 7, 2023 Maintainer

LabChameleon Sep 8, 2023 Author

LabChameleon
Sep 6, 2023

Replies: 3 comments 1 reply

XuehaiPan
Sep 6, 2023
Maintainer

LabChameleon
Sep 7, 2023
Author

XuehaiPan Sep 7, 2023
Maintainer

LabChameleon
Sep 8, 2023
Author