Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookahead - RuntimeError: Expected all tensors to be on the same device #306

Open
atonyo11 opened this issue Dec 6, 2024 · 4 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@atonyo11
Copy link

atonyo11 commented Dec 6, 2024

Describe the bug

I can run my program OK with optim.Adam. After wrap optimizer by Lookahead, errors were shown

To Reproduce

  • OS :Linux
  • PyTorch version : 2.0.1
  • Python version : 3.9
  • reproducible codes :
    self.optimizer = Lookahead(optim.Adam( model.parameters(), lr=self.optim_dict['base_lr'], weight_decay=self.optim_dict['weight_decay'] ), k=5, alpha=0.5)

Log

scaler.step(optimizer.optimizer) File "/private/.conda/envs/project1/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 374, in step retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs) File "/private/.conda/envs/project1/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step retval = optimizer.step(*args, **kwargs) File "/private/.conda/envs/project1/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, **kwargs) File "/private/.conda/envs/project1/lib/python3.9/site-packages/pytorch_optimizer/optimizer/lookahead.py", line 137, in step self.update(group) File "/private/.conda/envs/project1/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/private/.conda/envs/project1/lib/python3.9/site-packages/pytorch_optimizer/optimizer/lookahead.py", line 116, in update p.mul_(self.alpha).add_(slow, alpha=1.0 - self.alpha) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

@atonyo11 atonyo11 added the bug Something isn't working label Dec 6, 2024
@kozistr
Copy link
Owner

kozistr commented Dec 6, 2024

@atonyo11 hi. could you please share a specific example to reproduce? It'd be good to fix the code based on your usage. I checked the implementation and tested it with the below example, but I can't reproduce it.

It seems like the params of the Adam optimizer are in the GPU, but the params of Lookadhead aren't. I may be wrong, but I assume you might load your optimizer states on a different device or something similar.

import os

import torch
from torch import nn
from torch.nn import functional as F
from torch import optim, nn, utils, Tensor

from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

import lightning.pytorch as pl

from pytorch_optimizer import load_optimizer, Lookahead
from torch.optim import Optimizer, Adam


class LitAutoEncoder(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
        self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        return Lookahead(Adam(self.parameters(), lr=1e-3), k=5, alpha=0.5)

train_dataset = MNIST(os.getcwd(), train=True, download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(train_dataset)

autoencoder = LitAutoEncoder()
autoencoder.train()
autoencoder.cuda()

trainer = pl.Trainer(
    limit_train_batches=100,
    max_epochs=1,
    accelerator='auto',
    logger=True,
)

trainer.fit(autoencoder, train_loader, valid_loader)

@atonyo11
Copy link
Author

atonyo11 commented Dec 6, 2024

@kozistr Thank you for your quick reply.

I am doing this work.
https://github.com/hulianyuyy/CorrNet/blob/main/utils/optimizer.py

@kozistr
Copy link
Owner

kozistr commented Dec 8, 2024

@kozistr Thank you for your quick reply.

I am doing this work. https://github.com/hulianyuyy/CorrNet/blob/main/utils/optimizer.py

hi. Could you explain in more detail how to reproduce? I tested various scenarios as far as I could, but still have issues reproducing the device mismatch issue by loading from the checkpoint or calling the optimizer in and of itself. (I might miss something.)

However, I found that could possibly happen when you continue your training after trying to load the optimizer states (both Adam and Lookahead) through that repo you mentioned, Lookahead's state is still in on the CPU, because currently, the state is not saved and loaded, and its device is determined only when initialing the Lookahead optimizer.

In short, I just made a modification that can also save and load the Lookahead optimizer state, and all you need to do is to save and load the optimizer state like below.

optimizer = ...

torch.save(optimizer.state_dict(), 'opt.ckpt')
optimizer.load_state_dict(torch.load('opt.ckpt', map_location='cuda'))

you can check the modified implementation here.

hope this could help with your issue and please let me know if you still have a problem

@atonyo11
Copy link
Author

atonyo11 commented Dec 8, 2024

I just run program from start, no load pretrain
python main.py --config ./config/baseline.yaml --device 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants