Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorchLightningPruningCallback messes with Multiworker Dataloaders #154

Open
mspils opened this issue Oct 11, 2023 · 3 comments
Open

PyTorchLightningPruningCallback messes with Multiworker Dataloaders #154

mspils opened this issue Oct 11, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@mspils
Copy link

mspils commented Oct 11, 2023

Expected behavior

When using the PyTorchLightningPruningCallback a pruned trial should resolve without errors.

Environment

  • Optuna version:3.2.0
  • Python version:3.8.10
  • OS:Linux-5.4.0-159-generic-x86_64-with-glibc2.17
  • Other libraries and their versions
    • Pytorch lightning 2.0.3
    • Torch 2.0.1

Error messages, stack traces, or logs

[I 2023-10-11 17:40:48,579] Trial 5 finished with value: 0.08356545865535736 and parameters: {'learning_rate': 0.0020429196484991327, 'n_layers': 1, 'n_units_l0': 4}. Best is trial 4 with value: 0.078277587890625.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 49    
-------------------------------------
49        Trainable params
0         Non-trainable params
49        Total params
0.000     Total estimated model params size (MB)
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 237.30it/s, v_num=61[I 2023-10-11 17:40:49,436] Trial 6 pruned. Trial was pruned at epoch 1.███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1104.35it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 16    
-------------------------------------
16        Trainable params
0         Non-trainable params
16        Total params
0.000     Total estimated model params size (MB)
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 205.02it/s, v_num=62[I 2023-10-11 17:40:50,429] Trial 7 pruned. Trial was pruned at epoch 1.████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 965.54it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 13    
-------------------------------------
13        Trainable params
0         Non-trainable params
13        Total params
0.000     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 20.64it/s, v_num=62]
Epoch 0:   0%|                                                                                                                                                                                                                                                     | 0/10 [00:00<?, ?it/s]Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>                                                                                                                                                                                                 
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.75it/s, v_num=62]
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
                                                                                                                                                                                                                                                                                             assert self._parent_pid == os.getpid(), 'can only test a child process'                                                                                                                                                                                                                
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.75it/s, v_num=62]
                                                                                                                                                                                                                                                                                             self._shutdown_workers()                                                                                                                                                                                                                                                               
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.74it/s, v_num=62]
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
Traceback (most recent call last):
                                                                                                                                                                                                                                                                                           File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__                                                                                                                                                              
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.73it/s, v_num=62]
                                                                                                                                                                                                                                                                                             self._shutdown_workers()                                                                                                                                                                                                                                                               
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.71it/s, v_num=62]
                                                                                                                                                                                                                                                                                         Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>                                                                                                                                                                                                  
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.69it/s, v_num=62]
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
                                                                                                                                                                                                                                                                                         Traceback (most recent call last):                                                                                                                                                                                                                                                         
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
    if w.is_alive():
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.67it/s, v_num=62]
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.66it/s, v_num=62]
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 214.93it/s, v_num=63[I 2023-10-11 17:40:51,374] Trial 8 pruned. Trial was pruned at epoch 1.████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 576.38it/s]

Steps to reproduce

  1. Run the following code (Maybe change DEVICES and ACCELERATOR if you do not have multiple GPUs.)
  2. Wait until a trial is pruned.
from typing import List, Optional

import pytorch_lightning as pl
#import lightning.pytorch as pl
import optuna
import torch
from lightning.pytorch.callbacks import Callback
from optuna.integration import PyTorchLightningPruningCallback
from torch import nn, optim
from torch.utils.data import DataLoader

torch.set_float32_matmul_precision('high')
BATCHSIZE = 1024
EPOCHS = 50
ACCELERATOR = 'cuda'
DEVICES = [1]


class OptunaPruningCallback(PyTorchLightningPruningCallback, Callback):
    """Custom optuna Pruning Callback, because CUDA/Lightning do not play well with the default one.

    Args:
        PyTorchLightningPruningCallback (_type_): _description_
        pl (_type_): _description_
    """

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

class ToyDataSet(torch.utils.data.Dataset):
    def __init__(self, count):
        super(ToyDataSet).__init__()
        self.x = torch.rand(count,dtype=torch.float32)
        self.y = torch.rand(count,dtype=torch.float32)
        self.count  = count

    def __len__(self) -> int:
        return self.count

    def __getitem__(self, idx):
        if idx >= len(self):
            raise IndexError(f"Index {idx} is out of range, dataset has length {len(self)}")

        return self.x[idx], self.y[idx]

class LightningNet(pl.LightningModule):
    def __init__(self, output_dims,learning_rate) -> None:
        super().__init__()
        layers = []
        input_dim = 1
        for output_dim in output_dims:
            layers.append(nn.Linear(input_dim, output_dim))
            layers.append(nn.ReLU())
            input_dim = output_dim
        layers.append(nn.Linear(input_dim, 1))

        self.model = nn.Sequential(*layers)
        self.save_hyperparameters()

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        return self.model(data)

    def training_step(self, batch: List[torch.Tensor], batch_idx: int) -> torch.Tensor:
        x,y = batch
        x = x.view(-1,1)
        y_hat = self(x)[:,0]
        loss = nn.functional.mse_loss(y_hat,y)
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch: List[torch.Tensor], batch_idx: int) -> None:
        x,y = batch
        x = x.view(-1,1)
        y_hat = self(x)[:,0]
        val_loss = nn.functional.mse_loss(y_hat,y)
        self.log("val_loss", val_loss, sync_dist=True)

    def configure_optimizers(self) -> optim.Optimizer:
        return optim.Adam(self.model.parameters(),self.hparams.learning_rate)

    def setup(self, stage: Optional[str] = None) -> None:
        self.dataset_train = ToyDataSet(10000)
        self.dataset_val = ToyDataSet(1000)

    def train_dataloader(self) -> DataLoader:
        return DataLoader(self.dataset_train, batch_size=BATCHSIZE, shuffle=True, pin_memory=True,num_workers=8,persistent_workers=True)

    def val_dataloader(self) -> DataLoader:
        return DataLoader(self.dataset_val, batch_size=BATCHSIZE, shuffle=False, pin_memory=True,num_workers=8,persistent_workers=True)


def objective(trial: optuna.trial.Trial) -> float:
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
    n_layers = trial.suggest_int("n_layers", 1, 2)
    output_dims = [trial.suggest_int(f"n_units_l{i}", 4, 64, log=True) for i in range(n_layers)]

    model = LightningNet(output_dims,learning_rate)

    trainer = pl.Trainer(
        logger=True,
        enable_checkpointing=False,
        max_epochs=EPOCHS,
        accelerator=ACCELERATOR,
        devices=DEVICES,
        callbacks=[PyTorchLightningPruningCallback(trial, monitor="val_loss")],
        #callbacks=[OptunaPruningCallback(trial, monitor="val_loss")],
    )

    trainer.fit(model)

    return trainer.callback_metrics["val_loss"].item()


if __name__ == "__main__":
    study = optuna.create_study(
        direction="minimize",
        pruner= optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3, bootstrap_count=0),
        load_if_exists=True)

    study.optimize(objective, n_trials=100)

Additional context (optional)

When optimizing a study with optuna, using the PyTorchLightningPruningCallback it is possible for pruned trials to not finish properly.
DataLoaders with multiple workers are not killed properly and possibly even interfere with later trials. At least the logged v_nums are out of order sometimes.

@mspils mspils added the bug Something isn't working label Oct 11, 2023
@HideakiImamura
Copy link
Member

@mspils Does this problem still occur with the latest Optuna v3.4?

@mspils
Copy link
Author

mspils commented Nov 21, 2023

Yes and no. It crashes, which is probably an improvement:


[W 2023-11-21 13:45:48,635] Trial 0 failed with parameters: {'learning_rate': 0.009733867742024538, 'n_layers': 1, 'n_units_l0': 12} because of the following error: RuntimeError('DataLoader worker (pid(s) 3999530) exited unexpectedly').
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3999530) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "optuna_issue.py", line 108, in objective
    trainer.fit(model)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 202, in advance
    batch, _, __ = next(data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fetchers.py", line 127, in __next__
    batch = super().__next__()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fetchers.py", line 56, in __next__
    batch = next(self.iterator)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/utilities/combined_loader.py", line 326, in __next__
    out = next(self._iterator)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/utilities/combined_loader.py", line 74, in __next__
    out[i] = next(self.iterators[i])
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1284, in _get_data
    success, data = self._try_get_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3999530) exited unexpectedly
[W 2023-11-21 13:45:48,646] Trial 0 failed with value None.
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3999530) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "optuna_issue.py", line 119, in <module>
    study.optimize(objective, n_trials=100)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/study.py", line 451, in optimize
    _optimize(
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 66, in _optimize
    _optimize_sequential(
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 163, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 251, in _run_trial
    raise func_err
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "optuna_issue.py", line 108, in objective
    trainer.fit(model)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 202, in advance
    batch, _, __ = next(data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fetchers.py", line 127, in __next__
    batch = super().__next__()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fetchers.py", line 56, in __next__
    batch = next(self.iterator)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/utilities/combined_loader.py", line 326, in __next__
    out = next(self._iterator)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/utilities/combined_loader.py", line 74, in __next__
    out[i] = next(self.iterators[i])
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1284, in _get_data
    success, data = self._try_get_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3999530) exited unexpectedly
Epoch 0:   0%|          | 0/10 [00:00<?, ?it/s]                                         

@youyinnn
Copy link

Same issue here.

@nzw0301 nzw0301 transferred this issue from optuna/optuna Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants