Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError in epoch_best_postprocessing_or_default() #432

Open
cifkao opened this issue Aug 1, 2024 · 3 comments
Open

KeyError in epoch_best_postprocessing_or_default() #432

cifkao opened this issue Aug 1, 2024 · 3 comments

Comments

@cifkao
Copy link

cifkao commented Aug 1, 2024

I'm trying to run the benchmark but it crashes on the dcase2016_task2 task. After training for what seems like 229 epochs, at the prediction stage, I get a KeyError trying to access the postprocessing parameters at epoch 240:

predict - dcase2016_task2 - 2024-08-01 09:19:18,874 - 874 -  result: [0.1666666716337204, 29, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_di
m": 1024, "hidden_layers": 2, "hidden_norm": "<class 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_uniform_ at 0x7fd89f389830>", "lr": 0.0032, "max_epochs": 500, "norm_after_activation": false, "optim":
 "<class 'torch.optim.adam.Adam'>", "patience": 20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.19771863520145416, 39, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<c
lass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.00032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience":
20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.19354838132858276, 59, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<c
lass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_uniform_ at 0x7fd89f389830>", "lr": 0.00032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience":
 20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.1901140660047531, 269, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 1, "hidden_norm": "<c
lass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.00032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience":
20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.18285714089870453, 139, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 1, "hidden_norm": "<
class 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.0001, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience":
20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.1807909607887268, 69, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 1, "hidden_norm": "<cl
ass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.001, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience": 20}
, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.1732580065727234, 29, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<cl
ass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_uniform_ at 0x7fd89f389830>", "lr": 0.001, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience": 20
}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.1666666716337204, 29, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<cl
ass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_uniform_ at 0x7fd89f389830>", "lr": 0.0032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience": 2
0}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.16030533611774445, 19, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<c
lass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.0032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience": 2
0}, [["median_filter_ms", 250], ["min_duration", 125]]]
grid: 8it [1:59:58, 899.87s/it]
predict - dcase2016_task2 - 2024-08-01 09:19:18,874 - 874 - Best Grid Point Validation Score: 0.19771863520145416  Grid Point HyperParams: {'batch_size': 1024, 'check_val_every_n_epoch': 10, 'dropout': 0.1, 'embedding_norm': <class 'tor
ch.nn.modules.linear.Identity'>, 'hidden_dim': 1024, 'hidden_layers': 2, 'hidden_norm': <class 'torch.nn.modules.batchnorm.BatchNorm1d'>, 'initialization': <function xavier_normal_ at 0x7fd89f3898c0>, 'lr': 0.00032, 'max_epochs': 500, '
norm_after_activation': False, 'optim': <class 'torch.optim.adam.Adam'>, 'patience': 20}
split: 0it [00:00, ?it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84000/84000 [00:00<00:00, 140181.00it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84000/84000 [00:01<00:00, 59876.66it/s]
Getting embeddings for split ['test'], which has 84000 instances.███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏          | 79197/84000 [00:01<00:00, 60485.99it/s]
You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, r
ead https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Restoring states from the checkpoint path at logs/embeddings/mymodel/dcase2016_task2-hear2021-full/lightning_logs/version_4/checkpoints/epoch=39-step=10320.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3]
Loaded model weights from checkpoint at logs/embeddings/mymodel/dcase2016_task2-hear2021-full/lightning_logs/version_4/checkpoints/epoch=39-step=10320.ckpt
/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:229: PossibleUserWarning: The dataloader, test_dataloader 0, does not have many workers which may be a bottleneck.
Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  category=PossibleUserWarning,
  0%|                                                                                                                                                                                                               | 0/6 [2:00:03<?, ?it/s]
Traceback (most recent call last):
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/runner.py", line 181, in <module>
    runner()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/runner.py", line 148, in runner
    logger=logger,
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 1411, in task_predictions
    in_memory=in_memory,
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 1106, in task_predictions_test
    ckpt_path=grid_point.model_path, dataloaders=test_dataloader
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 795, in test
    self, self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in _test_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1188, in _run_stage
    return self._run_evaluate()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1228, in _run_evaluate
    eval_loop_results = self._evaluation_loop.run()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 206, in run
    output = self.on_run_end()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 180, in on_run_end
    self._evaluation_epoch_end(self._outputs)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 288, in _evaluation_epoch_end
    self.trainer._call_lightning_module_hook(hook_name, output_or_outputs)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 305, in test_epoch_end
    self._score_epoch_end("test", outputs)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 467, in _score_epoch_end
    postprocessing_cached = self.epoch_best_postprocessing_or_default(epoch)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 431, in epoch_best_postprocessing_or_default
    return self.epoch_best_postprocessing[epoch]
KeyError: 240
Testing DataLoader 0: 100%|██████████| 83/83 [00:02<00:00, 34.43it/s]

I'm using a conda environment. I have pytorch-lightning==1.9.5, torch==1.13.1 and scikit-learn==1.0.2.

@theMoro
Copy link

theMoro commented Aug 14, 2024

I have the same problem. Have you already solved it? :)

@theMoro
Copy link

theMoro commented Aug 17, 2024

I have now found the problem and a solution to it.

They want to set the current_epoch attribute of the PyTorch Lightning Trainer variable by calling:

trainer.fit_loop.current_epoch = grid_point.epoch

To get the wanted outcome, change this line to:
trainer.fit_loop.epoch_progress.current.completed = grid_point.epoch.
This actually changes the value you get when calling self.current_epoch in _score_epoch_end (line 464).

Another solution would probably be to just set a new variable of the trainer and then retrieve the value of that variable where you need it.

@cifkao
Copy link
Author

cifkao commented Aug 20, 2024

Thanks @theMoro, that fixed the problem for me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants