Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TOTALVI.train() Model training has an invalid value error #3028

Closed
raozuming opened this issue Oct 23, 2024 · 3 comments
Closed

TOTALVI.train() Model training has an invalid value error #3028

raozuming opened this issue Oct 23, 2024 · 3 comments
Labels

Comments

@raozuming
Copy link

raozuming commented Oct 23, 2024

scvi.model.TOTALVI.setup_mudata(
    mdata,
    rna_layer="counts" if rna_use_raw else None,
    protein_layer="counts" if protein_use_raw else None,
    modalities={
        "rna_layer": "multiomics" if self._use_hvg else "rna",
        "protein_layer": "protein",
    })

total_vi = scvi.model.TOTALVI(mdata, **kwags)
total_vi.train(accelerator=accelerator, devices=devices, **train_kwargs)
.conda/envs/py310/lib/python3.10/site-packages/mudata/_core/mudata.py:931: UserWarning: Cannot join columns with the same name because var_names are intersecting.
  warnings.warn(
INFO     Computing empirical prior initialization for protein background.
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
.conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py:316: The lr scheduler dict contains the key(s) ['monitor'], but the keys will be ignored. You need to call `lr_scheduler.step()` manually in manual optimization.
.conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py:298: The number of training batches (5) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Epoch 1/400:   0%|                                                                                                                                                                                                   | 0/400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "saw_multianalysis/multiAnalysis.py", line 410, in <module>
    run()
  File "saw_multianalysis/multiAnalysis.py", line 405, in run
    main(rna_path, protein_path, bin_size, protein_list, out_dir, convert_py_bool(use_gpu), gpu, num_threads,
  File "saw_multianalysis/multiAnalysis.py", line 309, in main
    totalVI(rna_data, protein_data, prefix, proteins, out_dir, use_gpu, gpu, num_threads, report)
  File "saw_multianalysis/multiAnalysis.py", line 175, in totalVI
    total_vi = ms_data.tl.total_vi(
  File ".conda/envs/py310/lib/python3.10/site-packages/stereo/algorithm/total_vi.py", line 142, in main
    total_vi.train(accelerator=accelerator, devices=devices, **train_kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/model/_totalvi.py", line 313, in train
    return runner()
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/train/_trainrunner.py", line 96, in __call__
    self.trainer.fit(self.training_plan, self.data_splitter)
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/train/_trainer.py", line 201, in fit
    super().fit(*args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 141, in run
    self.on_advance_end(data_fetcher)
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 295, in on_advance_end
    self.val_loop.run()
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 178, in _decorator
    return loop_run(self, *args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 135, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 396, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_args)
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 411, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/train/_trainingplans.py", line 360, in validation_step
    _, _, scvi_loss = self.forward(batch, loss_kwargs=self.loss_kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/train/_trainingplans.py", line 278, in forward
    return self.module(*args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/module/base/_decorators.py", line 41, in auto_transfer_args
    return fn(self, *args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/module/base/_base_module.py", line 208, in forward
    return _generic_forward(
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/module/base/_base_module.py", line 752, in _generic_forward
    generative_outputs = module.generative(**generative_inputs, **generative_kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/module/base/_decorators.py", line 41, in auto_transfer_args
    return fn(self, *args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/module/_totalvae.py", line 405, in generative
    px_, py_, log_pro_back_mean = self.decoder(
  File ".conda/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File ".conda/envs/py310/lib/python3.10/site-packages/scvi/nn/_base_components.py", line 886, in forward
    log_pro_back_mean = Normal(py_["back_alpha"], py_["back_beta"]).rsample()
  File ".conda/envs/py310/lib/python3.10/site-packages/torch/distributions/normal.py", line 56, in __init__
    super().__init__(batch_shape, validate_args=validate_args)
  File ".conda/envs/py310/lib/python3.10/site-packages/torch/distributions/distribution.py", line 62, in __init__
    raise ValueError(
ValueError: Expected parameter scale (Tensor of shape (126, 119)) of distribution Normal(loc: torch.Size([126, 119]), scale: torch.Size([126, 119])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [1.9011e-01, 6.1955e-02, 5.6041e-02,  ..., 1.2564e-01, 1.7372e-01,
         8.7467e-02],
        [1.8586e-04, 4.1208e-07, 1.2568e-07,  ..., 3.0242e-05, 5.3610e-07,
         1.3509e-06],
        ...,
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [1.3216e-19, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00]], device='cuda:0')

Versions:

python: 3.10.14
scvi-tools: 1.2.0
scanpy: 1.9.6

@raozuming raozuming added the bug label Oct 23, 2024
@raozuming raozuming changed the title TOTALVI.train() Model training shows NAN TOTALVI.train() Model training has an invalid value error Oct 23, 2024
@canergen
Copy link
Member

Hi, thanks for reporting the issue. Indeed, there's no epsilon on the scale in the decoder and therefore it can get equal to zero. I created a PR with a fix for this. Can you install from this branch and check that it's fixed? https://github.com/scverse/scvi-tools/tree/can_totalvi_decoder_eps

@raozuming
Copy link
Author

@canergen
Thank you, the model training did not go wrong, but the following error occurred: Monitored metric elbo_validation = nan is not finite. Previous best value was inf. Signaling Trainer to stop. The results of protein-related differential analysis are all 0.00000000. Is it because the data is too poor that the results are abnormal? The log is as follows:

.conda/envs/py310/lib/python3.10/site-packages/docrep/decorators.py:43: SyntaxWarning: 'param_categorical_covariate_keys' is not a valid key!
doc = func(self, args[0].doc, *args[1:], **kwargs)
.conda/envs/py310/lib/python3.10/site-packages/mudata/_core/mudata.py:931: UserWarning: Cannot join columns with the same name because var_names are intersecting.
warnings.warn(
INFO Computing empirical prior initialization for protein background.
INFO: GPU available: False, used: False
INFO:lightning.pytorch.utilities.rank_zero:GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
.conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py:316: The lr scheduler dict contains the key(s) ['monitor'], but the keys will be ignored. You need to call lr_scheduler.step() manually in manual optimization.
.conda/envs/py310/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py:298: The number of training batches (5) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Epoch 1/400: 0%|▎ | 1/400 [00:01<10:32, 1.59s/it, v_num=1, train_loss_step=1.99e+4, train_loss_epoch=7.67e+4]
Monitored metric elbo_validation = nan is not finite. Previous best value was inf. Signaling Trainer to stop.
[2024-10-28 16:08:55][Stereo][67371][MainThread][139743647491904][st_pipeline][41][INFO]: start to run normalize_total...
[2024-10-28 16:08:55][Stereo][67371][MainThread][139743647491904][st_pipeline][44][INFO]: normalize_total end, consume time 0.0092s.
[2024-10-28 16:08:55][Stereo][67371][MainThread][139743647491904][st_pipeline][41][INFO]: start to run log1p...
[2024-10-28 16:08:55][Stereo][67371][MainThread][139743647491904][st_pipeline][44][INFO]: log1p end, consume time 0.0046s.
[2024-10-28 16:08:55][Stereo][67371][MainThread][139743647491904][st_pipeline][41][INFO]: start to run neighbors...
[2024-10-28 16:09:06][Stereo][67371][MainThread][139743647491904][st_pipeline][44][INFO]: neighbors end, consume time 10.7429s.
[2024-10-28 16:09:06][Stereo][67371][MainThread][139743647491904][st_pipeline][41][INFO]: start to run umap...
completed 0 / 500 epochs
completed 50 / 500 epochs
completed 100 / 500 epochs
completed 150 / 500 epochs
completed 200 / 500 epochs
completed 250 / 500 epochs
completed 300 / 500 epochs
completed 350 / 500 epochs
completed 400 / 500 epochs
completed 450 / 500 epochs
[2024-10-28 16:09:09][Stereo][67371][MainThread][139743647491904][st_pipeline][44][INFO]: umap end, consume time 3.8732s.
[2024-10-28 16:09:09][Stereo][67371][MainThread][139743647491904][st_pipeline][41][INFO]: start to run leiden...
[2024-10-28 16:09:10][Stereo][67371][MainThread][139743647491904][st_pipeline][44][INFO]: leiden end, consume time 0.1685s.
DE...: 0%| | 0/27 [00:00<?, ?it/s].conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: divide by zero encountered in log2
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: invalid value encountered in subtract
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:152: RuntimeWarning: invalid value encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
DE...: 4%|███████▏ | 1/27 [00:04<02:07, 4.89s/it].conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: divide by zero encountered in log2
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: invalid value encountered in subtract
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:152: RuntimeWarning: invalid value encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
DE...: 7%|██████████████▎ | 2/27 [00:10<02:10, 5.20s/it].conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: divide by zero encountered in log2
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: invalid value encountered in subtract
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:152: RuntimeWarning: invalid value encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
DE...: 11%|█████████████████████▌ | 3/27 [00:15<02:07, 5.33s/it].conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: divide by zero encountered in log2
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: invalid value encountered in subtract
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:152: RuntimeWarning: invalid value encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
DE...: 15%|████████████████████████████▋ | 4/27 [00:23<02:28, 6.46s/it].conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: divide by zero encountered in log2
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: invalid value encountered in subtract
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:152: RuntimeWarning: invalid value encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
DE...: 19%|███████████████████████████████████▉ | 5/27 [00:30<02:25, 6.62s/it].conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: divide by zero encountered in log2
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: invalid value encountered in subtract
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:152: RuntimeWarning: invalid value encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
DE...: 22%|███████████████████████████████████████████ | 6/27 [00:35<02:07, 6.09s/it].conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:118: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
.conda/envs/py310/lib/python3.10/site-packages/scvi/model/base/_differential.py:313: RuntimeWarning: divide by zero encountered in log2
return np.log2(x + pseudocounts) - np.log2(y + pseudocounts)
.conda/envs/py310/lib/python3.10/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
DE...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [03:24<00:00, 7.57s/it]

@canergen
Copy link
Member

This looks like an error that was fixed in scVI-tools 1.2.0. can you please confirm that your installation went correct and you get scVI-tools 1.2.0 as version? Happy to have a look at the data. I don’t think it’s about poor quality, but might be sparsity in one protein.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants