Distributed training with train vocoder gan py error #365

JRMeyer · 2021-03-07T09:21:15Z

JRMeyer
Mar 7, 2021
Maintainer

>>> dyson
[March 4, 2021, 12:04pm]

Hello everyone,

So far, I've successfully trained a model with Tacotron 2 and the
synthesized speech with Universal FullBand-MelGAN sounds okay. To
further improve the quality and makes the synthesized voice sounds more
similar to the original speaker, I decided to train my own vocoder using
the same dataset. slash
But when I use the following command: slash
CUDA_VISIBLE_DEVICES='0,1,2' OMP_NUM_THREADS=1 python TTS/bin/distribute.py --script train_vocoder_gan.py --config_path config_vocoder_PWgan.json

I got the following output:

Traceback (most recent call last):
File '/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py', line 654, in
main(args)
File '/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py', line 559, in main
epoch)
File '/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py', line 114, in train
y_hat = model_G(c_G)
File '/home/ldai/anaconda3/envs/mozillatts/lib/python3.6/site-packages/torch/nn/modules/module.py', line 727, in _call_impl
result = self.forward(*input, **kwargs)
File '/home/ldai/anaconda3/envs/mozillatts/lib/python3.6/site-packages/torch/nn/parallel/distributed.py', line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

I didn't get such an error when using train_tacotron.py with the same
command. slash
Any suggestions?

Libraries version: slash
python=3.6.12 slash
torch=1.7.1

[This is an archived TTS discussion thread from discourse.mozilla.org/t/distributed-training-with-train-vocoder-gan-py-error]

JRMeyer · 2021-03-07T09:21:17Z

JRMeyer
Mar 7, 2021
Maintainer Author

[Archived] Distributed training with train vocoder gan py error

>>> dyson
[March 4, 2021, 12:04pm]

Hello everyone,

So far, I've successfully trained a model with Tacotron 2 and the
synthesized speech with Universal FullBand-MelGAN sounds okay. To
further improve the quality and makes the synthesized voice sounds more
similar to the original speaker, I decided to train my own vocoder using
the same dataset. slash
But when I use the following command: slash
`CUDA_VISIBLE_DEVICES='0,1,2' OMP_NUM_THREADS=1 python TTS/bin/distribute.py --script train_vocoder_gan.py --config_path config_vocoder_PWgan.json`

I got the following output:

Traceback (most recent call last):
File '/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py', line 654, in
main(args)
File '/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py', line 559, in main
epoch)
File '/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py', line 114, in train
y_hat = model_G(c_G)
File '/home/ldai/anaconda3/envs/mozillatts/lib/python3.6/site-packages/torch/nn/modules/module.py', line 727, in _call_impl
result = self.forward(*input, **kwargs)
File '/home/ldai/anaconda3/envs/mozillatts/lib/python3.6/site-packages/torch/nn/parallel/distributed.py', line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

I didn't get such an error when using train_tacotron.py with the same
command. slash
Any suggestions?

Libraries version: slash
python=3.6.12 slash
torch=1.7.1

### This is an archived TTS discussion thread from discourse.mozilla.org/t/distributed-training-with-train-vocoder-gan-py-error

1 reply

OmiWakode Mar 31, 2022

I also got a similar error while multi gpu training of the custom vits model. Did you solve this error?

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/train_tts_env/lib/python3.8/site-packages/trainer/trainer.py", line 1461, in fit
    self._fit()
  File "/home/ubuntu/anaconda3/envs/train_tts_env/lib/python3.8/site-packages/trainer/trainer.py", line 1449, in _fit
    self.test_run()
  File "/home/ubuntu/anaconda3/envs/train_tts_env/lib/python3.8/site-packages/trainer/trainer.py", line 1385, in test_run
    self.model.test_log(test_outputs, self.dashboard_logger, self.training_assets, self.total_steps_done)
  File "/home/ubuntu/anaconda3/envs/train_tts_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'test_log'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training with train vocoder gan py error #365

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Distributed training with train vocoder gan py error #365

JRMeyer Mar 7, 2021 Maintainer

Replies: 1 comment · 1 reply

JRMeyer Mar 7, 2021 Maintainer Author

OmiWakode Mar 31, 2022

JRMeyer
Mar 7, 2021
Maintainer

Replies: 1 comment 1 reply

JRMeyer
Mar 7, 2021
Maintainer Author