Replies: 1 comment 1 reply
-
[Archived] Distributed training with train vocoder gan py error
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
>>> dyson
[March 4, 2021, 12:04pm]
Hello everyone,
So far, I've successfully trained a model with Tacotron 2 and the
synthesized speech with Universal FullBand-MelGAN sounds okay. To
further improve the quality and makes the synthesized voice sounds more
similar to the original speaker, I decided to train my own vocoder using
the same dataset. slash
But when I use the following command: slash
CUDA_VISIBLE_DEVICES='0,1,2' OMP_NUM_THREADS=1 python TTS/bin/distribute.py --script train_vocoder_gan.py --config_path config_vocoder_PWgan.json
I got the following output:
Traceback (most recent call last):
File '/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py', line 654, in
main(args)
File '/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py', line 559, in main
epoch)
File '/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py', line 114, in train
y_hat = model_G(c_G)
File '/home/ldai/anaconda3/envs/mozillatts/lib/python3.6/site-packages/torch/nn/modules/module.py', line 727, in _call_impl
result = self.forward(*input, **kwargs)
File '/home/ldai/anaconda3/envs/mozillatts/lib/python3.6/site-packages/torch/nn/parallel/distributed.py', line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
; (2) making sure allforward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforward
function. Please include the loss function and the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, iterable).I didn't get such an error when using train_tacotron.py with the same
command. slash
Any suggestions?
Libraries version: slash
python=3.6.12 slash
torch=1.7.1
[This is an archived TTS discussion thread from discourse.mozilla.org/t/distributed-training-with-train-vocoder-gan-py-error]
Beta Was this translation helpful? Give feedback.
All reactions