You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After about a day of pouring over everything I've been successful at getting a custom LJSpeech data set together and am currently running CPU-based training using the default Tacotron config under /TTS/tts/configs/config.json. But, reading through recent posts under the issue queue and here in the forums, I'm realizing that there's a lot I don't yet understand.
For one, what's the difference between a TTS and a vocoder? I understand that both need to be trained -- are they trained separately, or together? In other words, after I'm done running this training with the Tacotron config, do I need to train all over again if I want to use MelGAN or WaveGrad? What does that process look like?
Right now I'm running training off the master branch; are there changes in the dev branch that would make this process better/faster/etc?
Since I'm running this in WSL2, I don't have access to CUDA despite having an nVidia graphics card. Which component(s) need to talk to CUDA, and do you know of a way to make use of CUDA without an insider build of Windows 10?
[This is an archived TTS discussion thread from discourse.mozilla.org/t/trying-to-understand-the-high-level-architecture]
> what's the difference between a TTS and a vocoder?
As i understand, TTS takes text and generates an audio representation (visually a mel + its underlying data) slash Vocoder takes that representation and translates it to the actual waveform which is audible. slash Hence they are separate models with their own parameters.
> are they trained separately, or together?
IDK for sure, I fiddled with Tacotron 2 separately and used the published Vocoder for inference and things came out okayish (given the level of data i had and the number of steps i had put it through). slash I then fine tuned the published vocoder with the same data set (ground truth), and some sentences sounded better and some sounded slightly worse. But the voice did change to resemble the target voice. slash So you can definitely train them separately, but I am not sure what the optimal order is, or whether it is recommended to train them together. I had asked but never got an answer. There is a 'recipe' section somewhere in Github, you may want to try that.
> am currently running CPU-based training using the default Tacotron > config under /TTS/tts/configs/config.json
Yikes, good luck, hope you have a lot of time to spare.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
>>> GuyEP
[February 21, 2021, 11:11pm]
After about a day of pouring over everything I've been successful at
getting a custom LJSpeech data set together and am currently running
CPU-based training using the default Tacotron config under
/TTS/tts/configs/config.json
. But, reading through recent posts underthe issue queue and here in the forums, I'm realizing that there's a lot
I don't yet understand.
For one, what's the difference between a TTS and a vocoder? I understand
that both need to be trained -- are they trained separately, or
together? In other words, after I'm done running this training with the
Tacotron config, do I need to train all over again if I want to use
MelGAN or WaveGrad? What does that process look like?
Right now I'm running training off the
master
branch; are therechanges in the
dev
branch that would make this processbetter/faster/etc?
Since I'm running this in WSL2, I don't have access to CUDA despite
having an nVidia graphics card. Which component(s) need to talk to CUDA,
and do you know of a way to make use of CUDA without an insider build of
Windows 10?
[This is an archived TTS discussion thread from discourse.mozilla.org/t/trying-to-understand-the-high-level-architecture]
Beta Was this translation helpful? Give feedback.
All reactions