Trying to understand the high level architecture #362

JRMeyer · 2021-03-07T09:20:57Z

JRMeyer
Mar 7, 2021
Maintainer

>>> GuyEP
[February 21, 2021, 11:11pm]

After about a day of pouring over everything I've been successful at
getting a custom LJSpeech data set together and am currently running
CPU-based training using the default Tacotron config under
/TTS/tts/configs/config.json. But, reading through recent posts under
the issue queue and here in the forums, I'm realizing that there's a lot
I don't yet understand.

For one, what's the difference between a TTS and a vocoder? I understand
that both need to be trained -- are they trained separately, or
together? In other words, after I'm done running this training with the
Tacotron config, do I need to train all over again if I want to use
MelGAN or WaveGrad? What does that process look like?

Right now I'm running training off the master branch; are there
changes in the dev branch that would make this process
better/faster/etc?

Since I'm running this in WSL2, I don't have access to CUDA despite
having an nVidia graphics card. Which component(s) need to talk to CUDA,
and do you know of a way to make use of CUDA without an insider build of
Windows 10?

[This is an archived TTS discussion thread from discourse.mozilla.org/t/trying-to-understand-the-high-level-architecture]

JRMeyer · 2021-03-07T09:20:59Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> bingecola
[February 22, 2021, 9:52pm]

> what's the difference between a TTS and a vocoder?

As i understand, TTS takes text and generates an audio representation
(visually a mel + its underlying data) slash
Vocoder takes that representation and translates it to the actual
waveform which is audible. slash
Hence they are separate models with their own parameters.

> are they trained separately, or together?

IDK for sure, I fiddled with Tacotron 2 separately and used the
published Vocoder for inference and things came out okayish (given the
level of data i had and the number of steps i had put it through). slash
I then fine tuned the published vocoder with the same data set (ground
truth), and some sentences sounded better and some sounded slightly
worse. But the voice did change to resemble the target voice. slash
So you can definitely train them separately, but I am not sure what the
optimal order is, or whether it is recommended to train them together. I
had asked but never got an answer. There is a 'recipe' section somewhere
in Github, you may want to try that.

> am currently running CPU-based training using the default Tacotron
> config under /TTS/tts/configs/config.json

Yikes, good luck, hope you have a lot of time to spare.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to understand the high level architecture #362

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Trying to understand the high level architecture #362

JRMeyer Mar 7, 2021 Maintainer

Replies: 1 comment

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author