Training a universal vocoder #252

JRMeyer · 2021-03-07T08:50:16Z

JRMeyer
Mar 7, 2021
Maintainer

>>> julian.weber
[August 9, 2020, 10:21am]

Hello,

Since training a vocoder takes time and compute, I'd like to train and
contribute a universal vocoder that works for most use cases. slash
I have compute but I'm no expert on TTS and I'd like help choosing hyper
parameters and tuning the config file.

did it with WaveRNN and it worked
very well. slash
I'd like to do the same with faster inference speed to cover more use
cases by using either MelGAN or PWGAN on the same LibriTTS dataset.

According to my understanding, the sample-rate of the dataset used to
train Tacotron doesn't really matter because it shouldn't affect the mel
spectogram (I'm not so sure about that), the only parameters that should
affect it are :

should be fixable without retraining)

And so, still according to my understanding, these are the parameters
that must be shared with all models that use the same vocoder.

The vocoder's output sample rate shouldn't matter too much but I think
that 16kHz instead of the LibriTTS's 24kHz should give a 33% boost in
inference performance. (I'm not so sure about that as well since PWGAN
and MelGAN are far more parallelised than WaveRNN)

What do you think about that ? Is it a good idea ? Am I off in my
understanding of the TTS process ?

[This is an archived TTS discussion thread from discourse.mozilla.org/t/training-a-universal-vocoder]

JRMeyer · 2021-03-07T08:50:19Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[August 9, 2020, 10:30am]

Good idea, thought it might be better to downsample to 22050, since most
TTS training I have seen happens with that sample rate. I also need to
point out that I tried to train a universal vocoder using LibriTTS
many times, but unfortunately it never worked well.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:22Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[August 9, 2020, 10:36am]

Could you please tell me what models did you try please and the rough
configs ?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:24Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[August 9, 2020, 10:44am]

I only tried ParallelWaveGAN. The sample rate was 22050. The win length
was 1100 and hop_length was 275. 80 mels and 0 mel_fmin. What I got was
a lot of static and muffled voices. In a paper, it is mentioned it is
much better to train on a much smaller amount of speakers, but with same
amount of speech for each of them. The paper mentioned 6 speakers (3
male and 3 female) with 10 hours for each.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:27Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[August 9, 2020, 11:41am]

Hum ok thanks for the insights. slash
So I may go with m-ailabs on French/German/English one speaker f/m each
then... slash
Did your config work fine on one speaker ? Do you think MelGAN would
behave better with a high number of speakers ? (I don't think MelGAN's
MOS is sufficient for the tradeoff speed/quality)

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:29Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[August 9, 2020, 12:11pm]

I am able to train PWGAN with LibriTTS better than static noise (still
training). However my initial work indicated that we need to use a
larger model since the original PWGAN model is very tiny and the variety
in the speech requires a stronger model.

I also try to make the model more sampling rate agnostic by providing a
different up-sampling network for each target sampling rate. And I
provide different sampling rate input for each batch at training.

you can check the code here
https://github.com/erogol/TTS_experiments/tree/generic_vocoder

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:32Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[August 9, 2020, 1:10pm]

Oh ok I get why the sample rate of the TTS training data matters, it's
because for win_length, the unit of the 1024 is sample, so 1024 samples
in 16k is not temporally the same as 24k. So I was wrong, input sr
matters.

What I don't get is why does the output sample rate of the vocoder
matters, instead of focusing on variable output sample rate, shouldn't
we focus instead on variable input sample rate ? (since different sample
rate produce different Mel Spectograms). slash
We could for exemple have the first few layers sr specific by switching
it during trainning. slash
Or we could train a small model to do 16k Mel spectogram to 24k Mel
spectogram (somthing like U-net)

?

thanks

Edit: not u-net because the size of the spectrogram varies with
duration, or windowed

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:34Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[August 9, 2020, 1:33pm]

Ok, just read your code. slash
I thought the upsampling was at the end of the network since you said
target sampling rate. Forget what I said about U-net. slash
Wouldn't it be simpler for the model to learn if the output sample rate
was fixed ? slash
Because if I understand your code correctly, each time it produce a
waveform with the target sr

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:37Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[August 9, 2020, 3:25pm]

It is another approach to produce a constant sr from a given input.
Maybe you can try this.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:40Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[August 9, 2020, 3:40pm]

Alright, RN I'm trying converting 16k Mel Spectrograms into 22k ones
with a model to see if without retraining, my Tacotron 16kHz can use
Waveglow 22050Hz as a vocoder. slash
If it doesn't work I'm gonna try constant constant sr output.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:42Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[August 9, 2020, 4:41pm]

I think it's a great start and actually a better approach probably to
solve the sr mismatch unless the user does not like the flexibility of
having multiple sr options.

Looking forward to see your results

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:45Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[August 9, 2020, 4:47pm]

I think it also makes sense to train this mel-spec model for multiple
sampling rates like 16k 22.50 and 24. I'd say it'd be hard to convert
from many-to-one without explicitly providing input sr.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:47Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[August 10, 2020, 8:56am]

Hi quick update,

Just to test something I tried using the vocoder in the
MultiBand_MelGAN_Example notebook trained in 22050Hz with my Tacotron2
fr 50k 16kHz.

Here's a sample with griffin_lim. slash
.https://soundcloud.com/julian-weber-8/gl slash
Here's one where I passed the mel_spec without processing slash
.https://soundcloud.com/julian-weber-8/without-pitch-correction slash
and then corrected for pitch with librosa.effects.pitch_shift and noise
reduction slash
.https://soundcloud.com/julian-weber-8/with-pitch-correction slash
And finally, the better sounding one, where I stretched the mel-spec to
the right length with Lanczos resampling. slash
.https://soundcloud.com/julian-weber-8/16ktts-22kmelgan

This is gonna be my baseline for the model that I'll build although I'm
not too sure that I can do better than that because your vocoder was
trained on a different voice/language

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:50Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[August 13, 2020, 12:39pm]

Universal PWGAN looks promising after 500k iterations. But I guess it
requires a larger model. I'll this run continue and start another one
with a larger model. Using different upsamle_net for different sampling
rates looks working as well. This model can produce speech in 16khz,
22050hz and 24khz.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:53Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[August 13, 2020, 12:51pm]

Sounds great. What is the quality you get on the smaller model? Mine
could generalize, but there was a 'zzzzz' sound, and when I used more
speakers, it sounded muffled.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:55Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[August 13, 2020, 1:08pm]

just a small background noise

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:50:58Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[August 13, 2020, 1:12pm]

Are you training using the vocoder module on the dev branch? Could you
share your config? If I have time maybe I can try training too
Although I think it is useless seeing as you have reached 500K. Are you
planning to release the model?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:51:00Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[August 14, 2020, 9:23am]

That's great ! slash
Maybe for the background noise, we could try tweaking a bit the loss for
minimum activation in the last layer of the generator ? (That might be a
dumb idea but it's cheap to try since you don't have to retrain the
whole model) slash
I'm sorry but I had to work on something else this week and maybe the
beginning of next week. I'll keep you posted if I make any improvements.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:51:03Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[August 15, 2020, 6:54pm]

> try tweaking a bit the loss for minimum activation in the last layer
> of the generator

what do you mean exactly with it?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:51:06Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[August 15, 2020, 6:55pm]

I stopped the training since the results do not improve. The model is
able to produce good speech for speakers in LibriTTS but some background
noise again for new speakers. The next, I'll try a larger model.

[next page
→

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a universal vocoder #252

{{title}}

Replies: 19 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training a universal vocoder #252

JRMeyer Mar 7, 2021 Maintainer

Replies: 19 comments

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author