Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking Tacotron2 output to wavenet vocoder #30

Closed
danshirron opened this issue Mar 6, 2018 · 19 comments
Closed

Taking Tacotron2 output to wavenet vocoder #30

danshirron opened this issue Mar 6, 2018 · 19 comments
Labels

Comments

@danshirron
Copy link

Anyone had experience with the above?
I guess that audio hparams need to be same for both. My intuition for using ljspeech:

Settings for tacotron 2 implementation(https://github.com/Rayhane-mamah/Tacotron-2):
num_mels=80
num_freq=1025; In wavenet code fft_size=1024 in t2 fft_size=(1025-1)*2=2048. As far as i understand i can keep this as is since anyways this accumulates to mel bands
sample_rate=20050 (As the ljspeech dataset)
frame_length_ms=46.44 (correlates to wavenet's fft_size/22050).
frame_shift_ms=11.61 (correlates to wavenet's hop_size=256, 256/22050=11.61ms)
preemphasis, not available in wavenet r9y9 implementation
Others: in t2 i dont have fmin(125 in wavenet) and fmax (7600 in wavenet). looking into t2 code,
the spectrogram fmin is set to 0 and fmax is set to 2/fsample = 22050/2=11025Hz. Since im using a pre-trained wavenet model i guess ill need to change params in t2 code.

Any remarks, suggestions?

@neverjoe
Copy link

neverjoe commented Mar 7, 2018

At least, u should adjust the upsample layer with your hop_size or frame_shift_ms, in my experience, just freezing other layer and retraining can work.

@imdatceleste
Copy link

@danshirron : just did that. I used the same training data for both, did NOT change much in hparams (apart from samplerate in my case and batch-size). Generated a mel in T2, reshape'd it (there is a bug in T2) and fed it to wavenet_vocoder. It works very nicely, though my results are not yet really good (incomprehensible audio in T2 as well as in wavenet). But the incomprehensible audio sounded identical, though the output of wavenet is significantly smoother :)

@r9y9
Copy link
Owner

r9y9 commented May 9, 2018

Good news: I just tried this and found it works nicely. I'm planning to make a notebook for Tacotron2 + WaveNet text-to-speech demo that can be run on Google colab. Two samples attached.

taco2.zip

@nikita-smetanin
Copy link

@r9y9 samples sound amazing, can you please share the recipe how to train this model(s)? This doesn't look like default hparams.

@rafaelvalle
Copy link

Samples sound great! Would love to see the Taco2 and Wavenet params.

@r9y9
Copy link
Owner

r9y9 commented May 10, 2018

https://r9y9.github.io/wavenet_vocoder/ Uploaded Tacotron2 samples. You can find links to hyper params on the page, but here you are:

WaveNet params are same as default except for max_time_steps (I tried 10000 this time instead of 8000). I think 8000 should work too.

Training recipe:

Synthesis recipe: combine them sequentially.

Does this help you?:)

@nikita-smetanin
Copy link

@r9y9 Great, thanks! Will try to reproduce it. Do you mean WaveNet fine-tuning with teacher-forcing like you did in this r9y9/deepvoice3_pytorch#21 pull or a simple fine-tuning on some dataset?

@r9y9
Copy link
Owner

r9y9 commented May 10, 2018

I meant just continuing training with the same dataset (LJSpeech) which the pretrained model trained on. This is due to there was a bug (#33) when I trained the pretrained model. I didn't use predicted mel-spectrograms for training WaveNet like I did in r9y9/deepvoice3_pytorch#21. That should improve quality but I wanted to try simpler case first.

@rafaelvalle
Copy link

@r9y9 Did you train the Taco 2 model with ARPAbet?

@PetrochukM
Copy link
Contributor

@r9y9 What was your loss at the end of training?

@r9y9
Copy link
Owner

r9y9 commented May 12, 2018

@rafaelvalle I used https://github.com/Rayhane-mamah/Tacotron-2. I believe this uses ARPAbet for the network input.

@r9y9
Copy link
Owner

r9y9 commented May 12, 2018

@PetrochukM The last training curve(~ 300ksteps) for training WaveNet used for my demo.
screenshot from 2018-05-12 15-54-16

@neverjoe
Copy link

@r9y9 Do you use np.interp to scale the mel feature? I found my result not performed perfect.

@r9y9
Copy link
Owner

r9y9 commented May 14, 2018

@neverjoe Check out https://colab.research.google.com/github/r9y9/Colaboratory/blob/master/Tacotron2_and_WaveNet_text_to_speech_demo.ipynb. It's the complete recipe used for my experiment. From that you will find:

  # Range [0, 4] was used for training Tacotron2 but WaveNet vocoder assumes [0, 1]
  c = np.interp(c, (0, 4), (0, 1))

@rafaelvalle
Copy link

fyi, it's also possible to denormalize and normalize back again...

@stale
Copy link

stale bot commented May 30, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 30, 2019
@stale stale bot closed this as completed Jun 6, 2019
@ahmed-fau
Copy link

@r9y9 is it required to use that line in case of copy-synthesis too?
c = np.interp(c, (0, 4), (0, 1))
Or just use the mel calculated by the preprocess.py script and feed it to the model directly?

@r9y9
Copy link
Owner

r9y9 commented Jul 9, 2020

Not required

@ahmed-fau
Copy link

@r9y9 I am trying to use the WaveNet vocoder for copy-synthesis but still not able to generate the target signal correctly as the steps of using pretrained model are not clear as in the TTS colab notebook. I used the preprocess.py file with wavallin option to create the mel of the signal and then fed it to the pretrained model of LJ according to implementation v0.1.1. However, the generated signal is a total noise. What do u think to be a problem?

BTW: I inspected the value range of the obtained mel of my LJ signal and it was bet [-5.0.1]. Does this mean that the mel is not normalized and I should use another preprocessing script to get it bet [0,1]?

wjqkkky pushed a commit to wjqkkky/wavenet_vocoder that referenced this issue Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants