-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taking Tacotron2 output to wavenet vocoder #30
Comments
At least, u should adjust the upsample layer with your hop_size or frame_shift_ms, in my experience, just freezing other layer and retraining can work. |
@danshirron : just did that. I used the same training data for both, did NOT change much in hparams (apart from samplerate in my case and batch-size). Generated a mel in T2, reshape'd it (there is a bug in T2) and fed it to wavenet_vocoder. It works very nicely, though my results are not yet really good (incomprehensible audio in T2 as well as in wavenet). But the incomprehensible audio sounded identical, though the output of wavenet is significantly smoother :) |
Good news: I just tried this and found it works nicely. I'm planning to make a notebook for Tacotron2 + WaveNet text-to-speech demo that can be run on Google colab. Two samples attached. |
@r9y9 samples sound amazing, can you please share the recipe how to train this model(s)? This doesn't look like default hparams. |
Samples sound great! Would love to see the Taco2 and Wavenet params. |
finally...! ref: #30 ref: r9y9/deepvoice3_pytorch#11 ref: Rayhane-mamah/Tacotron-2#30 (comment)
https://r9y9.github.io/wavenet_vocoder/ Uploaded Tacotron2 samples. You can find links to hyper params on the page, but here you are:
WaveNet params are same as default except for Training recipe:
Synthesis recipe: combine them sequentially. Does this help you?:) |
@r9y9 Great, thanks! Will try to reproduce it. Do you mean WaveNet fine-tuning with teacher-forcing like you did in this r9y9/deepvoice3_pytorch#21 pull or a simple fine-tuning on some dataset? |
I meant just continuing training with the same dataset (LJSpeech) which the pretrained model trained on. This is due to there was a bug (#33) when I trained the pretrained model. I didn't use predicted mel-spectrograms for training WaveNet like I did in r9y9/deepvoice3_pytorch#21. That should improve quality but I wanted to try simpler case first. |
@r9y9 Did you train the Taco 2 model with ARPAbet? |
@r9y9 What was your loss at the end of training? |
@rafaelvalle I used https://github.com/Rayhane-mamah/Tacotron-2. I believe this uses ARPAbet for the network input. |
@PetrochukM The last training curve(~ 300ksteps) for training WaveNet used for my demo. |
@r9y9 Do you use |
@neverjoe Check out https://colab.research.google.com/github/r9y9/Colaboratory/blob/master/Tacotron2_and_WaveNet_text_to_speech_demo.ipynb. It's the complete recipe used for my experiment. From that you will find: # Range [0, 4] was used for training Tacotron2 but WaveNet vocoder assumes [0, 1]
c = np.interp(c, (0, 4), (0, 1)) |
fyi, it's also possible to denormalize and normalize back again... |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@r9y9 is it required to use that line in case of copy-synthesis too? |
Not required |
@r9y9 I am trying to use the WaveNet vocoder for copy-synthesis but still not able to generate the target signal correctly as the steps of using pretrained model are not clear as in the TTS colab notebook. I used the BTW: I inspected the value range of the obtained mel of my LJ signal and it was bet [-5.0.1]. Does this mean that the mel is not normalized and I should use another preprocessing script to get it bet [0,1]? |
Anyone had experience with the above?
I guess that audio hparams need to be same for both. My intuition for using ljspeech:
Settings for tacotron 2 implementation(https://github.com/Rayhane-mamah/Tacotron-2):
num_mels=80
num_freq=1025; In wavenet code fft_size=1024 in t2 fft_size=(1025-1)*2=2048. As far as i understand i can keep this as is since anyways this accumulates to mel bands
sample_rate=20050 (As the ljspeech dataset)
frame_length_ms=46.44 (correlates to wavenet's fft_size/22050).
frame_shift_ms=11.61 (correlates to wavenet's hop_size=256, 256/22050=11.61ms)
preemphasis, not available in wavenet r9y9 implementation
Others: in t2 i dont have fmin(125 in wavenet) and fmax (7600 in wavenet). looking into t2 code,
the spectrogram fmin is set to 0 and fmax is set to 2/fsample = 22050/2=11025Hz. Since im using a pre-trained wavenet model i guess ill need to change params in t2 code.
Any remarks, suggestions?
The text was updated successfully, but these errors were encountered: