-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help needed. Trying to get vocoder working with output from a ML Tracotron #24
Comments
I've also tried the following and now I'm getting "RuntimeError: input.size(-1) must be equal to input_size. Expected 80, got 386": mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy"))
mel_npy = mel_npy.reshape((1, mel_npy.shape[0], mel_npy.shape[1]))
mel_tensor: Tensor = torch.tensor(mel_npy).to("cuda")
print(mel_tensor.shape) torch.Size([1, 80, 386])
Traceback (most recent call last):
File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 35, in <module>
main()
File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 28, in main
wav, sr = vocoder.generate(mel_tensor)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/univoc/model.py", line 102, in generate
mel, _ = self.rnn1(mel)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 835, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 229, in check_forward_args
self.check_input(input, batch_sizes)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 205, in check_input
raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 80, got 386 |
I finally figured out it needed a transpose, but, the generated wav is all silence? mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy")).transpose()
mel_npy = mel_npy.reshape((1, mel_npy.shape[0], mel_npy.shape[1]))
mel_tensor: Tensor = torch.tensor(mel_npy).to("cuda")
# generate waveform
with torch.no_grad():
wav, sr = vocoder.generate(mel_tensor)
# save output
sf.write(os.path.join(cwd, "tmp.wav"), wav, sr) |
The following seems to work. Definitely different sounding... mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy")).transpose()
top_db = 80
mel_npy = numpy.maximum(mel_npy, -top_db)
mel_npy = mel_npy / top_db
mel_tensor: Tensor = torch.FloatTensor(mel_npy).unsqueeze(0).to("cuda")
# generate waveform
with torch.no_grad():
wav, sr = vocoder.generate(mel_tensor)
# save output
sf.write(os.path.join(cwd, "tmp.wav"), wav, sr) |
Hi @michael-conrad, Apart from the normalization steps, the parameters used to extract the mel-spectorgram need to be the same as the ones used in this repo. From a cursory glance at https://github.com/Tomiinek/Multilingual_Text_to_Speech it looks like their model is trained on spectrograms from 22050Hz audio with a different hop-length and window-length to what I used here. To fix this you have two options: 1. retrain the vocoder (with some minor modifications) using their spectrograms as the input; or 2. retrain the acoustic model at https://github.com/Tomiinek/Multilingual_Text_to_Speech to produce spectrograms with matching parameters. |
I'm prepping to run another test with a fork of it. I'm looking in https://github.com/CherokeeLanguage/Cherokee-TTS/blob/master/params/params.py and trying to figure out what to change. I see there is a normalize setting. I think the script https://github.com/CherokeeLanguage/Cherokee-TTS/blob/master/data/prepare_spectrograms.py handles that. I should figure out how to normalize in this file to match the vocoder? """
******************** PARAMETERS OF AUDIO ********************
"""
sample_rate = 22050 # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
num_fft = 1102 # number of frequency bins used during computation of spectrograms
num_mels = 80 # number of mel bins used during computation of mel spectrograms
num_mfcc = 13 # number of MFCCs, used just for MCD computation (during training)
stft_window_ms = 50 # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
stft_shift_ms = 12.5 # shift of the window (or better said gap between windows) in ms
griffin_lim_iters = 60 # used if vocoding using Griffin-Lim algorithm (synthesize.py), greater value does not make much sense
griffin_lim_power = 1.5 # power applied to spectrograms before using GL
normalize_spectrogram = True # if True, spectrograms are normalized before passing into the model, a per-channel normalization is used
# statistics (mean and variance) are computed from dataset at the start of training
use_preemphasis = True # if True, a preemphasis is applied to raw waveform before using them (spectrogram computation)
preemphasis = 0.97 # amount of preemphasis, used if use_preemphasis is True |
Hello,
I'm trying to figure out what I need to do so to my numpy array can be vocoded by the UniversalVocoder.
Attached is a sample npy file.
The output is from a modified https://github.com/Tomiinek/Multilingual_Text_to_Speech
tmp.npy.zip
wavernn-vocoded.zip
The text was updated successfully, but these errors were encountered: