Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help needed. Trying to get vocoder working with output from a ML Tracotron #24

Open
michael-conrad opened this issue Oct 15, 2021 · 5 comments

Comments

@michael-conrad
Copy link

Hello,

I'm trying to figure out what I need to do so to my numpy array can be vocoded by the UniversalVocoder.

Attached is a sample npy file.

The output is from a modified https://github.com/Tomiinek/Multilingual_Text_to_Speech

import os

import numpy


def main():
    import torch
    import soundfile as sf
    from univoc import Vocoder

    cwd: str = os.getcwd()

    # download pretrained weights (and optionally move to GPU)
    vocoder: Vocoder = Vocoder.from_pretrained(
            "https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt").cuda()

    # load log-Mel spectrogram from file or from tts (see https://github.com/bshall/Tacotron for example)
    mel = numpy.load(os.path.join(cwd, "tmp.npy"))

    # generate waveform
    with torch.no_grad():
        wav, sr = vocoder.generate(mel)

    # save output
    sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)


if __name__ == "__main__":
    main()
Traceback (most recent call last):
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 29, in <module>
    main()
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 22, in main
    wav, sr = vocoder.generate(mel)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/univoc/model.py", line 102, in generate
    mel, _ = self.rnn1(mel)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 821, in forward
    max_batch_size = input.size(0) if self.batch_first else input.size(1)
TypeError: 'int' object is not callable

tmp.npy.zip
wavernn-vocoded.zip

@michael-conrad
Copy link
Author

I've also tried the following and now I'm getting "RuntimeError: input.size(-1) must be equal to input_size. Expected 80, got 386":

mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy"))
mel_npy = mel_npy.reshape((1, mel_npy.shape[0], mel_npy.shape[1]))
mel_tensor: Tensor = torch.tensor(mel_npy).to("cuda")
print(mel_tensor.shape)
torch.Size([1, 80, 386])
Traceback (most recent call last):
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 35, in <module>
    main()
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 28, in main
    wav, sr = vocoder.generate(mel_tensor)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/univoc/model.py", line 102, in generate
    mel, _ = self.rnn1(mel)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 835, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 229, in check_forward_args
    self.check_input(input, batch_sizes)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 205, in check_input
    raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 80, got 386

@michael-conrad
Copy link
Author

I finally figured out it needed a transpose, but, the generated wav is all silence?

    mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy")).transpose()
    mel_npy = mel_npy.reshape((1, mel_npy.shape[0], mel_npy.shape[1]))
    mel_tensor: Tensor = torch.tensor(mel_npy).to("cuda")

    # generate waveform
    with torch.no_grad():
        wav, sr = vocoder.generate(mel_tensor)

        # save output
        sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)

@michael-conrad
Copy link
Author

The following seems to work. Definitely different sounding...

universalvocoding.zip

    mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy")).transpose()
    top_db = 80
    mel_npy = numpy.maximum(mel_npy, -top_db)
    mel_npy = mel_npy / top_db
    mel_tensor: Tensor = torch.FloatTensor(mel_npy).unsqueeze(0).to("cuda")

    # generate waveform
    with torch.no_grad():
        wav, sr = vocoder.generate(mel_tensor)

        # save output
        sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)

@bshall
Copy link
Owner

bshall commented Oct 18, 2021

Hi @michael-conrad,

Apart from the normalization steps, the parameters used to extract the mel-spectorgram need to be the same as the ones used in this repo. From a cursory glance at https://github.com/Tomiinek/Multilingual_Text_to_Speech it looks like their model is trained on spectrograms from 22050Hz audio with a different hop-length and window-length to what I used here.

To fix this you have two options: 1. retrain the vocoder (with some minor modifications) using their spectrograms as the input; or 2. retrain the acoustic model at https://github.com/Tomiinek/Multilingual_Text_to_Speech to produce spectrograms with matching parameters.

@michael-conrad
Copy link
Author

I'm prepping to run another test with a fork of it.

I'm looking in https://github.com/CherokeeLanguage/Cherokee-TTS/blob/master/params/params.py and trying to figure out what to change. I see there is a normalize setting. I think the script https://github.com/CherokeeLanguage/Cherokee-TTS/blob/master/data/prepare_spectrograms.py handles that. I should figure out how to normalize in this file to match the vocoder?

"""
    ******************** PARAMETERS OF AUDIO ********************
    """

    sample_rate = 22050  # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
    num_fft = 1102  # number of frequency bins used during computation of spectrograms
    num_mels = 80  # number of mel bins used during computation of mel spectrograms
    num_mfcc = 13  # number of MFCCs, used just for MCD computation (during training)
    stft_window_ms = 50  # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
    stft_shift_ms = 12.5  # shift of the window (or better said gap between windows) in ms
    griffin_lim_iters = 60  # used if vocoding using Griffin-Lim algorithm (synthesize.py), greater value does not make much sense
    griffin_lim_power = 1.5  # power applied to spectrograms before using GL
    normalize_spectrogram = True  # if True, spectrograms are normalized before passing into the model, a per-channel normalization is used
    # statistics (mean and variance) are computed from dataset at the start of training
    use_preemphasis = True  # if True, a preemphasis is applied to raw waveform before using them (spectrogram computation)
    preemphasis = 0.97  # amount of preemphasis, used if use_preemphasis is True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants