Skip to content

Latest commit

 

History

History
156 lines (111 loc) · 6.81 KB

README.md

File metadata and controls

156 lines (111 loc) · 6.81 KB

Multi speaker TTS

This code is an implementation of the paper 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis', except 'WAVENET'. The algorithm is based on the following papers:

Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Le, Q. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (2017). Generalized end-to-end loss for speaker verification. arXiv preprint arXiv:1710.10467.
Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. arXiv preprint arXiv:1806.04558.
Prenger, R., Valle, R., & Catanzaro, B. (2019, May). Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3617-3621). IEEE.

Structrue

MSTTS_Structure

The model is divided into three parts that are learned independently of each other: speaker embedding, tacotron 2, and vocoder. Of these, there are two types of vocoder can be attached: the Tacotron 1 style and Waveglow.

Used dataset

Currently uploaded code is compatible with the following datasets. The O mark to the left of the dataset name is the dataset actually used in the uploaded result.

Speaker embedding

[X] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[X] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
[O] VoxCeleb: http://www.openslr.org/12/

Mel to Spectrogram

[O] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[O] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/

Waveglow

[O] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
Any voice wav files can be used.

Multi speaker TTS

[X] LJSpeech: https://keithito.com/LJ-Speech-Dataset/
[O] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[X] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
[X] Tedlium: http://www.openslr.org/12/
[O] TIMIT: http://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3

Instruction

Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameter.py' according to your environment.

Training

Speaker embedding

Generate pattern

python -m Speaker_Embedding.Pattern_Generate [options]

option list:
-vctk <path>		Set the path of VCTK. VCTK's patterns are generated.
-ls <path>		Set the path of LibriSpeech. LibriSpeech's patterns are generated.
-vox1 <path>		Set the path of VoxCeleb1. VoxCeleb1's patterns are generated.
-vox2 <path>		Set the path of VoxCeleb2. VoxCeleb2's patterns are generated.

Set inference files path while training for verification. Edit 'Speaker_Embedding_Inference_in_Train.txt'

Run

python -m Speaker_Embedding.Speaker_Embedding

Mel to spectrogram

Generate pattern

python -m Taco1_Mel_to_Spect.Pattern_Generate [options]

option list:
-vctk <path>		Set the path of VCTK. VCTK's patterns are generated.
-ls <path>		Set the path of LibriSpeech. LibriSpeech's patterns are generated.

Set inference files path while training for verification. Edit 'Mel_to_Spect_Inference_in_Train.txt'

Run

python -m Taco1_Mel_to_Spect.Taco1_Mel_to_Spect

Waveglow

There is no pattern generate step. Waveglow use wav file directly as patterns.

Set inference files path while training for verification. Edit 'WaveGlow_Inference_File_Path_in_Train.txt'

Run

python -m WaveGlow.WaveGlow

Multi speaker TTS

Generate pattern

python Pattern_Generate.py [options]

option list:
-lj <path>		Set the path of LJSpeech. LJSpeech's patterns are generated.
-vctk <path>		Set the path of VCTK. VCTK's patterns are generated.
-ls <path>		Set the path of LibriSpeech. LibriSpeech's patterns are generated.
-tl <path>		Set the path of Tedlium. Tedlium's patterns are generated.
-timit <path>		Set the path of TIMIT. TIMIT's patterns are generated.
-all		All save option. Generator ignore the 'Use_Wav_Length_Range' hyper parameter. If this option is not set, only patterns matching 'Use_Wav_Length_Range' will be generated.

Set inference files path and sentence while training for verification. Edit 'Inference_Sentence_in_Train.txt'

Run

python MSTTS_SV.py

Test

Run 'ipython' in the model's directory.

Run following command:

from MSTTS_SV import Tacotron2
new_Tacotron2 = Tacotron2(is_Training= False)
new_Tacotron2.Restore()

Set the speaker's Wav path list and text list like the following example:

path_List = [
    'E:/Multi_Speaker_TTS.Raw_Data/LJSpeech/wavs/LJ040-0143.wav',
    'E:/Multi_Speaker_TTS.Raw_Data/LibriSpeech/train/17/363/17-363-0039.flac',
    'E:/Multi_Speaker_TTS.Raw_Data/VCTK/wav48/p314/p314_020.wav',
    'E:/Multi_Speaker_TTS.Raw_Data/VCTK/wav48/p256/p256_001.wav'
    ]
text_List = [
    'He that has no shame has no conscience.',
    'Who knows much believes the less.',
    'Things are always at their best in the beginning.',
    'Please call Stella.'
    ]

※Two lists should have same length.

Run following command:

new_Tacotron2.Inference(
    path_List = path_List,
    text_List = text_List,
    file_Prefix = 'Result'
    )

Result

Speaker embedding

GS_10000

Mel to spectrogram

GS_12000 IDX_1

Waveglow

GS_915000 IDX_0

Currently, the performance of Waveglow was not good.

Multi speaker TTS

Exported wav files: WAV.zip

Trained checkpoint

https://drive.google.com/drive/folders/1wXrJY-gQTOs9yZ7nxvxPaAa6Wf8uF7zP?usp=sharing

Future works

Waveglow performance improvment