Skip to content

SilverSulfide/multi_speaker_tts

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi speaker TTS

This code is an implementation of the paper 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis', except 'WAVENET'. The algorithm is based on the following papers:

Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Le, Q. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (2017). Generalized end-to-end loss for speaker verification. arXiv preprint arXiv:1710.10467.
Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. arXiv preprint arXiv:1806.04558.
Prenger, R., Valle, R., & Catanzaro, B. (2019, May). Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3617-3621). IEEE.

Structrue

MSTTS_Structure

The model is divided into three parts that are learned independently of each other: speaker embedding, tacotron 2, and vocoder. Of these, there are two types of vocoder can be attached: the Tacotron 1 style and Waveglow.

Used dataset

Currently uploaded code is compatible with the following datasets. The O mark to the left of the dataset name is the dataset actually used in the uploaded result.

Speaker embedding

[X] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[X] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
[O] VoxCeleb: http://www.openslr.org/12/

Mel to Spectrogram

[O] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[O] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/

Waveglow

[O] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
Any voice wav files can be used.

Multi speaker TTS

[X] LJSpeech: https://keithito.com/LJ-Speech-Dataset/
[O] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[X] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
[X] Tedlium: http://www.openslr.org/12/
[O] TIMIT: http://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3

Instruction

Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameter.py' according to your environment.

Training

Speaker embedding

Generate pattern

python -m Speaker_Embedding.Pattern_Generate [options]

option list:
-vctk <path>		Set the path of VCTK. VCTK's patterns are generated.
-ls <path>		Set the path of LibriSpeech. LibriSpeech's patterns are generated.
-vox1 <path>		Set the path of VoxCeleb1. VoxCeleb1's patterns are generated.
-vox2 <path>		Set the path of VoxCeleb2. VoxCeleb2's patterns are generated.

Set inference files path while training for verification. Edit 'Speaker_Embedding_Inference_in_Train.txt'

Run

python -m Speaker_Embedding.Speaker_Embedding

Mel to spectrogram

Generate pattern

python -m Taco1_Mel_to_Spect.Pattern_Generate [options]

option list:
-vctk <path>		Set the path of VCTK. VCTK's patterns are generated.
-ls <path>		Set the path of LibriSpeech. LibriSpeech's patterns are generated.

Set inference files path while training for verification. Edit 'Mel_to_Spect_Inference_in_Train.txt'

Run

python -m Taco1_Mel_to_Spect.Taco1_Mel_to_Spect

Waveglow

There is no pattern generate step. Waveglow use wav file directly as patterns.

Set inference files path while training for verification. Edit 'WaveGlow_Inference_File_Path_in_Train.txt'

Run

python -m WaveGlow.WaveGlow

Multi speaker TTS

Generate pattern

python Pattern_Generate.py [options]

option list:
-lj <path>		Set the path of LJSpeech. LJSpeech's patterns are generated.
-vctk <path>		Set the path of VCTK. VCTK's patterns are generated.
-ls <path>		Set the path of LibriSpeech. LibriSpeech's patterns are generated.
-tl <path>		Set the path of Tedlium. Tedlium's patterns are generated.
-timit <path>		Set the path of TIMIT. TIMIT's patterns are generated.
-all		All save option. Generator ignore the 'Use_Wav_Length_Range' hyper parameter. If this option is not set, only patterns matching 'Use_Wav_Length_Range' will be generated.

Set inference files path and sentence while training for verification. Edit 'Inference_Sentence_in_Train.txt'

Run

python MSTTS_SV.py

Test

Run 'ipython' in the model's directory.

Run following command:

from MSTTS_SV import Tacotron2
new_Tacotron2 = Tacotron2(is_Training= False)
new_Tacotron2.Restore()

Set the speaker's Wav path list and text list like the following example:

path_List = [
    'E:/Multi_Speaker_TTS.Raw_Data/LJSpeech/wavs/LJ040-0143.wav',
    'E:/Multi_Speaker_TTS.Raw_Data/LibriSpeech/train/17/363/17-363-0039.flac',
    'E:/Multi_Speaker_TTS.Raw_Data/VCTK/wav48/p314/p314_020.wav',
    'E:/Multi_Speaker_TTS.Raw_Data/VCTK/wav48/p256/p256_001.wav'
    ]
text_List = [
    'He that has no shame has no conscience.',
    'Who knows much believes the less.',
    'Things are always at their best in the beginning.',
    'Please call Stella.'
    ]

※Two lists should have same length.

Run following command:

new_Tacotron2.Inference(
    path_List = path_List,
    text_List = text_List,
    file_Prefix = 'Result'
    )

Result

Speaker embedding

GS_10000

Mel to spectrogram

GS_12000 IDX_1

Waveglow

GS_915000 IDX_0

Currently, the performance of Waveglow was not good.

Multi speaker TTS

Exported wav files: WAV.zip

Trained checkpoint

https://drive.google.com/drive/folders/1wXrJY-gQTOs9yZ7nxvxPaAa6Wf8uF7zP?usp=sharing

Future works

Waveglow performance improvment

About

Implementation of Multi speaker TTS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%