Skip to content

Latest commit

 

History

History
116 lines (84 loc) · 5.99 KB

readme.md

File metadata and controls

116 lines (84 loc) · 5.99 KB

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao | Zhejiang University, Sea AI Lab

PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.

arXiv GitHub Stars

We provide our implementation and pretrained models in this repository.

Visit our demo page for audio samples.

News

  • December, 2022: GenerSpeech (NeurIPS 2022) released at Github.

Key Features

  • Multi-level Style Transfer for expressive text-to-speech.
  • Enhanced model generalization to out-of-distribution (OOD) style reference.

Quick Started

We provide an example of how you can generate high-fidelity samples using GenerSpeech.

To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.

Support Datasets and Pretrained Models

You can use pretrained models we provide here, and data here. Details of each folder are as in follows:

Model Dataset (16 kHz) Discription
GenerSpeech LibriTTS,ESD Acousitic model (config)
HIFI-GAN LibriTTS,ESD Neural Vocoder
Encoder / Emotion Encoder

More supported datasets are coming soon.

Dependencies

A suitable conda environment named generspeech can be created and activated with:

conda env create -f environment.yaml
conda activate generspeech

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference (Zero-shot TTS)

Here we provide a speech synthesis pipeline using GenerSpeech.

  1. Prepare GenerSpeech (acoustic model): Download and put checkpoint at checkpoints/GenerSpeech
  2. Prepare HIFI-GAN (neural vocoder): Download and put checkpoint at checkpoints/trainset_hifigan
  3. Prepare Emotion Encoder: Download and put checkpoint at checkpoints/Emotion_encoder.pt
  4. Prepare dataset: Download and put statistical files at data/binary/training_set
  5. Prepare path/to/reference_audio (16k): By default, GenerSpeech uses ASR + MFA to obtain the text-speech alignment from reference.
CUDA_VISIBLE_DEVICES=$GPU python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --hparams="text='here we go',ref_audio='assets/0011_001570.wav'"

Generated wav files are saved in infer_out by default.

Train your own model

Data Preparation and Configuration

  1. Set raw_data_dir, processed_data_dir, binary_data_dir in the config file, and download dataset to raw_data_dir.
  2. Check preprocess_cls in the config file. The dataset structure needs to follow the processor preprocess_cls, or you could rewrite it according to your dataset. We provide a Libritts processor as an example in modules/GenerSpeech/config/generspeech.yaml
  3. Download global emotion encoder to emotion_encoder_path. For more details, please refer to this branch.
  4. Preprocess Dataset
# Preprocess step: unify the file structure.
python data_gen/tts/bin/preprocess.py --config $path/to/config
# Align step: MFA alignment.
python data_gen/tts/bin/train_mfa_align.py --config $path/to/config
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config

You could also build a dataset via NATSpeech, which shares a common MFA data-processing procedure. We also provide our processed dataset (16kHz LibriTTS+ESD).

Training GenerSpeech

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --reset

Inference using GenerSpeech

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --infer

Acknowledgements

This implementation uses parts of the code from the following Github repos: FastDiff, NATSpeech, as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@inproceedings{huanggenerspeech,
  title={GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech},
  author={Huang, Rongjie and Ren, Yi and Liu, Jinglin and Cui, Chenye and Zhao, Zhou},
  booktitle={Advances in Neural Information Processing Systems}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.