This is a lightly modified fork of an open source PyTorch implementation of FastSpeech 2. This fork contains additional configuration, instructions, etc, to support the Yiddish language. To cite this Yiddish project please use:
@InProceedings{Webber_etal-2022,
author = {Jacob J. Webber and Samuel K. Lo and Isaac L. Bleaman},
title = {{REYD} -- The First {Yiddish} Text-to-Speech Dataset and System},
booktitle = {Proceedings of {Interspeech} 2022},
year = {2022},
doi = {10.21437/Interspeech.2022-789}
}
Full details of the project can be found here.
If you want to generate samples without hassle, there is an interactive demo available here.
The original README of the FastSpeech 2 implementation is appended below our Yiddish-specific instructions.
In order to run training or inference you will need to install the necessary dependencies:
pip install -r requirements.txt
Depending on your system you may need to delete the version numbers to get things to install smoothly. You should ensure you have a version of PyTorch with CUDA enabled for training. You will also need the yiddish
library from Isaac Bleaman:
pip install yiddish
Download pretrained models and pre-processed dataset:
wget --content-disposition https://figshare.com/ndownloader/articles/19350539/versions/1
unzip 19350539.zip
We provide a pre-processed version of our dataset. This consists of
- alignment TextGrids generated using the Montreal Forced Aligner. See here.
- FastSpeech2 specific pre-processing steps (generating spectrograms etc.) See preprocessing section below.
Move the relevant files from setup section:
mv yiddish_textgrids*.zip preprocessed_data/
for orthography in 'yivo_respelled' 'yivo_original' 'hasidic'; do unzip preprocessed_data/yiddish_textgrids_${orthography}.zip; done
There are three different orthographies available. For more information about these see the paper.
export orthography=yivo_respelled # yivo_original hasidic
Run the training script:
python train.py -p config/${orthography}/preprocess.yaml -m config/${orthography}/model.yaml -t config/${orthography}/train.yaml
For synthesis we provide pretrained models, which are downloaded as part of the set-up section.
Move these to the right place as below. If you have trained your own, this will not be needed.
unzip pretrained_models.zip
for orthography in 'yivo_respelled' 'yivo_original' 'hasidic'
do
mkdir -p output/ckpt/${orthography}
mkdir -p output/log/${orthography}
mkdir -p output/result/${orthography}
mv ${orthography}/100000.pth.tar output/ckpt/${orthography}/
rm -rf ${orthography}/
done
Speaker ids are 0: male, Lithuanian Yiddish, 1: female, Lithuanian Yiddish, 2: male, Polish Yiddish. Orthography is as above.
export s_id=0 # 1 2
export orthography=yivo_respelled # yivo_original hasidic
Set your text input:
export text="מעקט אָפּ דעם טעקסט און שרײַבט אַרײַן אַן אײגענעם"
Preprocess input text (See paper):
text=$(python yiddish_preprocessing.py ${text} -o ${orthography})
Run inference script:
python synthesize.py --text \"{text}\" --speaker_id ${s_id} --restore_step 100000 --mode single -p config/${othography}/preprocess.yaml -m config/${orthography}/model.yaml -t config/${orthography}/train.yaml
ls output/result/${orthography}
This is a PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. This project is based on xcmyz's implementation of FastSpeech. Feel free to use/modify the code.
There are several versions of FastSpeech 2. This implementation is more similar to version 1, which uses F0 values as the pitch features. On the other hand, pitch spectrograms extracted by continuous wavelet transform are used as the pitch features in the later versions.
- 2021/7/8: Release the checkpoint and audio samples of a multi-speaker English TTS model trained on LibriTTS
- 2021/2/26: Support English and Mandarin TTS
- 2021/2/26: Support multi-speaker TTS (AISHELL-3 and LibriTTS)
- 2021/2/26: Support MelGAN and HiFi-GAN vocoder
Audio samples generated by this implementation can be found here.
You can install the Python dependencies with
pip3 install -r requirements.txt
You have to download the pretrained models and put them in output/ckpt/LJSpeech/
, output/ckpt/AISHELL3
, or output/ckpt/LibriTTS/
.
For English single-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
For Mandarin multi-speaker TTS, try
python3 synthesize.py --text "大家好" --speaker_id SPEAKER_ID --restore_step 600000 --mode single -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml
For English multi-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step 800000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
The generated utterances will be put in output/result/
.
Here is an example of synthesized mel-spectrogram of the sentence "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition", with the English single-speaker TTS model.
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
to synthesize all utterances in preprocessed_data/LJSpeech/val.txt
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8 --energy_control 0.8
The supported datasets are
- LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
- AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
- LibriTTS: a multi-speaker English dataset containing 585 hours of speech by 2456 speakers.
We take LJSpeech as an example hereafter.
First, run
python3 prepare_align.py config/LJSpeech/preprocess.yaml
for some preparations.
As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Alignments of the supported datasets are provided here.
You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/
.
After that, run the preprocessing script by
python3 preprocess.py config/LJSpeech/preprocess.yaml
Alternately, you can align the corpus by yourself. Download the official MFA package and run
./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech
or
./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech
to align the corpus and then run the preprocessing script.
python3 preprocess.py config/LJSpeech/preprocess.yaml
Train your model with
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
The model takes less than 10k steps (less than 1 hour on my GTX1080Ti GPU) of training to generate audio samples with acceptable quality, which is much more efficient than the autoregressive models such as Tacotron2.
Use
tensorboard --logdir output/log/LJSpeech
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
- Following xcmyz's implementation, I use an additional Tacotron-2-styled Post-Net after the decoder, which is not used in the original FastSpeech 2.
- Gradient clipping is used in the training.
- In my experience, using phoneme-level pitch and energy prediction instead of frame-level prediction results in much better prosody, and normalizing the pitch and energy features also helps. Please refer to
config/README.md
for more details.
Please inform me if you find any mistakes in this repo, or any useful tips to train the FastSpeech 2 model.
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, Y. Ren, et al.
- xcmyz's FastSpeech implementation
- TensorSpeech's FastSpeech 2 implementation
- rishikksh20's FastSpeech 2 implementation
@INPROCEEDINGS{chien2021investigating,
author={Chien, Chung-Ming and Lin, Jheng-Hao and Huang, Chien-yu and Hsu, Po-chun and Lee, Hung-yi},
booktitle={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech},
year={2021},
volume={},
number={},
pages={8588-8592},
doi={10.1109/ICASSP39728.2021.9413880}}