Skip to content

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform with Multilingual Cleaners

License

Notifications You must be signed in to change notification settings

queechy/MB-iSTFT-VITS-multilingual-emotion

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MB-iSTFT-VITS with Multilingual Implementations

This is an multilingual implementation of MB-iSTFT-VITS to support conversion to various languages. MB-iSTFT-VITS showed 4.1 times faster inference time compared with original VITS!
Preprocessed Japanese Single Speaker training material is provided with つくよみちゃんコーパス(tsukuyomi-chan corpus). You need to download the corpus and place 100 .wav files to ./tsukuyomi_raw.

  • Currently Supported: Japanese / Korean
  • Chinese / CJKE / and other languages will be updated very soon!

How to use

Python >= 3.6 (Python == 3.7 is suggested)

Clone this repository

git clone https://github.com/misakiudon/MB-iSTFT-VITS-multilingual.git

Install requirements

pip install -r requirements.txt

You may need to install espeak first: apt-get install espeak

Create manifest data

Single speaker

"n_speakers" should be 0 in config.json

path/to/XXX.wav|transcript
  • Example
dataset/001.wav|こんにちは。

Mutiple speakers

Speaker id should start from 0

path/to/XXX.wav|speaker id|transcript
  • Example
dataset/001.wav|0|こんにちは。

Preprocess

Japanese preprocessed manifest data is provided with filelists/filelist_train2.txt.cleaned and filelists/filelist_val2.txt.cleaned.

# Single speaker
python preprocess.py --text_index 1 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'japanese_cleaners'

# Mutiple speakers
python preprocess.py --text_index 2 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'japanese_cleaners'

If your speech file is either not 22050Hz / Mono / PCM-16, the you should resample your .wav file first.

python convert_to_22050.py --in_path path/to/original_wav_dir/ --out_path path/to/output_wav_dir/

Build monotonic alignment search

# Cython-version Monotonoic Alignment Search
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace

Setting json file in configs

Model How to set up json file in configs Sample of json file configuration
iSTFT-VITS "istft_vits": true,
"upsample_rates": [8,8],
ljs_istft_vits.json
MB-iSTFT-VITS "subbands": 4,
"mb_istft_vits": true,
"upsample_rates": [4,4],
ljs_mb_istft_vits.json
MS-iSTFT-VITS "subbands": 4,
"ms_istft_vits": true,
"upsample_rates": [4,4],
ljs_ms_istft_vits.json

For tutorial, check config/tsukuyomi_chan.json for more examples

  • If you have done preprocessing, set "cleaned_text" to true.
  • Change training_files and validation_files to the path of preprocessed manifest files.
  • Select same text_cleaners you used in preprocessing step.

Train

# Single speaker
python train_latest.py -c <config> -m <folder>

# Mutiple speakers
python train_latest_ms.py -c <config> -m <folder>

In the case of training MB-iSTFT-VITS with Japanese tutorial corpus, run the following script. Resume training from lastest checkpoint is automatic.

python train_latest.py -c configs/tsukuyomi_chan.json -m tsukuyomi

After the training, you can check inference audio using inference.ipynb

References

About

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform with Multilingual Cleaners

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.6%
  • Jupyter Notebook 1.7%
  • Cython 0.7%