A simple tool to easily use Montreal Forced Aligner.
These days, as speech research community rapidly grows, text-wav forced alignment is necessary to the research such as Text-to-Speech, Voice Conversion and other speech-related search field. One simple and widely-used approach is to use Montreal Forced Aligner(MFA) [McAuliffe17] as text-wav forced aligner. Despite of lots of necessity, some speech-research beginners may feel that it is hard to train their custom dataset. For them, this repository offers following operations and procedures that are needed to run MFA with little efforts.
- Make whole dataset to (wav, lab) pair-formatted structure which is necessary to run MFA
- Generate phoneme dictionary from pretranied G2P model provided from official MFA documents
- Train MFA using pre-formatted (wav-lab) paired dataset and generated phoneme dictionary
- Validate and visualize the extracted TextGrid(alignment) via jupyter notebook
- Provide the text-wav alignment retrieved from Emotional Speech Dataset (ESD) [Zhou21]
To run this program, please follow the procedure below.
- Install anaconda and python=3.9.
- Install MFA and download ESD dataset
- Install pre-requisite modules using pip via following command
pip install -r requirements.txt
- Edit
config.py
to point your database. - Run formatter
python main.py
As a result of this tutorial, I upload text-wav alignment extracted using MFA.
Please refer visualise_alignment.ipynb.
- Emotional Speech Dataset (ESD) [Zhou21] - an English multispeaker-multiemotion dataset
- [Korean Single Speaker Dataset] (KSS) - a Korean single female speaker dataset
- 감정 음성합성 데이터셋(Korean Emotional Speech dataset) - a Korean single speaker multi-emotion dataset
- EmotionTTS OpenDB - a multipurpose dataset (in this repos, only consider multispeaker-multiemotion dataset)
- currently, only supports ESD
- Different emotions belonging to a single speaker are considered independently. (i.e., utterances with emotion 'Angry' and utterances with emotion 'Sad' from same speaker are treated with different speakers.) This is a simple "remedy" to reduce complexity of style(emotion) distribution.
- Please note that extracted alignments may not be accurate.
- Regarding ESD dataset, only English speakers are used.
Please email to [email protected]. Any suggestion or question be appreciated. Hope that this repository be helpful.