This is the implementation of our paper. In this paper, we proposed an unsupervised speech (phoneme) recogntion system which can achieve 33.1% phoneme error rate on TIMIT. This method developed a GAN-based model to achieve unsupervised phoneme recognition and we further use a set of HMMs to work in harmony with the GAN.
-
tensorflow 1.13
-
kaldi
-
srilm (can be built with kaldi/tools/install_srilm.sh)
-
librosa
- Usage:
- Modify
path.sh
with your path of Kaldi and srilm. - Modify
config.sh
with your code path and timit path. - Run
$ bash preprocess.sh
-
This script will extract features and split dataset into train/test set.
-
The data which WFST-decoder needed also generate from here.
- Usage:
- Modify the experimental setting in
config.sh
. - Modify the GAN-based model's parameter in
src/GAN-based-model/config.yaml
. - Run
$ bash run.sh
-
This scipt contains the training flow for GAN-based model and HMM model.
-
GAN-based model generated the transcription for training HMM model.
-
HMM model refined the phoneme boundaries for training GAN-based model.
- Training process with boundaries generated by GAS (bnd_type=uns) is unstable, which need more training attempts to achieve the satisfactory performance.
bnd_type
: type of initial phoneme boundaries (orc/uns).
setting
: matched and nonmatched case in our paper (match/nonmatch).
jobs
: number of jobs in parallel (depends on your decive).
Completely Unsupervised Speech Recognition By A Generative AdversarialNetwork Harmonized With Iteratively Refined Hidden Markov Models, Kuan-Yu Chen, Che-Ping Tsai et.al.
- The WFST decoder for phoneme classifier1 .
- The training scripts for Unsupervised HMM 1 .
Special thanks to Che-Ping Tsai (jackyyy0228) for kaldi parts! Special thanks to Sung-Feng Huang (b02901071) for pytorch version!