Skip to content

Building an accessible, robust, and general solution for open-set speech language identification

License

Notifications You must be signed in to change notification settings

Maxusmusti/open-set-lid

Repository files navigation

Open-Set Speech Language Identification and the CU MultiLang Dataset

Mustafa Eyceoz, Justin Lee, Siddarth Pittie

Publication

Robust and Accessible: The CU MultiLang Dataset and Continuing Open-Set Speech LID

  • arXiv Publication URL
  • Please read the paper for a summary of both the new multi-language speech dataset, as well as the goals, architecture, and results of our language identification system.

Previous Works

Modernizing Open-Set Speech Language Identification

Project Summary

Building an accessible, robust, and general solution for open-set speech language identification.

  • Capable of identifying known languages with high accuracy, but also recognizing and learning unknown languages on-the-fly without having to retrain the foundation TDNN model.
  • Highly portable, with not only full-system inference being possible on incredibly lightweight hardware, but even full model training on reasonable developer hardware (single-gpu, 32GB RAM system can train in a matter of hours).

Also building a diverse, high-coverage, open-source speech dataset spanning over 50 languages.

  • Used to make the system robust and generalized.
  • Not only coverage of most language families, but targeted diversity in speakers and dialects within each language as well.

CU MultiLang Dataset

The full dataset can be accessed at the below link:

LID System Code Guide

Run demo

$ python3 full_system.py

Train TDNN with 5hrs, 4sec, 15 epochs

$ python3 train_tdnn.py 5 4 15 0.8

Test the tdnn-final-submission TDNN model

$ python3 test_tdnn.py ./saved-models/tdnn-final-submission.pickle

Save the outputs of tdnn-final-submission TDNN model

$ python3 get_tdnn_outputs.py ./saved-models/tdnn-final-submission.pickle

Train LDA and pLDA layers using saved-tdnn-outputs

$ python3 train_lda_plda.py ./saved-tdnn-outputs

Feature Generation

MFCC + pitch features were generated using Kaldi

  • MFCC and pitch conf files can be found in the mfcc-confs subdir
    • Originally found in kaldi/egs/tedlium/s5_r3/scripts/conf/mfcc.conf and kaldi/egs/tedlium/s5_r3/scripts/conf/pitch.conf
  • Also in that subdir is our modified version of the make_mfcc_pitch.sh script
    • Originally found in and runnable from kaldi/egs/tedlium/s5_r3/scripts/steps/make_mfcc_pitch.sh
    • Usage: make_mfcc_pitch.sh --nj 1 --cmd "$train_cmd" <language directory> <log directory> <mfcc_pitch output directory>

Open-Source Citations

Language Data Sources

Basic PyTorch TDNN Reference

Python pLDA Reference

About

Building an accessible, robust, and general solution for open-set speech language identification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published