Skip to content

Latest commit

 

History

History
78 lines (56 loc) · 4.15 KB

README.md

File metadata and controls

78 lines (56 loc) · 4.15 KB

Project: Automatic Speech Recognition (ASR)

We recommend using our project template.

Task

Implement and train a neural-network speech recognition system with CTC loss. You are free to choose any model you like. We recommend you to have a look at these papers:

Try to avoid using implementations available on the internet.


General requirements

Requirements:

  • The code should be stored in a public github (or gitlab) repository

  • All the necessary packages should be mentioned in ./requirements.txt or be installed in dockerfile

  • All necessary resources (such as model checkpoints, LMs, and logs) should be downloadable with a script. Mention the script (or lines of code) in the README.md

  • You should implement all functions in test.py (for evaluation) so that one can reproduce your results

  • Basically, your test.py and train.py scripts should run without issues after running all commands in your installation guide.

  • Log everything that is useful: losses, data, learning rate, gradient norm, etc.

  • Provide the logs for the training of your final model from the start of the training. We heavily recommend you to use W&B Reports feature.

  • Attach a brief report. That includes:

    • How to reproduce your model? (example: train 50 epochs with config train_1.yaml and 50 epochs with train_2.yaml)
    • Attach training logs to show how fast did you network train
    • How did you train your final model?
    • What have you tried?
    • What worked and what didn't work?
    • What were the major challenges?

    Also attach a summary of all bonus tasks you've implemented.


Quality score

Score Dataset CER WER Description
1.0 -- -- -- At least you tried
2.0 LibriSpeech: test-clean 50 -- Well, it's something
3.0 LibriSpeech: test-clean 30 -- You can guess the target phrase if you try
4.0 LibriSpeech: test-clean 20 -- It gets some words right
5.0 LibriSpeech: test-clean -- 40 More than half of the words are looking fine
6.0 LibriSpeech: test-clean -- 30 It's quite readable
7.0 LibriSpeech: test-clean -- 20 Occasional mistakes
8.0 LibriSpeech: test-other -- 30 Your network can handle somewhat noisy audio.
8.5 LibriSpeech: test-other -- 25 Your network can handle somewhat noisy audio but it is still just close enough.
9.0 LibriSpeech: test-other -- 20 Somewhat suitable for practical applications.
10.0 LibriSpeech: test-other -- 10 Technically better than a human. Well done!

Dataset can be found here and on Kaggle.

Important

Use only train partitions of LibriSpeech or Mozilla Common Voice and data augmentation techniques to train your model.

To calculate the metrics, you can use torchmetrics implementation of CER and WER.

To save some coding time, you can use HuggingFace dataset library. Look how easy it is:

from datasets import load_dataset
dataset = load_dataset("librispeech_asr", split='train-clean-360')

Optional tasks

  • Use an external language model for evaluation. The choice of an LM-fusion method is up to you. You may find this library helpful. Note: implementing this part will yield a very significant quality boost (which will improve your score by a lot). We heavily recommend you to implement this part.
  • BPE instead of characters. You can use SentencePiece, HuggingFace, or YouTokenToMe.