Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web.
It enables transcription in multiple languages, as well as translation from those languages into English.
This notebook provides an easy to use interface to evaluate Whisper on audio recordings of text passages that have been sampled from Hugging Face Datasets.
The notebook sets up a Gradio UI that allows the user to:
- Sample text passages from any Dataset hosted on the Hugging Face Hub
- Record an audio snippet narrating the text,
- Transcribe the audio with Whisper
- Save the audio, transcribed and reference text, and word error rate to Comet for further evaluation and analysis.
To use the Evaluation UI, you will need a Comet account.
The UI has the following input components
- The Dataset Name. This is the root name of the text dataset on the Hugging Face hub.
- Subset. Certain datasets on the Hub are divided into subsets. Use this field to identify the subset, if there is no subset, you can leave it blank.
- Split. Which split of the dataset to sample from. e.g.
train
,test
etc. - Seed. Set the random seed for sampling. If the seed isn't set, a random one is automatically generated.
- Audio. One you have sampled a text passage, hit the record button to record yourself narrating the passage.
Finally hit the Transcribe button. That's it! All the relevant data will be logged to Comet for further analysis.
The Evaluation UI will log the following data to a Comet Experiment.
You can check out an example project here
- Model Type (tiny, base, small, medium etc)
- Beam Search Width
- Model Language
- Dataset Name
- Dataset Subset
- Column in the Dataset that contained the text
- Split (
train
,test
, etc.) of the dataset - Seed used to sample the text passage from the dataset
- Sample Text Length
- Word Error Rate Score
- Audio Snippet of the narrated text
- Reference/Sampled Text
- Transcribed Text