Metrics

This is a list of possible metrics for assessing STT providers, in no particular order. Some are applicable to streams or files only. Some are useful when analysing a single asset, others are meaningful only in the context of a large and varied data set. All require further detail.

Word Error Rate
Weighted WER (forthcoming?)
NER (subjective)
Timing accuracy
Speaker change (voice changes)
Speaker identification (words matched with speaker)
Voice recognition (voice matched to real world person)
Repeatability of results
Punctuation: sentence boundaries, semantic phrases
Capitalisation: sentence structure and proper nouns
Ratio of processing to duration
Stream vs file performance
Accuracy of initial ‘partials’ (live)
Latency of word recognition (live)
Latency/accuracy ratio (where configurable)
Growing results lookup distance (live)
Tolerance of noise
Performance on different accent groups
Tolerance of uncommon vocabulary
Some sense of how much errors are “grouped”. Are they peppered throughout or are they clustered.
The gap between aggregate confidence score and confidence. Average confidence vs actual accuracy, and also association at a per word level. This is useful when you want to automatically steer people towards places that need correction, or quickly rule files in or out as being appropriate for ASR at all.
How much performance is improved by adding a list of vocabulary (this is a bit fiddly though as might be tuneable to quirks of certain speech engines)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics

Clone this wiki locally