Skip to content

Metrics

EyalLavi edited this page Dec 17, 2018 · 3 revisions

This is a list of possible metrics for assessing STT providers, in no particular order. Some are applicable to streams or files only. Some are useful when analysing a single asset, others are meaningful only in the context of a large and varied data set. All require further detail.

  1. Word Error Rate
  2. Weighted WER (forthcoming?)
  3. NER (subjective)
  4. Timing accuracy
  5. Speaker change (voice changes)
  6. Speaker identification (words matched with speaker)
  7. Voice recognition (voice matched to real world person)
  8. Repeatability of results
  9. Punctuation: sentence boundaries, semantic phrases
  10. Capitalisation: sentence structure and proper nouns
  11. Ratio of processing to duration
  12. Stream vs file performance
  13. Accuracy of initial ‘partials’ (live)
  14. Latency of word recognition (live)
  15. Latency/accuracy ratio (where configurable)
  16. Growing results lookup distance (live)
  17. Tolerance of noise
  18. Performance on different accent groups
  19. Tolerance of uncommon vocabulary
  20. Some sense of how much errors are “grouped”. Are they peppered throughout or are they clustered.
  21. The gap between aggregate confidence score and confidence. Average confidence vs actual accuracy, and also association at a per word level. This is useful when you want to automatically steer people towards places that need correction, or quickly rule files in or out as being appropriate for ASR at all.
  22. How much performance is improved by adding a list of vocabulary (this is a bit fiddly though as might be tuneable to quirks of certain speech engines)
Clone this wiki locally