Skip to content

Comparing Speech Recognition Engines

Caleb Bassi edited this page Jan 1, 2021 · 1 revision

Google Cloud Speech-to-Text

Pros

  • is great for speech mode
    • uses context based inference
    • has a large vocabulary
  • supports a lot of languages
  • is very lightweight on resources since it runs on a remote server

Cons

  • proprietary
  • isn't free
  • uses a remote server
    • requires an internet connection
    • adds a decent amount of latency
  • isn't great for command mode
    • doesn't allow you to specify a command graph
      • allows you to specify preferred phrases which helps but isn't good enough
    • uses context-based inference
      • certain keywords having a harder time of being picked up depending on the context
      • gives non-deterministic results based on the context
    • gets a little pricey if using it a lot
  • at the mercy of Google since it's service-based

Kaldi

An older but up-to-date speech recognition engine that is a DNN-HMM hybrid.

Pros

  • there are a lot of models to choose from
  • works great for command mode
    • can dynamically set the grammar
  • the medium size models work pretty well for speech mode
    • the medium size models only use about 1 GB of memory

Cons

  • the tiny models are missing too much vocabulary for speech mode
  • the large models take up a lot of memory, like 2 to 3.5 GB

wav2letter

A newer DNN speech recognition engine from Facebook.

Pros

Cons

  • seems a little unapproachable from an end-user perspective
    • seems mainly tailored to researchers
    • there doesn't seem to be a nice and simple Python API
  • impractical to train models as a regular user since it needs a lot of GPUs

Picovoice Cheetah

Pros

  • very lightweight

Cons

  • has a weird license
  • doesn't give great results

Mozilla DeepSpeech

A newer DNN speech recognition engine.

Pros

Cons

  • doesn't give the best results

PocketSphinx

Deprecated in favor of using Kaldi with a lightweight model like one of these.

Dragon Dictation

Pros

Cons

  • proprietary
  • only works on Windows and macOS
    • it's possible to run it in a VM or on another computer and stream the results but this can be difficult to set up