Skip to content

OpenAI Whisper Demo for Nividia Jetson Nano

License

Notifications You must be signed in to change notification settings

arribada/whisper-edge-demo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Whisper Edge

Porting OpenAI Whisper speech recognition to edge devices with hardware ML accelerators, enabling always-on live voice transcription. Current work includes Jetson Nano and Coral Edge TPU.

Jetson Nano

Jetson Nano

Shopping cart

Part Price (2023)
NVIDIA Jetson Nano Developer Kit (4G) $149.00
ChanGeek CGS-M1 USB Microphone $16.99
Noctua NF-A4x10 5V Fan (or similar, recommended) $13.95
D-Link DWA-181 Wi-Fi Adapter (or similar, optional) $21.94

Model

The base.en version of Whisper seems to work best for the Jetson Nano:

  • base is the largest model size that fits into the 4GB of memory without modification.
  • Inference performance with base is ~10x real-time in isolation and ~1x real-time while recording concurrently.
  • Using the english-only .en version further improves WER (<5% on LibriSpeech test-clean).

Hack

Dilemma:

  • Whisper and some of its dependencies require Python 3.8.
  • The latest supported version of JetPack for Jetson Nano is 4.6.3, which is on Python 3.6.
  • No easy way to update Python to 3.8 without losing CUDA support for PyTorch.

Workaround:

Setup

USB Serial

Attach the Jetson Nano to your computer via USB and get a shell, e.g. with screen on Linux:

screen /dev/ttyUSB0 115200

Or with PuTTY on Windows.

You'll be prompted to log in with the default credentials:

login: alex
password: arribada

SSH

First, follow the developer kit setup instructions, connect the Wi-Fi adapter and the microphone to USB, and ideally install a fan. (Also plugging in an Ethernet cable helps to make the downloads faster.) Then, get a shell on the Jetson Nano:

Build

For the demo, the container should already be built. You can skip this step and proceed to Run.

We will use NVIDIA Docker containers to run inference. Get the source code and build the custom container:

git clone https://github.com/arribada/whisper-edge-demo.git whisper-edge-arribada
bash whisper-edge-arribada/build.sh

Run

Launch inference:

bash whisper-edge-arribada/run.sh

You should see console output similar to this:

I0317 00:42:23.979984 547488051216 stream.py:75] Loading model "base.en"...
100%|#######################################| 139M/139M [00:30<00:00, 4.71MiB/s]
I0317 00:43:14.232425 547488051216 stream.py:79] Warming model up...
I0317 00:43:55.164070 547488051216 stream.py:86] Starting stream...
I0317 00:44:19.775566 547488051216 stream.py:51]
I0317 00:44:22.046195 547488051216 stream.py:51] 
I0317 00:44:49.219501 547488051216 stream.py:51] Start speaking now to see the transcription!

Below is a script for demoing the transcription in real-time:

As the sun set, I couldn't help but admire the dolphins jumping out of the water, with seagulls flying overhead. 
It's a beautiful scene, but there's a problem on my mind: bycatch. 
You see, I'm a fisherman, and my family depends on our daily catch. 
But sometimes, our nets unintentionally trap dolphins, whales, and other creatures, instead of the sharks and seals we're targeting.

This script will highlight the keywords programmed into this demo in green.

The stream.py script run in the container accepts flags for different configurations (the default flags should work for the demo):

bash whisper-edge-arribada/run.sh --help

       USAGE: stream.py [flags]
flags:

stream.py:
  --channel_index: The index of the channel to use for transcription.
    (default: '0')
    (an integer)
  --chunk_seconds: The length in seconds of each recorded chunk of audio.
    (default: '10')
    (an integer)
  --input_device: The input device used to record audio.
    (default: 'plughw:2,0')
  --language: The language to use or empty to auto-detect.
    (default: 'en')
  --latency: The latency of the recording stream.
    (default: 'low')
  --model_name: The version of the OpenAI Whisper model to use.
    (default: 'base.en')
  --num_channels: The number of channels of the recorded audio.
    (default: '1')
    (an integer)
  --sample_rate: The sample rate of the recorded audio.
    (default: '16000')
    (an integer)

Try --helpfull to get a list of all flags.

Troubleshooting

To see if the microphone is working properly, use alsa-utils:

sudo apt-get -y install alsa-utils

# Is the USB device connected?
lsusb

# Is the correct recording device selected?
arecord -l

# Is the gain set properly?
alsamixer

# Does a test recording work?
arecord --format=S16_LE --duration=5 --rate=16000 --channels=1 --device=plughw:2,0 test.wav

About

OpenAI Whisper Demo for Nividia Jetson Nano

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 88.0%
  • Shell 12.0%