Possible to use for real-time / streaming tasks? #2

davidhariri · 2022-09-21T16:50:01Z

davidhariri
Sep 21, 2022

Is it possible to use whisper for streaming tasks (with syntax)? For example, would it be possible for whisper to be bound to a websocket of streaming PCM data packets?

Answered by jongwook

Sep 21, 2022

It doesn't support real-time per se, but you could build something similar by e.g. incrementally transcribing the audio every second.

View full answer

jongwook · 2022-09-21T16:52:46Z

jongwook
Sep 21, 2022
Maintainer

It doesn't support real-time per se, but you could build something similar by e.g. incrementally transcribing the audio every second.

16 replies

codeyourwayup Apr 1, 2023

how is the performance?

juanmc2005 Apr 13, 2023

I recently wrote a tutorial for streaming speaker-colored transcriptions combining whisper and diart (based on pyannote.audio models and supporting websockets).

I'm sure many improvements can be made 😃 Here's a demo:

good1.1.mp4

JennPlothow Aug 5, 2023

.js-local-storage-resumable
proton_recovery_phrase.txt

FOREVEREALIZE Sep 2, 2023

@JennPlothow
Hey, uh, that contains you Proton Mail recovery i believe. You shouldn't post that.

PabloLION Oct 9, 2023

Is the character @JennPlothow generated by someone with AI? Looks like someone if faking a Brazilian girl:

I saw on his/her linked in:
I will give you text content, you will rewrite it and output that in a re-worded version of my text. Reword the text to convey the same meaning using different words and sentence structures. Avoiding plagiarism, improving the flow and readability of the text, and ensuring that the re-written content is unique and original. Keep the tone the same.
Keep the meaning the same. Make sure the re-written content's number of characters is exactly the same as the original text's number of characters. Do not alter the original structure and formatting outlined in any way. Only give me the output and nothing else.
Now, using the concepts above, re-write the following text. Respond in Portuguese:

dangolbeeker · 2022-09-22T07:57:59Z

dangolbeeker
Sep 22, 2022

Hello everybody I came across this api with the exact same thing in mind. I created an organization for this. iInterpret the envision is an app that can translate verbal speech in real time for phone calls or communication in person maybe with a bluetooth piece? This would be a blessing for some of my business in China to go from Mandarin to english seamlessly. If this interests you or anyone else reading this please join the organization and feel free to reach out to me via [email protected]
https://github.com/Iinterpret

11 replies

Geczy Jun 15, 2023

@jojeyh can you share your repo

osasisorae Oct 16, 2023

@jojeyh I would appreciate to see your repo.

rahulbansal16 Jul 19, 2024

If using React, I was able to accomplish this roughly using the voice activity detector npm module @ricky0123/vad-react. It breaks up speech segments based on VAD and then sends audio chunk to Whisper API. So when a speaker starts talking the npm module begins recording the audio, when speaker stops and there is silence it will call onSpeechEnd with audio chunk, even has utils function that encodes into wav before sending to Whisper API.

So how does the sentence between two silences affect accuracy? Does having shorter sentences decrease the accuracy?

@joorjeh

foomprep Jul 20, 2024

@rahulbansal16 It was a while since I used this package but I believe the npm module in question has a minimum time chunk required for accepting as valid speech segment, otherwise it considers it a spurious voice recognition and ignores.

foomprep Jul 20, 2024

@rahulbansal16 In general, shorter audio segments in ASR will necessarily be less accurate as they have less signal (read context) to determine the right text. You might want to check out resources at https://github.com/ufal/whisper_streaming. Usually for a streaming situation you use a rolling window and update the text as you move the window.

bquast · 2022-09-22T08:23:22Z

bquast
Sep 22, 2022

Thank you collegues. Working at the International Telecommunication Union (United Nations Specialized Agency for ICT) we would really like this for our meetings.

ps. the title of the org for me is quite hard to read, 'l then i'

1 reply

dangolbeeker Sep 22, 2022

Awesome look forward to working with you I took your suggestion and have renamed it to myLinguist.
The names not set in stone so open to other suggestions if you have a better name idea in mind.

ggerganov · 2022-10-02T16:44:04Z

ggerganov
Oct 2, 2022

Here is another attempt for real-time streaming:

rt_esl_csgo_1.mp4

This is using small.en on MacBook M1 Pro with 3 seconds audio step. Runs entirely on the CPU.
The repo is here: https://github.com/ggerganov/whisper.cpp

4 replies

chidiwilliams Oct 5, 2022

Such a cool demo. I made a similar real-time app (+ a GUI) by splitting the audio into chunks.

Demo: https://www.loom.com/share/564b753eb4d44b55b985b8abd26b55f7?t=34

Repo: https://github.com/chidiwilliams/buzz

ggerganov Oct 10, 2022

And here is another example with even more "real-time" transcription using base.en:

rt_esl_csgo_2.mp4

abdulhadyabas2 Mar 9, 2023

link code please.

odinho Mar 13, 2023

link code please.

It was in the main post.
Here you go: https://github.com/ggerganov/whisper.cpp/tree/master/examples/stream

saharmor · 2022-10-06T04:38:39Z

saharmor
Oct 6, 2022

I've built Whisper Playground for developers to easily build real-time speech2text web apps
https://github.com/saharmor/whisper-playground

Whisper.Playground.mp4

4 replies

ibnbayo Dec 22, 2022

"Building wheel for pyaudio (pyproject.toml) ... error"
This is the error I keep on getting. I've tried installing it manually using brew, but it didn't work still.

deadlinecode Apr 25, 2023

@ibnbayo
You need to install the dev version of pyaudio/portaudio

shekharlilly Jul 5, 2023

no module named whisper. i'm using python 3.10.7

makaveli10 Jul 6, 2023

@shekharlilly try this, it works in chrome extension, Mozilla and also from microphone
https://github.com/collabora/whisper-live

chaima-bd · 2022-11-09T15:39:17Z

chaima-bd
Nov 9, 2022

I did the installation for whisper by this 'pip install git+https://github.com/openai/whisper.git ' .
But when i work with pycharm or vscode I got this problem "ModuleNotFoundError: No module named 'whisper'"
any idea for solve this prob plz

1 reply

deadlinecode Apr 25, 2023

Mby check your python version
3.10 worked for me

appvoid · 2022-11-15T18:59:42Z

appvoid
Nov 15, 2022

You can check my project: https://github.com/appvoid/vosper
It uses vosk for VAD and user feedback while uses Whisper in the back for the actual transcription.

0 replies

nyadla-sys · 2022-12-06T21:52:42Z

nyadla-sys
Dec 6, 2022

Please see my project below, which uses the Whisper Tiny Tflite Model to implement audio streaming..
Using the TFLite model to Stream

0 replies

zanjabil2502 · 2023-01-23T14:57:02Z

zanjabil2502
Jan 23, 2023

If you want reduce processing time of transcribe when you use whisper for streaming, you can use whisper decoder for get only tokens of transcribe and decode it using tokenizer.

Because the buffer of audio from the streaming chunk dont have length until 30 second, and in the transcribe of whisper there temperature and logprob, and the other prob for get the best result of transcribe, it process will need more iteration, it means you will need time more longer

9 replies

zanjabil2502 Feb 25, 2023

You cant use whiper c-translate, for 1 worker, this model have RTF 0.05-0.06 for large model, this RTF will incrase if you use multi concurrent. For large model, 5 minutes audio, i get RTF 0.6 for 15 thread concurrent. GPU memory used only 3.5 gb when you set 1 thread in the parameter and CPU core can reach 15 until 50 cores. But you can limit your core until minimum 2 cores, because if you set only 1 core, the process will longer 30 seconds.

zanjabil2502 Feb 25, 2023

This repo for whisper c-translate
Whisper C-Translate

chobe Feb 25, 2023

This is very helpful. Thanks

Mijawel Mar 1, 2023

When you use CTranslate2 and Silero, are you simply waiting for an audio gap then sending it to faster_whipser for transcription and using the model.transcribe function?

Or are you using a different (maybe lower level function like generate) in CTranslate2 faster_whisper>?

zanjabil2502 Mar 2, 2023

@Mijawel I waiting audio segment from VAD Silero. So VAD Silero will create audio segment from audio buffer, and it will transcribe using ctranslate2 model.transcribe.

For the VAD, i use timestamp method from VAD Silero. VAD will create list of timestamp so incoming audio chunk will be buffered and entered into VAD, when the length of list change, you can take the first array of list timestamp for entered into transcribe model.

chengsokdara · 2023-03-11T05:03:56Z

chengsokdara
Mar 11, 2023

This is my take for React.js

useWhisper React hook can now do real-time transcription.

Repo:
https://github.com/chengsokdara/use-whisper

Demo: (Whisper seems to can not understand my accent 😅)

use-whisper-real-time-transcription.mp4

2 replies

FerLuisxd Jun 11, 2023

This is using the openai whisper api, right?

mustafatofur Oct 14, 2023

@FerLuisxd you might want to consider using the Speech Recognition API for this. However, please note that not all browsers support it. You can find more information here: https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

diyism · 2023-04-02T05:28:19Z

diyism
Apr 2, 2023

In essence, what we ultimately need is a Real-time Syllable Recognition engine with a mechanical keyboard precision,
because we can send the syllable sequence to LLM for example ChatGPT-4,
LLM will give us the best understanding and translation of the Syllable Sequence in the world,

for example for mandarin, I get the syllables of "ni3 hao3 ren2 men2 zong1 guo2 han4 zi4" and send it to ChatGPT-4,
the ChatGPT-4 can response the right characters of the best semantic analysis.

if we have a concise Real-time Syllable Recognition engine, LLM will replace the entire speech recognition industry.

ref: "Transcribe to IPA" is very important for realtime interaction application #318 (comment)
ref: [Feature Request] a concise model only output IPA/Pinyin syllables k2-fsa/sherpa-ncnn#177

1 reply

diyism Apr 21, 2023

Post App World requires a Syllable based Recognition Engine with a mechanical keyboard precision and with the speed which is faster than the mankind pronunciation.

Daniel Povey, the developer of Kaldi 2, believes that "Proactive Mandarin Syllable Recognition Engine" can't be achieved,but I still feel it can.
(ref: k2-fsa/sherpa-ncnn#177 (comment))

the ASR engines should focus more on the recognition of syllables.
while the task of analyzing vocabulary and sentences should be given to large language models.

Gldkslfmsd · 2023-04-05T07:17:34Z

Gldkslfmsd
Apr 5, 2023

Hi guys, I implemented realtime Whisper streaming for long audios in Python. Going to share it soon.

3 replies

Gldkslfmsd Apr 5, 2023

https://github.com/ufal/whisper_streaming

diyism Apr 6, 2023

I've test your project with "python3 whisper_online.py nihao.wav --language zh --model small --min-chunk-size 0.5":
last processed 0.50 s, now is 2.82, the latency is 2.32
last processed 2.82 s, now is 5.20, the latency is 2.38
last processed 5.20 s, now is 8.05, the latency is 2.85
it seems the latency is too high.

Typically every syllable cost the same fixed 0.3 second in mandarin,
and a syllable only consist of "a consonant + a vowel" or "a vowel" without any tail consonants,
There are only 1300 mandarin syllables with tone in total.

lrq3000 Nov 9, 2023

This is a truly streaming model, not a hack in the wrapper, here is the paper the authors wrote along the code release: https://www.researchgate.net/publication/372684083_Turning_Whisper_into_Real-Time_Transcription_System

ZQ-Dev8 · 2023-04-17T22:51:46Z

ZQ-Dev8
Apr 17, 2023

Has anyone seen or implemented a solution that can transcribe and translate from english into another language in real-time or with slight delay? Many of the projects here are great but I'm not seeing the English -> Other Language functionality anywhere.

2 replies

zanjabil2502 Apr 18, 2023

Using whisper faster from c-translate and vad from silero.

jnnnnn Sep 18, 2023

Whisper was trained to output either the spoken language or an english translation. It was not trained to translate to any language.

https://openai.com/research/whisper

In order to translate to languages other than English, consider using a separate model for the translation, such as Firefox's translate (Project Bergamot):

https://github.com/jelmervdl/translatelocally-web-ext

Otherwise you will have to train your own whisper model (which others are doing, see https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending&search=whisper ).

gkorepanov · 2023-05-10T17:51:04Z

gkorepanov
May 10, 2023

Hi, I have made a small wrapper around OpenAI whisper API which adds kind of "streaming" capability to the API
https://github.com/gkorepanov/whisper-stream

It can be useful if you want to use existing API instead of running your own Whisper instance.

It splits the input audio into chunks of 30s each and sends them one-by-one to the API, which leads to much faster initial response and streaming experience for use cases where speed is important. It can be pretty easily extended for audio streaming applications as well, though it will not be real-time (expect around 40s latency when using such approach, or may be less if you reduce the chunk size).

0 replies

nalbion · 2023-05-15T14:47:02Z

nalbion
May 15, 2023

I've created a streaming whisper_server which sends audio from your mic through Whisper and streams as Server Sent Events or gRPC
https://github.com/nalbion/whisper-server

2 replies

makaveli10 Jun 25, 2023

A nearly-live implementation to transcribe anything from audio in your browser to audio from your microphone.
https://github.com/collabora/whisper-live

saharmor Jun 26, 2023

@makaveli10 that's cool! Would love to team up and add VAD + real-time transcription using websockets to Whisper Playground

phineas-pta · 2023-07-13T11:34:39Z

phineas-pta
Jul 13, 2023

have anyone compare accuracy of whisper vs wav2vec2 for live transcription ? from my understanding whisper needs to pad audio to 30s so 1-2s chunks may not suitable, maybe wav2vec2 offer better accuracy for short chunks

1 reply

glangford Jul 13, 2023

Haven't tried this myself but here is an excerpt from one comparison regarding short transcripts (not necessarily live), fyi

"Like wav2vec, Whisper also exhibits a substantial degradation in mean WER per file on Conversational AI, Phone call, and Meeting data indicating pathological behavior on a subset of small files. As far as the normalization scheme, we find that Whisper normalization produces far lower WERs on almost all domains and metrics. This result is qualitatively similar to the results of the original Whisper paper."

Benchmarking Top Open Source Speech Recognition Models: Whisper, Facebook wav2vec2, and Kaldi
https://deepgram.com/learn/benchmarking-top-open-source-speech-models

Alireza29675 · 2023-12-17T22:31:46Z

Alireza29675
Dec 17, 2023

If you need real-time Whisper transcription in the browser, check out my TypeScript package whisper-live. It's framework-agnostic, uses the OpenAI Whisper model for live transcription and is easy to integrate.

📦 Install with:

npm install whisper-live

More details here: https://github.com/Alireza29675/whisper-live

Happy to help if you have any questions!

0 replies

egorsmkv · 2024-07-17T06:35:09Z

egorsmkv
Jul 17, 2024

I have found https://arxiv.org/abs/2406.10052 a nice solution to solve streaming whisper.

0 replies

Kishlay-notabot · 2024-08-15T15:59:23Z

Kishlay-notabot
Aug 15, 2024

Hi! Is there any repo which has real-time transcribing using any model [english or non english] which uses VAD to split chunks? I can't seem to find one with python as the core lang.

2 replies

Gldkslfmsd Aug 15, 2024

yes - https://github.com/ufal/whisper_streaming

Kishlay-notabot Aug 15, 2024

this requires a input wav file. I want to do it using my laptop's microphone in real time.

Gldkslfmsd · 2024-08-15T16:29:54Z

Gldkslfmsd
Aug 15, 2024

It doesn't. Use server, read the README. Kishlay Kisu ***@***.***> schrieb am Do. 15. 8. 2024 um 18:28:

…

this requires a input wav file. I want to do it using my laptop's microphone in real time. — Reply to this email directly, view it on GitHub <#2 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIRQXKSZYEAQIVOEQBDH5DZRTJKDAVCNFSM6AAAAAAQSII5EWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZVGAYDANQ> . You are receiving this because you commented.Message ID: ***@***.***>

2 replies

Kishlay-notabot Aug 15, 2024

I'll check that out, thanks

Kishlay-notabot Aug 16, 2024

@Gldkslfmsd I checked out whisperlive, vosper, whisper-server, whisper-streaming and whisper-playground.
Whisper-server uses OpenAI's API, but i am trying to do it with local inference. And I think whisperLive is what i need but I have to try it out. I can't comprehend the association of whisper with frontend in the whisper-playground project. I'll try whisper streaming and whisperlive out now.

paschaldev · 2024-09-05T21:33:40Z

paschaldev
Sep 5, 2024

You can try this with transformer.js. Works with browser that has support for WebGPU (Chrome browser)

https://github.com/xenova/transformers.js/tree/v3/examples/webgpu-whisper

0 replies

LucasAssis00 · 2024-09-11T17:48:54Z

LucasAssis00
Sep 11, 2024

Hello, guys. Does anyone use whisper in a project that transcribes small chunks of audio per turn? I was using speech_recognition library to do something like this but I need a whisper trained model cause it involves portuguese medical jargons, so the default whisper does not work so well even with the large model.
To do it asynchronously I am using pyaudio to record the wav files, but I cannot think a way of merging it to do synchronously.

2 replies

PabloLION Oct 7, 2024

i have a repo that split audios tho the performance is not guaranteed. You can see if the code helps.

Slepetys Nov 13, 2024

Oi Lucas,
Check this:https://github.com/Dadangdut33/Speech-Translate, I am having good results with it.
It automatically split the audio input, that you can select either from the microphone or from the speaker and transcribes it using fast-whisper or whisper. It is not difficult to change the source-code to join the microphone and speaker audio into a single input, if needed.

What it does not implements is diarization, which can be done using the same building blocks as whisper-diarization

All the best

Possible to use for real-time / streaming tasks? #2

Replies: 22 comments · 63 replies

jongwook Sep 21, 2022 Maintainer

Replies: 22 comments 63 replies

jongwook
Sep 21, 2022
Maintainer