Speaker-blind speech recognition #144

juanmc2005 · 2023-04-24T09:59:34Z

Depends on #143

Adding a streaming ASR pipeline needed a big refactoring (that began with #143).
This PR continues this effort to allow a new type of pipeline that transcribes speech instead of segmenting it.
A default ASR model based on Whisper is provided, but the dependency is not mandatory.

Additional modifications were also needed to make Whisper compatible with batched inference.
Note that we do not condition Whisper on previous transcriptions here. I expected this to degrade transcription quality but I found it rather robust in my experiments with the microphone and spontaneous speech in various languages (English, Spanish and French).

The new Transcription pipeline can also use a segmentation model as a local VAD to skip non-voiced chunks. In my experiments, this worked better and faster than using Whisper's no_speech_prob.

Transcription is also compatible with diart.stream, diart.benchmark, diart.tune and diart.serve (hence diart.client too).

Still missing

README examples and possible restructuring

Changelog

TBD

…o feat/vad

…ut a bit quirky

….tune. Fix major bug in Optimizer

BlokusPokus · 2024-04-18T13:22:56Z

Is this feature considered implemented?

juanmc2005 · 2024-04-19T12:30:07Z

@BlokusPokus it seemed to work last time I tried but I didn't merge because I wanted to include a faster implementation of Whisper and I needed to clean up the code. Feel free to try it out but it's a pretty old version of the library. I need to find some time to update this PR. If you feel like it, it would be an amazing contribution!

GeorgeDeac · 2024-10-18T14:20:20Z

Yeah we definitely need a faster-whisper / WhisperLive implementation. WhisperLive also integrated VAD and I see it has some overlapping features.

juanmc2005 added 18 commits April 19, 2023 17:41

New feature: streaming voice activity detection. Pipeline name changes

bca2873

Merge branch 'develop' of github.com:juanmc2005/OnlineDiarization int…

5e44ad4

…o feat/vad

Update link in setup.cfg

7447061

Update code snippets in README

4985394

Add minor README modifications

540ad0a

Initial ASR implementation. Broken stuff

8cc9925

First working transcription pipeline. Using diarization is possible b…

1ae4934

…ut a bit quirky

Reduce Whisper VRAM footprint (around 400Mb). Add fp16 option

d8d7342

Change whisper input type based on fp16 parameter

2cfc35d

Implement batched inference for whisper. Re-implement decoding.

a40112c

Minor changes in transcription arguments

e8196a7

Greatly improve transcription pipeline by adding optional VAD

07dd9ae

Move pipelines to diart.pipelines. Add torchmetrics as a dependency

0bf2522

Add websocket compatibility to transcription pipeline

42fe5f7

Transcription pipeline is now fully compatible with diart.stream

49616e5

Make transcription pipeline compatible with diart.benchmark and diart…

babf49d

….tune. Fix major bug in Optimizer

Rename base pipeline and config objects

6609e3c

Merge changes from branch feat/vad

4c1aeba

juanmc2005 added bug Something isn't working feature New feature or request API Improvements to the API refactoring Internal design improvements that don't change the API labels Apr 24, 2023

juanmc2005 added this to the Version 0.8 milestone Apr 24, 2023

juanmc2005 added 6 commits April 24, 2023 12:39

New feature: streaming voice activity detection. Pipeline name changes

d19b044

Update link in setup.cfg

6caa4a4

Update code snippets in README

0993fe8

Add minor README modifications

95d4fae

Rename base pipeline and config objects

569c68f

Update branch with develop

eed864f

juanmc2005 mentioned this pull request Apr 26, 2023

Add speaker-aware transcription #147

Draft

juanmc2005 modified the milestones: Version 0.8, Version 0.9 Oct 11, 2023

juanmc2005 force-pushed the develop branch from b26e60c to 782ce49 Compare October 28, 2023 14:07

juanmc2005 removed this from the Version 0.9 milestone Nov 2, 2023

juanmc2005 marked this pull request as ready for review November 9, 2023 22:59

juanmc2005 force-pushed the develop branch from f531147 to 467997d Compare May 25, 2024 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker-blind speech recognition #144

Speaker-blind speech recognition #144

juanmc2005 commented Apr 24, 2023 •

edited

Loading

BlokusPokus commented Apr 18, 2024

juanmc2005 commented Apr 19, 2024

GeorgeDeac commented Oct 18, 2024

Speaker-blind speech recognition #144

Are you sure you want to change the base?

Speaker-blind speech recognition #144

Conversation

juanmc2005 commented Apr 24, 2023 • edited Loading

Still missing

Changelog

BlokusPokus commented Apr 18, 2024

juanmc2005 commented Apr 19, 2024

GeorgeDeac commented Oct 18, 2024

juanmc2005 commented Apr 24, 2023 •

edited

Loading