VAD quality #53

matanox · 2020-07-03T00:45:17Z

The readme says:

The VAD that Google developed for the WebRTC project is reportedly one of the best available, being fast, modern and free.

However I was unable to witness any auspicious accuracy with any aggression level (0-3). Is this statement based on any kind of benchmark or publication? Have you experienced any useful accuracy levels in your setup, using py-webrtcvad?

matanox · 2020-09-27T20:56:39Z

I'm consistently getting results that seem to work no better than audio level detection, also when using very good audio hardware with integrated noise cancellation. It might be good to integrate a newer VAD model into this repository I guess. Maybe integrating (upgrading to) the latest webrtc model would be a good idea, but webrtc is not a project fully focused on VAD, it also does AGC and many other things so it might not necessarily be a good idea to preserve the current project based off it, if the goal is only to tear out the VAD feature for python.

matanox · 2020-09-27T21:17:50Z

I think the following paragraph from the CMU Sphinx project sums it up quite nicely in terms of what to expect:

The major issue with VAD is that speech signal is considered alone and the methods for arbitrary audio signal recognition are in a pretty initial stage. So you can’t distinguish speech from other sounds because you don’t know what other sounds are. Also, the theory of separation of overlapped signals is also in a very initial stage. So most of the modern VADs operate on stationary noise only and can not deal with complex noises and overlapped speech. Things like bird singing in the background can make things pretty complex.

On the face of it, this VAD model like most other ones of its time, is okay at figuring out speech against stationary noise, but has little power in determining whether an episode of specific noise is speech or something entirely else, under which interpretation it is useful for quiet rooms and for cutting out voice segments from almost noise-less recordings, not so much for detecting speech "in the wild".

snakers4 · 2021-01-21T04:05:22Z

please see the benchmarks - #68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAD quality #53

VAD quality #53

matanox commented Jul 3, 2020 •

edited

Loading

matanox commented Sep 27, 2020

matanox commented Sep 27, 2020 •

edited

Loading

snakers4 commented Jan 21, 2021

VAD quality #53

VAD quality #53

Comments

matanox commented Jul 3, 2020 • edited Loading

matanox commented Sep 27, 2020

matanox commented Sep 27, 2020 • edited Loading

snakers4 commented Jan 21, 2021

matanox commented Jul 3, 2020 •

edited

Loading

matanox commented Sep 27, 2020 •

edited

Loading