Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAD quality #53

Open
matanox opened this issue Jul 3, 2020 · 3 comments
Open

VAD quality #53

matanox opened this issue Jul 3, 2020 · 3 comments

Comments

@matanox
Copy link

matanox commented Jul 3, 2020

The readme says:

The VAD that Google developed for the WebRTC project is reportedly one of the best available, being fast, modern and free.

However I was unable to witness any auspicious accuracy with any aggression level (0-3). Is this statement based on any kind of benchmark or publication? Have you experienced any useful accuracy levels in your setup, using py-webrtcvad?

@matanox
Copy link
Author

matanox commented Sep 27, 2020

I'm consistently getting results that seem to work no better than audio level detection, also when using very good audio hardware with integrated noise cancellation. It might be good to integrate a newer VAD model into this repository I guess. Maybe integrating (upgrading to) the latest webrtc model would be a good idea, but webrtc is not a project fully focused on VAD, it also does AGC and many other things so it might not necessarily be a good idea to preserve the current project based off it, if the goal is only to tear out the VAD feature for python.

@matanox
Copy link
Author

matanox commented Sep 27, 2020

I think the following paragraph from the CMU Sphinx project sums it up quite nicely in terms of what to expect:

The major issue with VAD is that speech signal is considered alone and the methods for arbitrary audio signal recognition are in a pretty initial stage. So you can’t distinguish speech from other sounds because you don’t know what other sounds are. Also, the theory of separation of overlapped signals is also in a very initial stage. So most of the modern VADs operate on stationary noise only and can not deal with complex noises and overlapped speech. Things like bird singing in the background can make things pretty complex.

On the face of it, this VAD model like most other ones of its time, is okay at figuring out speech against stationary noise, but has little power in determining whether an episode of specific noise is speech or something entirely else, under which interpretation it is useful for quiet rooms and for cutting out voice segments from almost noise-less recordings, not so much for detecting speech "in the wild".

@snakers4
Copy link

please see the benchmarks - #68

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants