-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quality benchmarks between audiotok / webrtcvad / silero-vad #68
Comments
You've sure done some thorough work here. Just as a sanity check, looks like the deep neural network model is the only one worth using for real world action, does it not? I wonder in what ways is the WebRTC VAD model even useful for the WebRTC project itself .... |
Despite the appearance, web rtc is not so bad False positives and lack of easy tuning / interpretable parameters / docs / support are the main culprit Also for this reason we just used standard params - we may be wrong somewhere and it can be tuned better, but 95% of users will not bother |
It seems that the Silero VAD and WebRTC VAD make different tradeoffs. WebRTC produces a VAD decision on 10ms to 30ms frames, whereas Silero produces a VAD decision on 150ms to 250ms frames. While it's true that short silences on the order of 30ms aren't particularly meaningful, the resolution of a VAD decision may be. In some applications, it may not be acceptable to discover up to 125ms late of a transition between speech and silence. WebRTC is designed to provide decisions in low-latency streaming applications where having a 100+ms buffer is not acceptable. I'm happy to see implementations explore different tradeoffs in the design space. Looking at a PR-curve alone, though, doesn't tell the full story. |
While it is true that we cannot really go below 100ms windows, there is just too much noise |
Also community provided some illustrative comparisons https://github.com/snakers4/silero-vad#live-demonstration |
Instruments
We have compared 3 easy-to-use off-the-shelf instruments for voice activity / audio activity detection:
Caveats
audiotok
provides Audio Activity Detection, which probably may just mean detecting silence in layman's terms;silero-vad
is geared towards speech detection (as opposed to noise or music);audiotok
andwebrtcvad
use 30-50ms chunks (we used default values of 30 ms forwebrtcvad
and 50 ms foraudiotok
);Methodology
Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology
Quality Benchmarks
Finished tests:
Portability and Speed
webrtcvad
is written inС++
around 2016, so theoretically it can be ported into many platforms;audiotok
is written in plain python, but I guess the algorithm itself can be ported;silero-vad
is based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);This is by no means an extensive and full research on the topic, please point out if anything is lacking.
The text was updated successfully, but these errors were encountered: