-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does not recognize speech after upgrading to V5.1 #515
Comments
I wouldn't call that "very clear speech", in general. Have you tried lowering the threshold? |
Thanks for the response! (i.e. speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=sr, threshold=0.15)) Well, at 0.45 and higher it does not recognize any speech. Comparing to real-life applications (e.g. call centers, medical health support, etc) this example is not challenging at all, thus is the "very clear speech"... |
Can you plot the speech probability vs time for a sample audio for both v4 and v5? Have specified other parameters like min speech duration, min silence duration..? |
How can I run v4? |
@yaronwinter same problem for me, have you rollback to V4 successfully? |
|
It is always worth doing the following:
For this particular case v4.0 gives this probability chart:
Which is nice, but almost the whole audio is speech anyway except for the starting bit anyway. For the latest version it is:
Here choosing the proper hyper params is next to impossible, because almost whole audio is speech and there are no breaks. To summarize, in this particular case the whole audio is mostly speech and probably using 4.0 is better, because it was not tuned on call-like data domain (typically calls that we have in the call-center are much less noisy) and the model thinks that this speech is background speech most likely. In any case you have three models to choose from - |
@snakers4 From my experiments, People should choose v3.1 or v4.0 with call-center audios for stable results. Anyway, thanks you very much!!! |
Looks like it depends on the audio quality. I do not really know how to handle this better. The same problem also applies to singing, music, murmur, background TV noises, parrot speech, etc |
Hey,you know what,I run into the same question in V4 model,and I avoid it by using V5 model. But it seems like it will have same problem when process specific data. |
Right, it's the generic ML problem: any model performs best on data that is similar to its train set, and the performance degrades for less similar data. |
Tried V5 model on my low-quality noisy call records too. V4 definitely performs better, as it gives lower probability for background voice and higher for those speaking directly to mic. which is good to my domain. |
Discussed in #514
Originally posted by yaronwinter August 7, 2024
I have been using SileroVAD for a few months now. After upgrading to V5.1 it suddenly fails to recognize very clear speech.
I have tried using both the torch.hub method and direct usage of the package modules, and in both cases it did not recognize anything in a signal with very clear speech:
And here is the audio file:
call_13.mp4
I would appreciate any advice!
Thanks,
Yaron
The text was updated successfully, but these errors were encountered: