Does not recognize speech after upgrading to V5.1 #515

yaronwinter · 2024-08-07T08:00:31Z

Discussed in #514

^{Originally posted by yaronwinter August 7, 2024}
I have been using SileroVAD for a few months now. After upgrading to V5.1 it suddenly fails to recognize very clear speech.
I have tried using both the torch.hub method and direct usage of the package modules, and in both cases it did not recognize anything in a signal with very clear speech:

And here is the audio file:

call_13.mp4

I would appreciate any advice!
Thanks,
Yaron

x86Gr · 2024-08-07T21:20:55Z

I wouldn't call that "very clear speech", in general. Have you tried lowering the threshold?

yaronwinter · 2024-08-11T13:13:34Z

Thanks for the response!
You refer to 'threshold' parameter, right?

(i.e. speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=sr, threshold=0.15))

Well, at 0.45 and higher it does not recognize any speech.
When decreasing it gradually it recognizes a few short segments of speech (not satisfactory), and then, at 0.15, it detects the whole signal as speech...
It is strange, as before the upgrae to 5.1 it detected speech pretty well on this data, with not need for threshold tuning.

Comparing to real-life applications (e.g. call centers, medical health support, etc) this example is not challenging at all, thus is the "very clear speech"...

x86Gr · 2024-08-11T13:26:58Z

Can you plot the speech probability vs time for a sample audio for both v4 and v5? Have specified other parameters like min speech duration, min silence duration..?

yaronwinter · 2024-08-11T16:58:03Z

How can I run v4?
Up to version v4 there is only the torch hub option for getting the model and modules, isn't it?
In fact I had not installed SileroVAD at all, but rather used the torch hub for importing the modules.
Only after it stopped detecting speech I found that v5.1 was released, but I haven't figured out how to return to v4...

leminhnguyen · 2024-08-13T04:57:33Z

@yaronwinter same problem for me, have you rollback to V4 successfully?

snakers4 · 2024-08-13T06:18:26Z

How can I run v4?

#474

snakers4 · 2024-08-13T07:12:17Z

It is always worth doing the following:

Plotting the probability chart (there is param in the get_speech_timestamps function);
Isolating the problematic audio;
Seeing if the problem is systemic, or just improper hyper-paramers;

For this particular case v4.0 gives this probability chart:

  model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad:v4.0',
                                model='silero_vad',
                                force_reload=True,
                                onnx=USE_ONNX)

Which is nice, but almost the whole audio is speech anyway except for the starting bit anyway.

For the latest version it is:

  model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                                model='silero_vad',
                                force_reload=True,
                                onnx=USE_ONNX)

Here choosing the proper hyper params is next to impossible, because almost whole audio is speech and there are no breaks.

To summarize, in this particular case the whole audio is mostly speech and probably using 4.0 is better, because it was not tuned on call-like data domain (typically calls that we have in the call-center are much less noisy) and the model thinks that this speech is background speech most likely.

In any case you have three models to choose from - v3.1, v4.0 and latest.

leminhnguyen · 2024-08-13T07:39:39Z

@snakers4 From my experiments, People should choose v3.1 or v4.0 with call-center audios for stable results. Anyway, thanks you very much!!!

snakers4 · 2024-08-13T07:45:06Z

Looks like it depends on the audio quality.
In our case audio quality is typically higher, hence we were optimizing the background speech objective as well.

I do not really know how to handle this better.
If there will be more edge cases, please open another issue.
Maybe we will think of something, i.e. how to make VAD run in several modes.

The same problem also applies to singing, music, murmur, background TV noises, parrot speech, etc

Simon-chai · 2024-09-06T03:33:31Z

Hey,you know what,I run into the same question in V4 model,and I avoid it by using V5 model. But it seems like it will have same problem when process specific data.

yaronwinter · 2024-09-09T09:43:45Z

Right, it's the generic ML problem: any model performs best on data that is similar to its train set, and the performance degrades for less similar data.
When I switched to V5 there was a massive decline in performance.
But more comprehensive tests afterwards had showed that V5 has also advantages in some areas:(

yuGAN6 · 2024-09-12T03:25:24Z

Tried V5 model on my low-quality noisy call records too. V4 definitely performs better, as it gives lower probability for background voice and higher for those speaking directly to mic. which is good to my domain.

snakers4 closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not recognize speech after upgrading to V5.1 #515

Does not recognize speech after upgrading to V5.1 #515

yaronwinter commented Aug 7, 2024

x86Gr commented Aug 7, 2024

yaronwinter commented Aug 11, 2024

x86Gr commented Aug 11, 2024

yaronwinter commented Aug 11, 2024

leminhnguyen commented Aug 13, 2024

snakers4 commented Aug 13, 2024

snakers4 commented Aug 13, 2024

leminhnguyen commented Aug 13, 2024 •

edited

Loading

snakers4 commented Aug 13, 2024

Simon-chai commented Sep 6, 2024

yaronwinter commented Sep 9, 2024

yuGAN6 commented Sep 12, 2024

Does not recognize speech after upgrading to V5.1 #515

Does not recognize speech after upgrading to V5.1 #515

Comments

yaronwinter commented Aug 7, 2024

Discussed in #514

x86Gr commented Aug 7, 2024

yaronwinter commented Aug 11, 2024

x86Gr commented Aug 11, 2024

yaronwinter commented Aug 11, 2024

leminhnguyen commented Aug 13, 2024

snakers4 commented Aug 13, 2024

snakers4 commented Aug 13, 2024

leminhnguyen commented Aug 13, 2024 • edited Loading

snakers4 commented Aug 13, 2024

Simon-chai commented Sep 6, 2024

yaronwinter commented Sep 9, 2024

yuGAN6 commented Sep 12, 2024

leminhnguyen commented Aug 13, 2024 •

edited

Loading