Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not recognize speech after upgrading to V5.1 #515

Closed
yaronwinter opened this issue Aug 7, 2024 Discussed in #514 · 12 comments
Closed

Does not recognize speech after upgrading to V5.1 #515

yaronwinter opened this issue Aug 7, 2024 Discussed in #514 · 12 comments

Comments

@yaronwinter
Copy link

Discussed in #514

Originally posted by yaronwinter August 7, 2024
I have been using SileroVAD for a few months now. After upgrading to V5.1 it suddenly fails to recognize very clear speech.
I have tried using both the torch.hub method and direct usage of the package modules, and in both cases it did not recognize anything in a signal with very clear speech:
torch_hub

torch_hub

And here is the audio file:

call_13.mp4

I would appreciate any advice!
Thanks,
Yaron

@x86Gr
Copy link

x86Gr commented Aug 7, 2024

I wouldn't call that "very clear speech", in general. Have you tried lowering the threshold?

@yaronwinter
Copy link
Author

Thanks for the response!
You refer to 'threshold' parameter, right?

(i.e. speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=sr, threshold=0.15))

Well, at 0.45 and higher it does not recognize any speech.
When decreasing it gradually it recognizes a few short segments of speech (not satisfactory), and then, at 0.15, it detects the whole signal as speech...
It is strange, as before the upgrae to 5.1 it detected speech pretty well on this data, with not need for threshold tuning.

Comparing to real-life applications (e.g. call centers, medical health support, etc) this example is not challenging at all, thus is the "very clear speech"...

@x86Gr
Copy link

x86Gr commented Aug 11, 2024

Can you plot the speech probability vs time for a sample audio for both v4 and v5? Have specified other parameters like min speech duration, min silence duration..?

@yaronwinter
Copy link
Author

How can I run v4?
Up to version v4 there is only the torch hub option for getting the model and modules, isn't it?
In fact I had not installed SileroVAD at all, but rather used the torch hub for importing the modules.
Only after it stopped detecting speech I found that v5.1 was released, but I haven't figured out how to return to v4...

@leminhnguyen
Copy link

@yaronwinter same problem for me, have you rollback to V4 successfully?

@snakers4
Copy link
Owner

How can I run v4?

#474

@snakers4
Copy link
Owner

It is always worth doing the following:

  • Plotting the probability chart (there is param in the get_speech_timestamps function);
  • Isolating the problematic audio;
  • Seeing if the problem is systemic, or just improper hyper-paramers;

For this particular case v4.0 gives this probability chart:

  model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad:v4.0',
                                model='silero_vad',
                                force_reload=True,
                                onnx=USE_ONNX)

image

Which is nice, but almost the whole audio is speech anyway except for the starting bit anyway.

For the latest version it is:

  model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                                model='silero_vad',
                                force_reload=True,
                                onnx=USE_ONNX)

image

Here choosing the proper hyper params is next to impossible, because almost whole audio is speech and there are no breaks.

To summarize, in this particular case the whole audio is mostly speech and probably using 4.0 is better, because it was not tuned on call-like data domain (typically calls that we have in the call-center are much less noisy) and the model thinks that this speech is background speech most likely.

In any case you have three models to choose from - v3.1, v4.0 and latest.

@leminhnguyen
Copy link

leminhnguyen commented Aug 13, 2024

@snakers4 From my experiments, People should choose v3.1 or v4.0 with call-center audios for stable results. Anyway, thanks you very much!!!

@snakers4
Copy link
Owner

Looks like it depends on the audio quality.
In our case audio quality is typically higher, hence we were optimizing the background speech objective as well.

I do not really know how to handle this better.
If there will be more edge cases, please open another issue.
Maybe we will think of something, i.e. how to make VAD run in several modes.

The same problem also applies to singing, music, murmur, background TV noises, parrot speech, etc

@Simon-chai
Copy link

Hey,you know what,I run into the same question in V4 model,and I avoid it by using V5 model. But it seems like it will have same problem when process specific data.

@yaronwinter
Copy link
Author

Right, it's the generic ML problem: any model performs best on data that is similar to its train set, and the performance degrades for less similar data.
When I switched to V5 there was a massive decline in performance.
But more comprehensive tests afterwards had showed that V5 has also advantages in some areas:(

@yuGAN6
Copy link
Contributor

yuGAN6 commented Sep 12, 2024

Tried V5 model on my low-quality noisy call records too. V4 definitely performs better, as it gives lower probability for background voice and higher for those speaking directly to mic. which is good to my domain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants