Fix silence detection #175

aigerimmmm · 2024-06-26T08:08:58Z

PR for the issue #27.

Added SilenceDetectionFilter similar to LanguageDetectionFilter.
Implemented detectSilence similar to detectLanguage in TextDecoder that does a single forward pass from to detect silence periods.
Added tests for SilenceDetectionFilter.
Added silence detection and segment skipping logic in TranscribeTask.
Updated models to support silence detection:
- Added flags in DecodingOptions for ignoring prefill prompts during silence detection.
- Changed property for managing silence detection threshold.

… long pause followed by continuous speech

ZachNagengast

Hi @aigerimmmm thanks for this submission, this is a hugely needed feature 🎉 Looks like a good start, please see the comments below and lmk what you think. Also on the added files: Is there a specific reason you need the new silent audio, rather than generating silence (all zeros) and appending it to the existing audio? We do that in a few tests you may want to reference. It would also be great to test the specific case where silence is above the threshold but the log probs for the full window are also above the threshold so we keep the tokens rather than skip the window. Thanks again fro working through this!

ZachNagengast · 2024-06-28T14:12:39Z

Tests/WhisperKitTests/UnitTests.swift

+
+      // MARK: - Helper Function
+
+      func loadAudioSamples(forResource resource: String, withExtension ext: String) -> [Float] {


Good idea 👍

ZachNagengast · 2024-06-28T14:13:53Z

Tests/WhisperKitTests/UnitTests.swift

+    func testSilentAudio() async throws {
+        let whisperKit = try await WhisperKit(modelFolder: tinyModelPath(), verbose: true, logLevel: .debug)
+
+        let silentAudioSamples: [Float] = loadAudioSamples(forResource: "silent_audio", withExtension: "mp3")


This can be [Float](repeating: 0.0, count: 16000)

Thanks for the suggestion! I've updated the testSilentAudio function to use [Float](repeating: 0.0, count: 16000) for silentAudioSamples.

ZachNagengast · 2024-06-28T14:25:10Z

Sources/WhisperKit/Core/TranscribeTask.swift

@@ -157,6 +157,13 @@ final class TranscribeTask {
                try Task.checkCancellation()
                // Send to decoder to predict text tokens with fallback
                let decodingResult = try await decodeWithFallback(encoderSegment: encoderOutput, decodingOptions: options, callback: decodingCallback)
+
+                // Check for silence detection
+                if decodingResult.noSpeechProb > (options.noSpeechThreshold ?? 0.7) {


I believe this should also contain a check for the overall log prob of the window, see reference: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L289

Thanks for the suggestion! I've added a check for the overall log probability of the window

ZachNagengast · 2024-06-28T14:27:24Z

Sources/WhisperKit/Core/TranscribeTask.swift

+                if decodingResult.noSpeechProb > (options.noSpeechThreshold ?? 0.7) {
+                    print("Detected silence with noSpeechProb \(decodingResult.noSpeechProb), skipping segment.")
+                    // Skip processing for silent segments
+                    break


Instead of break, this should just increase the seek and continue the loop https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L293

You are so right, added increase the seek and continue the loop

ZachNagengast · 2024-06-28T14:34:39Z

Sources/WhisperKit/Core/TextDecoder.swift

@@ -827,7 +892,7 @@ open class TextDecoder: TextDecoding, WhisperMLModel {
        let decodingFallback = DecodingFallback(
            options: options,
            isFirstTokenLogProbTooLow: isFirstTokenLogProbTooLow,
-            noSpeechProb: noSpeechProb,


Is this intended? If I'm reading this right, I think the noSpeechProb should be passed based on the actual probability of the no speech token at the first index of predicted tokens https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L689-L693

Yes, you are right. I have made changes to ensure that noSpeechProb is calculated at the start of the transcript (SOT) token and passed to DecodingFallback. The commit is 3a3b512.

ZachNagengast · 2024-06-28T14:44:21Z

Sources/WhisperKit/Core/Models.swift

@@ -317,9 +319,10 @@ public struct DecodingOptions {
        compressionRatioThreshold: Float? = 2.4,
        logProbThreshold: Float? = -1.0,
        firstTokenLogProbThreshold: Float? = -1.5,
-        noSpeechThreshold: Float? = 0.6,
+        noSpeechThreshold: Float? = 0.7,


The default in openai/whisper is 0.6, can you stick with that value or is there a reason you're updating it here?

I've reverted the noSpeechThreshold value back to 0.6 to align with the default used in openai/whisper.

ZachNagengast · 2024-06-28T14:49:45Z

Sources/WhisperKit/Core/Models.swift

        concurrentWorkerCount: Int = 0,
-        chunkingStrategy: ChunkingStrategy? = nil
+        chunkingStrategy: ChunkingStrategy? = nil,
+        ignorePrefillPromptForNoSpeechDetection: Bool = true


This is a bit verbose and also not sure exactly what it's intended use is? Are you referencing existing code or was this a need that came up from your testing? I can see a need for something that runs the forward pass for SOT token when using the prefillData, but for prefill prompt we would still want to check the no speech prob specifically at the SOT token, after all the prefill prompt tokens have been passed through to generate their KV caches. Best place to do that would be here:

WhisperKit/Sources/WhisperKit/Core/TextDecoder.swift

Line 600 in 7fa227e

if tokenIndex < intialPromptIndex {

Thank you for your suggestion! I did check the no speech probability after all prefill prompt tokens have been processed

ZachNagengast · 2024-06-28T14:55:40Z

Sources/WhisperKit/Core/LogitsFilter.swift

@@ -278,3 +278,37 @@ open class LanguageLogitsFilter: LogitsFiltering {
        return indexes
    }
 }
+
+@available(macOS 13, iOS 16, watchOS 10, visionOS 1, *)
+open class SilenceLogitsFilter: LogitsFiltering {


On second look at the source, I'm not sure we need a filter for this, since that will suppress the log probs of the "predicted token" during inference. What you already have here https://github.com/argmaxinc/WhisperKit/pull/175/files#diff-5dd6579fc66020b1085535bce41d2c2cc399a0b2b8f0ba225fc89f39d9ebdbc8R402 is checking the specific no speech index, which does everything you would need already.

One thing that you could add to the filters is supressing the no speech token in the SupressTokensFilter similar to this https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L638-L640 (also needs the remaining suppress tokens but can be fixed later)

Sources/WhisperKit/Core/TextDecoder.swift

Sources/WhisperKit/Core/TranscribeTask.swift

Co-authored-by: Zach Nagengast <[email protected]>

ZachNagengast · 2024-07-03T18:19:51Z

Hey @aigerimmmm just checking in, how are you feeling about the changes?

aigerimmmm · 2024-07-03T19:09:38Z

Hi @ZachNagengast, I really appreciate the all suggestions and detailed review, it's very helpful for me 👍 I've been working on it and I'm going to submit today.

…ility calculation for detectSilence

…and added test for testDetectSilenceHelperMethod

ZachNagengast · 2024-08-06T13:55:04Z

@aigerimmmm Checking in, is this ready for another review?

aigerimmmm · 2024-08-08T03:30:56Z

Hi @ZachNagengast thank you for reply, yes I am ready anytime for another review.

Aika and others added 9 commits June 26, 2024 03:49

Add SilenceDetectionFilter to improve silence detection logic

57256ab

Implement detectSilence in TextDecoder

b6cb176

Add silence detection and segment skipping logic in TranscribeTask

f724f89

Update models to support silence detection

b5fc115

Add tests for SilenceDetectionFilter

581cf77

Add silent_audio.mp3 for testing fully silent audio

fd1bea5

Add initial_silence_speech.m4a for testing initial silence detection:…

efe4f21

… long pause followed by continuous speech

Add continuous_speech.m4a for testing continuous speech detection

e8dad50

Update TranscribeTask.swift

d95c952

ZachNagengast requested changes Jun 28, 2024

View reviewed changes

aigerimmmm and others added 2 commits June 30, 2024 18:38

Update Sources/WhisperKit/Core/TextDecoder.swift

685b8d9

Co-authored-by: Zach Nagengast <[email protected]>

Update Sources/WhisperKit/Core/TranscribeTask.swift

f94fc94

Co-authored-by: Zach Nagengast <[email protected]>

Aika and others added 13 commits July 5, 2024 13:30

Added silence detection and log probability checks in TranscribeTask

1e97153

Kept noSpeechThreshold as original at 0.6

056992a

Added checking noSpeechProb at the SOT token and added softmax probab…

ffb19ef

…ility calculation for detectSilence

Added top-level function for detectSilence.

83af6a5

Added SilenceLogitsFilter to handle suppression of no speech token

38078f4

Changed silent audio samples to [Float](repeating: 0.0, count: 16000)…

571cca6

…and added test for testDetectSilenceHelperMethod

Update TextDecoder.swift

3a3b512

Delete Sources/WhisperKit/Core/LogitsFilter.swift

031d44f

Restore LogitsFilter.swift file

7ff3b0a

Update TextDecoder.swift

c788c16

Update TranscribeTask.swift

f4ac60a

Update UnitTests.swift

8cf2ef4

Update TextDecoder.swift

dda302f

aigerimmmm closed this Aug 8, 2024

aigerimmmm reopened this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix silence detection #175

Fix silence detection #175

aigerimmmm commented Jun 26, 2024 •

edited

Loading

ZachNagengast left a comment

ZachNagengast Jun 28, 2024

ZachNagengast Jun 28, 2024

aigerimmmm Jul 5, 2024

ZachNagengast Jun 28, 2024

aigerimmmm Jul 5, 2024

ZachNagengast Jun 28, 2024

aigerimmmm Jul 5, 2024

ZachNagengast Jun 28, 2024

aigerimmmm Jul 8, 2024 •

edited

Loading

ZachNagengast Jun 28, 2024

aigerimmmm Jul 5, 2024

ZachNagengast Jun 28, 2024

aigerimmmm Jul 5, 2024

ZachNagengast Jun 28, 2024

ZachNagengast Jun 28, 2024

ZachNagengast commented Jul 3, 2024

aigerimmmm commented Jul 3, 2024

ZachNagengast commented Aug 6, 2024

aigerimmmm commented Aug 8, 2024


		// MARK: - Helper Function

		func loadAudioSamples(forResource resource: String, withExtension ext: String) -> [Float] {

Fix silence detection #175

Are you sure you want to change the base?

Fix silence detection #175

Conversation

aigerimmmm commented Jun 26, 2024 • edited Loading

ZachNagengast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aigerimmmm Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZachNagengast commented Jul 3, 2024

aigerimmmm commented Jul 3, 2024

ZachNagengast commented Aug 6, 2024

aigerimmmm commented Aug 8, 2024

aigerimmmm commented Jun 26, 2024 •

edited

Loading

aigerimmmm Jul 8, 2024 •

edited

Loading