End segment after a silence does not get transcribed properly #26

cetiny · 2022-11-01T11:33:01Z

cetiny
Nov 1, 2022

I am manipulating the audio and adding two second breaks for better transcription of podcasts.

This causes Whisper with stable-ts ignore the last part of this audio file (as of 05:40). Also the transcription has some mistakes before.
https://www.dropbox.com/s/akml8nt9wozglag/shifted_audio.wav?dl=0

The base Whisper however transcribes correctly.

Maybe I should add suppress silence. Any ideas?

result2 = model .transcribe('weather.mp4', language='en', suppress_silence=True, ts_num=16, lower_quantile=0.05, lower_threshold=0.1))

My code as reference:

def transcribeWhisper():
    args = {
        "verbose": False,
        "task": "transcribe",
        "language": LANGUAGE,
        "best_of": 5,
        "beam_size": 5,
    }

    model = whisper.load_model(MODEL, device = DEVICE)
    modify_model(model)
    result = model.transcribe("shifted_audio.wav", **args,)
    return result

Answered by jianfch

Nov 2, 2022

suppress_silence=True is default. The timestamp decoding logic was not properly implemented for beam search, but it should work properly in 280999c.

shifted_audio.mp4

The 2nd is with beam_size=5
The 3rd is with beam_size=None (i.e. greedy search)
It appears that silence suppression works better with greedy search.

View full answer

jianfch · 2022-11-02T03:43:19Z

jianfch
Nov 2, 2022
Maintainer

suppress_silence=True is default. The timestamp decoding logic was not properly implemented for beam search, but it should work properly in 280999c.

shifted_audio.mp4

The 2nd is with beam_size=5
The 3rd is with beam_size=None (i.e. greedy search)
It appears that silence suppression works better with greedy search.

0 replies

cetiny · 2022-11-02T10:36:37Z

cetiny
Nov 2, 2022
Author

Thank you for the quick fix, the video recording above looks great. However, I am somehow still getting the same results with the latest build (280999c). Also when I set the beam_size to None. Is this prompt correct?

def transcribeWhisper():
    args = {
        "verbose": False,
        "task": "transcribe",
        "language": LANGUAGE,
        "best_of": 5,
        "beam_size": 5,
    }

    model = whisper.load_model(MODEL, device = DEVICE)
    modify_model(model)
    result = model.transcribe("shifted_audio.wav", **args,)
    return result

4 replies

jianfch Nov 2, 2022
Maintainer

That's odd. You can try a larger model. Or if that doesn't work then just disable silence suppression entirely, suppress_silence=True.

cetiny Nov 2, 2022
Author

suppress_silence=True this resolved the issue. I am using the "medium.en" model, which is the largest my GPU can handle. Thanks a lot!

jianfch Nov 2, 2022
Maintainer

That's odd. You can try a larger model. Or if that doesn't work then just disable silence suppression entirely, suppress_silence=True.

That's good. One correction: suppress_silence is True by default. I meant say: suppress_silence=False

cetiny Nov 2, 2022
Author

Thank you. I was indeed running it with suppress_silence = True. But this magically resolved the issue. Now I get the desired result with or without arguments. Maybe Jupyter Notebook was using a cached version. Thanks a lot for your quick replies. You are doing a great job with stable-ts.

cetiny · 2022-11-06T06:49:05Z

cetiny
Nov 6, 2022
Author

I am now getting some new errors with another file when silence is suppressed. It works with suppress_silence = False. Interesting

KeyError                                  Traceback (most recent call last)
Cell In [54], line 2
      1 # Transcribe
----> 2 transcriptionResult = transcribeWhisper()

Cell In [53], line 15, in transcribeWhisper()
     13 model = whisper.load_model(MODEL, device = DEVICE)
     14 modify_model(model)
---> 15 result = model.transcribe("shifted_audio.wav", suppress_silence = True, **args,)
     16 return result

File D:\Podscript\Transcribe\stable_whisper.py:1021, in transcribe_word_level(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, stab, top_focus, ts_num, alpha, print_unstab, suppress_silence, suppress_middle, suppress_word_ts, remove_background, silence_threshold, prepend_punctuations, append_punctuations, audio_for_mask, **decode_options)
   1018     suppress_ts_mask = None
   1020 decode_options["prompt"] = all_tokens[prompt_reset_since:]
-> 1021 result, finalized_ts_tokens, ts_logits = decode_with_fallback(segment,
   1022                                                               suppress_ts_mask=suppress_ts_mask)
   1024 result = result[0]
   1025 tokens = torch.tensor(result.tokens)

File D:\Podscript\Transcribe\stable_whisper.py:882, in transcribe_word_level.<locals>.decode_with_fallback(segment, suppress_ts_mask)
    879     best_of = kwargs.get("best_of", None)
    881 options = DecodingOptions(**kwargs, temperature=t)
--> 882 results, ts_tokens, ts_logits_ = model.decode(segment, options, ts_num=ts_num, alpha=alpha,
    883                                               suppress_ts_mask=suppress_ts_mask,
    884                                               suppress_word_ts=suppress_word_ts)
    886 kwargs.pop("beam_size", None)  # no beam search for t > 0
    887 kwargs.pop("patience", None)  # no patience for t > 0

File ~\miniconda3\envs\ProjectPodscript\lib\site-packages\torch\autograd\grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File D:\Podscript\Transcribe\stable_whisper.py:1469, in decode_word_level(model, mel, options, ts_num, alpha, suppress_ts_mask, suppress_word_ts)
   1466 if single:
   1467     mel = mel.unsqueeze(0)
-> 1469 result, ts = DecodingTaskWordLevel(model, options,
   1470                                    ts_num=ts_num,
   1471                                    alpha=alpha,
   1472                                    suppress_ts_mask=suppress_ts_mask,
   1473                                    suppress_word_ts=suppress_word_ts).run(mel)
   1475 if single:
   1476     result = result[0]

File ~\miniconda3\envs\ProjectPodscript\lib\site-packages\torch\autograd\grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File D:\Podscript\Transcribe\stable_whisper.py:1392, in DecodingTaskWordLevel.run(self, mel)
   1389 sum_logprobs = sum_logprobs.reshape(n_audio, self.n_group)
   1391 # get the final candidates for each group, and slice between the first sampled token and EOT
-> 1392 tokens, sum_logprobs, ts = self.decoder.finalize(tokens, sum_logprobs)
   1393 tokens: List[List[Tensor]] = [
   1394     [t[self.sample_begin: (t == tokenizer.eot).nonzero()[0, 0]] for t in s] for s in tokens
   1395 ]
   1396 ts: List[List[Tensor]] = [[t[:, :tokens[i][j].shape[-1]] for j, t in enumerate(s)] for i, s in enumerate(ts)]

File D:\Podscript\Transcribe\stable_whisper.py:1275, in BeamSearchDecoderWordLevel.finalize(self, preceding_tokens, sum_logprobs)
   1273 seq_tuple = tuple(sequence)
   1274 sequences[seq_tuple] = sum_logprobs[i][j].item()
-> 1275 ts_[i][seq_tuple] = self.ts[i, j]
   1276 if len(sequences) >= self.beam_size:
   1277     break

KeyError: 0

2 replies

jianfch Nov 6, 2022
Maintainer

Should it fixed in 4cbb0b8.

cetiny Nov 7, 2022
Author

Cool. working now. Thank you, again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End segment after a silence does not get transcribed properly #26

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

End segment after a silence does not get transcribed properly #26

cetiny Nov 1, 2022

Replies: 3 comments · 6 replies

jianfch Nov 2, 2022 Maintainer

cetiny Nov 2, 2022 Author

jianfch Nov 2, 2022 Maintainer

cetiny Nov 2, 2022 Author

jianfch Nov 2, 2022 Maintainer

cetiny Nov 2, 2022 Author

cetiny Nov 6, 2022 Author

jianfch Nov 6, 2022 Maintainer

cetiny Nov 7, 2022 Author

cetiny
Nov 1, 2022

Replies: 3 comments 6 replies

jianfch
Nov 2, 2022
Maintainer

cetiny
Nov 2, 2022
Author

jianfch Nov 2, 2022
Maintainer

cetiny Nov 2, 2022
Author

jianfch Nov 2, 2022
Maintainer

cetiny Nov 2, 2022
Author

cetiny
Nov 6, 2022
Author

jianfch Nov 6, 2022
Maintainer

cetiny Nov 7, 2022
Author