How to Prevent English Contractions From Being Split in SRT Segments? #123

chislon · 2023-04-07T04:12:06Z

chislon
Apr 7, 2023

Testing results of subtitles for hour-long files on version 2.3

I'm noticing that English contractions are being split onto different segments and I'm not sure how to control this behavior.

Aside from the default regrouping, I'm trying to use the split_by_length function to trim output of SRT subtitle to 84 characters, but I'm finding that sometimes that English contractions are being split into separate lines. I'm not sure if the problem is with split_by_length not recognizing apostrophe characters.

I know it has to do with regrouping function calls, so far I have this which isn't too bad except for the contractions:

  result
      .split_by_punctuation([('.', ' '), '。', '?', '？', ',', '，'])
      .split_by_gap(.5)
      .merge_by_punctuation(["'"])
      .merge_by_gap(.15, max_words=3)
      .split_by_punctuation([('.', ' '), '。', '?', '？'])
      .split_by_length(max_chars=84, lock=True)

Problematic output as follows:

1093
01:13:16,118 --> 01:13:19,262
What is going on in Tom

1094
01:13:19,898 --> 01:13:25,358
's life.

or

339
00:18:30,722 --> 00:18:31,166
I

340
00:18:32,802 --> 00:18:33,702
've known you for a while now

Here's the whole script I've been working with:

import os
import tkinter as tk
from tkinter import messagebox
from tkinter.filedialog import askopenfilename
from datetime import datetime
from stable_whisper import load_model

MODEL_SIZE = "small"
LANGUAGE = "en"

def generate_subtitle(path):
    print(f"Transcribing to SRT with model: {MODEL_SIZE}")
    model = load_model(MODEL_SIZE)
    
    decode_options = dict(language = LANGUAGE)
    transcribe_options = dict(task="transcribe", **decode_options)
    
    result = model.transcribe(audio=path, condition_on_previous_text=False, vad=True, only_voice_freq=True, regroup=False, **transcribe_options)
    (
        result
        .split_by_punctuation([('.', ' '), '。', '?', '？', ',', '，'])
        .split_by_gap(.5)
        .merge_by_punctuation([("'")])
        .merge_by_gap(.3, max_words=3)
        .split_by_punctuation([('.', ' '), '。', '?', '？'])
        .split_by_length(max_chars=84)
    )
    output_directory = os.path.dirname(path)
    result.to_srt_vtt(f"{path}.srt", word_level=False)
    return



print("Script for Outputting SRT Subtitle File from Media File")

root = tk.Tk()
root.withdraw()

filename = askopenfilename(
    title="Choose audio file to generate subtitles for", 
    filetypes=[("Audio files", "*.mp3 *.wav *.aac *.m4a *.mka *.mp4 *.mkv")]
)

if not filename:
    messagebox.showwarning("Critical Problem", "Filename not specified, exit now.")
    quit()

if os.path.exists(f"{filename}.srt"):
    messagebox.showwarning("Critical Problem", "Output SRT already exists, remove it before running this, exit now.")
    quit()

print(f"Input filename: {filename}")
start_time=datetime.now()
print("Start Time =", start_time.strftime("%H:%M:%S"))

try:
    generate_subtitle(filename)
except Exception as e:
    messagebox.showerror("Error", str(e))
    quit()

end_time = datetime.now()
print(f"End Time: {end_time.strftime('%H:%M:%S')}")

time_taken = end_time - start_time
time_taken_minutes = int(time_taken.total_seconds() // 60)
time_taken_seconds = int(time_taken.total_seconds() % 60)
print(f"Time taken: {time_taken_minutes} minutes, {time_taken_seconds} seconds")

messagebox.showinfo("End of script", f"Finished execution, exit now. Time taken: {time_taken_minutes} minutes, {time_taken_seconds} seconds")

print("End of script")

quit()

Edit:
This seems to be happening when Whisper operates in Chinese language mode and generates separate word tokens that separate English contractions

Answered by jianfch

Apr 8, 2023

result.merge_all_segments().split_by_punctuation([("","'")]).merge_by_punctuation([("", "'")], lock=True)

This should do the trick. If you use ["'"] it only looks for the ones ending with ' and not those starting with '. But your case is that it starts with '.

View full answer

jianfch · 2023-04-07T06:44:30Z

jianfch
Apr 7, 2023
Maintainer

The later splits might have split it. Have you tried it with lock=True?

result.merge_by_punctuation(["'"], lock=True)

11 replies

jianfch Apr 8, 2023
Maintainer

Those are separate words for language=zh because words are not split by space for zh. But it seems like they are part of the same segment from the json your shared.

9
00:00:35,580 --> 00:00:40,900
Tom and Mike were tirelessly in the studio recording and perfecting Tom's music.

11
00:00:48,120 --> 00:00:49,480
they'd do that as well.

chislon Apr 8, 2023
Author

I still have ended up with contractions split across 2 segments in my results, such as "Tom" and "'s" in separate segments in my output, I wonder if one of the split functions on my result processed it that way. But merge_by_punctuation didn't seem to do the trick.

This isn't mission critical for me, I'm happy with the state of things. It's just strange behavior that maybe someone else could run into.

Thanks a lot for taking time to respond at all. I know it takes time.

jianfch Apr 8, 2023
Maintainer

If you use other split methods after that one. It might split it even with lock=True on merge_by_punctuation() because it does not lock those are already in the same segments. To prevent this you can merge it all into 1 segment first before any splits/merges with merge_all_segments().

jianfch Apr 8, 2023
Maintainer

result.merge_all_segments().split_by_punctuation([("","'")]).merge_by_punctuation([("", "'")], lock=True)

This should do the trick. If you use ["'"] it only looks for the ones ending with ' and not those starting with '. But your case is that it starts with '.

Answer selected by chislon

chislon Apr 8, 2023
Author

Thanks, I'll try this!

chislon Apr 9, 2023
Author

I wasn't able to get the merge_all_segments, split_by_punctuation([("","'")]), merge_by_punctuation([("", "'")], lock=True) to work as expected. But the main thing was that the merge_by_punctuation is working as expected after commit a84a346 and as a workaround I'm calling the merge_by_punctuation last while keeping split_by_length at a modestly low count to keep styling somewhat consistent

chislon Apr 9, 2023
Author

Thanks again for putting this project together, it creates usable subtitles for what I'm trying to do even though accuracy is limited. It's definitely much more usable than out-of-the-box Whisper. This English contraction problem was bugging me a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Prevent English Contractions From Being Split in SRT Segments? #123

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to Prevent English Contractions From Being Split in SRT Segments? #123

chislon Apr 7, 2023

Replies: 1 comment · 11 replies

jianfch Apr 7, 2023 Maintainer

jianfch Apr 8, 2023 Maintainer

chislon Apr 8, 2023 Author

jianfch Apr 8, 2023 Maintainer

jianfch Apr 8, 2023 Maintainer

chislon Apr 8, 2023 Author

chislon Apr 9, 2023 Author

chislon Apr 9, 2023 Author

chislon
Apr 7, 2023

Replies: 1 comment 11 replies

jianfch
Apr 7, 2023
Maintainer

jianfch Apr 8, 2023
Maintainer

chislon Apr 8, 2023
Author

jianfch Apr 8, 2023
Maintainer

jianfch Apr 8, 2023
Maintainer

chislon Apr 8, 2023
Author

chislon Apr 9, 2023
Author

chislon Apr 9, 2023
Author