Prevent clumping of sentences in a segment. #90

aleksa11010 · 2023-03-07T13:46:42Z

aleksa11010
Mar 7, 2023

Hello,

My current implementation sometimes will clump fast questions and answers in a single segment.

{ "Start": "0:00:00", "End": "0:00:15", "Speaker": "SPEAKER 1", "Text": " What brings you to see us today? I'm here for my scheduled appointment. OK, is this your first time here? Yeah, it is. OK, please take a seat. " },

There are two different speakers but there is a short timeframe where one answers. I tried setting logprob_threshold=-1.5 and no_speech_threshold=0.05 but that still returns multiple sentences in a single segment.

What is the best way to ensure each segment is only a sentence?

Answered by aleksa11010

Mar 9, 2023

I managed to create something that works good, but not 100% accurate :

import pysrt
import nltk.data

def process_subtitle_file(filename):
    subs = pysrt.open(filename)
    text = ' '.join([sub.text.strip() for sub in subs])
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

    sentences = tokenizer.tokenize(text)
    sentence_data = []

    first_start = subs[0].start
    last_end = subs[-1].end

    for i, sentence in enumerate(sentences):
        # Initialize variables to track the start and end times of the sentence
        sentence_start = None
        sentence_end = None
        sentence_found = False
    
        # Iterate through each subtitle item to find the s…

View full answer

aleksa11010 · 2023-03-09T10:33:26Z

aleksa11010
Mar 9, 2023
Author

I managed to create something that works good, but not 100% accurate :

import pysrt
import nltk.data

def process_subtitle_file(filename):
    subs = pysrt.open(filename)
    text = ' '.join([sub.text.strip() for sub in subs])
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

    sentences = tokenizer.tokenize(text)
    sentence_data = []

    first_start = subs[0].start
    last_end = subs[-1].end

    for i, sentence in enumerate(sentences):
        # Initialize variables to track the start and end times of the sentence
        sentence_start = None
        sentence_end = None
        sentence_found = False
    
        # Iterate through each subtitle item to find the start and end times of the sentence
        for j in range(len(subs)):
            sub = subs[j]
    
            # If the current subtitle item contains the full sentence, set the start and end times
            if sentence in sub.text:
                sentence_start = (sub.start.hours * 3600 +
                                      sub.start.minutes * 60 +
                                      sub.start.seconds +
                                      sub.start.milliseconds / 1000)
                sentence_end = (sub.start.hours * 3600 +
                                sub.start.minutes * 60 +
                                sub.start.seconds +
                                sub.start.milliseconds / 1000 +
                                sub.duration.seconds +
                                sub.duration.milliseconds / 1000)
                sentence_found = True
                if sentence_start is not None and sentence_end is not None:
                    sentence_data.append({
                        'text': sentence,
                        'start': sentence_start,
                        'end': sentence_end
                    })
    
            # If the current subtitle item doesn't contain the full sentence, append the next subtitle items
            elif sub.text in sentence:
                sentence_start = (sub.start.hours * 3600 +
                                      sub.start.minutes * 60 +
                                      sub.start.seconds +
                                      sub.start.milliseconds / 1000)
                combined_text = sub.text
                k = j + 1
    
                while not sentence_found and k < len(subs):
                    next_sub = subs[k]
                    combined_text += " " + next_sub.text
                    k += 1
    
                    if sentence in combined_text:
                        sentence_end = (sub.start.hours * 3600 +
                                sub.start.minutes * 60 +
                                sub.start.seconds +
                                sub.start.milliseconds / 1000 +
                                sub.duration.seconds +
                                sub.duration.milliseconds / 1000)
                        sentence_found = True
                        if sentence_start is not None and sentence_end is not None:
                            sentence_data.append({
                                'text': sentence,
                                'start': sentence_start,
                                'end': sentence_end
                            })
                        
                        break
    
            # If the full sentence is found, print the sentence and its start/end time
            if sentence_found:
                print(f"Sentence {i+1}: {sentence} (start: {sentence_start}, end: {sentence_end})")
                break


    return sentence_data

0 replies

jianfch · 2023-03-11T00:59:42Z

jianfch
Mar 11, 2023
Maintainer

The latest whisper version introduced reliable word-level timestamps (but stable-ts has not been updated to support the new version). With the reliable word-level timestamps, it should be as straightforward as splitting the segments (i.e. by period and question marks).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent clumping of sentences in a segment. #90

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Prevent clumping of sentences in a segment. #90

aleksa11010 Mar 7, 2023

Replies: 2 comments

aleksa11010 Mar 9, 2023 Author

jianfch Mar 11, 2023 Maintainer

aleksa11010
Mar 7, 2023

aleksa11010
Mar 9, 2023
Author

jianfch
Mar 11, 2023
Maintainer