Skip to content

Prevent clumping of sentences in a segment. #90

Answered by aleksa11010
aleksa11010 asked this question in Q&A
Discussion options

You must be logged in to vote

I managed to create something that works good, but not 100% accurate :

import pysrt
import nltk.data

def process_subtitle_file(filename):
    subs = pysrt.open(filename)
    text = ' '.join([sub.text.strip() for sub in subs])
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

    sentences = tokenizer.tokenize(text)
    sentence_data = []

    first_start = subs[0].start
    last_end = subs[-1].end

    for i, sentence in enumerate(sentences):
        # Initialize variables to track the start and end times of the sentence
        sentence_start = None
        sentence_end = None
        sentence_found = False
    
        # Iterate through each subtitle item to find the s…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by aleksa11010
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants