This document provides a simple guide to three helper scripts used in the transcription and experiment workflow. Each script performs specific tasks, and the document explains what the script does, how to run it, and when it should be called in the overall workflow.
helper_determine_prompt_length_tokens.py
This script determines the token length of a given prompt using the Whisper transcription model. It helps ensure that prompt lengths do not exceed the token limits for transcription experiments.
To run the script with a specific prompt:
python -m whisper_chunk_transcribe.helper_determine_prompt_length_tokens.py "Your prompt text here"
- Before setting up experiments that require token length calculations for prompt templates.
- During experiment design to ensure that prompts fit within the token limit of the transcription engine.
- Uses the Whisper model to calculate the token length of a provided prompt.
- Outputs the token length to ensure it does not exceed the allowed limit.
This script updates the prompt terms for an experiment by retrieving terms from the database, calculating their token lengths, and building a prompt that fits within a specified token limit.
To run the script with an experiment ID and optional token limit:
python -m whisper_chunk_transcribe.helper_update_prompt_terms --experiment_id 1 --max_tokens 255
- After populating the experiment data but before running the experiment, to build the final prompts for any relevant test cases.
- During prompt creation, to ensure that the selected terms fit within the token limit for the transcription engine.
- Retrieves prompt terms from the database for the specified
experiment_id
. - Builds a prompt by adding terms that do not exceed the
max_tokens
limit. - Outputs the final prompt and its token count.
This script fetches player and team data from the ESPN API and upserts this information into the database for use in game-related transcription experiments.
To run the script, use the following command:
python -m whisper_chunk_transcribe.helper_player_team_names
- When preparing player and team data for sports-related transcription experiments.
- Before running experiments that require player and team information from specific games, particularly in sports domains like baseball, football, and basketball.
- Fetches game data from the ESPN API using
espn_id
. - Extracts player and team information for the home and away teams.
- Upserts the game, player, and team data into the database for use in experiments.
- First, use
helper_determine_prompt_length_tokens.py
to determine the token lengths of specific prompts, ensuring they fit within the transcription engine's limits. - Next, use
helper_update_prompt_terms.py
to update and generate valid prompt terms for an experiment, ensuring that the final prompt fits within the token limit before running the actual transcription experiments. - Finally, use
helper_player_team_names.py
to fetch and store player/team data for sports-related experiments.
This order ensures that all necessary data (players, teams, prompt terms) is correctly set up before executing the main transcription experiments.