Cog Whisper Diarization

Audio transcribing + diarization pipeline.

AI/ML Models used

Whisper Large v3 (CTranslate 2 version faster-whisper==1.0.3)
Pyannote audio 3.3.1

Usage

Used at Audiogest and Spectropic
Or try at Replicate
Or deploy yourself on Replicate or any machine with a GPU

Deploy

Make sure you have cog installed
Accept pyannote/segmentation-3.0 user conditions
Accept pyannote/speaker-diarization-3.1 user conditions
Create HuggingFace token at hf.co/settings/tokens.
Insert your own HuggingFace token in predict.py in the setup function
- (Be careful not to commit this token!)
Run cog build
Run cog predict -i input.wav
- Or push to Replicate with cog push r8.im/<username>/<name>
Please follow instructions on cog.run if you run into issues

Input

file_string: str: Either provide a Base64 encoded audio file.
file_url: str: Or provide a direct audio file URL.
file: Path: Or provide an audio file.
group_segments: bool: Group segments of the same speaker shorter than 2 seconds apart. Default is True.
num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
translate: bool: Translate the speech into English.
language: str: Language of the spoken words as a language code like 'en'. Leave empty to auto detect language.
prompt: str: Vocabulary: provide names, acronyms, and loanwords in a list. Use punctuation for best accuracy. Also now used as 'hotwords' paramater in transcribing,
offset_seconds: int: Offset in seconds, used for chunked inputs. Default is 0.
transcript_output_format: str: Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.
- Default is both.
- Options are words_only, segments_only, both,

Output

segments: List[Dict]: List of segments with speaker, start and end time.
- Includes avg_logprob for each segment and probability for each word level segment.
num_speakers: int: Number of speakers (detected, unless specified in input).
language: str: Language of the spoken words as a language code like 'en' (detected, unless specified in input).

Thanks to

pyannote - Speaker diarization model
whisper - Speech recognition model
faster-whisper - Reimplementation of Whisper model for faster inference
cog - ML containerization framework

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cog Whisper Diarization

AI/ML Models used

Usage

Deploy

Input

Output

Thanks to

About

Releases

Packages

Languages

tactint/cog-whisper-diarization

Folders and files

Latest commit

History

Repository files navigation

Cog Whisper Diarization

AI/ML Models used

Usage

Deploy

Input

Output

Thanks to

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages