Use Whisper for Speaker Diarization on Infant-directed Speech (IDS) #2401

sucv · 2024-10-22T10:26:57Z

sucv
Oct 22, 2024

Greetings everyone! I would like to seek for your advice on utilizing Whisper for infant-directed speech (IDS) speaker diarization. In which obtaining the timing of the parent's and infant's vocalization is the priority. (Note that I am aware of the existing speaker diarization tools like Pyannote, nemo, etc. But based on my trial and error so far, they cannot work for IDS scenarios.)

The scenarios for IDS is largely different from the normal conversation. In IDS, the infant is too young to actually speak, they can only cry, coo, or yelling, etc. The pitch/loudness of the adult's voice in IDS is usually different from their office meeting. And sometimes the vocalization is extremely imbalanced, like, the infant could rarely have any utterances or only short ones in a long session). The original Whisper is a STT model and cannot recognize such IDS utterance. (Maybe I am wrong?)

Therefore, I am thinking about finetuning the Whisper using my IDS data, then ask ChatGPT to classify the recognized transcription into parent/infant. This idea shows promising results when the input is from older child who can speak and being recognized by the Whisper. I am going to generalize it to infants who are between 3month and 18 months old.

Below are the detailed idea. Normally, the audio-transcription pairs to train the Whisper are like:

/path/to/audio1.wav,"Hello, how are you?"
/path/to/audio2.wav,"I am fine, thank you."
...

In our case, we could prepare something like:

/path/to/audio1.wav,"Baby, look here!"
/path/to/audio2.wav,non-verbal,
/path/to/audio3.wav,"Mom, mom"
...

In which the second file could be crying, cooing, etc. I hope that by finetuning a Whisper using data like this, it can identify the non-verbal utterance with timestamp. Then, a downstream ChatGPT will be used, with some prompt engineering as the example and guidance.

Do you think my idea makes sense or any critical thing overlooked? Much appreciated for any input!

EtienneAb3d · 2024-10-22T11:09:18Z

EtienneAb3d
Oct 22, 2024

You should have a look at this project. I think it's very close to yours:
https://github.com/YuanGongND/whisper-at

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Whisper for Speaker Diarization on Infant-directed Speech (IDS) #2401

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Use Whisper for Speaker Diarization on Infant-directed Speech (IDS) #2401

sucv Oct 22, 2024

Replies: 1 comment

EtienneAb3d Oct 22, 2024

sucv
Oct 22, 2024

EtienneAb3d
Oct 22, 2024