Replies: 1 comment
-
You should have a look at this project. I think it's very close to yours: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Greetings everyone! I would like to seek for your advice on utilizing Whisper for infant-directed speech (IDS) speaker diarization. In which obtaining the timing of the parent's and infant's vocalization is the priority. (Note that I am aware of the existing speaker diarization tools like Pyannote, nemo, etc. But based on my trial and error so far, they cannot work for IDS scenarios.)
The scenarios for IDS is largely different from the normal conversation. In IDS, the infant is too young to actually speak, they can only cry, coo, or yelling, etc. The pitch/loudness of the adult's voice in IDS is usually different from their office meeting. And sometimes the vocalization is extremely imbalanced, like, the infant could rarely have any utterances or only short ones in a long session). The original Whisper is a STT model and cannot recognize such IDS utterance. (Maybe I am wrong?)
Therefore, I am thinking about finetuning the Whisper using my IDS data, then ask ChatGPT to classify the recognized transcription into parent/infant. This idea shows promising results when the input is from older child who can speak and being recognized by the Whisper. I am going to generalize it to infants who are between 3month and 18 months old.
Below are the detailed idea. Normally, the audio-transcription pairs to train the Whisper are like:
In our case, we could prepare something like:
In which the second file could be crying, cooing, etc. I hope that by finetuning a Whisper using data like this, it can identify the non-verbal utterance with timestamp. Then, a downstream ChatGPT will be used, with some prompt engineering as the example and guidance.
Do you think my idea makes sense or any critical thing overlooked? Much appreciated for any input!
Beta Was this translation helpful? Give feedback.
All reactions