feat: More multilingual audio training data #143

hahuyhoang411 · 2024-12-04T21:50:51Z

Problem

Currently we only have mostly Vietnamese, some Thai, some Singlish.

Thanks jpc for sharing this dataset

hahuyhoang411 · 2024-12-06T22:34:52Z

This is the preprocessing code for the GigaSpeech2 dataset.

Load the dataset and map each audio file to its corresponding transcription using IDs.
Convert the dataset into a Hugging Face-compatible format for use with the current training code.

hahuyhoang411 added the help wanted Extra attention is needed label Dec 4, 2024

hahuyhoang411 added this to the Ichigo v0.6 milestone Dec 4, 2024

tikikun transferred this issue from janhq/WhisperSpeech Dec 11, 2024

github-project-automation bot added this to Jan & Cortex Dec 11, 2024

github-project-automation bot moved this to Investigating in Jan & Cortex Dec 11, 2024