Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: More multilingual audio training data #143

Open
hahuyhoang411 opened this issue Dec 4, 2024 · 1 comment
Open

feat: More multilingual audio training data #143

hahuyhoang411 opened this issue Dec 4, 2024 · 1 comment
Labels
help wanted Extra attention is needed
Milestone

Comments

@hahuyhoang411
Copy link
Contributor

Problem

Currently we only have mostly Vietnamese, some Thai, some Singlish.

Suggestion

Process the youtube audio in this dataset: https://huggingface.co/datasets/espnet/yodas2/

Thanks jpc for sharing this dataset

@hahuyhoang411 hahuyhoang411 added the help wanted Extra attention is needed label Dec 4, 2024
@hahuyhoang411 hahuyhoang411 added this to the Ichigo v0.6 milestone Dec 4, 2024
@hahuyhoang411
Copy link
Contributor Author

hahuyhoang411 commented Dec 6, 2024

Description

This is the preprocessing code for the GigaSpeech2 dataset.

Gist link

Data demo

Problem

  • The original GigaSpeech2 dataset contains over 7M raw Vietnamese audio files.
  • Audio files are stored in .tar.gz archives.
  • Transcriptions are saved separately in .tsv files.

Preprocessing Steps

  • Load the dataset and map each audio file to its corresponding transcription using IDs.
  • Convert the dataset into a Hugging Face-compatible format for use with the current training code.

@tikikun tikikun transferred this issue from janhq/WhisperSpeech Dec 11, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Jan & Cortex Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
Status: Investigating
Development

No branches or pull requests

1 participant