dataset: Ichigo LLM's instruct dataset #121

hahuyhoang411 · 2024-11-19T16:56:15Z

Goal

Create a speech instruction finetuning to make Ichigo better in conversation.

Tasklist

Gathering Vietnamese + English text instruction dataset.
Clean prompt in instruction dataset (e.g using distillabel)
Optimizing the pipeline
Draft research report

bachvudinh · 2024-11-20T02:55:27Z

Base on my experience, I have concerns about the reliability of Text2Semantic. When I modified the T2S model parameters to stabilize the semantic tokens, it significantly increased the pipeline's processing time compared to the standard Text2Speech + Speech2Semantic pipeline without saving the audio. Therefore, I recommend we proceed with the T2S+ S2S pipeline approach. cc @tuanlda78202

tuanlda78202 · 2024-11-20T16:25:31Z

We can use viXTTS for speech synthesis, that's so good!

bachvudinh · 2024-12-12T20:24:37Z

Gather all vietnamese instruction data source here:

Data Source	Number of Samples	Note
Viettel x Nvidia dataset	4.5M	instruct data with 55.9% CoT data, 25.7% QnA data and other.
Sailor2 dataset stage 1	TBD	TBD
Sailor2 dataset stage 2	TBD	TBD
Sailor2 dataset preference	TBD	TBD

dan-menlo · 2025-01-13T06:05:19Z

Add critique on Viettel x Nvidia dataset @bachvudinh

hahuyhoang411 mentioned this issue Nov 19, 2024

milestone: Ichigo v0.5 Multi-lingual #116

Open

7 tasks

hahuyhoang411 changed the title ~~Multi-lingual Instruct Speech Dataset Creation (Issue: )~~ task: Multi-lingual Instruct Speech Dataset Creation Nov 19, 2024

hahuyhoang411 assigned tuanlda78202 Nov 19, 2024

hahuyhoang411 added the P1: important Important feature / fix label Nov 19, 2024

bachvudinh self-assigned this Nov 20, 2024

hiento09 added this to Menlo Nov 22, 2024

github-project-automation bot moved this to Investigating in Menlo Nov 22, 2024

tikikun moved this from Investigating to In Progress in Menlo Nov 25, 2024

hahuyhoang411 added this to the Ichigo v0.5 - Multilingual milestone Nov 25, 2024

dan-menlo changed the title ~~task: Multi-lingual Instruct Speech Dataset Creation~~ task: Instruct Dataset Creation for Multilingual Speech Nov 27, 2024

hahuyhoang411 changed the title ~~task: Instruct Dataset Creation for Multilingual Speech~~ task: Instruct Dataset Creation for Multilingual Speech (Phase 2) Nov 27, 2024

hahuyhoang411 moved this from In Progress to Scheduled in Menlo Dec 1, 2024

hahuyhoang411 assigned hahuyhoang411 and unassigned tuanlda78202 Dec 12, 2024

dan-menlo changed the title ~~task: Instruct Dataset Creation for Multilingual Speech (Phase 2)~~ dataset: Ichigo LLM's instruct dataset Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset: Ichigo LLM's instruct dataset #121

dataset: Ichigo LLM's instruct dataset #121

hahuyhoang411 commented Nov 19, 2024 •

edited by bachvudinh

Loading

bachvudinh commented Nov 20, 2024 •

edited

Loading

tuanlda78202 commented Nov 20, 2024

bachvudinh commented Dec 12, 2024 •

edited by hahuyhoang411

Loading

dan-menlo commented Jan 13, 2025

dataset: Ichigo LLM's instruct dataset #121

dataset: Ichigo LLM's instruct dataset #121

Comments

hahuyhoang411 commented Nov 19, 2024 • edited by bachvudinh Loading

Goal

Tasklist

bachvudinh commented Nov 20, 2024 • edited Loading

tuanlda78202 commented Nov 20, 2024

bachvudinh commented Dec 12, 2024 • edited by hahuyhoang411 Loading

dan-menlo commented Jan 13, 2025

hahuyhoang411 commented Nov 19, 2024 •

edited by bachvudinh

Loading

bachvudinh commented Nov 20, 2024 •

edited

Loading

bachvudinh commented Dec 12, 2024 •

edited by hahuyhoang411

Loading