task: Train and test text2semantic under decoder only framework for ichigo v0.5 #145

tikikun · 2024-11-28T02:38:07Z

Motivation

Since ichigo v0.5 will support additional language that will make the traditional t2s obsolete. This is a good chance to introduce a t2s framework that we have full control over.

Goal

Be able to handle any arbitrary language

Current Ichigo approach (WhisperVQ) is only trained for 7 languages
We cannot find a ASR or STT module for every new language
Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)

Methodology

Teacher: Speech -> semantic (Whisper encoder)
Student: Text -> semantic (decoder, Qwen base model)

First step: Run base case first, with English first (before exploring other languages)

WhisperSpeech text-to-semantic model failed for our synthetic data pipeline, producing incompatible sound tokens that disrupted Ichigo’s comprehension.
- Proposed solution: a custom decoder-only text-to-semantic model (<3B parameters, similar to Qwen 2.5) with knowledge transfer from WhisperVQ and a more efficient architecture.
- Processed 10k English samples from MLS Eng 10k dataset (2.42M samples) using WhisperVQ for semantic token extraction, adding a <|text_to_semantic|> task token in user turns.
- Example dataset: Instruction data.
Modified Qwen 2.5 0.5B model:
- Introduced <|text_to_semantic|> task token and added 512 sound tokens + 3 special tokens (start, end, mask) to its vocabulary.
- Trained with instruction-based samples for text-to-semantic conversion.
- Embedded control tokens without modifying the tokenizer, scaled embedding layer and LM head to [152,192](https://github.com/QwenLM/Qwen/issues/419) tokens for training speed optimization.

What needed to be done:

Train t2s on a decoder model to test feasibility (we already have the data) under ichigo v0.4
Train t2s on decoder model on new quantizer after planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder #146 is done

Experiments

WhisperVQ 7 lang checkpoint

Run ID	Date	Model Config	Dataset	Learning Rate	Batch Size	Steps	Loss	Hardware
exp-t2s-0.5B	2024-11-28	Full-Finetune	Instruction text to sound sementic token	1e-3	96	28810	1.6-1.7	~ 4 hours on 2xH100
exp-t2s-1.5B-1	2024-11-29	Full-Finetune	Instruction text to sound sementic token	1e-3	84	28810	2.64	~ 10 hours on 6xA6000
exp-t2s-1.5B-2	2024-11-30	Full-Finetune	Instruction text to sound sementic token	1e-4	84	28810	1.84	~ 10 hours on 6xA6000
exp-t2s-llama3.2-1B	2024-12-1	Full-Finetune	Instruction text to sound sementic token	1e-4	96	25208	1.73	~ 9 hours on 6xA6000
exp-t2s-llama3.2-1B-1	2024-12-2	Full-Finetune	Instruction text to sound sementic token	1.5e-4	192	12604	1.77	~ 6 hours on 6xA6000
exp-t2s-llama3.2-1B-2	2024-12-3	Full-Finetune	Instruction text to sound sementic token	1.5e-4	192	57930	1.44	~30 hours on 6xA6000
exp-t2s-llama3.2-1B-dedup	2024-12-5	Full-Finetune	Instruction deduplicated sound sementic token	1.5e-4	168	33000	1.73	~30 hours on 6xA6000
exp-t2s-llama3.2-1B-compress	2024-12-9	Full-Finetune	Instruction compressed sound sementic token	1.5e-4	168	44713	1.6-1.63	~33 hours on 6xA6000
exp-t2s-llama3.2-1B-compress-1	2024-12-10	Full-Finetune	Instruction compressed sound sementic token	3e-4	256	57930	1.54	~24 hours on 6xA6000
exp-t2s-llama3.2-3B-compress	2024-12-13	Full-Finetune	Instruction compressed sound sementic token	1.5e-4	192	40607	1.59	~60 hours on 6xA6000
exp-t2s-llama3.2-3B	2024-12-13	Full-Finetune	Instruction text to sound sementic token	3e-4	192	77241	1.36	~60 hours on 6xA6000

Ichigo Quantizer epoch 5 of phase 2

Run ID	Date	Model Config	Dataset	Learning Rate	Batch Size	Steps	Loss	Hardware
Ichigo-t2s-1B-vie	2024-12-24	Full-Finetune	Vivoice instruction compressed sound sementic token	1e-4	192	23115	1.42	~7.5 hours on 6xA6000
Ichigo-t2s-1B-vie+en-1	2024-12-25	Full-Finetune	Vivoice+MLS_Eng_10k instruction compressed sound sementic token	1e-4	288	Early stop at epoch 3	1.98	~22.5 hours on 6xA6000
Ichigo-t2s-1B-vie+en-2	2024-12-25	Full-Finetune	Vivoice+MLS_Eng_10k instruction compressed sound sementic token	1e-4	288	Early stop at epoch 3	On-going	~22.5 hours on 6xA6000
Ichigo-t2s-1B-vie+en-3	2024-12-25	Full-Finetune	Vivoice+Libris_r_flitered_112k instruction compressed sound sementic token	1e-4	288	On-going	On-going	~10 hours on 6xA6000

Test Results:

Sample Synthetic data generation (text to semantics) Using Different Repetition Settings

Experiment ID	Inference Result	Note
exp-t2s-llama3.2-1B-1	exp-t2s-llama3.2-1B-1-result	Model don't know who to generate <sound_end> token, have repetitive and hallucination problem.
exp-t2s-llama3.2-1B-2	exp-t2s-llama3.2-1B-2	Successfully train a Text to Semantic model, model can generate the sound token that Ichigo v0.4 can understand but The model performance heavily rely on repetition_penalty hyperparameter if i set this number to 1.00 the model have hallucination and reptition problem and ichigo can't understand
exp-t2s-llama3.2-1B-compress	exp-t2s-llama3.2-1B-compress	After 2 epochs, the training loss converged to 1.6, so we early stop the training. Despite the relatively high convergence loss, the model successfully demonstrated the ability to generate sound tokens when evaluated on the test set.

Benchmarking

Using WhisperVQ dequantize the tokens back to embedding and then using whisper model to decode this emebedding into text. Benchmark on LibrisSpeech Clean test set

Model	WER (%)
Synthetic-T2S-1B-compress-with-prompt	7.37
WhisperVQ 7 lang with prompt	5.33
Synthetic-T2S-1B-compress	11.52
WhisperVQ 7 lang	10.86
Ichigo-t2s-1B-vie+en-1-epoch2	23.43
Ichigo-t2s-1B-vie+en-1-epoch3	23.83
Ichigo-t2s-1B-vie+en-2-epoch2	46.83
Ichigo-t2s-1B-vie+en-2-epoch3	29.68
Ichigo-t2s-1B-vie+en-3-epoch2	7.25
Ichigo-t2s-1B-vie+en-3-epoch3	7.88
Ichigo Quantizer epoch 5 of phase 2	11.16

Using WhisperVQ dequantize the tokens back to embedding and then using whisper model to decode this emebedding into text. Benchmark on Bud500 test set

Model	WER (%)
Ichigo-t2s-1B-vie	6.09
Ichigo-t2s-1B-vie+en-1-epoch2	4.5
Ichigo-t2s-1B-vie+en-1-epoch3	6.31
Ichigo-t2s-1B-vie+en-2-epoch2	3.7
Ichigo-t2s-1B-vie+en-2-epoch3	3.47
Ichigo-t2s-1B-vie+en-3-epoch2	3.85
Ichigo-t2s-1B-vie+en-3-epoch2	3.52

Using AudioBench.

The text was updated successfully, but these errors were encountered:

bachvudinh · 2024-11-29T05:19:34Z

Goal

Be able to handle any arbitrary language

Current Ichigo approach (WhisperVQ) is only trained for 7 languages
We cannot find a ASR or STT module for every new language
Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)

Methodology

After extensive hyperparameter tuning, the WhisperSpeech text-to-semantic model proved inadequate for our synthetic data pipeline. The model's output resulted in incompatible sound tokens that broke Ichigo's comprehension capabilities.
To address this, we propose developing a custom text-to-semantic model based on a decoder-only architecture (similar to Qwen 2.5) with <3B parameters. This model will leverage knowledge transfer from WhisperVQ while maintaining a more efficient architecture that better aligns with our use case.

To develop our custom text-to-semantic model, I processed 10k English samples from the MLS Eng 10k (2.42M samples) dataset by tokenizing the raw speech using WhisperVQ to extract semantic tokens. I also add a special task token <|text_to_semantic|> in user turn. Here is a sample from Instruction data:

[ { "content": "<|text_to_sementic|>he telegraphed to general pemberton that he had learned sherman was between them with four divisions at clinton saying that it was important to reestablish communications that pemberton might be reenforced and directing him to come up in sherman's rear at once", "role": "user" }, { "content": "<|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>", "role": "assistant" } ]

I then modified Qwen 2.5 0.5B by introducing a new task token <|text_to_semantic|> and incorporating 515 sound tokens into its vocabulary. The training data was structured as instruction-based samples designed to teach the model text-to-semantic token conversion. Following Qwen's methodology, we integrated control tokens into the embedding layer without tokenizer modification which, according to Qwen Authors, aim at optimizing training performance, we adjusted the embedding layer and LM head dimensions by a factor of 128, resulting in a final embedding size of 152,192 tokens.

bachvudinh · 2024-11-29T06:07:51Z

dan-homebrew · 2024-11-29T06:12:39Z

Need to add more details to this issue:

Goal: Be able to handle any arbitrary language
- Current Ichigo approach (WhisperVQ) is only trained for 7 languages
- We cannot find a ASR or STT module for every new language
- Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)

Please help me to align nomenclature etc. @tikikun's diagram above is very helpful.

hahuyhoang411 · 2024-12-01T12:56:09Z

I move the table to the top for better visualization cc @bachvudinh

PodsAreAllYouNeed · 2024-12-02T13:51:02Z

This task is a hybrid between Text-to-speech and speech-to-speech translation. It is quite hard because there is a one-to-many mapping between input text, and possible output token combinations.

Here are two papers that are using the same AR setting, but for slightly different tasks. I think it can be adapted.

AudioPALM: https://arxiv.org/pdf/2306.12925
Valle-E: https://arxiv.org/pdf/2301.02111

Specifically, I think we can use Valle-E's idea of using a phoneme conversion layer before sending the text into the AR model, this might bridge the gap to the semantic embeddings abit, making the AR model's job easier. We also need to somehow provide some auxiliary information about the expected acoustic ground-truth that we are using, otherwise, if we provide text-only to the AR model, there are too many possible correct answers, so across multiple samples the loss may conflict.

However, I think it will be hard to make this work. The AR model needs a better constraint.

My proposal

In the WhisperSpeech framework, the text-to-semantic model is the inverse of the whisper decoder. We need to involve the whisper decoder in the training.

Keep the same AR model structure
However, instead of trying to get the model to predict the whipserVQ codes, send continuous embeddings into the frozen whisper decoder. What we are trying to do is get the AR decoder model to trick the whisper decoder into thinking it is seeing output from the whisper encoder.
Compute the loss of the whisper decoder output to the original text.

You will meet a practical challenge, which is that while training this AR decoder model, its acting like its a NAR encoder model to the Whisper Decoder. There might be a smart way to solve this, but I can't think of one at the moment, or you can just use a NAR model.

Another (Simpler) Idea

If we really want an AR model trained using next token prediction, we must use WhisperVQ tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical WhisperVQ tokens together. This way, the model is not penalized for getting the output length wrong.

i.e this original example:
<|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>

get mapped to this:
<|sound_start|><|sound_0209|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0459|><|sound_0405|><|sound_end|>

This way the order of the token output matters, but the number of consecutively repeated tokens do not matter.
We can worry about upsampling the number of tokens as a separate problem. It might not matter to the decoder because the whole token sequence gets cross-attention anyway. The repeated tokens might not be adding that much information. During Fine-tuning we can apply a similar filtering to the WhisperVQ token stream, to see if the performance changes. Actually, if repeated tokens don't impact performance, then actually it makes inference even faster.

bachvudinh · 2024-12-04T06:54:57Z

Updated from Research sync 2024-12-4:

The current model have repetitive problem that generate endless sound semantic token and don't know how to generate <|sound_end|> token to stop.
The latest on-going runs that we scale the data to 3.7M samples, change the model to llama3.1 1B base version and doing 3 epoch. The loss metric stabilizes between 1.4-1.5 after the second epoch and shows no further improvement during the third epoch
@PodsAreAllYouNeed suggested that we should deduplicate sound token when making training data. The repeated tokens might not be adding that much information and its partition to how fast a word is pronounce. I will setup data and as soon as the exp-t2s-llama3.2-1B-2 experiment finish, I will start the training on the 6xA6000.

bachvudinh · 2024-12-05T04:24:12Z

cc @PodsAreAllYouNeed @tikikun
I do some testing with whisperVQ to see if the number of repeated semantic token proportion to how fast or slow a word are pronounce:

Audio Content	Audio Length	Speech Rate	Semantic Token Output	Tokens Length
Hello World	1s	Fast	`[207, 207, 207, 207, 207, 143, 153, 153, 13, 13, 13, 137, 228, 228, 1, 407, 1, 1, 222, 1, 1, 1, 1, 1, 11, 11, 207, 356, 348, 207, 356, 386, 207, 130]`	34
Hello World	3s	Slow	`[314, 336, 500, 207, 207, 153, 13, 208, 13, 13, 137, 228, 137, 228, 228, 228, 322, 378, 407, 407, 407, 378, 200, 200, 407, 378, 200, 378, 200, 200, 407, 200, 407, 78, 407, 378, 407, 378, 200, 407, 200, 200, 192, 407, 407, 407, 192, 192, 397, 397, 397, 479, 341, 479, 245, 245, 245, 245, 245, 35, 56, 245, 400, 376, 446, 378, 400, 192, 192, 400, 200, 200, 213, 207, 200, 207, 200, 207, 207, 207, 200, 508, 508, 508, 508, 207, 336, 348, 336, 508, 336, 508, 200]`	93
Hello	1s	Fast	`[207, 207, 207, 207, 143, 22, 207, 153, 179, 13, 13, 137, 228, 228, 322, 434, 434, 286, 286, 286, 286, 356, 207, 22, 382, 207]`	26
Hello	1s	Slow	`[207, 200, 153, 153, 153, 135, 13, 313, 313, 313, 313, 378, 378, 378, 378, 378, 378, 378, 378, 13, 392, 392, 228, 228, 378, 228, 275, 275, 275, 378, 275, 200, 227, 275, 200, 426, 407, 200, 434, 382, 382, 407, 382, 207, 207, 508]`	46

bachvudinh · 2024-12-09T01:13:45Z

What I tried to do:

Deduplicate the sound semantic token to get the shorter sound token sequence --> Efficient Training.
For example:
Input:

<|sound_start|><|sound_0207|><|sound_0148|><|sound_0000|><|sound_0249|><|sound_0249|><|sound_0177|>.  <|sound_0177|><|sound_0177|><|sound_0177|><|sound_0177|><|sound_0302|><|sound_0302|><|sound_0426|>.  <|sound_0426|><|sound_0081|><|sound_0090|><|sound_0491|><|sound_0327|><|sound_0327|><|sound_0153|>.  <|sound_0061|><|sound_0061|><|sound_0196|><|sound_0178|><|sound_0129|><|sound_0129|><|sound_0025|>.  <|sound_0070|><|sound_0114|><|sound_0159|><|sound_0159|><|sound_0159|><|sound_0010|><|sound_0316|>.  <|sound_0270|><|sound_0510|><|sound_0409|><|sound_0339|><|sound_0378|><|sound_0339|><|sound_0378|>.  <|sound_0407|><|sound_0407|><|sound_0404|><|sound_0404|><|sound_0235|><|sound_0322|><|sound_0322|>.  <|sound_0076|><|sound_0245|><|sound_0226|><|sound_0011|><|sound_0407|><|sound_0407|><|sound_0007|>.  <|sound_0007|><|sound_0007|><|sound_0187|><|sound_0316|><|sound_0300|><|sound_0316|><|sound_0316|>.  <|sound_0109|><|sound_0008|><|sound_0300|><|sound_0179|><|sound_0179|><|sound_0179|><|sound_0122|>.  <|sound_0122|><|sound_0158|><|sound_0158|><|sound_0245|><|sound_0080|><|sound_0080|><|sound_0461|>.  <|sound_0461|><|sound_0005|><|sound_0368|><|sound_0368|><|sound_0487|><|sound_0206|><|sound_0278|>.  <|sound_0278|><|sound_0002|><|sound_0002|><|sound_0102|><|sound_0329|><|sound_0444|><|sound_0114|>.  <|sound_0114|><|sound_0325|><|sound_0202|><|sound_0202|><|sound_0202|><|sound_0363|><|sound_0319|>.  <|sound_0161|><|sound_0147|><|sound_0397|><|sound_0325|><|sound_0081|><|sound_0081|><|sound_0319|>.  <|sound_0161|><|sound_0331|><|sound_0187|><|sound_0350|><|sound_0210|><|sound_0048|><|sound_0468|>.  <|sound_0468|><|sound_0125|><|sound_0271|><|sound_0271|><|sound_0157|><|sound_0344|><|sound_0007|>.  <|sound_0300|><|sound_0300|><|sound_0300|><|sound_0300|><|sound_0300|><|sound_0076|><|sound_0226|>.  <|sound_0498|><|sound_0498|><|sound_0330|><|sound_0468|><|sound_0468|><|sound_0468|><|sound_0434|>.  <|sound_0182|><|sound_end|>

output:

<|sound_start|><|sound_0207|><|sound_0148|><|sound_0000|><|sound_0249|><|sound_0177|><|sound_0302|>.  <|sound_0426|><|sound_0081|><|sound_0090|><|sound_0491|><|sound_0327|><|sound_0153|><|sound_0061|>.  <|sound_0196|><|sound_0178|><|sound_0129|><|sound_0025|><|sound_0070|><|sound_0114|><|sound_0159|>.  <|sound_0010|><|sound_0316|><|sound_0270|><|sound_0510|><|sound_0409|><|sound_0339|><|sound_0378|>.  <|sound_0339|><|sound_0378|><|sound_0407|><|sound_0404|><|sound_0235|><|sound_0322|><|sound_0076|>.  <|sound_0245|><|sound_0226|><|sound_0011|><|sound_0407|><|sound_0007|><|sound_0187|><|sound_0316|>.  <|sound_0300|><|sound_0316|><|sound_0109|><|sound_0008|><|sound_0300|><|sound_0179|><|sound_0122|>.  <|sound_0158|><|sound_0245|><|sound_0080|><|sound_0461|><|sound_0005|><|sound_0368|><|sound_0487|>.  <|sound_0206|><|sound_0278|><|sound_0002|><|sound_0102|><|sound_0329|><|sound_0444|><|sound_0114|>.  <|sound_0325|><|sound_0202|><|sound_0363|><|sound_0319|><|sound_0161|><|sound_0147|><|sound_0397|>.  <|sound_0325|><|sound_0081|><|sound_0319|><|sound_0161|><|sound_0331|><|sound_0187|><|sound_0350|>.  <|sound_0210|><|sound_0048|><|sound_0468|><|sound_0125|><|sound_0271|><|sound_0157|><|sound_0344|>.  <|sound_0007|><|sound_0300|><|sound_0076|><|sound_0226|><|sound_0498|><|sound_0330|><|sound_0468|>.  <|sound_0434|><|sound_0182|><|sound_end|>

I experimented with Llama 3.1 1B using deduplicated data for 3 epoch.

Result:

The loss did not go lower than 1.7 after 2 epoch (~ 1.73-1.8).
Model is unable to identify and learn the data pattern, possibly due to information loss caused by deduplication.

tikikun · 2024-12-09T01:56:29Z

Idea: Add duration tokens

Observations:

Loss did not converge as well after we de-dup, something must be lost
Repetition is the only difference, how many time a sound token is repeated
Continuous embedding is a by product of how Whisper Encoder process (each embedding is slightly different from other one even though they are almost the same)

Theories:

We need to be able to "add back the "repetition" information
After we make sure the information is kept the same, the loss behaviour should be relatively the same.

Implementation:

We create "<|dur_xxx|> (with x is an integer) to represent the repetition inside each repeated sound tokens group

Extra Information

Why some word might result in repetition?

At first glance it's tempting to think that the same information in a repetitive token (or embedding) might be redundant. But if we take a closer look at long and short vowels in English it might not be the case

Example:

Sheep - i
Ship - ɪ

The only way for you to discern the difference between Sheep and Ship sometimes in English speaking is only whether the i sound is long or short, or the duration part of it.

By de-duplicating the duration, everything becomes a short sound but your target training still sheep or sip (either long or short) make it impossible to really converge.

Hence, duration is the information that is left out when you de-dup.

PodsAreAllYouNeed · 2024-12-09T06:47:04Z

Related to the idea of the token-level duration token, we could potentially have a "global duration token" added as a context token either before or after the provided text input. This "global duration token" gives information to the t2s model about the length of semantic tokens it needs to generate. After training, this global duration token is also used to control the length of generation of the text, which controls the speaking speed. This is inspired by the "number of frames" mechanic found in the F5-TTS code generation https://github.com/SWivid/F5-TTS/blob/8898d05e374bcb8d3fc0b1286037e95df61f491f/src/f5_tts/infer/utils_infer.py#L449C1-L452C96

If TTS models need some global duration information in order to do the generation, then our text2semantic should also use the same kind of global information. we just need to encode it a little differently.

bachvudinh · 2024-12-10T02:55:38Z

Some result from the training runs:

The loss going lower than normal deduplication training ---> "repetition" information
The loss does not converge better than training using normal sound token ---> this maybe due to new added "duration token"

tikikun · 2024-12-11T02:44:19Z

Idea: Add duration tokens

Observations:

Loss did not converge as well after we de-dup, something must be lost

Repetition is the only difference, how many time a sound token is repeated

Continuous embedding is a by product of how Whisper Encoder process (each embedding is slightly different from other one even though they are almost the same)

Theories:

We need to be able to "add back the "repetition" information

After we make sure the information is kept the same, the loss behaviour should be relatively the same.

Implementation:

We create "<|dur_xxx|> (with x is an integer) to represent the repetition inside each repeated sound tokens group

Extra Information

Why some word might result in repetition?

At first glance it's tempting to think that the same information in a repetitive token (or embedding) might be redundant. But if we take a closer look at long and short vowels in English it might not be the case

Example:

Sheep - i

Ship - ɪ

The only way for you to discern the difference between Sheep and Ship sometimes in English speaking is only whether the i sound is long or short, or the duration part of it.

By de-duplicating the duration, everything becomes a short sound but your target training still sheep or sip (either long or short) make it impossible to really converge.

Hence, duration is the information that is left out when you de-dup.

This idea needs further validation on different generation len

tikikun · 2024-12-12T00:45:37Z

@bachvudinh please add validation on longer sequence

bachvudinh · 2024-12-12T05:50:27Z

Our data length distribution.

Testing the quality of text-to-semantic models using prompts of varying lengths:

Prompt length	Prompt	sound token length	Ichigo v0.4 Answer
30	Can you please tell a story about a cat who loves to play with a ball of yarn?	136	What a delightful request! Here's a tale about a feline friend who loves to play with a ball of yarn: Whiskers and the Yarn Ball In a cozy little house on a quiet street, there lived a sleek and curious cat named Whiskers. Whiskers was a ball of energy, always looking for ways to entertain herself. One day, while exploring the house, she stumbled upon a ball of yarn that had been left on the floor. The soft, fluffy yarn was like a treasure trove of wonder to Whiskers. She batted it around the room, chasing after it with glee. From that moment on, Whiskers was hooked. The yarn ball became her favorite toy, and she would spend hours playing with it every day. She'd chase it around the house, pouncing on it like a predator, and then gently batting it back and forth with her paw. The yarn ball was the perfect size, just big enough to be challenging but small enough to be manageable. As the days went by, Whiskers' love for the yarn ball only grew stronger. She'd carry it around in her mouth, proudly displaying her prize to anyone who would look. Her owners, the Smiths, would laugh at her antics, watching as she'd chase the yarn ball around the living room, leaping over furniture and dodging around chairs. But Whiskers' favorite game was "Yarn Ball Hide-and-Seek." She'd hide the yarn ball under a cushion, behind a bookshelf, or even in the laundry basket, and then use her keen senses to track it down. When she finally found it, she'd let out a triumphant meow, as if to say, "I've got it! I've got it!"
78	Summarize below paragraph: If we really want an AR model trained using next token prediction, we must use Whisper tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical Whisper tokens together. This way, the model is not penalized for getting the output length wrong.	685	The text describes a method for training an AI model using next token prediction. The model is trained using a specific format of word tokens and does not require any auxiliary information. The model can be trained using a simple technique of grouping identical word tokens together. This approach does not penalize the model for the output length on the token prediction.
82	Imagine you are a renowned architect designing a sustainable skyscraper in Singapore. The building needs to incorporate green technology, vertical gardens, and innovative energy solutions. Describe your vision for this groundbreaking structure, including specific details about materials, renewable energy systems, and how it will benefit the local community. Consider aspects like natural lighting, rainwater harvesting, and biodiversity integration.	2013	The Singapore SkyScape is a futuristic skyscraper that embodies the principles of green technology, vertical gardens, and innovative energy solutions. The building's design is a masterpiece of modern architecture, with a sleek and aerodynamic shape that maximizes natural light and ventilation. The exterior of the building is clad in a unique, translucent material that allows for maximum light penetration while also providing excellent insulation and energy efficiency. The SkyScape's vertical gardens are a key feature of the building's design. The gardens are located on the exterior of the building, and are designed to provide a natural habitat for a variety of plant species. The gardens are irrigated using a state-of-the-art hydroponic system, which uses recycled water and minimizes water waste. The gardens also provide a natural air filtration system, helping to purify the air and improve the overall air quality in the surrounding area. The SkyScape's innovative energy solutions are another key feature of the building's design. The building is powered by a combination of solar panels, wind turbines, and geothermal energy. The solar panels are located on the roof of the building, and provide a significant portion of the building's energy needs. The wind turbines are located on the exterior of the building, and provide additional energy.
145	Summarize the below paragraph: The importance of effective communication in everyday life cannot be overstated. Clear and concise communication helps to build strong relationships, avoid misunderstandings, and convey ideas with precision. When individuals communicate effectively, they are able to express their thoughts, needs, and desires in a way that is easily understood by others. Good communication involves not only speaking, but also listening actively and responding thoughtfully. It requires being aware of nonverbal cues, such as body language and tone of voice, and adapting one's message to the audience and context. By doing so, individuals can navigate complex social situations, resolve conflicts, and achieve their goals more efficiently.	2015	Effective communication is crucial in everyday life as it helps build strong relationships, avoid misunderstandings, and convey ideas with precision. Clear and concise communication is key to achieving this, and it is essential to communicate effectively in every part of the country. While computers can communicate with each other, they are limited to simple social situations and cannot resolve conflicts with humans. Therefore, it is important to communicate effectively with humans to avoid conflicts and achieve our goals.
144	Repeat what i said below: The importance of effective communication in everyday life cannot be overstated. Clear and concise communication helps to build strong relationships, avoid misunderstandings, and convey ideas with precision. When individuals communicate effectively, they are able to express their thoughts, needs, and desires in a way that is easily understood by others. Good communication involves not only speaking, but also listening actively and responding thoughtfully. It requires being aware of nonverbal cues, such as body language and tone of voice, and adapting one's message to the audience and context. By doing so, individuals can navigate complex social situations, resolve conflicts, and achieve their goals more efficiently.	814	The importance of effective communication in everyday life cannot be overstated. Clear and concise communication helps to build strong relationships, avoid misunderstandings, and convey ideas with precision. When individuals communicate effectively and respond thoughtfully, they complete the communication process. The employee's information has been completed. The communication has been completed. The communication process has been completed.

bachvudinh · 2024-12-12T21:32:30Z

Result from exp-t2s-llama3.2-base-1B-compress on LibriSpeech clean test set:
- [Detail WER Result of Llama3.2-instruct-1B-compress]
- Detail WER Result of Llama3.2-base-1B-compress
  
  Model WER (%)
  
  llama3.2-instruct-38-raw-v2 300.52
  
  llama3.2-instruct-18-compress 131.75
  
  llama3.2-base-18-compress 101.03
  
  WhisperVQ 7 lang 10.86
  
  WhisperVQ English+Polish 5.81
Audio Bench result of Llama3.2-base-1B-compress:
- Open-hermes : 3.52. Details result can be found here: openhermes_audio_test_gpt4_judge_score.json.
- Alpaca : 3.52. Details result can be found here: alpaca_audio_test_gpt4_judge_score.json.
Updated result for llama3.1-3B-raw-semantic token(epoch 2):
- Librispeech : 300.52%. stats_1.csv
- Audio bench:
  Openhermes: 3.56. openhermes_audio_test_gpt4_judge_score_T2S.json
  Alpaca: 3.55. alpaca_audio_test_gpt4_judge_score_T2S.json
The Result for llama3.1-3B-raw-semantic token(3 epoch):
- Librispeech:
- Audio bench:
  Openhermes: 3.41.
  Alpaca: 3.49.

PodsAreAllYouNeed · 2024-12-15T03:05:02Z

A good point of reference for the results for the text-to-semantic model would be the F5-TTS model.

We can also use their methodology of Librispeech-PC test-clean to test. Also, we can send the text into F5-TTS to generate, then take the generated audio and put it back into the full whisperVQ+Decoder model for ASR, as an additional point of reference. (This only needs to be done once for each time we retrain the VQ)

Looking at the table from F5-TTS, it seems that a good target for the WER of the t2s -> decoder round-trip would be 2-3% on the LibriSpeech test-clean dataset.

bachvudinh · 2024-12-18T09:26:00Z

Text-to-Semantics Training Issue Resolution

Issue Identified: Initially suspected a failure in Text-to-Semantics model training.
Root Cause Discovery: During a 2-hour pair-coding session, identified that training data contained a terminal period (".") that was missing from evaluation data.
Solution: Add the terminal period during evaluation, which successfully restored T2S model functionality.
Key Learnings:
- Separator tokens are critical components in decoder model performance, reinforcing concepts from today's brown bag presentation.
- Small details in data formatting can have significant impacts on research outcomes.

Word Error Rate (WER) Comparison Between Real Semantic tokens and Synthetic tokens.

Detail result:
- stats.csv
- stats_with_prompt.csv

Model	WER (%)
Synthetic-T2S-1B-compress-with-prompt	7.37
WhisperVQ 7 lang with prompt	5.33
Synthetic-T2S-1B-compress	11.52
WhisperVQ 7 lang	10.86

Note: "With prompt" refers to adding a prompt to the Whisper decoder.

PodsAreAllYouNeed · 2024-12-19T06:52:02Z

i used stats.csv to plot a histogram of the delta between t2s and whispervq

The WER difference between t2s and whisperVQ is very small, in a majority of the samples, the delta it within 0.5WER. This is a very good sign that our t2s model is behaving in a similar manner to the whisper encoder

tikikun assigned tikikun and bachvudinh Nov 28, 2024

dan-homebrew changed the title ~~task: Train and test t2s under decoder only framework for ichigo v0.5~~ task: Train and test text2semantic under decoder only framework for ichigo v0.5 Nov 29, 2024

dan-homebrew transferred this issue from janhq/ichigo Nov 29, 2024

bachvudinh added this to the Ichigo v0.5 milestone Nov 29, 2024

janhq deleted a comment from bachvudinh Dec 1, 2024

github-project-automation bot moved this to Investigating in Jan & Cortex Dec 11, 2024

github-project-automation bot added this to Jan & Cortex Dec 11, 2024

tikikun transferred this issue from janhq/WhisperSpeech Dec 11, 2024

dan-homebrew modified the milestone: Ichigo v0.5 Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task: Train and test text2semantic under decoder only framework for ichigo v0.5 #145

task: Train and test text2semantic under decoder only framework for ichigo v0.5 #145

tikikun commented Nov 28, 2024 •

edited by bachvudinh

Loading

bachvudinh commented Nov 29, 2024 •

edited by hahuyhoang411

Loading

bachvudinh commented Nov 29, 2024

dan-homebrew commented Nov 29, 2024 •

edited

Loading

hahuyhoang411 commented Dec 1, 2024

PodsAreAllYouNeed commented Dec 2, 2024

bachvudinh commented Dec 4, 2024 •

edited

Loading

bachvudinh commented Dec 5, 2024 •

edited

Loading

bachvudinh commented Dec 9, 2024 •

edited

Loading

tikikun commented Dec 9, 2024 •

edited

Loading

PodsAreAllYouNeed commented Dec 9, 2024

bachvudinh commented Dec 10, 2024

tikikun commented Dec 11, 2024

Idea: Add duration tokens

Extra Information

Example:

tikikun commented Dec 12, 2024

bachvudinh commented Dec 12, 2024 •

edited

Loading

bachvudinh commented Dec 12, 2024 •

edited

Loading

PodsAreAllYouNeed commented Dec 15, 2024

bachvudinh commented Dec 18, 2024 •

edited

Loading

PodsAreAllYouNeed commented Dec 19, 2024

task: Train and test text2semantic under decoder only framework for ichigo v0.5 #145

task: Train and test text2semantic under decoder only framework for ichigo v0.5 #145

Comments

tikikun commented Nov 28, 2024 • edited by bachvudinh Loading

Motivation

Goal

Methodology

What needed to be done:

Experiments

Test Results:

Benchmarking

bachvudinh commented Nov 29, 2024 • edited by hahuyhoang411 Loading

Goal

Methodology

bachvudinh commented Nov 29, 2024

dan-homebrew commented Nov 29, 2024 • edited Loading

hahuyhoang411 commented Dec 1, 2024

PodsAreAllYouNeed commented Dec 2, 2024

My proposal

Another (Simpler) Idea

bachvudinh commented Dec 4, 2024 • edited Loading

bachvudinh commented Dec 5, 2024 • edited Loading

bachvudinh commented Dec 9, 2024 • edited Loading

tikikun commented Dec 9, 2024 • edited Loading

Idea: Add duration tokens

Extra Information

Example:

PodsAreAllYouNeed commented Dec 9, 2024

bachvudinh commented Dec 10, 2024

tikikun commented Dec 11, 2024

Idea: Add duration tokens

Extra Information

Example:

tikikun commented Dec 12, 2024

bachvudinh commented Dec 12, 2024 • edited Loading

bachvudinh commented Dec 12, 2024 • edited Loading

PodsAreAllYouNeed commented Dec 15, 2024

bachvudinh commented Dec 18, 2024 • edited Loading

Text-to-Semantics Training Issue Resolution

Word Error Rate (WER) Comparison Between Real Semantic tokens and Synthetic tokens.

PodsAreAllYouNeed commented Dec 19, 2024

tikikun commented Nov 28, 2024 •

edited by bachvudinh

Loading

bachvudinh commented Nov 29, 2024 •

edited by hahuyhoang411

Loading

dan-homebrew commented Nov 29, 2024 •

edited

Loading

bachvudinh commented Dec 4, 2024 •

edited

Loading

bachvudinh commented Dec 5, 2024 •

edited

Loading

bachvudinh commented Dec 9, 2024 •

edited

Loading

tikikun commented Dec 9, 2024 •

edited

Loading

bachvudinh commented Dec 12, 2024 •

edited

Loading

bachvudinh commented Dec 12, 2024 •

edited

Loading

bachvudinh commented Dec 18, 2024 •

edited

Loading