Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task: Train and test text2semantic under decoder only framework for ichigo v0.5 #145

Open
2 tasks done
tikikun opened this issue Nov 28, 2024 · 18 comments
Open
2 tasks done
Assignees
Milestone

Comments

@tikikun
Copy link
Collaborator

tikikun commented Nov 28, 2024

Motivation

Since ichigo v0.5 will support additional language that will make the traditional t2s obsolete. This is a good chance to introduce a t2s framework that we have full control over.

Goal

Be able to handle any arbitrary language

  • Current Ichigo approach (WhisperVQ) is only trained for 7 languages
  • We cannot find a ASR or STT module for every new language
  • Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)

Methodology

  1. Teacher: Speech -> semantic (Whisper encoder)
  2. Student: Text -> semantic (decoder, Qwen base model)

First step: Run base case first, with English first (before exploring other languages)

  • WhisperSpeech text-to-semantic model failed for our synthetic data pipeline, producing incompatible sound tokens that disrupted Ichigo’s comprehension.
    • Proposed solution: a custom decoder-only text-to-semantic model (<3B parameters, similar to Qwen 2.5) with knowledge transfer from WhisperVQ and a more efficient architecture.
    • Processed 10k English samples from MLS Eng 10k dataset (2.42M samples) using WhisperVQ for semantic token extraction, adding a <|text_to_semantic|> task token in user turns.
    • Example dataset: Instruction data.
  • Modified Qwen 2.5 0.5B model:
    • Introduced <|text_to_semantic|> task token and added 512 sound tokens + 3 special tokens (start, end, mask) to its vocabulary.
    • Trained with instruction-based samples for text-to-semantic conversion.
    • Embedded control tokens without modifying the tokenizer, scaled embedding layer and LM head to [152,192](https://github.com/QwenLM/Qwen/issues/419) tokens for training speed optimization.

What needed to be done:

Experiments

Run ID Date Model Config Dataset Learning Rate Batch Size Steps Loss Hardware
exp-t2s-0.5B 2024-11-28 Full-Finetune Instruction text to sound sementic token 1e-3 96 28810 1.6-1.7 ~ 4 hours on 2xH100
exp-t2s-1.5B-1 2024-11-29 Full-Finetune Instruction text to sound sementic token 1e-3 84 28810 2.64 ~ 10 hours on 6xA6000
exp-t2s-1.5B-2 2024-11-30 Full-Finetune Instruction text to sound sementic token 1e-4 84 28810 1.84 ~ 10 hours on 6xA6000
exp-t2s-llama3.2-1B 2024-12-1 Full-Finetune Instruction text to sound sementic token 1e-4 96 25208 1.73 ~ 9 hours on 6xA6000
exp-t2s-llama3.2-1B-1 2024-12-2 Full-Finetune Instruction text to sound sementic token 1.5e-4 192 12604 1.77 ~ 6 hours on 6xA6000
exp-t2s-llama3.2-1B-2 2024-12-3 Full-Finetune Instruction text to sound sementic token 1.5e-4 192 57930 1.44 ~30 hours on 6xA6000
exp-t2s-llama3.2-1B-dedup 2024-12-5 Full-Finetune Instruction deduplicated sound sementic token 1.5e-4 168 33000 1.73 ~30 hours on 6xA6000
exp-t2s-llama3.2-1B-compress 2024-12-9 Full-Finetune Instruction compressed sound sementic token 1.5e-4 168 44713 1.6-1.63 ~33 hours on 6xA6000
exp-t2s-llama3.2-1B-compress-1 2024-12-10 Full-Finetune Instruction compressed sound sementic token 3e-4 256 57930 1.54 ~24 hours on 6xA6000
exp-t2s-llama3.2-3B-compress 2024-12-13 Full-Finetune Instruction compressed sound sementic token 1.5e-4 192 40607 1.59 ~60 hours on 6xA6000
exp-t2s-llama3.2-3B 2024-12-13 Full-Finetune Instruction text to sound sementic token 3e-4 192 77241 1.36 ~60 hours on 6xA6000
Run ID Date Model Config Dataset Learning Rate Batch Size Steps Loss Hardware
Ichigo-t2s-1B-vie 2024-12-24 Full-Finetune Vivoice instruction compressed sound sementic token 1e-4 192 23115 1.42 ~7.5 hours on 6xA6000
Ichigo-t2s-1B-vie+en-1 2024-12-25 Full-Finetune Vivoice+MLS_Eng_10k instruction compressed sound sementic token 1e-4 288 Early stop at epoch 3 1.98 ~22.5 hours on 6xA6000
Ichigo-t2s-1B-vie+en-2 2024-12-25 Full-Finetune Vivoice+MLS_Eng_10k instruction compressed sound sementic token 1e-4 288 Early stop at epoch 3 On-going ~22.5 hours on 6xA6000
Ichigo-t2s-1B-vie+en-3 2024-12-25 Full-Finetune Vivoice+Libris_r_flitered_112k instruction compressed sound sementic token 1e-4 288 On-going On-going ~10 hours on 6xA6000

Test Results:

  • Sample Synthetic data generation (text to semantics) Using Different Repetition Settings
Experiment ID Inference Result Note
exp-t2s-llama3.2-1B-1 exp-t2s-llama3.2-1B-1-result Model don't know who to generate <sound_end> token, have repetitive and hallucination problem.
exp-t2s-llama3.2-1B-2 exp-t2s-llama3.2-1B-2 Successfully train a Text to Semantic model, model can generate the sound token that Ichigo v0.4 can understand but The model performance heavily rely on repetition_penalty hyperparameter if i set this number to 1.00 the model have hallucination and reptition problem and ichigo can't understand
exp-t2s-llama3.2-1B-compress exp-t2s-llama3.2-1B-compress After 2 epochs, the training loss converged to 1.6, so we early stop the training. Despite the relatively high convergence loss, the model successfully demonstrated the ability to generate sound tokens when evaluated on the test set.

Benchmarking

  • Using WhisperVQ dequantize the tokens back to embedding and then using whisper model to decode this emebedding into text. Benchmark on LibrisSpeech Clean test set

    Model WER (%)
    Synthetic-T2S-1B-compress-with-prompt 7.37
    WhisperVQ 7 lang with prompt 5.33
    Synthetic-T2S-1B-compress 11.52
    WhisperVQ 7 lang 10.86
    Ichigo-t2s-1B-vie+en-1-epoch2 23.43
    Ichigo-t2s-1B-vie+en-1-epoch3 23.83
    Ichigo-t2s-1B-vie+en-2-epoch2 46.83
    Ichigo-t2s-1B-vie+en-2-epoch3 29.68
    Ichigo-t2s-1B-vie+en-3-epoch2 7.25
    Ichigo-t2s-1B-vie+en-3-epoch3 7.88
    Ichigo Quantizer epoch 5 of phase 2 11.16
  • Using WhisperVQ dequantize the tokens back to embedding and then using whisper model to decode this emebedding into text. Benchmark on Bud500 test set

    Model WER (%)
    Ichigo-t2s-1B-vie 6.09
    Ichigo-t2s-1B-vie+en-1-epoch2 4.5
    Ichigo-t2s-1B-vie+en-1-epoch3 6.31
    Ichigo-t2s-1B-vie+en-2-epoch2 3.7
    Ichigo-t2s-1B-vie+en-2-epoch3 3.47
    Ichigo-t2s-1B-vie+en-3-epoch2 3.85
    Ichigo-t2s-1B-vie+en-3-epoch2 3.52
  • Using AudioBench.

@bachvudinh
Copy link
Contributor

bachvudinh commented Nov 29, 2024

Goal

Be able to handle any arbitrary language

  • Current Ichigo approach (WhisperVQ) is only trained for 7 languages
  • We cannot find a ASR or STT module for every new language
  • Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)

Methodology

  • After extensive hyperparameter tuning, the WhisperSpeech text-to-semantic model proved inadequate for our synthetic data pipeline. The model's output resulted in incompatible sound tokens that broke Ichigo's comprehension capabilities.
    To address this, we propose developing a custom text-to-semantic model based on a decoder-only architecture (similar to Qwen 2.5) with <3B parameters. This model will leverage knowledge transfer from WhisperVQ while maintaining a more efficient architecture that better aligns with our use case.
  • To develop our custom text-to-semantic model, I processed 10k English samples from the MLS Eng 10k (2.42M samples) dataset by tokenizing the raw speech using WhisperVQ to extract semantic tokens. I also add a special task token <|text_to_semantic|> in user turn. Here is a sample from Instruction data:
    [ { "content": "<|text_to_sementic|>he telegraphed to general pemberton that he had learned sherman was between them with four divisions at clinton saying that it was important to reestablish communications that pemberton might be reenforced and directing him to come up in sherman's rear at once", "role": "user" }, { "content": "<|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>", "role": "assistant" } ]
    
  • I then modified Qwen 2.5 0.5B by introducing a new task token <|text_to_semantic|> and incorporating 515 sound tokens into its vocabulary. The training data was structured as instruction-based samples designed to teach the model text-to-semantic token conversion. Following Qwen's methodology, we integrated control tokens into the embedding layer without tokenizer modification which, according to Qwen Authors, aim at optimizing training performance, we adjusted the embedding layer and LM head dimensions by a factor of 128, resulting in a final embedding size of 152,192 tokens.

@dan-homebrew dan-homebrew changed the title task: Train and test t2s under decoder only framework for ichigo v0.5 task: Train and test text2semantic under decoder only framework for ichigo v0.5 Nov 29, 2024
@bachvudinh
Copy link
Contributor

image

@dan-homebrew
Copy link
Contributor

dan-homebrew commented Nov 29, 2024

Need to add more details to this issue:

  • Goal: Be able to handle any arbitrary language
    • Current Ichigo approach (WhisperVQ) is only trained for 7 languages
    • We cannot find a ASR or STT module for every new language
    • Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)

Please help me to align nomenclature etc. @tikikun's diagram above is very helpful.

@dan-homebrew dan-homebrew transferred this issue from janhq/ichigo Nov 29, 2024
@bachvudinh bachvudinh added this to the Ichigo v0.5 milestone Nov 29, 2024
@janhq janhq deleted a comment from bachvudinh Dec 1, 2024
@hahuyhoang411
Copy link
Contributor

I move the table to the top for better visualization cc @bachvudinh

@PodsAreAllYouNeed
Copy link

This task is a hybrid between Text-to-speech and speech-to-speech translation. It is quite hard because there is a one-to-many mapping between input text, and possible output token combinations.

Here are two papers that are using the same AR setting, but for slightly different tasks. I think it can be adapted.

AudioPALM: https://arxiv.org/pdf/2306.12925
Valle-E: https://arxiv.org/pdf/2301.02111

Specifically, I think we can use Valle-E's idea of using a phoneme conversion layer before sending the text into the AR model, this might bridge the gap to the semantic embeddings abit, making the AR model's job easier. We also need to somehow provide some auxiliary information about the expected acoustic ground-truth that we are using, otherwise, if we provide text-only to the AR model, there are too many possible correct answers, so across multiple samples the loss may conflict.

However, I think it will be hard to make this work. The AR model needs a better constraint.

My proposal

In the WhisperSpeech framework, the text-to-semantic model is the inverse of the whisper decoder. We need to involve the whisper decoder in the training.

  1. Keep the same AR model structure
  2. However, instead of trying to get the model to predict the whipserVQ codes, send continuous embeddings into the frozen whisper decoder. What we are trying to do is get the AR decoder model to trick the whisper decoder into thinking it is seeing output from the whisper encoder.
  3. Compute the loss of the whisper decoder output to the original text.

You will meet a practical challenge, which is that while training this AR decoder model, its acting like its a NAR encoder model to the Whisper Decoder. There might be a smart way to solve this, but I can't think of one at the moment, or you can just use a NAR model.

Another (Simpler) Idea

If we really want an AR model trained using next token prediction, we must use WhisperVQ tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical WhisperVQ tokens together. This way, the model is not penalized for getting the output length wrong.

i.e this original example:
<|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>

get mapped to this:
<|sound_start|><|sound_0209|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0459|><|sound_0405|><|sound_end|>

This way the order of the token output matters, but the number of consecutively repeated tokens do not matter.
We can worry about upsampling the number of tokens as a separate problem. It might not matter to the decoder because the whole token sequence gets cross-attention anyway. The repeated tokens might not be adding that much information. During Fine-tuning we can apply a similar filtering to the WhisperVQ token stream, to see if the performance changes. Actually, if repeated tokens don't impact performance, then actually it makes inference even faster.

@bachvudinh
Copy link
Contributor

bachvudinh commented Dec 4, 2024

Updated from Research sync 2024-12-4:

  • The current model have repetitive problem that generate endless sound semantic token and don't know how to generate <|sound_end|> token to stop.
  • The latest on-going runs that we scale the data to 3.7M samples, change the model to llama3.1 1B base version and doing 3 epoch. The loss metric stabilizes between 1.4-1.5 after the second epoch and shows no further improvement during the third epoch
  • @PodsAreAllYouNeed suggested that we should deduplicate sound token when making training data. The repeated tokens might not be adding that much information and its partition to how fast a word is pronounce. I will setup data and as soon as the exp-t2s-llama3.2-1B-2 experiment finish, I will start the training on the 6xA6000.

@bachvudinh
Copy link
Contributor

bachvudinh commented Dec 5, 2024

cc @PodsAreAllYouNeed @tikikun
I do some testing with whisperVQ to see if the number of repeated semantic token proportion to how fast or slow a word are pronounce:

Audio Content Audio Length Speech Rate Semantic Token Output Tokens Length
Hello World 1s Fast [207, 207, 207, 207, 207, 143, 153, 153, 13, 13, 13, 137, 228, 228, 1, 407, 1, 1, 222, 1, 1, 1, 1, 1, 11, 11, 207, 356, 348, 207, 356, 386, 207, 130] 34
Hello World 3s Slow [314, 336, 500, 207, 207, 153, 13, 208, 13, 13, 137, 228, 137, 228, 228, 228, 322, 378, 407, 407, 407, 378, 200, 200, 407, 378, 200, 378, 200, 200, 407, 200, 407, 78, 407, 378, 407, 378, 200, 407, 200, 200, 192, 407, 407, 407, 192, 192, 397, 397, 397, 479, 341, 479, 245, 245, 245, 245, 245, 35, 56, 245, 400, 376, 446, 378, 400, 192, 192, 400, 200, 200, 213, 207, 200, 207, 200, 207, 207, 207, 200, 508, 508, 508, 508, 207, 336, 348, 336, 508, 336, 508, 200] 93
Hello 1s Fast [207, 207, 207, 207, 143, 22, 207, 153, 179, 13, 13, 137, 228, 228, 322, 434, 434, 286, 286, 286, 286, 356, 207, 22, 382, 207] 26
Hello 1s Slow [207, 200, 153, 153, 153, 135, 13, 313, 313, 313, 313, 378, 378, 378, 378, 378, 378, 378, 378, 13, 392, 392, 228, 228, 378, 228, 275, 275, 275, 378, 275, 200, 227, 275, 200, 426, 407, 200, 434, 382, 382, 407, 382, 207, 207, 508] 46

@bachvudinh
Copy link
Contributor

bachvudinh commented Dec 9, 2024

What I tried to do:

  • Deduplicate the sound semantic token to get the shorter sound token sequence --> Efficient Training.
    For example:
    Input:

    <|sound_start|><|sound_0207|><|sound_0148|><|sound_0000|><|sound_0249|><|sound_0249|><|sound_0177|>.  <|sound_0177|><|sound_0177|><|sound_0177|><|sound_0177|><|sound_0302|><|sound_0302|><|sound_0426|>.  <|sound_0426|><|sound_0081|><|sound_0090|><|sound_0491|><|sound_0327|><|sound_0327|><|sound_0153|>.  <|sound_0061|><|sound_0061|><|sound_0196|><|sound_0178|><|sound_0129|><|sound_0129|><|sound_0025|>.  <|sound_0070|><|sound_0114|><|sound_0159|><|sound_0159|><|sound_0159|><|sound_0010|><|sound_0316|>.  <|sound_0270|><|sound_0510|><|sound_0409|><|sound_0339|><|sound_0378|><|sound_0339|><|sound_0378|>.  <|sound_0407|><|sound_0407|><|sound_0404|><|sound_0404|><|sound_0235|><|sound_0322|><|sound_0322|>.  <|sound_0076|><|sound_0245|><|sound_0226|><|sound_0011|><|sound_0407|><|sound_0407|><|sound_0007|>.  <|sound_0007|><|sound_0007|><|sound_0187|><|sound_0316|><|sound_0300|><|sound_0316|><|sound_0316|>.  <|sound_0109|><|sound_0008|><|sound_0300|><|sound_0179|><|sound_0179|><|sound_0179|><|sound_0122|>.  <|sound_0122|><|sound_0158|><|sound_0158|><|sound_0245|><|sound_0080|><|sound_0080|><|sound_0461|>.  <|sound_0461|><|sound_0005|><|sound_0368|><|sound_0368|><|sound_0487|><|sound_0206|><|sound_0278|>.  <|sound_0278|><|sound_0002|><|sound_0002|><|sound_0102|><|sound_0329|><|sound_0444|><|sound_0114|>.  <|sound_0114|><|sound_0325|><|sound_0202|><|sound_0202|><|sound_0202|><|sound_0363|><|sound_0319|>.  <|sound_0161|><|sound_0147|><|sound_0397|><|sound_0325|><|sound_0081|><|sound_0081|><|sound_0319|>.  <|sound_0161|><|sound_0331|><|sound_0187|><|sound_0350|><|sound_0210|><|sound_0048|><|sound_0468|>.  <|sound_0468|><|sound_0125|><|sound_0271|><|sound_0271|><|sound_0157|><|sound_0344|><|sound_0007|>.  <|sound_0300|><|sound_0300|><|sound_0300|><|sound_0300|><|sound_0300|><|sound_0076|><|sound_0226|>.  <|sound_0498|><|sound_0498|><|sound_0330|><|sound_0468|><|sound_0468|><|sound_0468|><|sound_0434|>.  <|sound_0182|><|sound_end|>
    

    output:

    <|sound_start|><|sound_0207|><|sound_0148|><|sound_0000|><|sound_0249|><|sound_0177|><|sound_0302|>.  <|sound_0426|><|sound_0081|><|sound_0090|><|sound_0491|><|sound_0327|><|sound_0153|><|sound_0061|>.  <|sound_0196|><|sound_0178|><|sound_0129|><|sound_0025|><|sound_0070|><|sound_0114|><|sound_0159|>.  <|sound_0010|><|sound_0316|><|sound_0270|><|sound_0510|><|sound_0409|><|sound_0339|><|sound_0378|>.  <|sound_0339|><|sound_0378|><|sound_0407|><|sound_0404|><|sound_0235|><|sound_0322|><|sound_0076|>.  <|sound_0245|><|sound_0226|><|sound_0011|><|sound_0407|><|sound_0007|><|sound_0187|><|sound_0316|>.  <|sound_0300|><|sound_0316|><|sound_0109|><|sound_0008|><|sound_0300|><|sound_0179|><|sound_0122|>.  <|sound_0158|><|sound_0245|><|sound_0080|><|sound_0461|><|sound_0005|><|sound_0368|><|sound_0487|>.  <|sound_0206|><|sound_0278|><|sound_0002|><|sound_0102|><|sound_0329|><|sound_0444|><|sound_0114|>.  <|sound_0325|><|sound_0202|><|sound_0363|><|sound_0319|><|sound_0161|><|sound_0147|><|sound_0397|>.  <|sound_0325|><|sound_0081|><|sound_0319|><|sound_0161|><|sound_0331|><|sound_0187|><|sound_0350|>.  <|sound_0210|><|sound_0048|><|sound_0468|><|sound_0125|><|sound_0271|><|sound_0157|><|sound_0344|>.  <|sound_0007|><|sound_0300|><|sound_0076|><|sound_0226|><|sound_0498|><|sound_0330|><|sound_0468|>.  <|sound_0434|><|sound_0182|><|sound_end|>
    
  • I experimented with Llama 3.1 1B using deduplicated data for 3 epoch.

Result:

  • The loss did not go lower than 1.7 after 2 epoch (~ 1.73-1.8).
  • Model is unable to identify and learn the data pattern, possibly due to information loss caused by deduplication.

@tikikun
Copy link
Collaborator Author

tikikun commented Dec 9, 2024

Idea: Add duration tokens

Observations:

  • Loss did not converge as well after we de-dup, something must be lost
  • Repetition is the only difference, how many time a sound token is repeated
  • Continuous embedding is a by product of how Whisper Encoder process (each embedding is slightly different from other one even though they are almost the same)

Theories:

  • We need to be able to "add back the "repetition" information
  • After we make sure the information is kept the same, the loss behaviour should be relatively the same.

Implementation:

  • We create "<|dur_xxx|> (with x is an integer) to represent the repetition inside each repeated sound tokens group

Image

Extra Information

Why some word might result in repetition?

At first glance it's tempting to think that the same information in a repetitive token (or embedding) might be redundant. But if we take a closer look at long and short vowels in English it might not be the case

Example:

  • Sheep - i
  • Ship - ɪ

The only way for you to discern the difference between Sheep and Ship sometimes in English speaking is only whether the i sound is long or short, or the duration part of it.

By de-duplicating the duration, everything becomes a short sound but your target training still sheep or sip (either long or short) make it impossible to really converge.

Hence, duration is the information that is left out when you de-dup.

@PodsAreAllYouNeed
Copy link

Related to the idea of the token-level duration token, we could potentially have a "global duration token" added as a context token either before or after the provided text input. This "global duration token" gives information to the t2s model about the length of semantic tokens it needs to generate. After training, this global duration token is also used to control the length of generation of the text, which controls the speaking speed. This is inspired by the "number of frames" mechanic found in the F5-TTS code generation https://github.com/SWivid/F5-TTS/blob/8898d05e374bcb8d3fc0b1286037e95df61f491f/src/f5_tts/infer/utils_infer.py#L449C1-L452C96

If TTS models need some global duration information in order to do the generation, then our text2semantic should also use the same kind of global information. we just need to encode it a little differently.

@bachvudinh
Copy link
Contributor

Some result from the training runs:

  • The loss going lower than normal deduplication training ---> "repetition" information
  • The loss does not converge better than training using normal sound token ---> this maybe due to new added "duration token"
    image

@github-project-automation github-project-automation bot moved this to Investigating in Jan & Cortex Dec 11, 2024
@tikikun tikikun transferred this issue from janhq/WhisperSpeech Dec 11, 2024
@tikikun
Copy link
Collaborator Author

tikikun commented Dec 11, 2024

Idea: Add duration tokens

Observations:

  • Loss did not converge as well after we de-dup, something must be lost
  • Repetition is the only difference, how many time a sound token is repeated
  • Continuous embedding is a by product of how Whisper Encoder process (each embedding is slightly different from other one even though they are almost the same)

Theories:

  • We need to be able to "add back the "repetition" information
  • After we make sure the information is kept the same, the loss behaviour should be relatively the same.

Implementation:

  • We create "<|dur_xxx|> (with x is an integer) to represent the repetition inside each repeated sound tokens group

Image

Extra Information

Why some word might result in repetition?

At first glance it's tempting to think that the same information in a repetitive token (or embedding) might be redundant. But if we take a closer look at long and short vowels in English it might not be the case

Example:

  • Sheep - i
  • Ship - ɪ

The only way for you to discern the difference between Sheep and Ship sometimes in English speaking is only whether the i sound is long or short, or the duration part of it.

By de-duplicating the duration, everything becomes a short sound but your target training still sheep or sip (either long or short) make it impossible to really converge.

Hence, duration is the information that is left out when you de-dup.

This idea needs further validation on different generation len

@tikikun
Copy link
Collaborator Author

tikikun commented Dec 12, 2024

@bachvudinh please add validation on longer sequence

@bachvudinh
Copy link
Contributor

bachvudinh commented Dec 12, 2024

  • Our data length distribution.
    image

  • Testing the quality of text-to-semantic models using prompts of varying lengths:

    Prompt length Prompt sound token length Ichigo v0.4 Answer
    30 Can you please tell a story about a cat who loves to play with a ball of yarn? 136 What a delightful request! Here's a tale about a feline friend who loves to play with a ball of yarn: Whiskers and the Yarn Ball In a cozy little house on a quiet street, there lived a sleek and curious cat named Whiskers. Whiskers was a ball of energy, always looking for ways to entertain herself. One day, while exploring the house, she stumbled upon a ball of yarn that had been left on the floor. The soft, fluffy yarn was like a treasure trove of wonder to Whiskers. She batted it around the room, chasing after it with glee. From that moment on, Whiskers was hooked. The yarn ball became her favorite toy, and she would spend hours playing with it every day. She'd chase it around the house, pouncing on it like a predator, and then gently batting it back and forth with her paw. The yarn ball was the perfect size, just big enough to be challenging but small enough to be manageable. As the days went by, Whiskers' love for the yarn ball only grew stronger. She'd carry it around in her mouth, proudly displaying her prize to anyone who would look. Her owners, the Smiths, would laugh at her antics, watching as she'd chase the yarn ball around the living room, leaping over furniture and dodging around chairs. But Whiskers' favorite game was "Yarn Ball Hide-and-Seek." She'd hide the yarn ball under a cushion, behind a bookshelf, or even in the laundry basket, and then use her keen senses to track it down. When she finally found it, she'd let out a triumphant meow, as if to say, "I've got it! I've got it!"
    78 Summarize below paragraph: If we really want an AR model trained using next token prediction, we must use Whisper tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical Whisper tokens together. This way, the model is not penalized for getting the output length wrong. 685 The text describes a method for training an AI model using next token prediction. The model is trained using a specific format of word tokens and does not require any auxiliary information. The model can be trained using a simple technique of grouping identical word tokens together. This approach does not penalize the model for the output length on the token prediction.
    82 Imagine you are a renowned architect designing a sustainable skyscraper in Singapore. The building needs to incorporate green technology, vertical gardens, and innovative energy solutions. Describe your vision for this groundbreaking structure, including specific details about materials, renewable energy systems, and how it will benefit the local community. Consider aspects like natural lighting, rainwater harvesting, and biodiversity integration. 2013 The Singapore SkyScape is a futuristic skyscraper that embodies the principles of green technology, vertical gardens, and innovative energy solutions. The building's design is a masterpiece of modern architecture, with a sleek and aerodynamic shape that maximizes natural light and ventilation. The exterior of the building is clad in a unique, translucent material that allows for maximum light penetration while also providing excellent insulation and energy efficiency. The SkyScape's vertical gardens are a key feature of the building's design. The gardens are located on the exterior of the building, and are designed to provide a natural habitat for a variety of plant species. The gardens are irrigated using a state-of-the-art hydroponic system, which uses recycled water and minimizes water waste. The gardens also provide a natural air filtration system, helping to purify the air and improve the overall air quality in the surrounding area. The SkyScape's innovative energy solutions are another key feature of the building's design. The building is powered by a combination of solar panels, wind turbines, and geothermal energy. The solar panels are located on the roof of the building, and provide a significant portion of the building's energy needs. The wind turbines are located on the exterior of the building, and provide additional energy.
    145 Summarize the below paragraph: The importance of effective communication in everyday life cannot be overstated. Clear and concise communication helps to build strong relationships, avoid misunderstandings, and convey ideas with precision. When individuals communicate effectively, they are able to express their thoughts, needs, and desires in a way that is easily understood by others. Good communication involves not only speaking, but also listening actively and responding thoughtfully. It requires being aware of nonverbal cues, such as body language and tone of voice, and adapting one's message to the audience and context. By doing so, individuals can navigate complex social situations, resolve conflicts, and achieve their goals more efficiently. 2015 Effective communication is crucial in everyday life as it helps build strong relationships, avoid misunderstandings, and convey ideas with precision. Clear and concise communication is key to achieving this, and it is essential to communicate effectively in every part of the country. While computers can communicate with each other, they are limited to simple social situations and cannot resolve conflicts with humans. Therefore, it is important to communicate effectively with humans to avoid conflicts and achieve our goals.
    144 Repeat what i said below: The importance of effective communication in everyday life cannot be overstated. Clear and concise communication helps to build strong relationships, avoid misunderstandings, and convey ideas with precision. When individuals communicate effectively, they are able to express their thoughts, needs, and desires in a way that is easily understood by others. Good communication involves not only speaking, but also listening actively and responding thoughtfully. It requires being aware of nonverbal cues, such as body language and tone of voice, and adapting one's message to the audience and context. By doing so, individuals can navigate complex social situations, resolve conflicts, and achieve their goals more efficiently. 814 The importance of effective communication in everyday life cannot be overstated. Clear and concise communication helps to build strong relationships, avoid misunderstandings, and convey ideas with precision. When individuals communicate effectively and respond thoughtfully, they complete the communication process. The employee's information has been completed. The communication has been completed. The communication process has been completed.

@dan-homebrew dan-homebrew modified the milestone: Ichigo v0.5 Dec 12, 2024
@bachvudinh
Copy link
Contributor

bachvudinh commented Dec 12, 2024

@PodsAreAllYouNeed
Copy link

A good point of reference for the results for the text-to-semantic model would be the F5-TTS model.

We can also use their methodology of Librispeech-PC test-clean to test. Also, we can send the text into F5-TTS to generate, then take the generated audio and put it back into the full whisperVQ+Decoder model for ASR, as an additional point of reference. (This only needs to be done once for each time we retrain the VQ)

Image

Looking at the table from F5-TTS, it seems that a good target for the WER of the t2s -> decoder round-trip would be 2-3% on the LibriSpeech test-clean dataset.

@bachvudinh
Copy link
Contributor

bachvudinh commented Dec 18, 2024

Text-to-Semantics Training Issue Resolution

  • Issue Identified: Initially suspected a failure in Text-to-Semantics model training.

  • Root Cause Discovery: During a 2-hour pair-coding session, identified that training data contained a terminal period (".") that was missing from evaluation data.

  • Solution: Add the terminal period during evaluation, which successfully restored T2S model functionality.

  • Key Learnings:

    • Separator tokens are critical components in decoder model performance, reinforcing concepts from today's brown bag presentation.
    • Small details in data formatting can have significant impacts on research outcomes.

Word Error Rate (WER) Comparison Between Real Semantic tokens and Synthetic tokens.

Model WER (%)
Synthetic-T2S-1B-compress-with-prompt 7.37
WhisperVQ 7 lang with prompt 5.33
Synthetic-T2S-1B-compress 11.52
WhisperVQ 7 lang 10.86

Note: "With prompt" refers to adding a prompt to the Whisper decoder.

@PodsAreAllYouNeed
Copy link

i used stats.csv to plot a histogram of the delta between t2s and whispervq

image

The WER difference between t2s and whisperVQ is very small, in a majority of the samples, the delta it within 0.5WER. This is a very good sign that our t2s model is behaving in a similar manner to the whisper encoder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Investigating
Development

No branches or pull requests

5 participants