Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder #146

Open
2 of 4 tasks
Tracked by #116
hahuyhoang411 opened this issue Nov 19, 2024 · 13 comments
Open
2 of 4 tasks
Tracked by #116
Assignees
Milestone

Comments

@hahuyhoang411
Copy link
Contributor

hahuyhoang411 commented Nov 19, 2024

Goal

Experiment on WhisperVQ model for better result on multilingual. Hypothesis the current codebook is only 512 which is a small space to compress the multilingual capability.

Learning Goals

  • Train our own Quantizer
  • Gain more understanding of quantization
  • Quantizer training leads to more reusable efforts
    • Can quantize images, etc
Screenshot 2024-11-27 at 15 25 51

Tasklist

Experiments

  • We expand our quantizer codebook size from 512 (original weight) to 1024 for better capturing multilingual.
  • The first issue we encountered while training the quantizer model was a high KL loss. Detailed debugging notes are available in our worklog: GitHub Issue janhq/ichigo#144.
Run ID Date (dd/mm/yy) Model Config Dataset Learning Rate Batch Size Steps Total Loss CE Loss KL Loss Commit Loss Codebook Utilization Runtime Hardware Note
expand-exp1 02/12/24 codes=1024 (random init), max_token=200 Bud500 1e-3 8 9999 15.83308 0.76369 15.06781 0.0015839 86% 1h 30m 1xA6000 small batch size is not optimal
expand-exp2 02/12/24 codes=1024 (random init), max_token=200 Bud500 1e-3 42 1689 15.96285 0.83534 15.12557 0.001936 95% 1h 5m 1xA6000 random init got higher codebook utils
expand-exp3 02/12/24 codes=1024 (avg init), max_token=200 Bud500 1e-3 42 4999 15.39613 0.80295 14.59104 0.0021388 84% 3h 21m 1xA6000  
kl-exp1 03/12/24 codes=1024, KL=5,max_token=200 Bud500 1e-3 42 782 78.53279 1.11499 15.48318 0.001883 83% 31m 1s 1xA6000 change KL factor but still high KL loss
kl-exp2 03/12/24 codes=1024, KL=2,max_token=200 Bud500 1e-3 42 2924 30.53683 0.91597 14.80941 0.002393 85% 1h 57m 1xA6000 change KL factor but still high KL loss
kl-exp3 03/12/24 codes=1024, KL=1.5,max_token=200 Bud500 1e-3 42 2939 23.34618 0.84657 14.99841 0.001992 86% 1h 58m 1xA6000 change KL factor but still high KL loss
kl-exp4 03/12/24 codes=1024, KL=3,max_token=200 Bud500 1e-3 42 3726 45.72793 1.0146 14.90375 0.002078 84% 2h 43m 1xA6000 change KL factor but still high KL loss
mask-logit-exp 03/12/24 codes=1024, KL=1, max_token=200 Bud500 1e-3 42 2335 0.60911 0.29703 0.3107 0.0013738 86% N/A 1xA6000 missing loss calculation (deprecated implementation)
expand-exp4 04/12/24 codes=1024 (random init), max_token=50 Bud500 1e-3 42 15099 4.71717 0.7786 3.93544 0.003314 96% 7h 43m 1xA6000 decrease tokens cap = reduce pad tokesn = reduce KL loss
expand-exp5 04/12/24 codes=1024 (random init), max_token=20 Bud500 1e-3 42 15099 2.75478 0.82511 1.92655 0.0031208 94% 7h 32m 1xA6000 tokens cap = max tokens of data (20) = optimal KL loss
@hahuyhoang411 hahuyhoang411 changed the title WhisperVQ Development (Issue: ) task: WhisperVQ Development Nov 19, 2024
@bachvudinh bachvudinh self-assigned this Nov 20, 2024
@tuanlda78202
Copy link
Contributor

WhisperSpeech operates in two stages:

  • Reading: Converts text into phonetic representations.
  • Speaking: Enhances semantic-to-acoustic token generation with speaker embeddings and a vocoder.

For the "reading" stage, we leverage the off-the-shelf Whisper model, while training a quantizer on the continuous output embeddings. This pipeline includes dataset preparation, model training, and evaluation.

Evaluation Strategy:

  • The Word Error Rate (WER) is computed by decoding the quantized vector back to audio, using the same Whisper model for ASR, and comparing the input and output tokens.

cc @bachvudinh

@hahuyhoang411
Copy link
Contributor Author

Check the proportion to mix dataset also cc @tuanlda78202

@tuanlda78202
Copy link
Contributor

Checked codebase for preparing training data for the training pipeline of WhisperSpeech quantizer and found many bugs (outdated version of lib, missing functions, etc.). Done fixing bugs for preparing the dataset and keeping it track with the training pipeline, will test with MLS in the next few days.
cc @tikikun

@bachvudinh bachvudinh self-assigned this Nov 22, 2024
@tuanlda78202
Copy link
Contributor

tuanlda78202 commented Nov 26, 2024

I just successfully changed the WhisperVQ codebase to support HF Dataset directly (wrapped by WebDS) and tested training with some data samples as well. We will refactor the WhisperVQ codebase for better control and will set up training later.
Modified repo: https://github.com/janhq/ichigo-quantizer

@hahuyhoang411
Copy link
Contributor Author

hahuyhoang411 commented Nov 26, 2024

would love to have the link to your modified repo (the current code base is locked) @tuanlda78202

@dan-homebrew dan-homebrew changed the title task: WhisperVQ Development task: Learn how to train and work with WhisperVQ Nov 27, 2024
@dan-homebrew dan-homebrew changed the title task: Learn how to train and work with WhisperVQ task: Train WhisperVQ to support multiple languages Nov 27, 2024
@dan-homebrew dan-homebrew changed the title task: Train WhisperVQ to support multiple languages task: Train our own Quantizer for multilingual speech to work with Whisper Encoder Nov 27, 2024
@dan-homebrew dan-homebrew changed the title task: Train our own Quantizer for multilingual speech to work with Whisper Encoder planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder Nov 27, 2024
@tikikun
Copy link
Collaborator

tikikun commented Nov 27, 2024

Diagram for how the training process will be done

image

@tikikun
Copy link
Collaborator

tikikun commented Nov 27, 2024

Quantization in Generating Synthetic Semantics Embeddings for LLMs

This document outlines the benefits of using quantization, specifically Vector Quantization (VQ), for generating synthetic semantic embeddings in text-to-semantics models. The focus here is on scenarios where audio data is used as input to generate embeddings for downstream tasks, such as feeding audio-derived semantics into large language models (LLMs). Quantization offers a powerful alternative to continuous embeddings, providing improved control, interpretability, and scalability.


Advantages of Quantization in Generating Semantic Embeddings

1. Transferability and Discreteness

Quantization offers two critical advantages that make it highly suitable for generating semantic embeddings:

  • Transferability Across Models: Discrete embeddings created through quantization are independent of the model used to produce them. This independence means that once quantized embeddings are generated, they can easily be reused across different LLMs or other downstream applications without requiring retraining.

  • Discreteness of Representations: Continuous embeddings are high-dimensional and complex, while quantization transforms these into discrete tokens. These tokens are easier to manipulate and more interpretable, making them ideal for tasks like generating synthetic semantic embeddings that can be seamlessly used by LLMs.


2. Simplified Synthetic Data Generation with Discrete Semantic Tokens

Quantization simplifies the process of generating synthetic embeddings by converting raw audio inputs into discrete tokens that encapsulate their semantic meaning. This process involves two key components:

  • Codebook as a Semantic Vocabulary: A pre-trained codebook is used to map audio features into a finite set of discrete semantic tokens. Each token corresponds to a meaningful representation (e.g., "speech tone," "emotion," or "phoneme cluster") and provides a compact and interpretable way of encoding the underlying semantics.

  • Autoregressive or Generative Models: Generative models trained on these discrete tokens can predict token sequences from audio data. This enables the generation of high-quality synthetic embeddings that capture the semantics of the audio in a consistent and interpretable manner.


Challenges with Continuous Embeddings

When using continuous embeddings to represent audio semantics, several challenges arise:

  1. Artifacts in Generated Data: Continuous embeddings often contain noise or irrelevant features, as they encode both essential semantic content and extraneous details. This can lead to artifacts or inaccuracies when these embeddings are used as input for LLMs.

  2. Difficulty in Control and Interpretability: Continuous embeddings are not inherently interpretable, making it hard to manipulate specific semantic aspects (e.g., isolating emotion from speech). This lack of control can result in synthetic embeddings that fail to capture the desired meaning accurately.

  3. Scalability Issues: Continuous embeddings require significant computational resources for processing and storage due to their high dimensionality, making them less scalable compared to discrete representations.


Practical Example: Using Audio Data to Generate Semantic Embeddings for LLMs

To better understand the benefits of quantization, consider the task of generating semantic embeddings from audio inputs (such as speech recordings) to be used as input for an LLM. Below, we compare two approaches: using continuous embeddings versus discrete tokens derived from a codebook.

Continuous Embeddings Approach

  1. A speech recording is processed to generate a continuous embedding—a high-dimensional vector representing features such as phonemes, pitch, tone, and emotion.
  2. While this embedding captures detailed information about the speech, it also includes extraneous noise and irrelevant features. For example, it might encode microphone characteristics or background sound.
  3. Feeding this continuous embedding directly into an LLM can result in inconsistencies or inaccuracies in semantic understanding due to the noisy and high-dimensional nature of the data.

Quantization with a Codebook

  1. The speech recording is processed through a quantizer that maps the audio features to discrete tokens based on a pre-trained codebook.
    • For example, "high-pitched phoneme," "neutral tone," and "positive sentiment" might be represented as discrete tokens like [P1], [T1], and [S1].
  2. These tokens are compact, interpretable, and noise-free representations of the audio's semantic content.
  3. The discrete tokens are then fed into an LLM as input, ensuring that the model focuses only on the core semantics of the audio without being distracted by irrelevant details.
  4. Additionally, synthetic embeddings can be generated by sampling token sequences from the codebook using a generative model. These synthetic embeddings can be used to augment training data or for other downstream tasks.

For example:

  • Imagine generating synthetic responses for a voice assistant like Alexa or Siri based on user speech input. Using continuous embeddings might result in responses that are overly sensitive to background noise or unintended emotional tones in the input. By contrast, using quantized discrete tokens ensures that only the core semantic intent (e.g., "requesting weather update") is captured and processed accurately by the LLM.

Summary: Why Quantization Matters for Audio-to-Semantics Conversion

Quantization provides clear advantages over continuous embeddings when generating semantic representations from audio data to be used as input for LLMs:

  1. Transferability and Reusability: Quantized embeddings are model-independent and can be seamlessly reused across different applications or LLMs.

  2. Improved Data Quality: Discrete tokens eliminate noise and artifacts inherent in continuous embeddings, ensuring more accurate and consistent semantic representations.

  3. Enhanced Interpretability and Control: The discrete nature of quantized tokens allows for easier manipulation and understanding of semantic content.

  4. Scalability: Discrete tokens require less computational overhead compared to high-dimensional continuous embeddings, making them more scalable for large-scale applications.

  5. High-Quality Synthetic Embedding Generation: By leveraging a pre-trained codebook and generative models, quantization enables the generation of synthetic semantic embeddings that are highly interpretable and aligned with desired meanings.

By adopting quantization in audio-to-semantics workflows, practitioners can ensure that LLMs receive cleaner, more interpretable inputs while also benefiting from enhanced scalability and control in synthetic data generation tasks.

@tuanlda78202
Copy link
Contributor

tuanlda78202 commented Nov 27, 2024

Upload refactor quantizer training pipeline for easy maintenance at WhisperSpeech fork janhq/WhisperSpeech

cc @tikikun @bachvudinh

@tikikun
Copy link
Collaborator

tikikun commented Dec 2, 2024

Pressing Issues

Some issues with the current training test:

  • Random Initialization: We should not randomly init the weight, but instead using a current quantizer and expand the codebook, with averaging weight strategy.
  • Learning Rate: Current learning rate is obviously not optimal

@tuanlda78202 please make the amends accordingly.

@PodsAreAllYouNeed
Copy link

PodsAreAllYouNeed commented Dec 2, 2024

Pressing Issues

Some issues with the current training test:

  • Random Initialization: We should not randomly init the weight, but instead using a current quantizer and expand the codebook, with averaging weight strategy.
  • Learning Rate: Current learning rate is obviously not optimal

@tuanlda78202 please make the amends accordingly.

For new datasets we are adding to the training, it would be good practice to check the baseline. i.e. what is the original WER of Whisper on the Bud-500 without any quantization?

If the original WER is like 30 without any intervention from us, then WER of 31.7 is great, and the problem is whisper, which we cannot fix, and no amount of tuning on the quantizer side will allow us to fix the problem.

@tikikun
Copy link
Collaborator

tikikun commented Dec 3, 2024

Update yst issues on the code:

  • @tuanlda78202 had a typo in the code, it was found and fixed, training will start today.

@hahuyhoang411
Copy link
Contributor Author

Note on some current exps: KL loss is really high (10)

@github-project-automation github-project-automation bot moved this to Investigating in Jan & Cortex Dec 11, 2024
@tikikun tikikun transferred this issue from janhq/WhisperSpeech Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Investigating
Development

No branches or pull requests

5 participants