planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder #146

hahuyhoang411 · 2024-11-19T16:56:11Z

Goal

Experiment on WhisperVQ model for better result on multilingual. Hypothesis the current codebook is only 512 which is a small space to compress the multilingual capability.

Learning Goals

Train our own Quantizer
Gain more understanding of quantization
- WhisperSpeech currently uses ResidualVQ
- Can we explore other VQ methods (e.g. SimVQ, Lookup-free, etc)
Quantizer training leads to more reusable efforts
- Can quantize images, etc

Tasklist

Check the WhisperSpeech training codebase:
Experiment with model parameters (codebook, batchsize, data pipeline, etc)
training run: tuning ichigo-quantizer #144
Draft research report

Experiments

We expand our quantizer codebook size from 512 (original weight) to 1024 for better capturing multilingual.
- Discussion, Implementation
The first issue we encountered while training the quantizer model was a high KL loss. Detailed debugging notes are available in our worklog: GitHub Issue janhq/ichigo#144.

Run ID	Date (dd/mm/yy)	Model Config	Dataset	Learning Rate	Batch Size	Steps	Total Loss	CE Loss	KL Loss	Commit Loss	Codebook Utilization	Runtime	Hardware	Note
expand-exp1	02/12/24	codes=1024 (random init), max_token=200	Bud500	1e-3	8	9999	15.83308	0.76369	15.06781	0.0015839	86%	1h 30m	1xA6000	small batch size is not optimal
expand-exp2	02/12/24	codes=1024 (random init), max_token=200	Bud500	1e-3	42	1689	15.96285	0.83534	15.12557	0.001936	95%	1h 5m	1xA6000	random init got higher codebook utils
expand-exp3	02/12/24	codes=1024 (avg init), max_token=200	Bud500	1e-3	42	4999	15.39613	0.80295	14.59104	0.0021388	84%	3h 21m	1xA6000
kl-exp1	03/12/24	codes=1024, KL=5,max_token=200	Bud500	1e-3	42	782	78.53279	1.11499	15.48318	0.001883	83%	31m 1s	1xA6000	change KL factor but still high KL loss
kl-exp2	03/12/24	codes=1024, KL=2,max_token=200	Bud500	1e-3	42	2924	30.53683	0.91597	14.80941	0.002393	85%	1h 57m	1xA6000	change KL factor but still high KL loss
kl-exp3	03/12/24	codes=1024, KL=1.5,max_token=200	Bud500	1e-3	42	2939	23.34618	0.84657	14.99841	0.001992	86%	1h 58m	1xA6000	change KL factor but still high KL loss
kl-exp4	03/12/24	codes=1024, KL=3,max_token=200	Bud500	1e-3	42	3726	45.72793	1.0146	14.90375	0.002078	84%	2h 43m	1xA6000	change KL factor but still high KL loss
mask-logit-exp	03/12/24	codes=1024, KL=1, max_token=200	Bud500	1e-3	42	2335	0.60911	0.29703	0.3107	0.0013738	86%	N/A	1xA6000	missing loss calculation (deprecated implementation)
expand-exp4	04/12/24	codes=1024 (random init), max_token=50	Bud500	1e-3	42	15099	4.71717	0.7786	3.93544	0.003314	96%	7h 43m	1xA6000	decrease tokens cap = reduce pad tokesn = reduce KL loss
expand-exp5	04/12/24	codes=1024 (random init), max_token=20	Bud500	1e-3	42	15099	2.75478	0.82511	1.92655	0.0031208	94%	7h 32m	1xA6000	tokens cap = max tokens of data (20) = optimal KL loss

tuanlda78202 · 2024-11-20T16:23:06Z

WhisperSpeech operates in two stages:

Reading: Converts text into phonetic representations.
Speaking: Enhances semantic-to-acoustic token generation with speaker embeddings and a vocoder.

For the "reading" stage, we leverage the off-the-shelf Whisper model, while training a quantizer on the continuous output embeddings. This pipeline includes dataset preparation, model training, and evaluation.

Evaluation Strategy:

The Word Error Rate (WER) is computed by decoding the quantized vector back to audio, using the same Whisper model for ASR, and comparing the input and output tokens.

cc @bachvudinh

hahuyhoang411 · 2024-11-20T20:06:19Z

Check the proportion to mix dataset also cc @tuanlda78202

tuanlda78202 · 2024-11-22T01:57:05Z

Checked codebase for preparing training data for the training pipeline of WhisperSpeech quantizer and found many bugs (outdated version of lib, missing functions, etc.). Done fixing bugs for preparing the dataset and keeping it track with the training pipeline, will test with MLS in the next few days.
cc @tikikun

tuanlda78202 · 2024-11-26T17:29:31Z

I just successfully changed the WhisperVQ codebase to support HF Dataset directly (wrapped by WebDS) and tested training with some data samples as well. We will refactor the WhisperVQ codebase for better control and will set up training later.
Modified repo: https://github.com/janhq/ichigo-quantizer

hahuyhoang411 · 2024-11-26T19:26:43Z

would love to have the link to your modified repo (the current code base is locked) @tuanlda78202

tikikun · 2024-11-27T08:23:16Z

Diagram for how the training process will be done

tikikun · 2024-11-27T09:33:11Z

Quantization in Generating Synthetic Semantics Embeddings for LLMs

This document outlines the benefits of using quantization, specifically Vector Quantization (VQ), for generating synthetic semantic embeddings in text-to-semantics models. The focus here is on scenarios where audio data is used as input to generate embeddings for downstream tasks, such as feeding audio-derived semantics into large language models (LLMs). Quantization offers a powerful alternative to continuous embeddings, providing improved control, interpretability, and scalability.

Advantages of Quantization in Generating Semantic Embeddings

1. Transferability and Discreteness

Quantization offers two critical advantages that make it highly suitable for generating semantic embeddings:

Transferability Across Models: Discrete embeddings created through quantization are independent of the model used to produce them. This independence means that once quantized embeddings are generated, they can easily be reused across different LLMs or other downstream applications without requiring retraining.
Discreteness of Representations: Continuous embeddings are high-dimensional and complex, while quantization transforms these into discrete tokens. These tokens are easier to manipulate and more interpretable, making them ideal for tasks like generating synthetic semantic embeddings that can be seamlessly used by LLMs.

2. Simplified Synthetic Data Generation with Discrete Semantic Tokens

Quantization simplifies the process of generating synthetic embeddings by converting raw audio inputs into discrete tokens that encapsulate their semantic meaning. This process involves two key components:

Codebook as a Semantic Vocabulary: A pre-trained codebook is used to map audio features into a finite set of discrete semantic tokens. Each token corresponds to a meaningful representation (e.g., "speech tone," "emotion," or "phoneme cluster") and provides a compact and interpretable way of encoding the underlying semantics.
Autoregressive or Generative Models: Generative models trained on these discrete tokens can predict token sequences from audio data. This enables the generation of high-quality synthetic embeddings that capture the semantics of the audio in a consistent and interpretable manner.

Challenges with Continuous Embeddings

When using continuous embeddings to represent audio semantics, several challenges arise:

Artifacts in Generated Data: Continuous embeddings often contain noise or irrelevant features, as they encode both essential semantic content and extraneous details. This can lead to artifacts or inaccuracies when these embeddings are used as input for LLMs.
Difficulty in Control and Interpretability: Continuous embeddings are not inherently interpretable, making it hard to manipulate specific semantic aspects (e.g., isolating emotion from speech). This lack of control can result in synthetic embeddings that fail to capture the desired meaning accurately.
Scalability Issues: Continuous embeddings require significant computational resources for processing and storage due to their high dimensionality, making them less scalable compared to discrete representations.

Practical Example: Using Audio Data to Generate Semantic Embeddings for LLMs

To better understand the benefits of quantization, consider the task of generating semantic embeddings from audio inputs (such as speech recordings) to be used as input for an LLM. Below, we compare two approaches: using continuous embeddings versus discrete tokens derived from a codebook.

Continuous Embeddings Approach

A speech recording is processed to generate a continuous embedding—a high-dimensional vector representing features such as phonemes, pitch, tone, and emotion.
While this embedding captures detailed information about the speech, it also includes extraneous noise and irrelevant features. For example, it might encode microphone characteristics or background sound.
Feeding this continuous embedding directly into an LLM can result in inconsistencies or inaccuracies in semantic understanding due to the noisy and high-dimensional nature of the data.

Quantization with a Codebook

The speech recording is processed through a quantizer that maps the audio features to discrete tokens based on a pre-trained codebook.
- For example, "high-pitched phoneme," "neutral tone," and "positive sentiment" might be represented as discrete tokens like [P1], [T1], and [S1].
These tokens are compact, interpretable, and noise-free representations of the audio's semantic content.
The discrete tokens are then fed into an LLM as input, ensuring that the model focuses only on the core semantics of the audio without being distracted by irrelevant details.
Additionally, synthetic embeddings can be generated by sampling token sequences from the codebook using a generative model. These synthetic embeddings can be used to augment training data or for other downstream tasks.

For example:

Imagine generating synthetic responses for a voice assistant like Alexa or Siri based on user speech input. Using continuous embeddings might result in responses that are overly sensitive to background noise or unintended emotional tones in the input. By contrast, using quantized discrete tokens ensures that only the core semantic intent (e.g., "requesting weather update") is captured and processed accurately by the LLM.

Summary: Why Quantization Matters for Audio-to-Semantics Conversion

Quantization provides clear advantages over continuous embeddings when generating semantic representations from audio data to be used as input for LLMs:

Transferability and Reusability: Quantized embeddings are model-independent and can be seamlessly reused across different applications or LLMs.
Improved Data Quality: Discrete tokens eliminate noise and artifacts inherent in continuous embeddings, ensuring more accurate and consistent semantic representations.
Enhanced Interpretability and Control: The discrete nature of quantized tokens allows for easier manipulation and understanding of semantic content.
Scalability: Discrete tokens require less computational overhead compared to high-dimensional continuous embeddings, making them more scalable for large-scale applications.
High-Quality Synthetic Embedding Generation: By leveraging a pre-trained codebook and generative models, quantization enables the generation of synthetic semantic embeddings that are highly interpretable and aligned with desired meanings.

By adopting quantization in audio-to-semantics workflows, practitioners can ensure that LLMs receive cleaner, more interpretable inputs while also benefiting from enhanced scalability and control in synthetic data generation tasks.

tuanlda78202 · 2024-11-27T17:43:04Z

Upload refactor quantizer training pipeline for easy maintenance at WhisperSpeech fork janhq/WhisperSpeech

cc @tikikun @bachvudinh

hahuyhoang411 · 2024-11-29T10:29:32Z

For the evaluation we can utilize code from HF:

tikikun · 2024-12-02T03:18:35Z

Pressing Issues

Some issues with the current training test:

Random Initialization: We should not randomly init the weight, but instead using a current quantizer and expand the codebook, with averaging weight strategy.
Learning Rate: Current learning rate is obviously not optimal

@tuanlda78202 please make the amends accordingly.

PodsAreAllYouNeed · 2024-12-02T12:55:50Z

Pressing Issues

Some issues with the current training test:

Random Initialization: We should not randomly init the weight, but instead using a current quantizer and expand the codebook, with averaging weight strategy.

Learning Rate: Current learning rate is obviously not optimal

@tuanlda78202 please make the amends accordingly.

For new datasets we are adding to the training, it would be good practice to check the baseline. i.e. what is the original WER of Whisper on the Bud-500 without any quantization?

If the original WER is like 30 without any intervention from us, then WER of 31.7 is great, and the problem is whisper, which we cannot fix, and no amount of tuning on the quantizer side will allow us to fix the problem.

tikikun · 2024-12-03T02:35:22Z

Update yst issues on the code:

@tuanlda78202 had a typo in the code, it was found and fixed, training will start today.

hahuyhoang411 · 2024-12-03T06:51:52Z

Note on some current exps: KL loss is really high (10)

hahuyhoang411 mentioned this issue Nov 19, 2024

milestone: Ichigo v0.5 Multi-lingual #116

Open

7 tasks

hahuyhoang411 changed the title ~~WhisperVQ Development (Issue: )~~ task: WhisperVQ Development Nov 19, 2024

hahuyhoang411 assigned tuanlda78202 and unassigned tuanlda78202 Nov 19, 2024

bachvudinh self-assigned this Nov 20, 2024

hahuyhoang411 assigned tuanlda78202 and unassigned bachvudinh Nov 20, 2024

bachvudinh self-assigned this Nov 22, 2024

dan-homebrew changed the title ~~task: WhisperVQ Development~~ task: Learn how to train and work with WhisperVQ Nov 27, 2024

dan-homebrew changed the title ~~task: Learn how to train and work with WhisperVQ~~ task: Train WhisperVQ to support multiple languages Nov 27, 2024

dan-homebrew changed the title ~~task: Train WhisperVQ to support multiple languages~~ task: Train our own Quantizer for multilingual speech to work with Whisper Encoder Nov 27, 2024

dan-homebrew changed the title ~~task: Train our own Quantizer for multilingual speech to work with Whisper Encoder~~ planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder Nov 27, 2024

tikikun mentioned this issue Dec 11, 2024

task: Train and test text2semantic under decoder only framework for ichigo v0.5 #145

Open

2 tasks

dan-homebrew transferred this issue from janhq/ichigo Nov 29, 2024

hahuyhoang411 added this to the Ichigo v0.5 milestone Nov 29, 2024

hahuyhoang411 mentioned this issue Dec 2, 2024

training run: tuning ichigo-quantizer #144

Open

10 tasks

hahuyhoang411 unassigned bachvudinh Dec 4, 2024

github-project-automation bot added this to Jan & Cortex Dec 11, 2024

github-project-automation bot moved this to Investigating in Jan & Cortex Dec 11, 2024

tikikun transferred this issue from janhq/WhisperSpeech Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder #146

planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder #146

hahuyhoang411 commented Nov 19, 2024 •

edited by dan-homebrew

Loading

tuanlda78202 commented Nov 20, 2024

hahuyhoang411 commented Nov 20, 2024

tuanlda78202 commented Nov 22, 2024

tuanlda78202 commented Nov 26, 2024 •

edited

Loading

hahuyhoang411 commented Nov 26, 2024 •

edited

Loading

tikikun commented Nov 27, 2024 •

edited

Loading

tikikun commented Nov 27, 2024 •

edited

Loading

tuanlda78202 commented Nov 27, 2024 •

edited

Loading

hahuyhoang411 commented Nov 29, 2024

tikikun commented Dec 2, 2024

PodsAreAllYouNeed commented Dec 2, 2024 •

edited

Loading

Pressing Issues

tikikun commented Dec 3, 2024

hahuyhoang411 commented Dec 3, 2024

planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder #146

planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder #146

Comments

hahuyhoang411 commented Nov 19, 2024 • edited by dan-homebrew Loading

Goal

Learning Goals

Tasklist

Experiments

tuanlda78202 commented Nov 20, 2024

hahuyhoang411 commented Nov 20, 2024

tuanlda78202 commented Nov 22, 2024

tuanlda78202 commented Nov 26, 2024 • edited Loading

hahuyhoang411 commented Nov 26, 2024 • edited Loading

tikikun commented Nov 27, 2024 • edited Loading

tikikun commented Nov 27, 2024 • edited Loading

Quantization in Generating Synthetic Semantics Embeddings for LLMs

Advantages of Quantization in Generating Semantic Embeddings

1. Transferability and Discreteness

2. Simplified Synthetic Data Generation with Discrete Semantic Tokens

Challenges with Continuous Embeddings

Practical Example: Using Audio Data to Generate Semantic Embeddings for LLMs

Continuous Embeddings Approach

Quantization with a Codebook

Summary: Why Quantization Matters for Audio-to-Semantics Conversion

tuanlda78202 commented Nov 27, 2024 • edited Loading

hahuyhoang411 commented Nov 29, 2024

tikikun commented Dec 2, 2024

Pressing Issues

PodsAreAllYouNeed commented Dec 2, 2024 • edited Loading

Pressing Issues

tikikun commented Dec 3, 2024

hahuyhoang411 commented Dec 3, 2024

hahuyhoang411 commented Nov 19, 2024 •

edited by dan-homebrew

Loading

tuanlda78202 commented Nov 26, 2024 •

edited

Loading

hahuyhoang411 commented Nov 26, 2024 •

edited

Loading

tikikun commented Nov 27, 2024 •

edited

Loading

tikikun commented Nov 27, 2024 •

edited

Loading

tuanlda78202 commented Nov 27, 2024 •

edited

Loading

PodsAreAllYouNeed commented Dec 2, 2024 •

edited

Loading