-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
planning: Train our own Quantizer for multilingual speech to work with Whisper Encoder #146
Comments
WhisperSpeech operates in two stages:
For the "reading" stage, we leverage the off-the-shelf Whisper model, while training a quantizer on the continuous output embeddings. This pipeline includes dataset preparation, model training, and evaluation. Evaluation Strategy:
cc @bachvudinh |
Check the proportion to mix dataset also cc @tuanlda78202 |
Checked codebase for preparing training data for the training pipeline of WhisperSpeech quantizer and found many bugs (outdated version of lib, missing functions, etc.). Done fixing bugs for preparing the dataset and keeping it track with the training pipeline, will test with MLS in the next few days. |
I just successfully changed the WhisperVQ codebase to support HF Dataset directly (wrapped by WebDS) and tested training with some data samples as well. We will refactor the WhisperVQ codebase for better control and will set up training later. |
would love to have the link to your modified repo (the current code base is locked) @tuanlda78202 |
Quantization in Generating Synthetic Semantics Embeddings for LLMsThis document outlines the benefits of using quantization, specifically Vector Quantization (VQ), for generating synthetic semantic embeddings in text-to-semantics models. The focus here is on scenarios where audio data is used as input to generate embeddings for downstream tasks, such as feeding audio-derived semantics into large language models (LLMs). Quantization offers a powerful alternative to continuous embeddings, providing improved control, interpretability, and scalability. Advantages of Quantization in Generating Semantic Embeddings1. Transferability and DiscretenessQuantization offers two critical advantages that make it highly suitable for generating semantic embeddings:
2. Simplified Synthetic Data Generation with Discrete Semantic TokensQuantization simplifies the process of generating synthetic embeddings by converting raw audio inputs into discrete tokens that encapsulate their semantic meaning. This process involves two key components:
Challenges with Continuous EmbeddingsWhen using continuous embeddings to represent audio semantics, several challenges arise:
Practical Example: Using Audio Data to Generate Semantic Embeddings for LLMsTo better understand the benefits of quantization, consider the task of generating semantic embeddings from audio inputs (such as speech recordings) to be used as input for an LLM. Below, we compare two approaches: using continuous embeddings versus discrete tokens derived from a codebook. Continuous Embeddings Approach
Quantization with a Codebook
For example:
Summary: Why Quantization Matters for Audio-to-Semantics ConversionQuantization provides clear advantages over continuous embeddings when generating semantic representations from audio data to be used as input for LLMs:
By adopting quantization in audio-to-semantics workflows, practitioners can ensure that LLMs receive cleaner, more interpretable inputs while also benefiting from enhanced scalability and control in synthetic data generation tasks. |
Upload refactor quantizer training pipeline for easy maintenance at WhisperSpeech fork janhq/WhisperSpeech |
For the evaluation we can utilize code from HF: |
Pressing IssuesSome issues with the current training test:
@tuanlda78202 please make the amends accordingly. |
For new datasets we are adding to the training, it would be good practice to check the baseline. i.e. what is the original WER of Whisper on the Bud-500 without any quantization? If the original WER is like 30 without any intervention from us, then WER of 31.7 is great, and the problem is whisper, which we cannot fix, and no amount of tuning on the quantizer side will allow us to fix the problem. |
Update yst issues on the code:
|
Note on some current exps: KL loss is really high (10) |
Goal
Experiment on WhisperVQ model for better result on multilingual. Hypothesis the current codebook is only 512 which is a small space to compress the multilingual capability.
Learning Goals
Tasklist
Experiments
The text was updated successfully, but these errors were encountered: