Releases: janhq/ichigo
Releases Β· janhq/ichigo
Ichigo Whisper v0.1
Ichigo Whisper v0.1 Release Notes:
- π Introducing Ichigo Whisper v0.1
We are thrilled to announce our very first speech tokenizer built upon theWhisper-medium
model!
Ichigo Whisper is a lightweight (22M parameters), open-source speech tokenizer designed to optimize performance for multilingual while maintaining strong English capabilities. Unlike continuous embedding models, Ichigo Whisper compresses speech into discrete tokens, enabling seamless integration with large language models (LLMs) for advanced speech understanding.
π Performance Highlights:
1. Vietnamese
Model | Codebook Size | Test Dataset | Test Samples | WER |
---|---|---|---|---|
Ichigo Whisper | 2561 | viVoice | 1000 | 11.36 |
Whisper Medium | - | viVoice | 1000 | 18.64 |
2. English
Model | Codebook Size | Test Dataset | Test Samples | WER |
---|---|---|---|---|
Ichigo Whisper | 2561 | LibriTTS-R | 1000 | 12.96 |
Whisper Medium | - | LibriTTS-R | 1000 | 12.99 |
π Resources:
- Model Weights: Hugging Face: Ichigo Whisper v0.1
- Live Demo: Ichigo Whisper Online
v0.4 π Ichigo!
Change log for Ichigo v0.4:
- Unified Training Pipeline: Consolidated Phase 2 and Phase 3 into a single-phase training approach.
- Training data enhancements:
- Migrated speech noise data and speech multi-turn data from Phase 3 into Phase 2.
- Introduced noise-augmented multi-turn conversations: we synthetic by injecting noise turn in speech and text-only multi-turn datasets.
Performance Improvements vs v0.3:
- Enhanced Intelligence: Improved benchmark scores on MMLU (64.66).
- Extended Context Handling
- Advanced Noise Management: More robust rejection of noisy environmental inputs
- Improving Multi-turn Capabilities.
Model weight: https://huggingface.co/collections/homebrewltd/ichigo-v04-67317bde6dfdfdd55dddbc6e
Live demo at: https://ichigo.homebrew.ltd/
First release of π Ichigo!
Model weight can be downloaded at:
Changelog: v0.2 vs v0.3
Overall Comparison
Phase | Aspect | v0.2 | v0.3 |
---|---|---|---|
Pretraining | Data Size | 2.42M | 3.87M |
Data Source | parler-tts/mls_eng_10k | facebook/multilingual_librispeech | |
Data Synthetic Pipeline | Using WhisperVQ(old checkpoint: whisper-vq-stoks-medium-en+pl.model) to tokenize english-only audio. | Using latest checkpoint whisper-vq-stoks-v3-7lang.model for 8 lang audio. | |
Epoch | 1 | 1 | |
Global batch size | 480 | 480 | |
Learning Rate | 2e-4 | 2e-4 | |
Warmup Steps | 80 | 50 | |
Weight Decay | 0.005 | 0.005 | |
Max length | 512 | 512 | |
Precision | bf16 | bf16 | |
Instruction Phase | Data Size | 929K | 1.89M + 165k (phase 3) |
Preprocessing | Using rule-base to remove all hard-to-pronounce prompt | Utilizing rule-based methods to filter out hard-to-pronounce prompts, and rephrasing certain LLM-generated responses to sound more natural and human-like. | |
Data Synthetic Pipeline | Using old text-to-speech checkpoint to generate: t2s-small-yt.model then using whisper-vq-stoks-medium-en+pl.model to tokenize audio. | Change t2s checkpoint to t2s-v1.1-small-en+pl.model and whisperVQ checkpoint to whisper-vq-stoks-v3-7lang.model. | |
Epoch | 5 | 1 | |
Global batch size | 128 | 256 | |
Gradient Acc Step per device | 1 | 8 | |
Learning Rate | 1e-4 | 7e-5 and 1.5e-5 for phase 3 | |
Warmup Steps | 80 | 73 and 8 for phase 3 | |
Weight Decay | 0.005 | 0.005 | |
Max length | 1024 | 4096 | |
Precision | bf16 | bf16 |
Instruction Phase Data Task Types
Task Type | v0.2 | v0.3 |
---|---|---|
Speech Multiturn | None | 150k(Mostly 2 turns around 10k >=4 turns |
Speech QA | 679k samples | 1.332M samples |
Transcription | 250k samples(Using a special token to denote a transcription task) | 400k samples(Using 6 different prompts) |
Noise Audio | None | 8k samples(Using Qwen2.5-72B to generate diverse synthetic answers for randomly generated sound tokens, with lengths matching the distribution of the Speech QA prompt) |
Text-only | None | 150k samples including: 100k multiturn + 50k single turn |