Skip to content

Releases: janhq/ichigo

Ichigo Whisper v0.1

30 Dec 13:16
2124598
Compare
Choose a tag to compare

Ichigo Whisper v0.1 Release Notes:

  • πŸŽ‰ Introducing Ichigo Whisper v0.1
    We are thrilled to announce our very first speech tokenizer built upon the Whisper-medium model!
    Ichigo Whisper is a lightweight (22M parameters), open-source speech tokenizer designed to optimize performance for multilingual while maintaining strong English capabilities. Unlike continuous embedding models, Ichigo Whisper compresses speech into discrete tokens, enabling seamless integration with large language models (LLMs) for advanced speech understanding.

πŸš€ Performance Highlights:

1. Vietnamese

Model Codebook Size Test Dataset Test Samples WER
Ichigo Whisper 2561 viVoice 1000 11.36
Whisper Medium - viVoice 1000 18.64

2. English

Model Codebook Size Test Dataset Test Samples WER
Ichigo Whisper 2561 LibriTTS-R 1000 12.96
Whisper Medium - LibriTTS-R 1000 12.99

πŸ”— Resources:

v0.4 πŸ“ Ichigo!

11 Nov 04:22
28e5934
Compare
Choose a tag to compare

Change log for Ichigo v0.4:

  • Unified Training Pipeline: Consolidated Phase 2 and Phase 3 into a single-phase training approach.
  • Training data enhancements:
    • Migrated speech noise data and speech multi-turn data from Phase 3 into Phase 2.
    • Introduced noise-augmented multi-turn conversations: we synthetic by injecting noise turn in speech and text-only multi-turn datasets.

Performance Improvements vs v0.3:

  • Enhanced Intelligence: Improved benchmark scores on MMLU (64.66).
  • Extended Context Handling
  • Advanced Noise Management: More robust rejection of noisy environmental inputs
  • Improving Multi-turn Capabilities.

Model weight: https://huggingface.co/collections/homebrewltd/ichigo-v04-67317bde6dfdfdd55dddbc6e
Live demo at: https://ichigo.homebrew.ltd/

First release of πŸ“ Ichigo!

05 Nov 07:47
28e5934
Compare
Choose a tag to compare

Model weight can be downloaded at:

Changelog: v0.2 vs v0.3

Overall Comparison

Phase Aspect v0.2 v0.3
Pretraining Data Size 2.42M 3.87M
Data Source parler-tts/mls_eng_10k facebook/multilingual_librispeech
Data Synthetic Pipeline Using WhisperVQ(old checkpoint: whisper-vq-stoks-medium-en+pl.model) to tokenize english-only audio. Using latest checkpoint whisper-vq-stoks-v3-7lang.model for 8 lang audio.
Epoch 1 1
Global batch size 480 480
Learning Rate 2e-4 2e-4
Warmup Steps 80 50
Weight Decay 0.005 0.005
Max length 512 512
Precision bf16 bf16
Instruction Phase Data Size 929K 1.89M + 165k (phase 3)
Preprocessing Using rule-base to remove all hard-to-pronounce prompt Utilizing rule-based methods to filter out hard-to-pronounce prompts, and rephrasing certain LLM-generated responses to sound more natural and human-like.
Data Synthetic Pipeline Using old text-to-speech checkpoint to generate: t2s-small-yt.model then using whisper-vq-stoks-medium-en+pl.model to tokenize audio. Change t2s checkpoint to t2s-v1.1-small-en+pl.model and whisperVQ checkpoint to whisper-vq-stoks-v3-7lang.model.
Epoch 5 1
Global batch size 128 256
Gradient Acc Step per device 1 8
Learning Rate 1e-4 7e-5 and 1.5e-5 for phase 3
Warmup Steps 80 73 and 8 for phase 3
Weight Decay 0.005 0.005
Max length 1024 4096
Precision bf16 bf16

Instruction Phase Data Task Types

Task Type v0.2 v0.3
Speech Multiturn None 150k(Mostly 2 turns around 10k >=4 turns
Speech QA 679k samples 1.332M samples
Transcription 250k samples(Using a special token to denote a transcription task) 400k samples(Using 6 different prompts)
Noise Audio None 8k samples(Using Qwen2.5-72B to generate diverse synthetic answers for randomly generated sound tokens, with lengths matching the distribution of the Speech QA prompt)
Text-only None 150k samples including: 100k multiturn + 50k single turn

Performance