Support longer audio contexts #110

liPatrick · 2024-09-11T07:41:23Z

The whisper encoder has a max context of 30s of audio.

This pr enables our model to support longer contexts by splitting long audio into chunks of 30sec (with the last chunk being the exception)

Note* should also work in batch mode.

farzadab

Thanks Pat.

Your approach is valid, but I had a slightly different take on how we would want to approach this so that it would later on allow for having multiple (zero or more) audio tracks per sample.

Let's meet if the explanation I gave is vague.

ultravox/model/ultravox_model.py

liPatrick · 2024-09-17T21:00:40Z

Results from eval on long audio contexts:

Note* ASR measured by WER, and translation measured by BLEU.
combine-n means we're concatenating n samples into one.

https://wandb.ai/fixie/ultravox/runs/0ws1m9us/overview
english to chinese:
eval/covost2_long_audio-asr-combine-5-en_zh-CN.2k-asr:0.23348530515869995
eval/covost2_long_audio-asr-combine-10-en_zh-CN.2k-asr:0.23519305542996655
eval/covost2_long_audio-translate-combine-5-en_zh-CN.2k-bleu:17.7327965079455
eval/covost2_long_audio-translate-combine-10-en_zh-CN.2k-bleu:17.980528222436593
eval/covost2-asr-en_zh-CN.2k-asr:0.18523806908671536
eval/covost2-translate-en_zh-CN.2k-bleu:24.688760684909365

spanish to english:
https://wandb.ai/fixie/ultravox/runs/2a27bgqx/overview
eval/covost2_long_audio-asr-combine-5-es_en.2k-asr:0.1345223909283106
eval/covost2_long_audio-asr-combine-10-es_en.2k-asr:0.20847972323659428
eval/covost2_long_audio-translate-combine-5-es_en.2k-bleu:34.29590491442046
eval/covost2_long_audio-translate-combine-10-es_en.2k-bleu:34.08795842954878
eval/covost2-asr-es_en.2k-asr:0.14595425715933116
eval/covost2-translate-es_en.2k-bleu:36.91120944829759

spanish to english (split time domain instead of log mel)
eval/covost2_long_audio-asr-combine-5-es_en.2k-asr:0.13383817028637324
eval/covost2_long_audio-asr-combine-10-es_en.2k-asr:0.20998654622333268
eval/covost2_long_audio-translate-combine-5-es_en.2k-bleu:34.3932654555113
eval/covost2_long_audio-translate-combine-10-es_en.2k-bleu:34.02819219754704
eval/covost2-asr-es_en.2k-asr:0.12854122621564482
eval/covost2-translate-es_en.2k-bleu:36.75934943165827

Both ASR and translations of long audio perform comparably to the benchmarks.

juberti · 2024-09-18T01:10:41Z

Interesting. Do you have a sample output dataset I could take a look at?

liPatrick · 2024-09-18T01:17:46Z

Interesting. Do you have a sample output dataset I could take a look at?

yeah, take a look here for the original dataset: https://huggingface.co/datasets/fixie-ai/covost2_long_audio

here for the output results: https://wandb.ai/fixie/ultravox/runs/2a27bgqx/files/output

juberti · 2024-09-18T03:54:27Z

Hmm, I listened to a few clips and I wonder if the merging is the right way to do this. The audio clips tend to be fairly different with their own speaking rate and volume level, and the combined audio often just doesn't make much sense (e.g., ambiguous pronoun resolution), which perhaps explains the hit to the es-en ASR metrics.

I wonder if we could use something like librispeech and stitch some of the segmented clips back together so they are coherent. Or maybe just find an ASR dataset with longer samples - we probably don't need a ton of data here.

zqhuang211 · 2024-09-18T13:13:05Z

I wonder if we could use something like librispeech and stitch some of the segmented clips back together so they are coherent. Or maybe just find an ASR dataset with longer samples - we probably don't need a ton of data here.

Yes, some speech datasets, like LibriSpeech and GigaSpeech, have meta-information that supports this. I kind of prefer stitching back segments instead of selecting longer samples so that you have a comparable baseline (segmented) with the former approach. We could find similar datasets for speech translation too.

The synthesized data is noisier but also harder. If the model is able to perform at the human level, there shouldn’t be any significant difference in performance if it transcribes/translates an individual segment at a time or five segments at a time.

The approach you suggested would produce more realistic datasets and hopefully can provide a measure of how the model benefits from longer context.

I think both are valid and measure different aspects of the model.

juberti · 2024-09-18T20:07:07Z

OK, I can get behind that. I still think this warrants further investigation though:

eval/covost2-asr-es_en.2k-asr:0.14595425715933116
eval/covost2_long_audio-asr-combine-5-es_en.2k-asr:0.1345223909283106
eval/covost2_long_audio-asr-combine-10-es_en.2k-asr:0.20847972323659428

It seems odd that combining 5x is much better than combining 10x, and suggests a potential problem in the merging.

Also, did you run any numbers for the duplicate (rather than merge) approach?

liPatrick · 2024-09-18T20:10:19Z

Yeah I ran some evals for the duplicate, and what ended up happening was it would only transcribe the first utterance. So the WER was really high (around 0.82).

juberti · 2024-09-18T20:16:27Z

Hmm, that's kind of surprising. Perhaps the encoder hidden states end up the same when repeated audio is used?

juberti · 2024-09-18T20:19:53Z

ultravox/inference/infer_test.py

@@ -60,6 +60,7 @@ def fake_generate(**kwargs):
        )
        self.model.device = "cpu"
        self.model.generate = mock.MagicMock(side_effect=fake_generate)
+        self.model.audio_tower_context_length = None


 EXPECTED_TOKEN_IDS_START = [128000, 128006, 882, 128007]


suggest adding a new test to this file to demonstrate the long-audio-context handling in the processor

ultravox/model/ultravox_model.py

ultravox/model/ultravox_processing.py

liPatrick added 2 commits September 11, 2024 00:39

Support longer audio contexts

8883907

Formatting

7672177

liPatrick requested review from zqhuang211, juberti and farzadab September 11, 2024 22:34

farzadab reviewed Sep 12, 2024

View reviewed changes

ultravox/model/ultravox_model.py Outdated Show resolved Hide resolved

ultravox/model/ultravox_model.py Outdated Show resolved Hide resolved

farzadab reviewed Sep 12, 2024

View reviewed changes

ultravox/model/ultravox_model.py Outdated Show resolved Hide resolved

liPatrick added 4 commits September 13, 2024 00:26

Working

e2e8115

Fix some tests

a5e2ac6

Fix some more tests

495b894

Remove prints

0275766

juberti reviewed Sep 18, 2024

View reviewed changes

farzadab reviewed Sep 18, 2024

View reviewed changes

ultravox/model/ultravox_model.py Outdated Show resolved Hide resolved

ultravox/model/ultravox_processing.py Outdated Show resolved Hide resolved

liPatrick added 3 commits September 19, 2024 14:35

Move audio context length into processor init

cc8ec19

Add tests

d427b38

Merge

b723dc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support longer audio contexts #110

Support longer audio contexts #110

liPatrick commented Sep 11, 2024 •

edited

Loading

farzadab left a comment

liPatrick commented Sep 17, 2024 •

edited

Loading

juberti commented Sep 18, 2024

liPatrick commented Sep 18, 2024 •

edited

Loading

juberti commented Sep 18, 2024 •

edited

Loading

zqhuang211 commented Sep 18, 2024

juberti commented Sep 18, 2024

liPatrick commented Sep 18, 2024

juberti commented Sep 18, 2024

juberti Sep 18, 2024

Support longer audio contexts #110

Are you sure you want to change the base?

Support longer audio contexts #110

Conversation

liPatrick commented Sep 11, 2024 • edited Loading

farzadab left a comment

Choose a reason for hiding this comment

liPatrick commented Sep 17, 2024 • edited Loading

juberti commented Sep 18, 2024

liPatrick commented Sep 18, 2024 • edited Loading

juberti commented Sep 18, 2024 • edited Loading

zqhuang211 commented Sep 18, 2024

juberti commented Sep 18, 2024

liPatrick commented Sep 18, 2024

juberti commented Sep 18, 2024

juberti Sep 18, 2024

Choose a reason for hiding this comment

liPatrick commented Sep 11, 2024 •

edited

Loading

liPatrick commented Sep 17, 2024 •

edited

Loading

liPatrick commented Sep 18, 2024 •

edited

Loading

juberti commented Sep 18, 2024 •

edited

Loading