Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support longer audio contexts #110

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

liPatrick
Copy link
Contributor

@liPatrick liPatrick commented Sep 11, 2024

The whisper encoder has a max context of 30s of audio.

This pr enables our model to support longer contexts by splitting long audio into chunks of 30sec (with the last chunk being the exception)

Note* should also work in batch mode.

Copy link
Contributor

@farzadab farzadab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pat.

Your approach is valid, but I had a slightly different take on how we would want to approach this so that it would later on allow for having multiple (zero or more) audio tracks per sample.

Let's meet if the explanation I gave is vague.

ultravox/model/ultravox_model.py Outdated Show resolved Hide resolved
ultravox/model/ultravox_model.py Outdated Show resolved Hide resolved
@liPatrick
Copy link
Contributor Author

liPatrick commented Sep 17, 2024

Results from eval on long audio contexts:

Note* ASR measured by WER, and translation measured by BLEU.
combine-n means we're concatenating n samples into one.

https://wandb.ai/fixie/ultravox/runs/0ws1m9us/overview
english to chinese:
eval/covost2_long_audio-asr-combine-5-en_zh-CN.2k-asr:0.23348530515869995
eval/covost2_long_audio-asr-combine-10-en_zh-CN.2k-asr:0.23519305542996655
eval/covost2_long_audio-translate-combine-5-en_zh-CN.2k-bleu:17.7327965079455
eval/covost2_long_audio-translate-combine-10-en_zh-CN.2k-bleu:17.980528222436593
eval/covost2-asr-en_zh-CN.2k-asr:0.18523806908671536
eval/covost2-translate-en_zh-CN.2k-bleu:24.688760684909365

spanish to english:
https://wandb.ai/fixie/ultravox/runs/2a27bgqx/overview
eval/covost2_long_audio-asr-combine-5-es_en.2k-asr:0.1345223909283106
eval/covost2_long_audio-asr-combine-10-es_en.2k-asr:0.20847972323659428
eval/covost2_long_audio-translate-combine-5-es_en.2k-bleu:34.29590491442046
eval/covost2_long_audio-translate-combine-10-es_en.2k-bleu:34.08795842954878
eval/covost2-asr-es_en.2k-asr:0.14595425715933116
eval/covost2-translate-es_en.2k-bleu:36.91120944829759

spanish to english (split time domain instead of log mel)
eval/covost2_long_audio-asr-combine-5-es_en.2k-asr:0.13383817028637324
eval/covost2_long_audio-asr-combine-10-es_en.2k-asr:0.20998654622333268
eval/covost2_long_audio-translate-combine-5-es_en.2k-bleu:34.3932654555113
eval/covost2_long_audio-translate-combine-10-es_en.2k-bleu:34.02819219754704
eval/covost2-asr-es_en.2k-asr:0.12854122621564482
eval/covost2-translate-es_en.2k-bleu:36.75934943165827

Both ASR and translations of long audio perform comparably to the benchmarks.

@juberti
Copy link
Contributor

juberti commented Sep 18, 2024

Interesting. Do you have a sample output dataset I could take a look at?

@liPatrick
Copy link
Contributor Author

liPatrick commented Sep 18, 2024

Interesting. Do you have a sample output dataset I could take a look at?

yeah, take a look here for the original dataset: https://huggingface.co/datasets/fixie-ai/covost2_long_audio

here for the output results: https://wandb.ai/fixie/ultravox/runs/2a27bgqx/files/output

@juberti
Copy link
Contributor

juberti commented Sep 18, 2024

Hmm, I listened to a few clips and I wonder if the merging is the right way to do this. The audio clips tend to be fairly different with their own speaking rate and volume level, and the combined audio often just doesn't make much sense (e.g., ambiguous pronoun resolution), which perhaps explains the hit to the es-en ASR metrics.

I wonder if we could use something like librispeech and stitch some of the segmented clips back together so they are coherent. Or maybe just find an ASR dataset with longer samples - we probably don't need a ton of data here.

@zqhuang211
Copy link
Contributor

I wonder if we could use something like librispeech and stitch some of the segmented clips back together so they are coherent. Or maybe just find an ASR dataset with longer samples - we probably don't need a ton of data here.

Yes, some speech datasets, like LibriSpeech and GigaSpeech, have meta-information that supports this. I kind of prefer stitching back segments instead of selecting longer samples so that you have a comparable baseline (segmented) with the former approach. We could find similar datasets for speech translation too.

The synthesized data is noisier but also harder. If the model is able to perform at the human level, there shouldn’t be any significant difference in performance if it transcribes/translates an individual segment at a time or five segments at a time.

The approach you suggested would produce more realistic datasets and hopefully can provide a measure of how the model benefits from longer context.

I think both are valid and measure different aspects of the model.

@juberti
Copy link
Contributor

juberti commented Sep 18, 2024

OK, I can get behind that. I still think this warrants further investigation though:

eval/covost2-asr-es_en.2k-asr:0.14595425715933116
eval/covost2_long_audio-asr-combine-5-es_en.2k-asr:0.1345223909283106
eval/covost2_long_audio-asr-combine-10-es_en.2k-asr:0.20847972323659428

It seems odd that combining 5x is much better than combining 10x, and suggests a potential problem in the merging.

Also, did you run any numbers for the duplicate (rather than merge) approach?

@liPatrick
Copy link
Contributor Author

Yeah I ran some evals for the duplicate, and what ended up happening was it would only transcribe the first utterance. So the WER was really high (around 0.82).

@juberti
Copy link
Contributor

juberti commented Sep 18, 2024

Hmm, that's kind of surprising. Perhaps the encoder hidden states end up the same when repeated audio is used?

@@ -60,6 +60,7 @@ def fake_generate(**kwargs):
)
self.model.device = "cpu"
self.model.generate = mock.MagicMock(side_effect=fake_generate)
self.model.audio_tower_context_length = None


EXPECTED_TOKEN_IDS_START = [128000, 128006, 882, 128007]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest adding a new test to this file to demonstrate the long-audio-context handling in the processor

ultravox/model/ultravox_model.py Outdated Show resolved Hide resolved
ultravox/model/ultravox_processing.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants