-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support longer audio contexts #110
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pat.
Your approach is valid, but I had a slightly different take on how we would want to approach this so that it would later on allow for having multiple (zero or more) audio tracks per sample.
Let's meet if the explanation I gave is vague.
Results from eval on long audio contexts: Note* ASR measured by WER, and translation measured by BLEU. https://wandb.ai/fixie/ultravox/runs/0ws1m9us/overview spanish to english: spanish to english (split time domain instead of log mel) Both ASR and translations of long audio perform comparably to the benchmarks. |
Interesting. Do you have a sample output dataset I could take a look at? |
yeah, take a look here for the original dataset: https://huggingface.co/datasets/fixie-ai/covost2_long_audio here for the output results: https://wandb.ai/fixie/ultravox/runs/2a27bgqx/files/output |
Hmm, I listened to a few clips and I wonder if the merging is the right way to do this. The audio clips tend to be fairly different with their own speaking rate and volume level, and the combined audio often just doesn't make much sense (e.g., ambiguous pronoun resolution), which perhaps explains the hit to the es-en ASR metrics. I wonder if we could use something like librispeech and stitch some of the segmented clips back together so they are coherent. Or maybe just find an ASR dataset with longer samples - we probably don't need a ton of data here. |
Yes, some speech datasets, like LibriSpeech and GigaSpeech, have meta-information that supports this. I kind of prefer stitching back segments instead of selecting longer samples so that you have a comparable baseline (segmented) with the former approach. We could find similar datasets for speech translation too. The synthesized data is noisier but also harder. If the model is able to perform at the human level, there shouldn’t be any significant difference in performance if it transcribes/translates an individual segment at a time or five segments at a time. The approach you suggested would produce more realistic datasets and hopefully can provide a measure of how the model benefits from longer context. I think both are valid and measure different aspects of the model. |
OK, I can get behind that. I still think this warrants further investigation though:
It seems odd that combining 5x is much better than combining 10x, and suggests a potential problem in the merging. Also, did you run any numbers for the duplicate (rather than merge) approach? |
Yeah I ran some evals for the duplicate, and what ended up happening was it would only transcribe the first utterance. So the WER was really high (around 0.82). |
Hmm, that's kind of surprising. Perhaps the encoder hidden states end up the same when repeated audio is used? |
ultravox/inference/infer_test.py
Outdated
@@ -60,6 +60,7 @@ def fake_generate(**kwargs): | |||
) | |||
self.model.device = "cpu" | |||
self.model.generate = mock.MagicMock(side_effect=fake_generate) | |||
self.model.audio_tower_context_length = None | |||
|
|||
|
|||
EXPECTED_TOKEN_IDS_START = [128000, 128006, 882, 128007] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest adding a new test to this file to demonstrate the long-audio-context handling in the processor
The whisper encoder has a max context of 30s of audio.
This pr enables our model to support longer contexts by splitting long audio into chunks of 30sec (with the last chunk being the exception)
Note* should also work in batch mode.