-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add faster-whisper (ctranslate2) as option for Whisper annotation workflow #1017
base: master
Are you sure you want to change the base?
Add faster-whisper (ctranslate2) as option for Whisper annotation workflow #1017
Conversation
Would it be possible to combine whisper and faster-whisper into a single CLI/method and then add faster-whisper as an optional flag enabled/disabled by default? The internals of the function call can be kept separate, but from a user perspective, it makes more sense since they have the same functionality. I'm thinking of it as 2 backends with the same user-facing wrapper. |
1610802
to
0f5a2e1
Compare
Thanks for the quick initial review! I combined whisper and faster-whisper into a single CLI/method with a
Quick benchmark on mini-librispeech dev-clean-2: OpenAI Whisper, RTX2080Ti:
faster-whisper/ctranslate2,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for a great contribution, this is super interesting. The code looks good to me. I would love to enable it by default.
I quickly compared the results between old and new whisper implementations on a 60s clip from AMI. In that clip, I noticed that faster-whisper tends to skip short, isolated, and noisy utterances such as "Okay" or "Thank you", probably due to VAD (which is OK I guess). However the time boundaries seem off when you compare it to the original implementation, please see the screenshot. Do you think it's possible to fix it? Maybe more accurate information is exposed somewhere in faster-whisper and it's just not being used here? Otherwise there's a lot of silence/non-speech included in the supervisions. Note: the top plot is from the original Whisper, and the bottom plot is from faster-whisper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into a few issues running it, can you make the suggested changes that fix this?
@@ -55,6 +56,36 @@ def workflows(): | |||
@click.option( | |||
"-d", "--device", default="cpu", help="Device on which to run the inference." | |||
) | |||
@click.option( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change to:
@click.option(
"--faster-whisper/--normal-whisper",
default=True,
help="If True, use faster-whisper's implementation based on CTranslate2.",
)
Otherwise it can't be turned off.
) | ||
@click.option( | ||
"--faster-whisper-compute-type", | ||
default="float16", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default="float16", | |
default="auto", |
Otherwise it won't work on (some?) CPUs.
device_index=device_index, | ||
compute_type=compute_type, | ||
num_workers=num_workers, | ||
download_root=download_root, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the change that enables this option is still not released in pip, I suggest a bit of workaround here, otherwise it cannot be ran:
+ opt_kwargs = {}
+ if download_root is not None:
+ opt_kwargs["download_root"] = download_root
model = WhisperModel(
model_name,
device=device,
device_index=device_index,
compute_type=compute_type,
num_workers=num_workers,
- download_root=download_root,
+ **opt_kwargs,
)
- model.logger.setLevel(logging.WARNING)
+ if hasattr(model, "logger"):
+ model.logger.setLevel(logging.WARNING)
Note I also suggested a check for logger, on my installation model
did not have the attribute logger
defined.
Sorry for the delay, I've been quite busy. I'll pick this up shortly and address the requested changes. |
@entn-at any updates on this? |
This PR adds a second Whisper annotation workflow that uses faster-whisper powered by CTranslate2's implementation (see https://github.com/entn-at/lhotse/tree/feature/whisper-ctranslate2). It's a lot faster and uses far less memory.
This implementation also obtains word start and end times. I'm still investigating whether they are accurate enough in general to be used as alignments.