Possible issue with whisper example notebook #770

rosyvs · 2023-08-01T20:31:03Z

rosyvs
Aug 1, 2023

I have been experimenting with the helpful tutorial on tuning Whisper with Peft:

peft/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

I have come across something suspect: when I use the evaluation code in the notebook to evaluate my tuned model, I get much better results than from creating a pipeline. I get a WER of .34 for the former, .47 for the latter.

I have tried all sorts of things but I have narrowed it down to the following argument when calling model.generate()

decoder_input_ids=batch["labels"][:, :4].to("cuda")
I believe the first 3 elements of the labels are the usual special IDs specifying task/language etc. These are the same for all inputs. However the 4th element differs for each input, and I believe it is actually the first token in the ground-truth transcript. Therefore the model is cheating by getting a prompt.

@pacman100 Does this sound right? I am not sure if I have interpreted the input IDs correctly. If not, what is the reason for keeping the first 4 tokens from the labels during inference?

Thanks in advance for any responses!

Tigerdwgth · 2024-03-15T09:14:26Z

Tigerdwgth
Mar 15, 2024

I think if you use tokenizer with predict_timestamps=True during training, it should be like this:
STARTOFTRANSCRIPT--LANGUAGE--TASK--FIRSTTIMESTAMP--FIRSTREALTOKEN
However, if using the same tokenizer in testing and there is no timestamp in the source label and still use predict_timestamps=True it will cause:
STARTOFTRANSCRIPT--LANGUAGE--TASK--FIRSTREALTOKEN,
it should actually be:
STARTOFTRANSCRIPT--LANGUAGE--TASK--NOTIMESTAMP
thus cause leak of data.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issue with whisper example notebook #770

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Possible issue with whisper example notebook #770

rosyvs Aug 1, 2023

Replies: 1 comment

Tigerdwgth Mar 15, 2024

rosyvs
Aug 1, 2023

Tigerdwgth
Mar 15, 2024