-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomly getting error while generating word timestamps #59
Comments
You can try adjusting the for start_seq, req_idx in start_seq_wise_req.items():
# adding adjusted_num_frames
adjusted_num_frames = [min(frame, MAX_TEXT_TOKEN_LENGTH) for frame in seq_lens[req_idx].detach().cpu().numpy()]
res = self.aligner_model.align(
ctranslate2.StorageView.from_array(features[req_idx]),
start_sequence=list(start_seq),
text_tokens=[text_tokens[_] for _ in req_idx],
num_frames=adjusted_num_frames,
median_filter_width=7
) and adjusting def data_collate_fn(self, batch):
# adding max_seq_len_samples
max_seq_len_samples = MAX_TEXT_TOKEN_LENGTH * (HOP_LENGTH * INPUT_STRIDE)
if self.use_dynamic_time_axis:
max_len = min(max([_[3] for _ in batch]) + self.dta_padding, N_SAMPLES, max_seq_len_samples)
else:
max_len = min(N_SAMPLES, max_seq_len_samples) Let me know if that fixes anything @rahulmate |
Thanks @aleksandr-smechov changes in align_words function solved the issue. I haven’t done benchmark yet but will run it to check the timestamps. For changes in data_collate_fn I was getting error with tensorRt model tensor
|
See shashikg#59 (comment) Error: No position encodings are defined for positions >= 448, but got position 454
See shashikg#59 (comment) Error: No position encodings are defined for positions >= 448, but got position 454
For me, the above didn't solve anything. The issue I'm facing is that model (large-v3) is hallucinating and creating repetition of some phrases, which then increases length of chunk/tokens. Large-v2 didn't have this problem with this specific audio, but it did with some that were fine with large-v3. Overall, i would say that tensorrt-llm backend is showing more hallucinations than ctranslate2 is. |
code
`model = whisper_s2t.load_model(model_identifier="large-v2", asr_options={'word_timestamps': True},backend='TensorRT-LLM')
files = ['output.wav']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]
out = model.transcribe_with_vad(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=16)`
For above code sometime it throws in below error for same file. Is there any explanation for it.
`RuntimeError Traceback (most recent call last)
Cell In[15], line 10
8 initial_prompts = [None]
9 start =time.time()
---> 10 out = model.transcribe_with_vad(files,
11 lang_codes=lang_codes,
12 tasks=tasks,
13 initial_prompts=initial_prompts,
14 batch_size=16)
15 end =time.time()
16 print(f"batch :: {16} time:: {end-start}")
File ~/temp_triton/triton_env/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/temp_triton/triton_env/lib/python3.10/site-packages/whisper_s2t/backends/init.py:171, in WhisperModel.transcribe_with_vad(self, audio_files, lang_codes, tasks, initial_prompts, batch_size)
169 for signals, prompts, seq_len, seg_metadata, pbar_update in self.data_loader(audio_files, lang_codes, tasks, initial_prompts, batch_size=batch_size):
170 mels, seq_len = self.preprocessor(signals, seq_len)
--> 171 res = self.generate_segment_batched(mels.to(self.device), prompts, seq_len, seg_metadata)
173 for res_idx, _seg_metadata in enumerate(seg_metadata):
174 responses[_seg_metadata['file_id']].append({**res[res_idx],
175 'start_time': round(_seg_metadata['start_time'], 3),
176 'end_time': round(_seg_metadata['end_time'], 3)})
File ~/temp_triton/triton_env/lib/python3.10/site-packages/whisper_s2t/backends/tensorrt/model.py:248, in WhisperModelTRT.generate_segment_batched(self, features, prompts, seq_lens, seg_metadata)
246 text_tokens = [[_t for _t in x[0] if t < self.tokenizer.eot]+[self.tokenizer.eot] for x in result]
247 sot_seqs = [tuple([-4:]) for _ in prompts]
--> 248 word_timings = self.align_words(features, texts, text_tokens, sot_seqs, seq_lens, seg_metadata)
250 for _response, _word_timings in zip(response, word_timings):
251 _response['word_timestamps'] = _word_timings
File ~/temp_triton/triton_env/lib/python3.10/site-packages/whisper_s2t/backends/tensorrt/model.py:200, in WhisperModelTRT.align_words(self, features, texts, text_tokens, sot_seqs, seq_lens, seg_metadata)
198 token_alignments = [[] for _ in seg_metadata]
199 for start_seq, req_idx in start_seq_wise_req.items():
--> 200 res = self.aligner_model.align(ctranslate2.StorageView.from_array(features[req_idx]),
201 start_sequence=list(start_seq),
202 text_tokens=[text_tokens[_] for _ in req_idx],
203 num_frames=list(seq_lens[req_idx].detach().cpu().numpy()),
204 median_filter_width=7)
206 for _res, _req_idx in zip(res, req_idx):
207 token_alignments[_req_idx] = _res
RuntimeError: No position encodings are defined for positions >= 448, but got position 454`
The text was updated successfully, but these errors were encountered: