Lost end of audio with jit_pretrained_streaming.py #1779

kfmn · 2024-10-21T07:38:22Z

Hi,

I trained a streaming zipformer transducer on my data and converted the model to JIT by export.py with specific values of chunk_length and left_context_frames. Then I wanted to run streaming decoding using jit_pretrained_streaming.py and it seems this script does not decode the final part of audio.

First of all there is a misprint in

icefall/egs/librispeech/ASR/zipformer/jit_pretrained_streaming.py

Line 218 in f84270c

chunk = int(0.25 * args.sample_rate) # 0.2 second

, it should be 0.25 second, not 0.2

Next, features are generated chunk-by-chunk and are decoded whenever the condition in

icefall/egs/librispeech/ASR/zipformer/jit_pretrained_streaming.py

Line 234 in f84270c

while online_fbank.num_frames_ready - num_processed_frames >= T:

is satisfied.

But if after last call to greedy_search this condition is not satisfied anymore, all computed features remain unprocessed.
As a result, the decoding hypotheses are sometimes truncated

csukuangfj · 2024-10-28T04:12:08Z

First of all there is a misprint in

No, it is correct. You can select an arbitrary positive value for it.
Its sole purpose is to simulate how fast the data samples arrive.

This value has nothing to do with your model parameters.

all computed features remain unprocessed.
As a result, the decoding hypotheses are sometimes truncated

icefall/egs/librispeech/ASR/zipformer/jit_pretrained_streaming.py

Line 214 in f84270c

tail_padding = torch.zeros(int(0.3 * args.sample_rate), dtype=torch.float32)

We have tail paddings here. You can use a larger tail padding if you find that the last chunk is not decoded.

By the way, please provide a concrete example with runnable code/script to reproduce your issue.

If you only look at the code without running it, I suggest that you run it first and then check whether what you think matches the actual result.

kfmn · 2024-10-29T09:13:07Z

I meant two things:

In line chunk = int(0.25 * args.sample_rate) chunk length is set to 0.25 seconds but comment says 0.2 seconds, nothing more.
I understand that increasing tail_padding solves the problem of lost frames in decoding. But it seems the hardcoded length of 0.3 seconds does not fit well to longer decoding chunks. For example, if I have chunk_size = 64, it corresponds to 128 frames, besides, the encoder.pad_length is added to obtain T=141 frames. It is much more than hardcoded 30 frames of tail_padding and it leads to loss of last real (not padded) frames. So my suggestion is to make tail_padding dependent on and consistent with chunk_size and encoder.pad_length

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost end of audio with jit_pretrained_streaming.py #1779

Lost end of audio with jit_pretrained_streaming.py #1779

kfmn commented Oct 21, 2024 •

edited

Loading

csukuangfj commented Oct 28, 2024

kfmn commented Oct 29, 2024 •

edited

Loading

Lost end of audio with jit_pretrained_streaming.py #1779

Lost end of audio with jit_pretrained_streaming.py #1779

Comments

kfmn commented Oct 21, 2024 • edited Loading

csukuangfj commented Oct 28, 2024

kfmn commented Oct 29, 2024 • edited Loading

kfmn commented Oct 21, 2024 •

edited

Loading

kfmn commented Oct 29, 2024 •

edited

Loading