Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost end of audio with jit_pretrained_streaming.py #1779

Open
kfmn opened this issue Oct 21, 2024 · 2 comments
Open

Lost end of audio with jit_pretrained_streaming.py #1779

kfmn opened this issue Oct 21, 2024 · 2 comments

Comments

@kfmn
Copy link

kfmn commented Oct 21, 2024

Hi,

I trained a streaming zipformer transducer on my data and converted the model to JIT by export.py with specific values of chunk_length and left_context_frames. Then I wanted to run streaming decoding using jit_pretrained_streaming.py and it seems this script does not decode the final part of audio.

First of all there is a misprint in

chunk = int(0.25 * args.sample_rate) # 0.2 second
, it should be 0.25 second, not 0.2

Next, features are generated chunk-by-chunk and are decoded whenever the condition in

while online_fbank.num_frames_ready - num_processed_frames >= T:
is satisfied.

But if after last call to greedy_search this condition is not satisfied anymore, all computed features remain unprocessed.
As a result, the decoding hypotheses are sometimes truncated

@csukuangfj
Copy link
Collaborator

First of all there is a misprint in

No, it is correct. You can select an arbitrary positive value for it.
Its sole purpose is to simulate how fast the data samples arrive.

This value has nothing to do with your model parameters.


all computed features remain unprocessed.
As a result, the decoding hypotheses are sometimes truncated

tail_padding = torch.zeros(int(0.3 * args.sample_rate), dtype=torch.float32)

We have tail paddings here. You can use a larger tail padding if you find that the last chunk is not decoded.


By the way, please provide a concrete example with runnable code/script to reproduce your issue.

If you only look at the code without running it, I suggest that you run it first and then check whether what you think matches the actual result.

@kfmn
Copy link
Author

kfmn commented Oct 29, 2024

I meant two things:

  1. In line chunk = int(0.25 * args.sample_rate) chunk length is set to 0.25 seconds but comment says 0.2 seconds, nothing more.
  2. I understand that increasing tail_padding solves the problem of lost frames in decoding. But it seems the hardcoded length of 0.3 seconds does not fit well to longer decoding chunks. For example, if I have chunk_size = 64, it corresponds to 128 frames, besides, the encoder.pad_length is added to obtain T=141 frames. It is much more than hardcoded 30 frames of tail_padding and it leads to loss of last real (not padded) frames. So my suggestion is to make tail_padding dependent on and consistent with chunk_size and encoder.pad_length

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants