-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference latency #288
Comments
HifiGAN is essentially larger and heavier. you need to either find another ckpt pretrained on ISTFT or train a new model yourself from scratch. you can also fine tune on top of the LJ ckpt which is not recommended but one of my friends managed to get reasonable results by doing so. as for your other questions, no the dataset have no impact on the latency. only the parameters of your model and mainly the size of the decoder matters the most. |
Thanks for your reply. We have two models; one trained on libriTTS-R (360+100 hrs) data and the other finetuned on this model with 20 min audio samples for multiple speakers. We kept max_len 100 for the first and 400 for the second one. The first model and the second one have an average latency difference of nearly 1.5 sec. |
You're welcome. |
Understood. But in our experiment, we checked the size of decoder for both models mentioned above. It was same for both , 217 MBs. But still both models have a latency difference of 1.5 seconds. Do you know of any other possible cause?
|
Also, one model is trained from scratch and the other one is fine-tuned. Will that make any difference? Num of Model params & model size is same :/ |
Unless you change the decoder, or use very short samples with LFInference, there must not be a whole lot of latency overhead |
it’s unusual that fine-tuning StyleTTS2 increases the checkpoint file size, even though the number of parameters in the model remains the same. Has anyone identified the reason behind this size increase? |
I was trying out the model with 439 characters and saw 5-6 sec of average latency on libri-TTS dataset. Is there a way we can reduce the latency (decoder takes the most time).
Also, I finetuned the model with a few samples from a new speaker and saw the latency increased by 600-700 ms further, is this expected?
Is the latency expected to increase if the dataset is larger (english only)?
Similarly if we add more languages, is the model inference latency going to increase?
The text was updated successfully, but these errors were encountered: