-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training wav2letter++ streaming convnets (TDS + CTC) #101
Comments
Hi Erik,
On Sat, Apr 4, 2020 at 4:16 PM Erik Ziegler ***@***.***> wrote:
First of all, I think your work is amazing and making all your models available is just so generous.
thank you :)
How many and what GPUs do you use for training? (The wav2letter guys said here that they used 32 GPUs for training the streaming convnets acoustic model, which sounds a little bit insane)
I used a single 1080Ti
How much RAM does the system need to have or is it primarily GPU work?
not sure how much is actually needed, the system I used has 64GB of RAM
How long did the training of your wav2letter model took?
my memory may be off here, but I think some 4-6 months
Are there any pitfalls when training for wav2letter?
well, as with most models the language model has a high impact on the
final WER results. I also remember the code wasn't as robust as kaldi
back than but I guess that should have improved by now
good luck with training your model! :)
guenter
|
Thank you for your reply and the insights! Maybe as you speak of it, what about your language models? How long did it take, say for the large order 6 german lm? And if I have domain-specific words that I really want my speech recognition to know about, should I add examples of that to the speech corpora or should I make sure that those words are well represented in the language model text corpora? Or both? Or should the language model text corpora be identical to the speech corpora text? Sorry for the questioning :D |
Hi Erik. Your project sounds interesting. I have only one remark about annotation quality, because this problem came up several times in this project and Guenter spent a lot of time to correct annotation problems in speech corpora. So, if you include the latest Common Voice data set, a very new data set, I would be cautious and would try to spot problematic audio files and/or annotations. Just curious: What WERs are you expecting? Sven |
@svenha you're right. Theoretically, the Common Voice dataset should be already reviewed by the users, but I don't know if that actually ensures the data's quality. Regarding WER I am not expecting anything. I'll compare it to a microphone streaming implementation with a Kaldi model (the german Zamia Kaldi model) and see what feels better and more robust. |
Hi Erik,
On Sat, Apr 4, 2020 at 5:30 PM Erik Ziegler ***@***.***> wrote:
Maybe as you speak of it, what about your language models? How long did it take, say for the large order 6 german lm?
don't remember exactly, but not very long - maybe a few days at most
And if I have domain-specific words that I really want my speech recognition to know about, should I add examples of that to the speech corpora or should I make sure that those words are well represented in the language model text corpora? Or both?
more data is always good :)) ideally, you want recordings of all those
domain specific words in multiple contexts by multiple speakers using
different microphones, environments etc. and of course your language
models should cover these words in realistic contexts as well.
cheers,
guenter
|
@erksch did you have success training a streaming convnet on the mozilla dataset? I would be attempting something similar. |
Hey!
First of all, I think your work is amazing and making all your models available is just so generous.
I checked out your german wav2letter model and as I can tell from your train config (w2l_config_conv_glu_train.cfg) the acoustic model is based on
conv_glu
with ASG criterion from the original wav2letter paper.Facebook released its
streaming_convnets
version in January which allows online speech recognition with streaming capability and I would kill for having a german model for that. Here is a link to the architecture file and the training config.I want to train the acoustic model with the hardware resources I have available and updated german speech corpora (like the most recent common voice with 500 hours of german speech).
Regarding your experience in training a wav2letter model:
Vielen Dank :)
The text was updated successfully, but these errors were encountered: