-
Notifications
You must be signed in to change notification settings - Fork 42
Low hanging fruit: neural language model #132
Comments
This is a pretty feature-rich and efficient implementation of sub-word tokenizers (with training methods too) https://github.com/huggingface/tokenizers |
Cool, thanks for the info!
…On Fri, Mar 19, 2021 at 9:53 PM Piotr Żelasko ***@***.***> wrote:
This is a pretty feature-rich and efficient implementation of sub-word
tokenizers (with training methods too)
https://github.com/huggingface/tokenizers
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#132 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO5W5D4HAR5SU4CYB2DTENJOZANCNFSM4ZOG5CIA>
.
|
I am doing this task.
Now a tokenizer is trained using https://github.com/huggingface/tokenizers suggested by @pzelasko with librispeech train_960_text(text from train_clean_360, train_clean_100, train_other_500. Currently librispeech-lm-norm.txt is not used), and a demo is shown as below. As shown in the above screenshot : "studying" is tokenized into sequence ('st', '##ud', '##ying' ). Next: I am going to train a tokenizer with full librispeech text, i.e. train_960_text(48MB) and librispeech-lm-norm.txt (4GB). What kind of neural networks for training LM should we try first after data preparation is done? |
Looks cool! My two cents are it’s probably worth it to start with RNNLM and eventually try some autoregressive transformers like GPT2 (small/medium size). |
We'll probably be evaluating this in batch mode, not word by word, so some kind of transformer would probably be good from an efficiency point of view, but for prototyping, anything is OK with me. I suppose my main concern is to keep the code relatively simple, as compatible/similar as possible with our AM training code, and not have too many additional dependencies. But anything is OK with me as long as you keep making some kind of progress, as it will all increase your familiarity with the issues. Just so we can see what you are doing script-wise, if you could make a PR to the repo it would be great. We don't have to worry too much about making the scripts too nice; snowfall is all supposed to be a draft. |
There is another tokenizer that is used in torchtext: |
Guys,
I realized that there is some very low hanging fruit that could easily make our WERs state of the art, which is neural LM rescoring. An advantage of our framework-- possibly the key advantage-- is that the decoding part is very easy, so we can easily rescore large N-best lists with neural LMs.
In addition, it's quite easy to manipulate variable-length sequences, so things like training and using LMs should be a little easier than they otherwise would be.
Here's what I propose: as a relatively easy baseline that can be extended later, we can train a word-piece neural LM (I recommend word-pieces because the vocab size could otherwise be quite large, making the embedding matrices difficult to train). So we'll need:
(i) some mechanism to split up words into word-pieces,
(ii) data preparation for the LM training, which in the Librispeech case would, I assume, include the additional text training data that Librispeech comes with,
(iii) script to train the actual LM. I assume this would be quite similar to our conformer self-attention model, with a xent output (no forward-backward needed), except we'd use a different type of masking, i.e. a mask of size (B, T, T), because we need to limit it to only left-context.
For decoding with the LM, we'd first do a decoding with our n-gram LM, get word sequences using our randomized N-best approach, get their LM scores with our LM, then comput the scores with the n-gram LM and neural LM scores interpolated 50-50 or something like that. [Note: converting the word sequences into word-piece sequences is very easy, we can just do it by indexing a ragged tensor and then removing an axis.]
Dan
The text was updated successfully, but these errors were encountered: