Low hanging fruit: neural language model #132

danpovey · 2021-03-19T07:10:02Z

Guys,

I realized that there is some very low hanging fruit that could easily make our WERs state of the art, which is neural LM rescoring. An advantage of our framework-- possibly the key advantage-- is that the decoding part is very easy, so we can easily rescore large N-best lists with neural LMs.
In addition, it's quite easy to manipulate variable-length sequences, so things like training and using LMs should be a little easier than they otherwise would be.

Here's what I propose: as a relatively easy baseline that can be extended later, we can train a word-piece neural LM (I recommend word-pieces because the vocab size could otherwise be quite large, making the embedding matrices difficult to train). So we'll need:
(i) some mechanism to split up words into word-pieces,
(ii) data preparation for the LM training, which in the Librispeech case would, I assume, include the additional text training data that Librispeech comes with,
(iii) script to train the actual LM. I assume this would be quite similar to our conformer self-attention model, with a xent output (no forward-backward needed), except we'd use a different type of masking, i.e. a mask of size (B, T, T), because we need to limit it to only left-context.

For decoding with the LM, we'd first do a decoding with our n-gram LM, get word sequences using our randomized N-best approach, get their LM scores with our LM, then comput the scores with the n-gram LM and neural LM scores interpolated 50-50 or something like that. [Note: converting the word sequences into word-piece sequences is very easy, we can just do it by indexing a ragged tensor and then removing an axis.]

Dan

pzelasko · 2021-03-19T13:53:28Z

This is a pretty feature-rich and efficient implementation of sub-word tokenizers (with training methods too) https://github.com/huggingface/tokenizers

danpovey · 2021-03-19T16:15:57Z

Cool, thanks for the info!

…

On Fri, Mar 19, 2021 at 9:53 PM Piotr Żelasko ***@***.***> wrote: This is a pretty feature-rich and efficient implementation of sub-word tokenizers (with training methods too) https://github.com/huggingface/tokenizers — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#132 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO5W5D4HAR5SU4CYB2DTENJOZANCNFSM4ZOG5CIA> .

glynpu · 2021-03-23T13:08:24Z

I am doing this task.

(i) some mechanism to split up words into word-pieces,

Now a tokenizer is trained using https://github.com/huggingface/tokenizers suggested by @pzelasko with librispeech train_960_text(text from train_clean_360, train_clean_100, train_other_500. Currently librispeech-lm-norm.txt is not used), and a demo is shown as below.

As shown in the above screenshot : "studying" is tokenized into sequence ('st', '##ud', '##ying' ).
@danpovey what do you think about this method?

Next: I am going to train a tokenizer with full librispeech text, i.e. train_960_text(48MB) and librispeech-lm-norm.txt (4GB).

What kind of neural networks for training LM should we try first after data preparation is done?
@danpovey I find a reference from espnet which is RNNLM, but I am not sure if it is appropriate for this task.

pzelasko · 2021-03-23T14:05:09Z

Looks cool! My two cents are it’s probably worth it to start with RNNLM and eventually try some autoregressive transformers like GPT2 (small/medium size).

danpovey · 2021-03-23T14:21:47Z

We'll probably be evaluating this in batch mode, not word by word, so some kind of transformer would probably be good from an efficiency point of view, but for prototyping, anything is OK with me. I suppose my main concern is to keep the code relatively simple, as compatible/similar as possible with our AM training code, and not have too many additional dependencies. But anything is OK with me as long as you keep making some kind of progress, as it will all increase your familiarity with the issues.

Just so we can see what you are doing script-wise, if you could make a PR to the repo it would be great. We don't have to worry too much about making the scripts too nice; snowfall is all supposed to be a draft.

csukuangfj · 2021-04-06T11:25:43Z

There is another tokenizer that is used in torchtext:

glynpu linked a pull request Mar 25, 2021 that will close this issue

WIP: huggingface tokenizer and Neural LM training pipeline. #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low hanging fruit: neural language model #132

Low hanging fruit: neural language model #132

danpovey commented Mar 19, 2021

pzelasko commented Mar 19, 2021

danpovey commented Mar 19, 2021 via email

glynpu commented Mar 23, 2021 •

edited

Loading

pzelasko commented Mar 23, 2021

danpovey commented Mar 23, 2021

csukuangfj commented Apr 6, 2021

Low hanging fruit: neural language model #132

Low hanging fruit: neural language model #132

Comments

danpovey commented Mar 19, 2021

pzelasko commented Mar 19, 2021

danpovey commented Mar 19, 2021 via email

glynpu commented Mar 23, 2021 • edited Loading

pzelasko commented Mar 23, 2021

danpovey commented Mar 23, 2021

csukuangfj commented Apr 6, 2021

glynpu commented Mar 23, 2021 •

edited

Loading