Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Low hanging fruit: neural language model #132

Open
danpovey opened this issue Mar 19, 2021 · 6 comments · May be fixed by #139
Open

Low hanging fruit: neural language model #132

danpovey opened this issue Mar 19, 2021 · 6 comments · May be fixed by #139

Comments

@danpovey
Copy link
Contributor

Guys,

I realized that there is some very low hanging fruit that could easily make our WERs state of the art, which is neural LM rescoring. An advantage of our framework-- possibly the key advantage-- is that the decoding part is very easy, so we can easily rescore large N-best lists with neural LMs.
In addition, it's quite easy to manipulate variable-length sequences, so things like training and using LMs should be a little easier than they otherwise would be.

Here's what I propose: as a relatively easy baseline that can be extended later, we can train a word-piece neural LM (I recommend word-pieces because the vocab size could otherwise be quite large, making the embedding matrices difficult to train). So we'll need:
(i) some mechanism to split up words into word-pieces,
(ii) data preparation for the LM training, which in the Librispeech case would, I assume, include the additional text training data that Librispeech comes with,
(iii) script to train the actual LM. I assume this would be quite similar to our conformer self-attention model, with a xent output (no forward-backward needed), except we'd use a different type of masking, i.e. a mask of size (B, T, T), because we need to limit it to only left-context.

For decoding with the LM, we'd first do a decoding with our n-gram LM, get word sequences using our randomized N-best approach, get their LM scores with our LM, then comput the scores with the n-gram LM and neural LM scores interpolated 50-50 or something like that. [Note: converting the word sequences into word-piece sequences is very easy, we can just do it by indexing a ragged tensor and then removing an axis.]

Dan

@pzelasko
Copy link
Collaborator

This is a pretty feature-rich and efficient implementation of sub-word tokenizers (with training methods too) https://github.com/huggingface/tokenizers

@danpovey
Copy link
Contributor Author

danpovey commented Mar 19, 2021 via email

@glynpu
Copy link
Contributor

glynpu commented Mar 23, 2021

I am doing this task.

(i) some mechanism to split up words into word-pieces,

Now a tokenizer is trained using https://github.com/huggingface/tokenizers suggested by @pzelasko with librispeech train_960_text(text from train_clean_360, train_clean_100, train_other_500. Currently librispeech-lm-norm.txt is not used), and a demo is shown as below.

image

As shown in the above screenshot : "studying" is tokenized into sequence ('st', '##ud', '##ying' ).
@danpovey what do you think about this method?

Next: I am going to train a tokenizer with full librispeech text, i.e. train_960_text(48MB) and librispeech-lm-norm.txt (4GB).

What kind of neural networks for training LM should we try first after data preparation is done?
@danpovey I find a reference from espnet which is RNNLM, but I am not sure if it is appropriate for this task.

@pzelasko
Copy link
Collaborator

Looks cool! My two cents are it’s probably worth it to start with RNNLM and eventually try some autoregressive transformers like GPT2 (small/medium size).

@danpovey
Copy link
Contributor Author

We'll probably be evaluating this in batch mode, not word by word, so some kind of transformer would probably be good from an efficiency point of view, but for prototyping, anything is OK with me. I suppose my main concern is to keep the code relatively simple, as compatible/similar as possible with our AM training code, and not have too many additional dependencies. But anything is OK with me as long as you keep making some kind of progress, as it will all increase your familiarity with the issues.

Just so we can see what you are doing script-wise, if you could make a PR to the repo it would be great. We don't have to worry too much about making the scripts too nice; snowfall is all supposed to be a draft.

@csukuangfj
Copy link
Collaborator

There is another tokenizer that is used in torchtext:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants