Model Name | Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training | Pre-Train Data | Batch Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT | Encoder-Only |
|
|
30k tokens | Greedy decomposition of token into sub-words until it finds tokens in vocabulary. | Sum of: Token embeddings (WordPiece) + segment embedding (learned) + Absolute position embedding | Scaled Dot-product Self-Attention (note: advised to pad inputs on right rather than left since positional embeddings are absolute.) | GeLU : Dying ReLU problem - a node can be stuck @ 0 with negative inputs, stops learning, cannot recover. |
|
|
Book Corpus and Wikipedia (~16 GB uncompressed) | 256 batch size, maximum length 512 |
Recall that GPT takes the original Transformer encoder-decoder model for neural translation and crafts a left-to-right decoder-only architecture. This is tantamount to only employing the "self-attention" mechanism suggested in the original paper (there were 2 types: encoder-decoder and self-attention for both the encoder and decoder...GPT uses the decoder's self-attention mechanism). The main drawback from the author's perspective is the lack of information from GPT's uni-directional approach. Instead, they suggest an "encoder-only" approach, where information is used from all directions (i.e. "bi-directional"). This means departing from the upper-right triangle self-attention mask used in GPT and instead using a mask that is based on randomly masking tokens in the corpus that will be predicted using the surrounding context - hence the name "masked language model" (detailed in table). There are some other architecture departures such as activation functions and input/embeddings.
This shows the pre-training task and the various branches of fine-tuning depending on the problem.
(from original paper)
This shows how the input representation is formed using 3 different embeddings.
(from original paper)
This shows the different ways the model output are used as input into another "final" layer when fine-tuned for task specific problems.
(from original paper)
This is buried in the appendix, but is a pretty important ablation study showing the difference in accuracy between MLM and Decoder style self-attention.
(from original paper)
Here we can see the trade-off between number of layers, hidden size, attention heads and decreases in perplexity/increases in accuracy.
(from original paper)