Skip to content
This repository has been archived by the owner on Jan 27, 2023. It is now read-only.

Latest commit

 

History

History
52 lines (28 loc) · 5.21 KB

bert.md

File metadata and controls

52 lines (28 loc) · 5.21 KB

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Summary

Model Name Model Type (Encoder-Decoder, etc.) Pre-train Objective Tokenization Vocab Size OOV Handling Embeddings Attention Activations Parameters Training Pre-Train Data Batch Size
BERT Encoder-Only
  • Masked Language Modeling (MLM) :~15% tokens chosen -> 80% replaced with [MASK], 10% random token, 10% left unchanged. A shallow decoder is used to reproduce the original text.
  • Next Sentence Prediction (NSP) : Binary classification task, predicts if 2 sequences follow each other in corpus (useful on Q&A, etc.). Sampling is 50% 0,1.
  • Training loss is mean of MLM + NSP likelihood
  • Wordpiece Tokenization : original paper, huggingface explanation
  • Token break down : [CLS] token (useful for many-to-one fine-tuned tasks such as classification) + WordPiece tokens + [SEP] token for each sentence.
  • MAX 512 Tokens.
30k tokens Greedy decomposition of token into sub-words until it finds tokens in vocabulary. Sum of: Token embeddings (WordPiece) + segment embedding (learned) + Absolute position embedding Scaled Dot-product Self-Attention (note: advised to pad inputs on right rather than left since positional embeddings are absolute.) GeLU : Dying ReLU problem - a node can be stuck @ 0 with negative inputs, stops learning, cannot recover.
  • BERT base: 12 layers (transformer blocks), 12 attention heads, hidden size = 768 -> ~110 MM params
  • BERT large: 24 layers, 16 attention heads, hidden size = 1024 -> ~340 MM params
  • Generally the parameter space choices are embed_size (E) == hidden_size (H), the feed-forward size is 4H and the number of attention heads is H/64. To see the math on how the total number of parameters is calculated, check out this comment on BERT github here
  • Adam and L_2 weight decay
  • Learning rate is warmed up during first 10k steps to peak value of 1e.-4, then linearlly decayed
  • Models are pretrained for S= 1MM updates
  • No layers frozen
  • Same learning rate throughout
Book Corpus and Wikipedia (~16 GB uncompressed) 256 batch size, maximum length 512

TL;DR

Recall that GPT takes the original Transformer encoder-decoder model for neural translation and crafts a left-to-right decoder-only architecture. This is tantamount to only employing the "self-attention" mechanism suggested in the original paper (there were 2 types: encoder-decoder and self-attention for both the encoder and decoder...GPT uses the decoder's self-attention mechanism). The main drawback from the author's perspective is the lack of information from GPT's uni-directional approach. Instead, they suggest an "encoder-only" approach, where information is used from all directions (i.e. "bi-directional"). This means departing from the upper-right triangle self-attention mask used in GPT and instead using a mask that is based on randomly masking tokens in the corpus that will be predicted using the surrounding context - hence the name "masked language model" (detailed in table). There are some other architecture departures such as activation functions and input/embeddings.

Art

Fig 1: Pre-training v. Fine-Tuning

This shows the pre-training task and the various branches of fine-tuning depending on the problem.

fig 1

(from original paper)

Fig 2: Input Embedding Construction

This shows how the input representation is formed using 3 different embeddings.

fig 2

(from original paper)

Fig 4: Different Fine-Tuning Constructions

This shows the different ways the model output are used as input into another "final" layer when fine-tuned for task specific problems.

fig 4

(from original paper)

Fig 5: Left-to-Right versus MLM

This is buried in the appendix, but is a pretty important ablation study showing the difference in accuracy between MLM and Decoder style self-attention.

fig 5

(from original paper)

Table 6: Model Size v. Accuracy/Perplexity on Hold-Out Data

Here we can see the trade-off between number of layers, hidden size, attention heads and decreases in perplexity/increases in accuracy.

fig 6

(from original paper)