Lecture BERT and Other Pre-trained Language Models

ELMo

Dynamic embeddings not from the Embeeding layer but from two unidirectional LSTM layers.

Transformer

Multi-head self attention
- No locality bias (\(N \times N\))
- One input with multiple type of information
Feed-forward layers
Layer norm and residuals
Positional embeddings

BERT

Bidirectional models enable the word to see themselves from context, then prediction of the next word is trivial.

Predicting 15% word throughout the forword pass:
- 80% of them are replaced with [MASK]
- 10% of them are kept same
- 10% of them are randomly replaced with other words
Next sentence prediction (not so important compared to MLM based on # On Losses for Modern Language Models), but it looks still helpful for NLI in their ablation study
Token embeddings + Segment embedings + Position embeddings
One model to rule them all

RoBERTa

More data
Better and longer training

XLNet

Relative positional embedding
Permutation language model

ALBERT

Parameter sharing
Factorized embedding
Smaller but not faster

T5

Extensive ablation study

Electra

Discriminator to tell mask predictions from the truth

Distillation

It works better than pretraining+fine-tuning for small models