Lecture 13 Contextual Word Embeddings

Reflections on word representations

Only one representation for a word so far: Lecture 1 Overview, Lecture 2 Word Vectors and Word Senses and Lecture 12 – Subword Models.

Randomly initialized and trained word embeddings work if the dataset is large enough. But pre-trained word vectors help because you can train the word embeddings with much larger data and larger vocabulary.

UNK: Using the same UNK token for all OOV words makes them indistinguishable from each other. (character-level model can remedy that)

Word Sense Disambiguation: one word should have different representations for different aspects.

We already have a solution to model a word in a context: we can simply treat LSTM hidden representations as such.

Pre-ELMo and ELMo

TagLM: combine char embeddings + word embeddings and LM embeddings/hidden states. Using only LM embeddings wasn’t great.

ELMo: a biLM (forward and backword models) with 2 BiLSTM layers. Using all layers’ hidden states instead of only one of them. Lower layer is better for lower-level syntax: POS, Dependency Parsing, and NER; Higher layer is better for higher level semantics: sentiment, SRL, QA, and NLI

ULMfit

Use the same pre-trained BiLSTM LM and fine-tune it together with a classification head.

Transformer

We need parallelization but RNNs won’t cut it, and we still need attentions.

As \(d_k\), the dimension of the hidden size gets larger, the variance of \(q^Tk\) also gets larger. After the softmax, the gradients gets smaller, so the authors scaled the dot product by a factor of \(\sqrt{d_k}\).

BERT

MLM, Masked Language Modeling: predict masked tokens with bidirectional information; NSP, Next Sentence Prediction: predict sentence relation, utilizing segment embeddings;

Fine-tuning like Lecture 13 Contextual Word Embeddings.