Lecture 6 Language Modeling

Given a sequence of words, a language model computes the probability distribution of the next word. \(P(x^{t+1}|x^t, ..., x^1)\)

N-gram Language Models

N-gram here refers to the words instead of sub-word tokens. Using a pre-defined \(N\) prevents distant but relevant context being learned.

Sparsity Problem:
- Count-based probability can lead to zero probability in the nominator. Smoothing would help.
- Count-based probability can lead to zero probability in the denominator. Fall-back (N-gram -> N-1 gram etc)would help
- It will get worse as the N gets bigger. \(N < 5\) in general.

Neural Networks

window based neural language models:
- Still cannot handle remote context with a fixed window
- No sparsity problem
- Don’t need to store counts

Recurrent Neural Networks

Same weight matrix is applied at each time step
Can process input with any length
Recurrent steps cannot be parallelized

Links to this page

CS224N NLP with Deep Learning (Winter 2019)

Lecture 6: Language Modeling