Given a sequence of words, a language model computes the probability distribution of the next word. \(P(x^{t+1}|x^t, ..., x^1)\)
N-gram Language Models
N-gram here refers to the words instead of sub-word tokens. Using a pre-defined \(N\) prevents distant but relevant context being learned.
-
Sparsity Problem:
- Count-based probability can lead to zero probability in the nominator. Smoothing would help.
- Count-based probability can lead to zero probability in the denominator. Fall-back (N-gram -> N-1 gram etc)would help
- It will get worse as the N gets bigger. \(N < 5\) in general.
Neural Networks
-
window based neural language models:
- Still cannot handle remote context with a fixed window
- No sparsity problem
- Don’t need to store counts
Recurrent Neural Networks
- Same weight matrix is applied at each time step
- Can process input with any length
- Recurrent steps cannot be parallelized