Lecture 8 Translation, Seq2Seq, Attention

Machine Translation

Early Machine Translation: Rule-based and dictionary-based
Statistical Machine Translation: Learning probability of \(\text{argmax}_yP(y|x) \rightarrow \text{argmax}_yP(x|y)P(y)\) to find best \(y\) (translation) of \(x\) (input). \(P(x|y)\) is considered a translation model and \(P(y)\) is the language model

Statistical Machine Translation

\(P(x|y)\) is learned approximately by learning \(P(x, a|y)\) where \(a\) is an alignment
Decoding is a process finding the translation while pruning low-probability branches

Neural Machine Translation

Encoder captures the representation of the input source sentence with the last hidden state
The hidden state is fed into decoder with an START token as input

Sequence to Sequence

Many applications: Summarization, dialogue, parsing, and code generation
It is considered as a conditional language model
Training can be done with teacher forcing, end to end

Decoding

Greedy decoding: Taking the output as input for next round, however local argmax is not global argmax
Beam search:
- Keep track of K most probable translations
- Stop one hypothesis when an END token is generated
- Till \(t\) steps or \(n\) complete hypotheses
- Longer hypotheses have lower scores -> normalize by length

Advantages

Better performance: fluency, context usage and more capable of learning phrases
Simplicity

Disadvantages

Interpretability
Controllability

Evaluation

BLEU: n-gram precision + brevity penalty

Problems

OOV
Domain mismatch
Distance context
Biases
Commonsense
Hallucination, repetition

Attention

Seq2seq has an information bottleneck for using the last hidden state alone for decoding.

At each step of decoding, attention helps the model to focus on some part of the input directly. Single hidden state -> weighted hidden state from all hidden states, plus shortcut connections to remediate vanishing gradients. It also provides better interpretability.

Variants

dot-product attention scores: \(e_i = s^Th_i\)
multiplicative attention: \(e_i = s^TWh_i\)
additive attention: \(e_i = v^T\text{tanh}(W_1h_i + W_2s)\)