Self-attention
┌─┐
│E│
└─┘
▲
│
│
┌───┐
│ a │◀──────────────┐
└───┘ │
▲ │
│ │
┌────┐ │
│ │ │
│dot │ │
┌──▶│prod│◀──┐ │
│ │ │ │ │
│ └────┘ │ │
│ │ │
┌─────┐ ┌─────┐ │
│ W_k │ │ W_q │ │
└─────┘ └─────┘ │
▲ ▲ │
│ │ │
┌─┐ ┌─┐ ┌─┐
│E│ │E│ │E│
└─┘ └─┘ └─┘\(W_k\) and \(W_q\) is specific to one attention head, so if we want to capture more relations, we need to use more attention heads.
Advantages
- direct connection between any two positions
- direct modeling of the context
- capability to be parallelized
- modeling similarity by nature
- relative attention provides more expressiveness for input such as image, music or graph
What’s next
- Non autoregressive transformers/decoding
- self-supervision
- understanding
- multitask learning
- long-range context