Lecture 14 Transformers and Self-Attention

Self-attention

         ┌─┐                  
         │E│                  
         └─┘                  
          ▲                   
          │                   
          │                   
        ┌───┐                 
        │ a │◀──────────────┐ 
        └───┘               │ 
          ▲                 │ 
          │                 │ 
       ┌────┐               │ 
       │    │               │ 
       │dot │               │ 
   ┌──▶│prod│◀──┐           │ 
   │   │    │   │           │ 
   │   └────┘   │           │ 
   │            │           │ 
┌─────┐      ┌─────┐        │ 
│ W_k │      │ W_q │        │ 
└─────┘      └─────┘        │ 
   ▲            ▲           │ 
   │            │           │ 
  ┌─┐          ┌─┐         ┌─┐
  │E│          │E│         │E│
  └─┘          └─┘         └─┘

\(W_k\) and \(W_q\) is specific to one attention head, so if we want to capture more relations, we need to use more attention heads.

Advantages

direct connection between any two positions
direct modeling of the context
capability to be parallelized
modeling similarity by nature
relative attention provides more expressiveness for input such as image, music or graph

What’s next

Non autoregressive transformers/decoding
self-supervision
understanding
multitask learning
long-range context