Content
- Title, abstract, and introduction
- All headings
- Conclusion
- References
Goals
-
What is the category of this paper
Transformer
-
What is the context of this paper
To make the transformer more efficient
-
What are the assumptions of this paper
- The input length is not always the longer the better
- Starting with short sequences and then moving onto longer sequences (curriculum learning) helps the model in terms of perplexity
-
What are the main contributions of this paper
- Position infused attention (PIA) makes it possible to cache the hidden representations to allow attention work across non-overlapping sequences
- Two-stage sequence-length-based curriculum learning improves the model’s performance
-
Is the paper well-written
Second Pass
Content
- Figures, diagrams, and other illustrations
- Mark useful references
Goals
-
Summarize the paper with supporting evidence
-
Two-stage training
- short sequences (32-1536 in the paper) first and then long sequences (3072 in the paper) last
-
PIA
-
L'is the cached sequence length -
Key + Pos:
[L' + {1, ..., L}, H]the de facto context -
Query + Pos:
[H, 1]the current word -
Value:
[L' + {1, ..., L}, H]
-
weights = (Key + Pos) @ (Query + Pos) = [L' + {1, ..., L}, H] @ [H, 1] = [L' + {1, ..., L}, 1] weights = softmax(weights) value = weights * value -
Two-stage training
Third Pass
Content
- Virtually re-implement
- Identify and challenge every assumption
Goals
- Reconstruct the entire structure
- Identify strong and weak points
- Identify implicit assumptions, missing citations, and issues with experimental and analytical techniques