Content

Goals

What is the category of this paper

Transformer
What is the context of this paper

To make the transformer more efficient
What are the assumptions of this paper
- The input length is not always the longer the better
- Starting with short sequences and then moving onto longer sequences (curriculum learning) helps the model in terms of perplexity
What are the main contributions of this paper
- Position infused attention (PIA) makes it possible to cache the hidden representations to allow attention work across non-overlapping sequences
- Two-stage sequence-length-based curriculum learning improves the model’s performance
Is the paper well-written

Reconstruct the entire structure
Identify strong and weak points
Identify implicit assumptions, missing citations, and issues with experimental and analytical techniques