UNIPELT: A Unified Framework for Parameter-Efficient Language Model Tuning

Adapter-tuning: by injecting trainable parameters into the otherwise frozen network
Prefix-tuning: by compressing task-related knowledge into virtual token embeddings
Fine-tuning without any introduction of new parameters: BitFit on bias terms or models like GPT-3 with in-context (zero-shot or few-shot) learning

┌──────────────────────────────────────┐
│                Output                │
└──────────────────────────────────────┘
                    ▲                   
        ┌───────────┤                   
 ┌──────┴──────┐    │                   
 │ bottleneck  │    │                   
 │   layers    │    │                   
 └─────────────┘    │                   
        ▲           │                   
        │           │                   
        │           │                   
 ┌─────────────┐    │                   
 │    Gate     │    │                   
 └─────────────┘    │                   
        ▲           │                   
        └───────────┤                   
                    │                   
┌──────────────────────────────────────┐
│            frozen modules            │
└──────────────────────────────────────┘
                   ...                  
                                        
┌──────────────────────────────────────┐
│            frozen modules            │
└──────────────────────────────────────┘
                    ▲                   
                    │                   
┌──────────────────────────────────────┐
│                Input                 │
└──────────────────────────────────────┘
                    ▲                   
           ┌────────┤                   
           │        │                   
     ┌──────────┐   │                   
     │   Gate   │   │                   
     └──────────┘   │                   
           ▲        │                   
           └────────┤                   
                    │                   
           ┌────────────────┐           
           │     hidden     │           
           └────────────────┘

UNIPELT can be seen as a different form of conditional computation, or simply a two-MoE architecture:

to introduce prefix or not
to route data through bottleneck layers or not in a sequential fashion.

Comments

No ablation study on those gates. Do they do what we expect them to do, i.e. is a lower or higher value tied to the task or the data?
Calculating the number of parameters is not very intuitive since not all parameters are activated during training or inference when the gate values can be (near) zeros.
The efficiency measurement lacks of evidence especially when no comparison is given.