UNIPELT: A Unified Framework for Parameter-Efficient Language Model Tuning

#mixture of experts #conditional computation #efficient computation #fine-tuning #transformers
  • Adapter-tuning: by injecting trainable parameters into the otherwise frozen network
  • Prefix-tuning: by compressing task-related knowledge into virtual token embeddings
  • Fine-tuning without any introduction of new parameters: BitFit on bias terms or models like GPT-3 with in-context (zero-shot or few-shot) learning
┌──────────────────────────────────────┐
│                Output                │
└──────────────────────────────────────┘
                    ▲                   
        ┌───────────┤                   
 ┌──────┴──────┐    │                   
 │ bottleneck  │    │                   
 │   layers    │    │                   
 └─────────────┘    │                   
        ▲           │                   
        │           │                   
        │           │                   
 ┌─────────────┐    │                   
 │    Gate     │    │                   
 └─────────────┘    │                   
        ▲           │                   
        └───────────┤                   
                    │                   
┌──────────────────────────────────────┐
│            frozen modules            │
└──────────────────────────────────────┘
                   ...                  
                                        
┌──────────────────────────────────────┐
│            frozen modules            │
└──────────────────────────────────────┘
                    ▲                   
                    │                   
┌──────────────────────────────────────┐
│                Input                 │
└──────────────────────────────────────┘
                    ▲                   
           ┌────────┤                   
           │        │                   
     ┌──────────┐   │                   
     │   Gate   │   │                   
     └──────────┘   │                   
           ▲        │                   
           └────────┤                   
                    │                   
           ┌────────────────┐           
           │     hidden     │           
           └────────────────┘           

UNIPELT can be seen as a different form of conditional computation, or simply a two-MoE architecture:

  • to introduce prefix or not
  • to route data through bottleneck layers or not in a sequential fashion.

Comments

  • No ablation study on those gates. Do they do what we expect them to do, i.e. is a lower or higher value tied to the task or the data?
  • Calculating the number of parameters is not very intuitive since not all parameters are activated during training or inference when the gate values can be (near) zeros.
  • The efficiency measurement lacks of evidence especially when no comparison is given.