data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Problem

Different objectives for each modality create gaps in performance as well frictions. location

Solution

A unified objective – predicting latent representation for masked tokens/spans/patches from an all-seeing teacher model location and location:

target: moving average of the representation from the teacher model (all layers or last \(K\) layers location), which sees all the information such that the hidden representation is contextualized
prediction: hidden representations from the student model, where input is masked (image -> patches, speech -> spans, text -> words)
objective: regression

Tricks

Share encoder parameters between teacher and student location;
Normalization in target representations prevents collapsing – model generating similar representation for all inputs. location
- Learning rate being too small or too large
- Decaying rate being too small
- Mask being too small

It is worth noting that each modality still requires different encoding mechanism since the information density and input representations are quite different. location and location