data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Problem
Different objectives for each modality create gaps in performance as well frictions. location
Solution
A unified objective – predicting latent representation for masked tokens/spans/patches from an all-seeing teacher model location and location:
- target: moving average of the representation from the teacher model (all layers or last \(K\) layers location), which sees all the information such that the hidden representation is contextualized
- prediction: hidden representations from the student model, where input is masked (image -> patches, speech -> spans, text -> words)
- objective: regression
Tricks
- Share encoder parameters between teacher and student location;
-
Normalization in target representations prevents collapsing – model generating similar representation for all inputs. location
- Learning rate being too small or too large
- Decaying rate being too small
- Mask being too small
It is worth noting that each modality still requires different encoding mechanism since the information density and input representations are quite different. location and location