data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

#Natural Language Processing #Multi-modality #Representation

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

@baevski_

Problem

Different objectives for each modality create gaps in performance as well frictions. location

Solution

A unified objective – predicting latent representation for masked tokens/spans/patches from an all-seeing teacher model location and location:

  • target: moving average of the representation from the teacher model (all layers or last \(K\) layers location), which sees all the information such that the hidden representation is contextualized
  • prediction: hidden representations from the student model, where input is masked (image -> patches, speech -> spans, text -> words)
  • objective: regression

Tricks

  • Share encoder parameters between teacher and student location;
  • Normalization in target representations prevents collapsing – model generating similar representation for all inputs. location
    • Learning rate being too small or too large
    • Decaying rate being too small
    • Mask being too small

It is worth noting that each modality still requires different encoding mechanism since the information density and input representations are quite different. location and location