Paper CANINE Pre-training an Efficient Tokenization-Free Encoder for Language Representation

We’ve seen some ideas on remedy the vocabulary burden from modern large models with hashing tricks like PARDO/PQRNN. But all of them still require some level of tokenization up-front where text is broken into hashable & somehow meaningful chunks/subwords.

Model

In this paper, CANINE makes it more generic by applying character/code-point level hash embeddings and block-wise self-attention – focusing on locality of characters since they don’t really have long dependency – and strided convolutions to down-size the sequence length (\(\frac{2048}{4}\rightarrow{512}\)).

Above-mentioned architecture is sufficient for classification, but for sequence generation tasks, they concatenate the attended character embeddings (word/subword-level information) with down-sampled hidden embeddings (contextual information) and apply another round of convolution and transformer decoder layer to generate characters sequentially.

Loss

  • Whitespace-bounded Span masking and prediction: they mask several spans of characters and ask the model to predict characters within each span.
  • Subword span masking and prediction: instead of using whitespace to find spans, they can fall back to subword (presumably generated by something else like Sentencepiece)

Thoughts

It is a neat extension to what people are already doing with hash embeddings but at the end of the day, if we were ever to build an ultimate multilingual model, CANINE probably still need more parameters at convolution layers and longer sequence length, and potentially more data to compensate learning from scratch.