202110091321 Open Vocabulary

#open vocabulary #representation learning #translation #nlp

Open Vocabulary

Problems

Word based models suffer from the Out-of-Vocabulary (OOV) problem. Character-level models can be useful if the sequence length is manageable. Recent sub-word models using 1 , 2 , and 3 are a good compromise. To some extend, they can be considered as open vocabulary since you can degenerate a word to complete characters (all Unicode characters theoretically, but ASCII characters in the English dominating world). But the problem of a vocabulary remains:

  • You have to store and maintain a physical copy of the vocabulary along side your model
  • You have to store the token embeddings, which can be a lot of parameters, in your model

Open, Low, and No Vocabulary

If we define open vocabulary as a criterion that there would be no OOV problem, then you can either have a finite vocabulary or no vocabulary at all(hash, VTR). If you do, the vocabulary can be small (bytes, ASCII characters), moderate or big(BPE w/ fixed size).

No vocabulary

Paper Robust Open-Vocabulary Translation from Visual Text Representations

We have seen similar problems using subwords for languages like English in paper Paper CANINE Pre-training an Efficient Tokenization-Free Encoder for Language Representation, and in this paper, the authors takes on the same problem with a refreshing perspective - learning text embeddings from visually rendered text.

Text segmentation techniques like BPE and SentencePiece are subject to a lot of noise from the following scenarios:

PhenomenaWordBPE
Vowelizationكتابكتاب
الك·ِ·ت·اب·
Misspellinglanguagelanguage
langaugela ng au ge
Confusablesreallyreally
rea11yre a 1 1 y
Shared Character Components확인확인·한·다
확인확인·했다
[[vtr.png|Architecture]]

Here, the rendered text is segmented into blocks/slices using a sliding window and each slice goes through a series of transformations — Conv2D, BatchNorm, ReLU and a linear transformation — and eventually goes to a standard transformer for further processing.

Some interesting observations from the paper:

  • Given a fixed window size, increasing the stride degrades the performance
  • Increasing the convolution channel size does not necessarily translate into performance gains
  • Smaller strides increase the training time as the sequence are getting longer
  • Consistent improvement over baseline models on noisy data (confusables, permutations, natural noise like misspelling)

Low vocabulary

Paper CANINE Pre-training an Efficient Tokenization-Free Encoder for Language Representation Paper Charformer Fast Character Transformers via Gradient-based Subword Tokenization