Open Vocabulary
Problems
Word based models suffer from the Out-of-Vocabulary (OOV) problem. Character-level models can be useful if the sequence length is manageable. Recent sub-word models using 1 , 2 , and 3 are a good compromise. To some extend, they can be considered as open vocabulary since you can degenerate a word to complete characters (all Unicode characters theoretically, but ASCII characters in the English dominating world). But the problem of a vocabulary remains:
- You have to store and maintain a physical copy of the vocabulary along side your model
- You have to store the token embeddings, which can be a lot of parameters, in your model
Open, Low, and No Vocabulary
If we define open vocabulary as a criterion that there would be no OOV problem, then you can either have a finite vocabulary or no vocabulary at all(hash, VTR). If you do, the vocabulary can be small (bytes, ASCII characters), moderate or big(BPE w/ fixed size).
No vocabulary
We have seen similar problems using subwords for languages like English in paper Paper CANINE Pre-training an Efficient Tokenization-Free Encoder for Language Representation, and in this paper, the authors takes on the same problem with a refreshing perspective - learning text embeddings from visually rendered text.
Text segmentation techniques like BPE and SentencePiece are subject to a lot of noise from the following scenarios:
| Phenomena | Word | BPE |
|---|---|---|
| Vowelization | كتاب | كتاب |
| الك·ِ·ت·اب· | ||
| Misspelling | language | language |
| langauge | la ng au ge | |
| Confusables | really | really |
| rea11y | re a 1 1 y | |
| Shared Character Components | 확인한다 | 확인·한·다 |
| 확인했다 | 확인·했다 |
Here, the rendered text is segmented into blocks/slices using a sliding window and each slice goes through a series of transformations — Conv2D, BatchNorm, ReLU and a linear transformation — and eventually goes to a standard transformer for further processing.
Some interesting observations from the paper:
- Given a fixed window size, increasing the stride degrades the performance
- Increasing the convolution channel size does not necessarily translate into performance gains
- Smaller strides increase the training time as the sequence are getting longer
- Consistent improvement over baseline models on noisy data (confusables, permutations, natural noise like misspelling)
Low vocabulary
Paper CANINE Pre-training an Efficient Tokenization-Free Encoder for Language Representation Paper Charformer Fast Character Transformers via Gradient-based Subword Tokenization