Open Vocabulary

Problems

Word based models suffer from the Out-of-Vocabulary (OOV) problem. Character-level models can be useful if the sequence length is manageable. Recent sub-word models using ¹ , ² , and ³ are a good compromise. To some extend, they can be considered as open vocabulary since you can degenerate a word to complete characters (all Unicode characters theoretically, but ASCII characters in the English dominating world). But the problem of a vocabulary remains:

You have to store and maintain a physical copy of the vocabulary along side your model
You have to store the token embeddings, which can be a lot of parameters, in your model

Open, Low, and No Vocabulary

If we define open vocabulary as a criterion that there would be no OOV problem, then you can either have a finite vocabulary or no vocabulary at all(hash, VTR). If you do, the vocabulary can be small (bytes, ASCII characters), moderate or big(BPE w/ fixed size).

No vocabulary

Paper Robust Open-Vocabulary Translation from Visual Text Representations

We have seen similar problems using subwords for languages like English in paper Paper CANINE Pre-training an Efficient Tokenization-Free Encoder for Language Representation, and in this paper, the authors takes on the same problem with a refreshing perspective - learning text embeddings from visually rendered text.

Text segmentation techniques like BPE and SentencePiece are subject to a lot of noise from the following scenarios:

Phenomena	Word	BPE
Vowelization	كتاب	كتاب
		الك·ِ·ت·اب·
Misspelling	language	language
	langauge	la ng au ge
Confusables	really	really
	rea11y	re a 1 1 y
Shared Character Components	확인한다	확인·한·다
	확인했다	확인·했다

~~[[vtr.png|Architecture]]~~❌

Here, the rendered text is segmented into blocks/slices using a sliding window and each slice goes through a series of transformations — Conv2D, BatchNorm, ReLU and a linear transformation — and eventually goes to a standard transformer for further processing.

Some interesting observations from the paper:

Given a fixed window size, increasing the stride degrades the performance
Increasing the convolution channel size does not necessarily translate into performance gains
Smaller strides increase the training time as the sequence are getting longer
Consistent improvement over baseline models on noisy data (confusables, permutations, natural noise like misspelling)

Low vocabulary

Paper CANINE Pre-training an Efficient Tokenization-Free Encoder for Language Representation Paper Charformer Fast Character Transformers via Gradient-based Subword Tokenization

Footnotes

[1609.08144v2] Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

[1808.06226] SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Byte pair encoding - Wikipedia

https://aclanthology.org/D19-1506.pdf

Google AI Blog: Advancing NLP with Efficient Projection-Based Model Architectures

[2105.13626] ByT5: Towards a token-free future with pre-trained byte-to-byte models

202110091321 Open Vocabulary

Open Vocabulary

Problems

Open, Low, and No Vocabulary

No vocabulary

Low vocabulary