Lecture 12 – Subword Models

Human languages/sounds(phonemes) is not countinus even though our mouth is.
Morphemes are meaningful subwords.
Writing systems are different between languages. Sometimes one word in one language might be 4 in another. (Componds)
Why we want a subword level model:
- For understanding complex and rich morphology
- For transliteration (e.g names)
- Informal spelling on the internet

Character-level Models

Word embeddings from character embeddings
Direct sequence modeling based on character embeddings

Characters might don’t mean much by themselves. But large/deep models are capable of learning to remember higher level meaning from those character embeddings.

But one character model in English is not going to be useful for other languages, because the writing systems aren’t the same.

Disadvantages

hard to train because of the increased sequenc length

Subword Models

Byte Pair Encoding: Merge byte pairs as a new byte Wordpiece: Tokenize a word within a word boundary Sentencepiece: Tokenize an input with spaces being a special character

Conv/Highway + char embeddings to learn word representations.

Hybrid Models

Only use the character LSTM model when it is OOV, both in training and inference. Because the this character model is a second-level model, it does not perform well on capturing context from word-level representations. (Translating names for example)

fastText

A next-generation of Lecture 1 Overview: n-gram augmented with boundary symbols for one word