Lecture 15 Natural Language Generation

What is Natural Language Generation

NLP refers to the property that a model generates text at the output side.

Language model and conditional language model are corner stones of NLG: Language model gives a probability distribution given an input. Conditional language model takes more than just the input but also other signals as well.

Decoding

Greedy decoding

Beam search

small k: deteriorates to greedy decoding, no backtracking and generally yields poor performance
large k: better performance but expensive to track, higher beam size might lead to degraded performance as well (generic but less useful)

Sampling

Pure/naive sampling directly from the distribution instead of just argmax in greedy decoding
Top-k sampling

Temperature in Softmax

It changes the decoding probability distribution, not a decoding algorithm itself.

$$ P_t(w) = \frac{exp(s_w)}{\sum exp(s_o)} \rightarrow \frac{exp(s_w/\tau)}{\sum exp(s_o/\tau)} $$

Higher temperature: everything is squeezed towards 1/uniform and therefore have closer probabilities, thus more diverse output;
Lower temperature: the distribution is more spiky and less diverse;

Tasks and approaches

Summarization

single-document summarization
multi-document summarization

extractive summarization
abstractive summarization

Neural summarization

Pointer generator/copy mechanism
Bottom-up summarization:
- content selection: tag a word to include in the generation or not
- generation: only generate on the selected words

Dialogue

Task-oriented
Social dialogue

Traditional RNN models does not help in this take because of:

genericness: change the sampling or change generation process (e.g. to add retrieval process)
irrelevant response: use mutual information to penalize generic responses
repetition: block generating same n-grams, coverage mechanism
lack of context
lack of consistency, persona

Storytelling

Image or prompt -> story

Evaluation

ROUGE: Recall-oriented Understudy for Gisting Evaluation, focusing more on recall(information retrieval) than precision(BLEU). Higher ROUGE score does not guarantee better summarization.

Perplexity only tells you how strong your LM is but not generation.

Aspect-based automatic metrics

Fluency
Style
Diversity
Relevance

Human evaluations aren’t perfect either.

Trends and the future

incorporating discrete latent variables
non-autoregressive generation
better objectives

use constraints in open-end generation tasks
aim for specific targets for both the model and evaluation
automatic metrics help
reproducibility