Lecture 11 Convolutional Neural Networks for NLP

The paradigm in which people are believing at the moment constantly shifts, especially in the NLP/DL community. It is a life-long learning process for us.

Debugging the neural networks is hard but is necessary.

From RNNs to CNNs

The idea of CNN applied to text is that:

The last word usually dominates in RNNs (attention helps)
We’d like to capture some pieces of a sentence (similar to n-grams)
learn locally with phrases then group them later

1D CNNs are used for text normally, working at the sequence dimension with the same hidden size. It reduces the sequence dimension, if without padding. It also reduces the hidden dimension for each kernel. So we can apply multiple kernels to increase the output hidden size.

Max Pooling summarizes the output further. Max Pooling usually works in NLP as we care more about the salient signals. (pooling over time) ^92f988

Strides compactify the output with smaller seuqence dimension.

Tricks:

You can also mix strides and pooling to create local max pooling
K-max pooling
Use dilation to jump rows when convoluting
Use dropout

Comparison

Bag of Vectors: good for simple classification tasks
Window model: good for sequence tasks
CNNs: classification, efficient with GPUs
RNNs: slower but better for sequence tasks, language modeling and if with attention can be really performant.

Useful concepts

Gated Units

Residual Blocks aka. Skip Connections ^283dea
Highway Blocks: it looks like a gated version of Lecture 11 Convolutional Neural Networks for NLP. theoratically, you can hope your CNNs in Lecture 11 Convolutional Neural Networks for NLP to learn the gates.

Bacth Normalization

Z-transform over a mini-batch.

1x1 Convolutions

Fully connected layers for all channels

Very Deep Convolutional Networks for Text Classifications

Character embeddings
1d/temporal convolution
9 ~ 29 layers, though 47 layers does not improve at all
Lecture 11 Convolutional Neural Networks for NLP works in general

Combine RNNs and CNNs

Quasi-Recurrent Neural Networks

Carry over local gates over deeper layers to have pesudo-recurrence
More efficient with parallelism