Lecture 7 Vanishing Gradients, Fancy RNNs

Vanishing or exploding gradient problem

gradients get smaller and smaller when back propagation.

If gradients are getting smaller, there is no way to know if the dependency is not needed or not captured. Models can’t learn distance dependencies.

syntactic recency VS. sequential recency

The writer of the books is
The writer of the books are

Exploding gradients

a large step change in parameters

Gradient clipping

make smaller steps when the cliffs are steep

RNN variants: LSTM and GRU

LSTM

Hidden state and cell state. Cell state stores long-term information

$$ \begin{align} f^{t} &= \sigma(W_fh^{t-1} + U_fx^t + b_f)\\ i^{t} &= \sigma(W_ih^{t-1} + U_ix^t + b_i)\\ o^{t} &= \sigma(W_oh^{t-1} + U_ox^t + b_o)\\ \end{align} $$

Forget gate controls what is forgotten and kept
Input gate controls what is updated
Output gate controls what is output

$$ \begin{align} \tilde{c}^t &= tanh(W_ch^{t-1} + U_cx^t + b_c)\\ c^t &= f^t \circ c^{t-1} + i^t \circ \tilde{c}^t\\ h^t &= o^t \circ tanh(c^t) \end{align} $$

GRU

no cell states

$$ \begin{align} u^{t} &= \sigma(W_uh^{t-1} + U_ux^t + b_u)\\ r^{t} &= \sigma(W_rh^{t-1} + U_rx^t + b_r) \tilde{h}^t &= tanh(W_h(r^t \circ h^{t-1}) + U_hx^t + b_h)\\ h^t &= (1 - u^t) \circ h^{t-1} + u^t \circ \tilde{h}^t\\ \end{align} $$

Gradient clipping and skip connections

allow gradients to flow

Bidirectional RNN and Multi-layer RNN