Lecture 7 Vanishing Gradients, Fancy RNNs

Vanishing or exploding gradient problem

gradients get smaller and smaller when back propagation.

If gradients are getting smaller, there is no way to know if the dependency is not needed or not captured. Models can’t learn distance dependencies.

syntactic recency VS. sequential recency

  • The writer of the books is
  • The writer of the books are

Exploding gradients

  • a large step change in parameters

Gradient clipping

  • make smaller steps when the cliffs are steep

RNN variants: LSTM and GRU

LSTM

  • Hidden state and cell state. Cell state stores long-term information

$$ \begin{align} f^{t} &= \sigma(W_fh^{t-1} + U_fx^t + b_f)\\ i^{t} &= \sigma(W_ih^{t-1} + U_ix^t + b_i)\\ o^{t} &= \sigma(W_oh^{t-1} + U_ox^t + b_o)\\ \end{align} $$

  • Forget gate controls what is forgotten and kept
  • Input gate controls what is updated
  • Output gate controls what is output

$$ \begin{align} \tilde{c}^t &= tanh(W_ch^{t-1} + U_cx^t + b_c)\\ c^t &= f^t \circ c^{t-1} + i^t \circ \tilde{c}^t\\ h^t &= o^t \circ tanh(c^t) \end{align} $$

GRU

  • no cell states

$$ \begin{align} u^{t} &= \sigma(W_uh^{t-1} + U_ux^t + b_u)\\ r^{t} &= \sigma(W_rh^{t-1} + U_rx^t + b_r) \tilde{h}^t &= tanh(W_h(r^t \circ h^{t-1}) + U_hx^t + b_h)\\ h^t &= (1 - u^t) \circ h^{t-1} + u^t \circ \tilde{h}^t\\ \end{align} $$

Gradient clipping and skip connections

allow gradients to flow

Bidirectional RNN and Multi-layer RNN