Lecture 4 Back-propagation and computation graphs

$$\frac{\partial s}{\partial W} = \delta^T x^T $$ where \(\delta\) is the error signal from above (later layers) and \(x\) is the local gradiaent signal.

Tips on deriving gradients:

Keep track of dimensionality
Chain rule
Softmax: two cases, one for correct label and one for others
Element-wise derivatives first if you are getting confused
Use shape convention
Use pre-trained word vectors
Train/fine-tune them only when you have a large dataset

Backpropagation

$$ \begin{align} \frac{\partial s}{\partial h} &\rightarrow \frac{\partial h}{\partial z} \rightarrow \frac{\partial s}{\partial z} \end{align} $$

Upstream graident * local gradients -> downstream gradients

Gradient checking:

For small delta, compute the gradients using \(f'(x) = \frac{f(x+h) - f(x-h)}{2h}\)

Regularization

\(J(\theta) = ... + \lambda \sum_k \theta_k^2\)

Non-linear activations

Parameter initialization

Optimizers

Learning rates

start small
or if you are using fancy optimizers, start large (0.1 e.g.)

Links to this page

CS224N NLP with Deep Learning (Winter 2019)

Lecture 4: Back-propagation and computation graphs