Gradients without Backpropagation
Forward mode VS. reverse mode
Reverse mode aka. backpropagation is often represented as $$ \begin{align} \underset{\mathcal{R}^n}{\theta} &\xrightarrow{\textit{forward}} \underset{\mathcal{R}^n \rightarrow \mathcal{R}^m}{f(\theta)}\\ \underset{\mathcal{R}^m \cdot \mathcal{R}^{m\times n}}{v^T J_f(\theta)} &\xleftarrow{\textit{reverse}} \underset{\mathcal{R}^m}{v}\\ \end{align} $$
where \(v \in \mathcal{R}^m\) is the gradients propagated to this point and \(J_f{\theta} \in \mathcal{R}^{m\times n}\) is the partial derivatives evaluated at \(f : \mathcal{R}^n \rightarrow \mathcal{R}^m\) with respect to the parameter \(\theta \in \mathcal{R}^n\).
I think they use \(\theta\) both as parameter and input which is really confusing. For example, if the \(\theta \in \mathcal{R}^n\) is the parameter, then how come it has nothing to do with \(m\) when \(f: \mathcal{R}^n \rightarrow \mathcal{R}^m\)? But if it is the input, why do they use \(\theta_t\) as their parameter in the pseudo code description?
Forward mode computes the forward gradients as the following: $$ \begin{align} \underset{\mathcal{R}^n}{\theta} &\xrightarrow{\textit{forward}} \underset{\mathcal{R}^n \rightarrow \mathcal{R}^m}{f(\theta)}\\ \underset{\mathcal{R}^n}{v} &\xrightarrow{\textit{forward}} \underset{\mathcal{R}^{m\times n} \cdot \mathcal{R}^n}{J_f(\theta)v}\\ \underset{\mathcal{R}^n \rightarrow \mathcal{R}^n}{g(\theta)} &= \underset{ (\mathcal{R}^n \rightarrow \mathcal{R}^n \cdot \mathcal{R}^n) \mathcal{R}^n }{(\nabla f(\theta)\cdot v )v} \end{align} $$
Here \(v \in \mathcal{R}^n\) is a perturbation vector (\(vi \sim \mathcal{N}(0, 1)\)). The calculation of \(J_f(\theta)v\) is done by treating \(v\) as a weight vector and \(J_f(\theta)v\) is simply a weighted sums of \(J_f(\theta)\), the sensitivities for \(f\) given \(\theta\).
“Given a function f : Rn → Rm and the values θ ∈ Rn, v ∈ Rn, forward mode AD computes f (θ) and the Jacobianvector product2 Jf (θ) v, where Jf (θ) ∈ Rm×n is the Jacobian matrix of all partial derivatives of f evaluated at θ, and v is a vector of perturbations.3 For the case of f : Rn → R the Jacobian–vector product corresponds to a directional derivative ∇f (θ) · v, which is the projection of the gradient ∇f at θ onto the direction vector v, representing the rate of change along that direction. It is important to note that the forward mode evaluates the function f and its Jacobian–vector product Jf v simultaneously in a single forward run. Also note that Jf v is obtained without having to compute the Jacobian Jf , a feature sometimes referred to as a matrix-free computation.4” (Baydin et al., 2022, p. 2) (pdf)
“v ∈ Rn is a perturbation vector taken as a multivariate random variable v ∼ p(v) such that v’s scalar components vi are independent and have zero mean and unit variance for all i, and ∇f (θ) · v ∈ R is the directional derivative of f at point θ in direction v.” (Baydin et al., 2022, p. 3) (pdf)
“Computing ∇f using only forward mode is possible by evaluating f forward n times with direction vectors taken as standard basis (or one-hot) vectors ei ∈ Rn, i = 1 . . . n, where ei denotes a vector with a 1 in the ith coordinate and 0s elsewhere. This allows the evaluation of the sensitivity of f w.r.t. each input ∂f ∂θi separately, which when combined give us the gradient ∇f .” (Baydin et al., 2022, p. 3) (pdf)
“Importantly, there is virtually no difference in memory consumption between the two methods.” (Baydin et al., 2022, p. 7) (pdf)