OPT_2: Gradient descent convergence rate (P1)

Gradient descend convergence guarantee

Given an objective function $f(w)$ with $w$ is the parameters (weights), satisfies:
- Lipschitz continuous: $||\nabla f(w)-\nabla f(v)||\le L||w-v||$ $(1)$
- $C^2$ function: $\nabla^2f(w)\preceq LI$ $(2)$
To minimize $f$ , by gradient descent, we update $w$ by the rule:
By Taylor expansion, we have:
- Replace $(a,b)$ by $(w^{k+1},w^k)$ . Consider $(4)$ as a quadratic function of $(w^{k+1}-w^k)$ . The optimal solution is:
- So, $\alpha=\frac{1}{L}$ is the optimal learning rate for function $f$ .
In practice, we never use $\alpha=\frac{1}{L}$ because:
- $L$ is expensive to compute
- $L$ is really big, then $\alpha$ is too small.