B.2 Gradient Methods

$\newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\textm}[1]{\textsf{#1}} \newcommand{\textnormal}[1]{\textsf{#1}} \def\T{{\mkern-2mu\raise-1mu\mathsf{T}}} \newcommand{\R}{\mathbb{R}} % real numbers \newcommand{\E}{{\rm I\kern-.2em E}} \newcommand{\w}{\bm{w}} % bold w \newcommand{\bmu}{\bm{\mu}} % bold mu \newcommand{\bSigma}{\bm{\Sigma}} % bold mu \newcommand{\bigO}{O} %\mathcal{O} \renewcommand{\d}[1]{\operatorname{d}\!{#1}}$

Consider the unconstrained optimization problem $\begin{equation} \begin{array}{ll} \underset{\bm{x}}{\textm{minimize}} & f(\bm{x}) \end{array} \tag{B.1} \end{equation}$ where $f$ is the objective function (assumed to be continuously differentiable).

An iterative method or algorithm produces a sequence of iterates $\bm{x}^0, \bm{x}^1, \bm{x}^2,\dots$ that may or may not converge to an optimal solution $\bm{x}^\star$ . In an ideal case with $f$ convex, one may expect that, as the iterations proceed with $k \rightarrow \infty$ , the objective function converges to the optimal value of the problem, $f\left(\bm{x}^k\right) \rightarrow p^\star,$ and the gradient tends to zero, $\nabla f\left(\bm{x}^k\right) \rightarrow \bm{0}.$

For details on iterative methods and convergence properties, the reader is referred to the many available textbooks, such as Bertsekas (1999), Boyd and Vandenberghe (2004), Nocedal and Wright (2006), and Beck (2017). The case of nondifferentiable $f$ leads to subgradient methods (Beck, 2017), which are not considered herein.

B.2.1 Descent Methods

Descent methods (a.k.a. gradient methods) satisfy the property that $f\left(\bm{x}^{k+1}\right) < f\left(\bm{x}^{k}\right)$ and obtain the iterates as follows: $\bm{x}^{k+1} = \bm{x}^{k} + \alpha^{k}\bm{d}^{k},$ where $\bm{d}^{k}$ is the search direction and $\alpha^{k}$ is the stepsize at iteration $k$ .

For a sufficiently small step, the descent property implies that $\bm{d}$ has to satisfy $\nabla f\left(\bm{x}\right)^\T \bm{d} < 0$ and that $\alpha$ has to be properly chosen (if $\alpha$ is too large, even with a descent direction, the descent property may be violated).

B.2.2 Line Search

Line search is the procedure by which the stepsize $\alpha$ is chosen. There are several possible methods but the following two are widely used due to their good theoretical convergence properties as well as their practical performance.

Exact line search: Based on solving the univariate optimization problem: $\alpha = \underset{\alpha>0}{\textm{arg min}}\; f(\bm{x} + \alpha\bm{d}).$
Backtracking line search (a.k.a. Armijo rule): Starting at $\alpha=1$ , repeat $\alpha \gets \beta \alpha$ until $f(\bm{x} + \alpha\bm{d}) < f(\bm{x}) + \sigma \alpha \nabla f(\bm{x})^\T\bm{d},$ where $\sigma \in (0,1/2)$ and $\beta \in (0,1)$ are given parameters.

B.2.3 Gradient Descent Method

The gradient descent method (also known as steepest descent method) is a descent method where the search direction is chosen as the opposite direction of the gradient: $\bm{d} = -\nabla f(\bm{x}),$ which is clearly a descent direction since it satisfies $\nabla f\left(\bm{x}\right)^\T\bm{d} < 0$ .

The gradient descent update is then $\bm{x}^{k+1} = \bm{x}^{k} - \alpha^{k}\nabla f\left(\bm{x}^{k}\right).$ As a stopping criterion, the heuristic $\|\nabla f(\bm{x})\|_2 \le \epsilon$ is commonly used.

Unfortunately, convergence for the gradient descent method is often slow, so it is rarely used in practice unless there is a need due to the high dimensionality of the problem or a requirement for distributed implementation. The gradient descent method is summarized in Algorithm B.1.

Algorithm B.1: Gradient descent method for the unconstrained problem (B.1).

Choose initial point $\bm{x}^0$ ;
Set $k \gets 0$ ;
repeat

Compute the negative gradient as descent direction: $\bm{d}^{k} = -\nabla f\left(\bm{x}^{k}\right)$ ;
Line search: Choose a stepsize $\alpha^{k} > 0$ via exact or backtracking line search;
Obtain next iterate: $\bm{x}^{k+1} = \bm{x}^{k} - \alpha^{k}\nabla f\left(\bm{x}^{k}\right);$
$k \gets k+1$ ;

until convergence;

B.2.4 Newton’s Method

Newton’s method is a descent method that uses not only the gradient but also the Hessian of $f$ , denoted by $\nabla^2 f(\bm{x})$ , in the search direction: $\bm{d} = -\nabla^2 f(\bm{x})^{-1} \nabla f(\bm{x}),$ which is a descent direction (assuming $f$ is convex). It is tacitly assumed that $f$ is twice continuously differentiable and that the Hessian matrix is positive definite for all $\bm{x}$ .

This direction has the interesting interpretation that $\bm{x} + \bm{d}$ minimizes the second-order approximation of $f(\bm{x})$ around $\bm{x}$ : $\hat{f}(\bm{x} + \bm{v}) = f(\bm{x}) + \nabla f(\bm{x})^\T\bm{v} + \frac{1}{2}\bm{v}^\T\nabla^2 f(\bm{x})\bm{v}.$

Newton’s method update is then $\bm{x}^{k+1} = \bm{x}^{k} - \alpha^{k} \nabla^2 f\left(\bm{x}^{k}\right)^{-1} \nabla f\left(\bm{x}^{k}\right).$

An interesting quantity in Newton’s method is the Newton decrement $\lambda(\bm{x}) = (\nabla f(\bm{x})^\T\nabla^2 f(\bm{x})^{-1}\nabla f(\bm{x}))^{1/2}$ that measures the proximity of $\bm{x}$ to an optimal point. More specifically, it gives an estimate of $f(\bm{x}) - p^\star$ as $f(\bm{x}) - \textm{inf}_{\bm{y}} \hat{f}(\bm{y}) = \frac{1}{2} \lambda(\bm{x})^2,$ where $\hat{f}(\bm{x})$ is the quadratic approximation of $f(\bm{x})$ . Observe that the computational cost of the Newton decrement is negligible given that we already have the Newton direction: $\lambda(\bm{x})^2 = -\nabla f(\bm{x})^\T\bm{d}$ .

Newton’s method enjoys fast convergence and it is at the heart of most modern solvers. Only when the dimensionality of the problem is very large does it becomes impractical to implement due the computation and storage of the Hessian. For such cases there are approximate versions of Newton’s method called quasi-Newton’s methods (Nocedal and Wright, 2006). Newton’s method is summarized in Algorithm B.2.

Algorithm B.2: Newton’s method for the unconstrained problem (B.1).

Choose initial point $\bm{x}^0$ and tolerance $\epsilon>0$ ;
Set $k \gets 0$ ;
repeat

Compute Newton direction and decrement: $\bm{d}^{k} = -\nabla^2 f(\bm{x}^{k})^{-1} \nabla f(\bm{x}^{k}) \quad\textm{and}\quad \lambda(\bm{x}^{k})^2 = -\nabla f(\bm{x}^{k})^\T\bm{d}^{k};$
Line search: Choose a stepsize $\alpha^{k} > 0$ via exact or backtracking line search;
Obtain next iterate: $\bm{x}^{k+1} = \bm{x}^{k} - \alpha^{k} \nabla^2 f\left(\bm{x}^{k}\right)^{-1} \nabla f\left(\bm{x}^{k}\right);$
$k \gets k+1$ ;

until convergence (i.e., $\lambda(\bm{x}^{k})^2/2 \le \epsilon$ );

B.2.5 Convergence

Ideally, we would like the sequence of a descent method $\{\bm{x}^{k}\}$ to converge to a global minimum. Unfortunately, however, this is much to expect, at least when $f$ is not convex, because of the presence of local minima that are not global. Thus, the most we can expect from a descent method is that it converges to a stationary point. Such a point is a global minimum if $f$ is convex, but this need not be so for nonconvex problems.

Descent methods enjoys nice theoretical convergence (Bertsekas, 1999) as summarized in Theorem B.1.

Theorem B.1 (Convergence of descent methods) Suppose $\{\bm{x}^{k}\}$ is a sequence generated by a descent method (such as the gradient descent method or Newton’s method) and the stepsize $\alpha^{k}$ is chosen by exact line search or backtracking line search. Then every limit point of $\{\bm{x}^{k}\}$ is a stationary point of the problem.

Other simpler stepsize rules also enjoy theoretical convergence (Bertsekas, 1999):

constant stepsize: $\alpha^{k} = \alpha$ for sufficiently small $\alpha$ ;
dimishing stepsize rule: $\alpha^{k} \rightarrow 0$ with $\sum_{k=0}^\infty \alpha^{k} = \infty$ .

Regarding Newton’s method, it is worth knowing that its convergence can be divided into two phases:

damped Newton phase: the convergence is slow; and
quadratically convergent phase: the convergence is extremely fast.

In practice, the gradient descent method converges slowly, whereas Newton’s method enjoys much faster convergence (at the expense of computing the Hessian). Thus, if the problem dimensionality is manageable, Newton’s method is preferred. In some applications, however, the dimensionality can be extremely large (such as in deep learning) and computing and storing the Hessian is not a feasible option.

References

Beck, A. (2017). First-Order Methods in Optimization. Society for Industrial and Applied Mathematics (SIAM).

Bertsekas, D. P. (1999). Nonlinear Programming. Athena Scientific.

Boyd, S., and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

Nocedal, J., and Wright, S. J. (2006). Numerical Optimization. Springer.