\( \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\textm}[1]{\textsf{#1}} \def\T{{\mkern-2mu\raise-1mu\mathsf{T}}} \newcommand{\R}{\mathbb{R}} % real numbers \newcommand{\E}{{\rm I\kern-.2em E}} \newcommand{\w}{\bm{w}} % bold w \newcommand{\bmu}{\bm{\mu}} % bold mu \newcommand{\bSigma}{\bm{\Sigma}} % bold mu \newcommand{\bigO}{O} %\mathcal{O} \renewcommand{\d}[1]{\operatorname{d}\!{#1}} \)

B.3 Projected gradient methods

Consider the constrained optimization problem \[ \begin{array}{ll} \underset{\bm{x}}{\textm{minimize}} & f(\bm{x})\\ \textm{subject to} & \bm{x} \in \mathcal{X}, \end{array} \] where \(f\) is the objective function (assumed to be continuously differentiable) and \(\mathcal{X}\) is a convex set.

If we were to use a descent method \[\bm{x}^{k+1} = \bm{x}^{k} + \alpha^{k}\bm{d}^{k}\] with \(\bm{d}^{k}\) a descent direction, we would possibly end up with an infeasible point \(\bm{x}^{k+1}\).

Projected gradient methods or gradient projection methods address this issue by projecting onto the feasible set after taking the step (Beck, 2017; Bertsekas, 1999): \[\bm{x}^{k+1} = \left[\bm{x}^{k} + \alpha^{k}\bm{d}^{k}\right]_\mathcal{X}\]

where \(\left[\bm{x}\right]_\mathcal{X}\) denotes projection onto the set \(\mathcal{X}\) defined as the solution to \(\textm{min}_{\bm{y}} \|\bm{y} - \bm{x}\|\) subject to \(\bm{y} \in \mathcal{X}.\)

A slightly more general version of the gradient projection method is \[ \begin{aligned} \bm{\bar{x}}^{k} &= \left[\bm{x}^{k} + s^{k}\bm{d}^{k}\right]_\mathcal{X}\\ \bm{x}^{k+1} &= \bm{x}^{k} + \alpha^{k}\left(\bm{\bar{x}}^{k} - \bm{x}^{k}\right), \end{aligned} \] where \(\bm{d}^{k} = \bm{\bar{x}}^{k} - \bm{x}^{k}\) is a feasible direction (because \(\bm{\bar{x}}^{k}\) is feasible due to the projection), \(\alpha^{k}\) is the stepsize, and \(s^{k}\) is a positive scalar (Bertsekas, 1999). Note that if we choose \(\alpha^{k}=1\) then the iteration simplifies to the previous expression: \[\bm{x}^{k+1} = \left[\bm{x}^{k} + s^{k}\bm{d}^{k}\right]_\mathcal{X}\] where now \(s^k\) can be viewed as a stepsize. Further note that if \(\bm{x}^{k} + s^{k}\bm{d}^{k}\) is already feasible, then the gradient projection method reduces to the regular gradient method.

In practice, the gradient projection method only makes sense if the projection is easy to compute.

Projected gradient descent method

The projected gradient descent method simply uses as search direction the negative gradient: \[ \begin{aligned} \bm{\bar{x}}^{k} &= \left[\bm{x}^{k} - s^{k} \nabla f\left(\bm{x}^{k}\right)\right]_\mathcal{X}\\ \bm{x}^{k+1} &= \bm{x}^{k} + \alpha^{k}\left(\bm{\bar{x}}^{k} - \bm{x}^{k}\right). \end{aligned} \]

Constrained Newton’s method

Let us assume that \(f\) is twice continuously differentiable and that the Hessian matrix is positive definite for all \(\bm{x}\in\mathcal{X}\).

The constrained Newton’s method is \[ \begin{aligned} \bm{\bar{x}}^{k} &= \underset{\bm{x}\in\mathcal{X}}{\textm{arg min}} \left\{\nabla f\left(\bm{x}^k\right)^\T\left(\bm{x} - \bm{x}^k\right) + \frac{1}{2s^k}\left(\bm{x} - \bm{x}^k\right)^\T \nabla^2 f\left(\bm{x}^k\right)\left(\bm{x} - \bm{x}^k\right)\right\}\\ \bm{x}^{k+1} &= \bm{x}^{k} + \alpha^{k}\left(\bm{\bar{x}}^{k} - \bm{x}^{k}\right). \end{aligned} \] Note that if \(s^k=1\), the quadratic cost above is the second order Taylor series expansion of \(f\) around \(\bm{x}^k\) (apart from a constant term). In particular, if \(\alpha^k=1\) and \(s^k=1\), then \(\bm{x}^{k+1}\) is the vector that minimizes the second order Taylor series expansion around \(\bm{x}^k\), just as in the case of Newton’s method for unconstrained optimization.

The main difficulty with this method is in solving the quadratic subproblem to find \(\bm{\bar{x}}^{k}\). This may not be simple even when the constraint set \(\mathcal{X}\) has a simple structure. Thus the method typically makes practical sense only for problems of small dimension.

Convergence

The convergence of gradient projection methods can be found in (Bertsekas, 1999) and it is summarized in Theorem B.2.

Theorem B.2 (Convergence of gradient projection methods) Suppose \(\{\bm{x}^{k}\}\) is a sequence generated by a gradient projection method (such as the projected gradient descent method or the constrained Newton’s method) and the stepsize \(\alpha^{k}\) is chosen by the exact line search or backtracking line search. Then, every limit point of \(\{\bm{x}^{k}\}\) is a stationary point of the problem.

Other simpler stepsize rules also enjoy theoretical convergence, such as the constant stepsize \(\alpha^{k} = 1\) and \(s^k = s\) for sufficiently small \(s\) (Bertsekas, 1999).

References

Beck, A. (2017). First-order methods in optimization. Society for Industrial and Applied Mathematics (SIAM).
Bertsekas, D. P. (1999). Nonlinear programming. Athena Scientific.