3.2 The Newton Direction

Given a current best estimate xn, we can approximate f with a quadratic polynomial. For some small p,

f(xn+p)f(xn)+pf(xn)+12pf(xn)p.

If we minimize the right hand side with respect to p, we obtain pn=f(xn)1[f(xn)] which we can think of as the steepest descent direction “twisted” by the inverse of the Hessian matrix f(xn)1. Newton’s method has a “natural” step length of 1, so that the updating procedure is

xn+1=xnf(xn)1f(xn).

Newton’s method makes a quadratic approximation to the target function f at each step of the algorithm. This follows the “optimization transfer” principle mentioned earlier, whereby we take a complex function f, replace it with a simpler function g that is easier to optimize, and then optimize the simpler function repeatedly until convergence to the solution.

We can visualize how Newton’s method makes its quadratic approximation to the target function easily in one dimension.

curve(-dnorm(x), -2, 3, lwd = 2, ylim = c(-0.55, .1))
xn <- -1.2
abline(v = xn, lty = 2)
axis(3, xn, expression(x[n]))
g <- function(x) {
        -dnorm(xn) + (x-xn) * xn * dnorm(xn) - 0.5 * (x-xn)^2 * (dnorm(xn) - xn * (xn * dnorm(xn)))
}
curve(g, -2, 3, add = TRUE, col = 4)
op <- optimize(g, c(0, 3))
abline(v = op$minimum, lty = 2)
axis(3, op$minimum, expression(x[n+1]))

In the above figure, the next iterate, xn+1 is actually further away from the minimum than our previous iterate xn. The quadratic approximation that Newton’s method makes to f is not guaranteed to be good at every point of the function.

This shows an important “feature” of Newton’s method, which is that it is not monotone. The successive iterations that Newton’s method produces are not guaranteed to be improvements in the sense that each iterate is closer to the truth. The tradeoff here is that while Newton’s method is very fast (quadratic convergence), it can be unstable at times. Monotone algorithms (like the EM algorithm that we discuss later) that always produce improvements, are more stable, but generally converge at slower rates.

In the next figure, however, we can see that the solution provided by the next approximation, xn+2, is indeed quite close to the true minimum.

curve(-dnorm(x), -2, 3, lwd = 2, ylim = c(-0.55, .1))
xn <- -1.2
op <- optimize(g, c(0, 3))
abline(v = op$minimum, lty = 2)
axis(3, op$minimum, expression(x[n+1]))

xn <- op$minimum
curve(g, -2, 3, add = TRUE, col = 4)
op <- optimize(g, c(0, 3))
abline(v = op$minimum, lty = 2)
axis(3, op$minimum, expression(x[n+2]))

It is worth noting that in the rare event that f is in fact a quadratic polynomial, Newton’s method will converge in a single step because the quadratic approximation that it makes to f will be exact.

3.2.1 Generalized Linear Models

The generalized linear model is an extension of the standard linear model to allow for non-Normal response distributions. The distributions used typically come from an exponential family whose density functions share some common characteristics. With a GLM, we typical present it as yip(yiμi), where p is an exponential family distribution, E[yi]=μi, g(μi)=xiβ, where g is a nonlinear link function, and Var(yi)=V(μ) where V is a known variance function.

Unlike the standard linear model, the maximum likelihood estimate of the parameter vector β cannot be obtained in closed form, so an iterative algorithm must be used to obtain the estimate. The traditional algorithm used is the Fisher scoring algorithm. This algorithm uses a linear approximation to the nonlinear link function g, which can be written as g(yi)g(μi)+(yiμi)g(μi). The typical notation of GLMs refers to zi=g(μi)+(yiμi)g(μi) as the working response. The Fisher scoring algorithm then works as follows.

  1. Start with ˆμi, some initial value.

  2. Compute zi=g(ˆμi)+(yiˆμi)g(ˆμi).

  3. Given the n×1 vector of working responses z and the n×p predictor matrix X we compute a weighted regression of z on X to get βn=(XWX)1XWz where W is a diagonal matrix with diagonal elements wii=[g(μi)2V(μi)]1.

  4. Given βn, we can recompute ˆμi=g1(xiβn) and go to 2.

Note that in Step 3 above, the weights are simply the inverses of the variance of zi, i.e.  Var(zi)=Var(g(μi)+(yiμi)g(μi))=Var((yiμi)g(μi))=V(μi)g(μi)2 Naturally, when doing a weighted regression, we would weight by the inverse of the variances.

3.2.1.1 Example: Poisson Regression

For a Poisson regression, we have yiPoisson(μi) where g(μ)=logμi=xiβ because the log is the canonical link function for the Poisson distribution. We also have g(μi)=1μi and V(μi)=μi. Therefore, the Fisher scoring algorithm is

  1. Initialize ˆμi, perhaps using yi+1 (to avoid zeros).

  2. Let zi=logˆμi+(yiˆμi)1ˆμi

  3. Regression z on X using the weights wii=[1ˆμ2iˆμi]1=ˆμi.

Using the Poisson regression example, we can draw a connection between the usual Fisher scoring algorithm for fitting GLMs and Newton’s method. Recall that if (β) is the log-likelihood as a function of the regression paramters β, then the Newton updating scheme is βn+1=βn+(βn)1[(βn)].

The log-likelihoood for a Poisson regression model can be written in vector/matrix form as (β)=yXβexp(Xβ)1 where the exponential is taken component-wise on the vector Xβ. The gradient function is (β)=XyXexp(Xβ)=X(yμ) and the Hessian is (β)=XWX where W is a diagonal matrix with the values wii=exp(xiβ) on the diagonal. The Newton iteration is then βn+1=βn+(XWX)1(X(yμ))=βn+(XWX)1XW(zXβn)=(XWX)1XWz+βn(XWX)1XWXβn=(XWX)1XWz Therefore the iteration is exactly the same as the Fisher scoring algorithm in this case. In general, Newton’s method and Fisher scoring will coincide with any generalized linear model using an exponential family with a canonical link function.

3.2.2 Newton’s Method in R

The nlm() function in R implements Newton’s method for minimizing a function given a vector of starting values. By default, one does not need to supply the gradient or Hessian functions; they will be estimated numerically by the algorithm. However, for the purposes of improving accuracy of the algorithm, both the gradient and Hessian can be supplied as attributes of the target function.

As an example, we will use the nlm() function to fit a simple logistic regression model for binary data. This model specifies that yiBernoulli(pi) where logpi1pi=β0+xiβ1 and the goal is to estimate β via maximum likelihood. Given the assumed Bernoulli distribution, we can write the log-likelihood for a single observation as logL(β)=log{ni=1pyii(1pi)1yi}=ni=1yilogpi+(1yi)log(1pi)=ni=1yilogpi1pi+log(1pi)=ni=1yi(β0+xiβ1)+log(11+e(β0+xiβ1))=ni=1yi(β0+xiβ1)log(1+e(β0+xiβ1)) If we take the very last line of the above derivation and take a single element inside the sum, we have i(β)=yi(β0+xiβ1)log(1+e(β0+xiβ1)) We will need the gradient and Hessian of this with respect to β. Because the sum and the derivative are exchangeable, we can then sum each of the individual gradients and Hessians to get the full gradient and Hessian for the entire sample, so that (β)=ni=1i(β) and (β)=ni=1i(β). Now, taking the gradient and Hessian of the above expression may be mildly inconvenient, but it is far from impossible. Nevertheless, R provides an automated way to do symbolic differentiation so that manual work can be avoided. The deriv() function computes the gradient and Hessian of an expression symbolically so that it can be used in minimization routines. It cannot compute gradients of arbitrary expressions, but it it does support a wide range of common statistical functions.