5.4 Local likelihood

We next explore an extension of the local polynomial estimator that imitates the expansion that generalized linear models made of linear models, allowing the former to consider non-continuous responses. This extension is aimed to estimate the regression function by relying on the likelihood, rather than the least squares. The main idea behind local likelihood is, therefore, to locally fit parametric models by maximum likelihood.

We begin by seeing that local likelihood based on the linear model is equivalent to local polynomial modeling. Theorem B.1 shows that, under the assumptions given in Section B.1.2, the maximum likelihood estimate of β in the linear model

(5.15)Y|(X1,,Xp)N(β0+β1X1++βpXp,σ2)

is equivalent to the least squares estimate, β^=(XX)1XY. The reason for this is the form of the conditional (on X1,,Xp) log-likelihood:

(β)=n2log(2πσ2)12σ2i=1n(Yiβ0β1Xi1βpXip)2.

If there is a single predictor X, the situation on which we focus on, a polynomial fitting of order p of the conditional mean can be achieved by the well-known trick of identifying the j-th predictor Xj in (5.15) by Xj.188 This results in

(5.16)Y|XN(β0+β1X++βpXp,σ2).

Therefore, we can define the weighted log-likelihood of the linear model (5.16) about x as

(5.17)x,h(β):=n2log(2πσ2)12σ2i=1n(Yiβ0β1(Xix)βp(Xix)p)2Kh(xXi).

Maximizing with respect to β the local log-likelihood (5.17) provides β^0=m^(x;p,h), which is precisely the local polynomial estimator, as it was obtained in (4.10), but now obtained from a likelihood-based perspective. The key point is to realize that the very same idea can be applied to the family of generalized linear models, for which linear regression (in this case manifested as a polynomial regression) is just a particular case.

Figure 5.1: Construction of the local likelihood estimator. The animation shows how local likelihood fits in a neighborhood of x are combined to provide an estimate of the regression function for binary response, which depends on the polynomial degree, bandwidth, and kernel (gray density at the bottom). The data points are shaded according to their weights for the local fit at x. Application available here.

We illustrate the local likelihood principle for the logistic regression (see Section B.2 for a quick review). In this case, the sample is (X1,Y1),,(Xn,Yn) with

Yi|XiBer(logistic(η(Xi))),i=1,,n,

with the polynomial term189

η(x):=β0+β1x++βpxp.

The log-likelihood of β is

(β)=i=1n{Yilog(logistic(η(Xi)))+(1Yi)log(1logistic(η(Xi)))}=i=1n(Yi,η(Xi)),

where we consider the log-likelihood addend (y,η):=yηlog(1+eη). For the sake of clarity in the next developments, we make explicit the dependence on η(x) of this addend; conversely, we make its dependence on β implicit.

The local log-likelihood of β about x is then defined as

(5.18)x,h(β):=i=1n(Yi,η(Xix))Kh(xXi).

Maximizing190 the local log-likelihood (5.18) with respect to β provides

β^h=argmaxβRp+1x,h(β).

The local likelihood estimate of η(x) is the intercept of this fit,

η^(x):=β^h,0.

Note that the dependence of β^h,0 on x is omitted. From η^(x), we can obtain the local logistic regression evaluated at x as191

(5.19)m^(x;p,h):=g1(η^(x))=logistic(β^h,0).

Each evaluation of m^(x;p,h) in a different x requires, thus, a weighted fit of the underlying local logistic model.

The code below shows three different ways of implementing in R the local logistic regression (with p=1).

# Simulate some data
n <- 200
logistic <- function(x) 1 / (1 + exp(-x))
p <- function(x) logistic(1 - 3 * sin(x))
set.seed(123456)
X <- sort(runif(n = n, -3, 3))
Y <- rbinom(n = n, size = 1, prob = p(X))

# Set bandwidth and evaluation grid
h <- 0.25
x <- seq(-3, 3, l = 501)

# Approach 1: optimize the weighted log-likelihood through the workhorse
# function underneath glm, glm.fit
suppressWarnings(
  fit_glm <- sapply(x, function(x) {
    K <- dnorm(x = x, mean = X, sd = h)
    glm.fit(x = cbind(1, X - x), y = Y, weights = K,
            family = binomial())$coefficients[1]
  })
)

# Approach 2: optimize directly the weighted log-likelihood
suppressWarnings(
  fit_nlm <- sapply(x, function(x) {
    K <- dnorm(x = x, mean = X, sd = h)
    nlm(f = function(beta) {
      -sum(K * (Y * (beta[1] + beta[2] * (X - x)) -
                  log(1 + exp(beta[1] + beta[2] * (X - x)))))
      }, p = c(0, 0))$estimate[1]
  })
)

# Approach 3: employ locfit::locfit
# CAREFUL: locfit::locfit uses a different internal parametrization for h!
# As it can be see, the bandwidths in approaches 1-2 and approach 3 do NOT
# give the same results! With h = 0.75 in locfit::locfit the fit is close
# to the previous ones (done for h = 0.25), but not exactly the same...
fit_locfit <- locfit::locfit(Y ~ locfit::lp(X, deg = 1, h = 0.75),
                             family = "binomial", kern = "gauss")

# Compare fits
plot(x, p(x), ylim = c(0, 1.5), type = "l", lwd = 2)
lines(x, logistic(fit_glm), col = 2, lwd = 2)
lines(x, logistic(fit_nlm), col = 3, lty = 2)
lines(x, predict(fit_locfit, newdata = x), col = 4)
legend("topright", legend = c("p(x)", "glm", "nlm", "locfit"),
       lwd = 2, col = c(1, 2, 3, 4), lty = c(1, 1, 2, 1))

Bandwidth selection can be done by means of likelihood cross-validation. The objective is to maximize the local log-likelihood fit at (Xi,Yi) but removing the influence by the datum itself. That is,

h^LCV=argmaxh>0LCV(h),(5.20)LCV(h)=i=1n(Yi,η^i(Xi)),

where η^i(Xi) represents the local fit at Xi without the i-th datum (Xi,Yi).192 Unfortunately, the nonlinearity of (5.19) forbids a simplifying result as the one in Proposition 4.1. Thus, in principle, it is required to fit n local likelihoods for sample size n1 to obtain a single evaluation of (5.20).193

We conclude by illustrating how to compute the LCV function and optimize it. Keep in mind that much more efficient implementations are possible!

# Exact LCV - recall that we *maximize* the LCV!
h <- seq(0.1, 2, by = 0.1)
suppressWarnings(
  LCV <- sapply(h, function(h) {
  sum(sapply(1:n, function(i) {
    K <- dnorm(x = X[i], mean = X[-i], sd = h)
    nlm(f = function(beta) {
      -sum(K * (Y[-i] * (beta[1] + beta[2] * (X[-i] - X[i])) -
                  log(1 + exp(beta[1] + beta[2] * (X[-i] - X[i])))))
      }, p = c(0, 0))$minimum
    }))
  })
)
plot(h, LCV, type = "o")
abline(v = h[which.max(LCV)], col = 2)

Exercise 5.12 (Adapted from Example 4.6 in Wasserman (2006)) The dataset at http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/bpd.dat (alternative link) contains information about the presence of bronchopulmonary dysplasia (binary response) and the birth weight in grams (predictor) of 223 newborns.

  1. Use the three approaches described above to compute the local logistic regression (first degree) and plot their outputs for a bandwidth that is somehow adequate.
  2. Using h^LCV, explore and comment on the resulting estimates, providing insights into the data.
  3. From the obtained fit, derive several simple medical diagnostic rules for the probability of the presence of bronchopulmonary dysplasia from the birth weight.

Exercise 5.13 The challenger.txt dataset contains information regarding the state of the solid rocket boosters after launch for 23 shuttle flights prior the Challenger launch. Each row has, among others, the variables fail.field (indicator of whether there was an incident with the O-rings), nfail.field (number of incidents with the O-rings), and temp (temperature in the day of launch, measured in Celsius degrees). See Section 5.1 in García-Portugués (2025) for further context on the Challenger case study.

  1. Fit a local logistic regression (first degree) for fails.field ~ temp, for three choices of bandwidths: one that oversmooths, another that is somehow adequate, and another that undersmooths. Do the effects of temp on fails.field seem to be significant?
  2. Obtain h^LCV and plot the LCV function with a reasonable accuracy.
  3. Using h^LCV, predict the probability of an incident at temperatures 0.6 (launch temperature of the Challenger) and 11.67 (vice president of engineers’ recommended launch threshold).
  4. What are the local odds at 0.6, 11.67, and 20? Show the local logistic models about these points, in spirit of Figure 5.1, and interpret the results.

Exercise 5.14 Implement your own version of the local likelihood estimator (first degree) for the Poisson regression model. To do so:

  1. Derive the local log-likelihood about x for the Poisson regression (which is analogous to (5.18)). You can check Section 5.2.2 in García-Portugués (2025) for information on the Poisson regression.
  2. Code from scratch an R function, loc_pois, that maximizes the previous local likelihood for a vector of evaluation points. loc_pois must take as input the samples X and Y, the vector of evaluation points x, the bandwidth h, and the kernel K.
  3. Implement a cv_loc_pois function that obtains the likelihood cross-validated bandwidth for the local Poisson regression.
  4. Validate the correct behavior of loc_pois and cv_loc_pois by sampling a sample of n=200 from Y|X=xPoisson(λ(x)), where λ(x)=esin(x) and X is distributed according to a nor1mix::MW.nm7.
  5. Compare your results with the locfit::locfit function using family = "poisson" (you should not expect an exact fit due to the different parametrization of h used in the locfit::locfit function, as warned in the previous explanatory code chunk).

References

García-Portugués, E. 2025. Notes for Predictive Modeling. https://bookdown.org/egarpor/PM-UC3M/.
Loader, C. 1999. Local Regression and Likelihood. Statistics and Computing. New York: Springer. https://doi.org/10.2307/1270956.
———. 2006. All of Nonparametric Statistics. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/0-387-30623-4.

  1. Observe that, for the sake of easier presentation, we come back to the notation and setting employed in Chapter 4, in which only one predictor X was available for explaining Y, and where p denoted the order of the polynomial fit employed.↩︎

  2. If p=1, then we have the usual simple logistic model. Notice that we are just doing the “polynomial trick” done in linear regression, which consisted in expanding the number of predictors with the powers Xj, j=1,,p, of X to achieve more flexibility in the parametric fit.↩︎

  3. There is no analytical solution for the optimization problem due to its nonlinearity, so a numerical approach is required.↩︎

  4. An alternative and useful view is that, by maximizing (5.18), we are fitting the logistic model p^x(t):=logistic(β^h,0+β^h,1(tx)++β^h,p(tx)p) that is centered about x. Then, we employ this model to predict Y for X=t=x, resulting logistic(β^h,0).↩︎

  5. Observe that (5.20) is equivalent to (4.23) if the generalized linear model is a linear model.↩︎

  6. The interested reader is referred to Sections 4.3.3 and 4.4.3 in Loader (1999) for an approximation of (5.20) that only requires a local likelihood fit for a single sample.↩︎