3.2 Kernel regression estimation
3.2.1 Nadaraya–Watson estimator
Our objective is to estimate the regression function nonparametrically. Due to its definition, we can rewrite it as follows:
This expression shows an interesting point: the regression function can be computed from the joint density and the marginal Therefore, given a sample an estimate of follows by replacing the previous densities by their kdes. To that aim, recall that in Exercise 2.15 we defined a multivariate extension of the kde based on product kernels. For the two dimensional case, the kde with equal bandwidths is
Using (3.11),
This yields the so-called Nadaraya–Watson14 estimate:
where This estimate can be seen as a weighted average of by means of the set of weights (check that they add to one). The set of weights depends on the evaluation point That means that the Nadaraya–Watson estimator is a local mean of around (see Figure 3.6).
Let’s implement the Nadaraya–Watson estimate to get a feeling of how it works in practice.
# Nadaraya-Watson
<- function(x, X, Y, h, K = dnorm) {
mNW
# Arguments
# x: evaluation points
# X: vector (size n) with the predictors
# Y: vector (size n) with the response variable
# h: bandwidth
# K: kernel
# Matrix of size n x length(x)
<- sapply(X, function(Xi) K((x - Xi) / h) / h)
Kx
# Weights
<- Kx / rowSums(Kx) # Column recycling!
W
# Means at x ("drop" to drop the matrix attributes)
drop(W %*% Y)
}
# Generate some data to test the implementation
set.seed(12345)
<- 100
n <- rnorm(n, sd = 2)
eps <- function(x) x^2 * cos(x)
m <- rnorm(n, sd = 2)
X <- m(X) + eps
Y <- seq(-10, 10, l = 500)
xGrid
# Bandwidth
<- 0.5
h
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("topright", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)
Exercise 3.4 Implement your own version of the Nadaraya–Watson estimator in R and compare it with mNW
. You may focus only on the normal kernel and reduce the accuracy of the final computation up to 1e-7
to achieve better efficiency. Are you able to improve the speed of mNW
? Use system.time
or the microbenchmark
package to measure the running times for a sample size of
Winner-takes-all bonus: the significative fastest version of the Nadaraya–Watson estimator (under the above conditions) will get a bonus of 0.5 points.
The code below illustrates the effect of varying for the Nadaraya–Watson estimator using manipulate
.
# Simple plot of N-W for varying h's
manipulate({
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("topright", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)
h = slider(min = 0.01, max = 2, initial = 0.5, step = 0.01)) },
3.2.2 Local polynomial regression
Nadaraya–Watson can be seen as a particular case of a local polynomial fit, specifically, the one corresponding to a local constant fit. The motivation for the local polynomial fit comes from attempting to the minimize the RSS
This is not achievable directly, since no knowledge on is available. However, by a -th order Taylor expression it is possible to obtain that, for close to
Replacing (3.14) in (3.13), we have that
This expression is still not workable: it depends on which of course are unknown! The great idea is to set and turn (3.15) into a linear regression problem where the unknown parameters are :
While doing so, an estimate of automatically will yield estimates for and we know how to obtain by minimizing (3.16). The final touch is to make the contributions of dependent on the distance to by weighting with kernels:
Denoting
and
we can re-express (3.17) into a weighted least squares problem whose exact solution is
The estimate for is then computed as
where and is the -th canonical vector. Just as the Nadaraya–Watson, the local polynomial estimator is a linear combination of the responses. Two cases deserve special attention:
is the local constant estimator or the Nadaraya–Watson estimator (Exercise 3.6). In this situation, the estimator has explicit weights, as we saw before:
is the local linear estimator, which has weights equal to (Exercise 3.9):
where
Figure 3.6 illustrates the construction of the local polynomial estimator (up to cubic degree) and shows how the intercept of the local fit, estimates at
Figure 3.6: Construction of the local polynomial estimator. The animation shows how local polynomial fits in a neighborhood of are combined to provide an estimate of the regression function, which depends on the polynomial degree, bandwidth, and kernel (gray density at the bottom). The data points are shaded according to their weights for the local fit at Application also available here.
KernSmooth
’s locpoly
implements the local polynomial estimator. Below are some examples of its usage.
# Generate some data to test the implementation
set.seed(123456)
<- 100
n <- rnorm(n, sd = 2)
eps <- function(x) x^3 * sin(x)
m <- rnorm(n, sd = 1.5)
X <- m(X) + eps
Y <- seq(-10, 10, l = 500)
xGrid
# Bandwidth
<- 0.25
h <- locpoly(x = X, y = Y, bandwidth = h, degree = 0, range.x = c(-10, 10),
lp0 gridsize = 500)
<- locpoly(x = X, y = Y, bandwidth = h, degree = 1, range.x = c(-10, 10),
lp1 gridsize = 500)
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(lp0$x, lp0$y, col = 2)
lines(lp1$x, lp1$y, col = 3)
legend("bottom", legend = c("True regression", "Local constant",
"Local linear"),
lwd = 2, col = 1:3)
# Simple plot of local polynomials for varying h's
manipulate({
# Plot data
<- locpoly(x = X, y = Y, bandwidth = h, degree = p, range.x = c(-10, 10),
lpp gridsize = 500)
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(lpp$x, lpp$y, col = p + 2)
legend("bottom", legend = c("True regression", "Local polynomial fit"),
lwd = 2, col = c(1, p + 2))
h = slider(min = 0.01, max = 2, initial = 0.5, step = 0.01),
}, p = slider(min = 0, max = 4, initial = 0, step = 1))