## 3.4 Bandwidth selection

Bandwidth selection is, as in kernel density estimation, of key practical importance. Several bandwidth selectors, in the spirit of the plug-in and cross-validation ideas discussed in Section 2.4 have been proposed. There are, for example, a rule-of-thumb analogue (see Section 4.2 in Fan and Gijbels (1996)) and a direct plug-in analogue (see Section 5.8 in Wand and Jones (1995)). For simplicity, we will focus only on the cross-validation bandwidth selector.

Following an analogy with the fit of the linear model, we could look for the bandwidth $$h$$ such that it minimizes the RSS of the form

\begin{align} \frac{1}{n}\sum_{i=1}^n(Y_i-\hat m(X_i;p,h))^2.\tag{3.24} \end{align}

However, this is a bad idea. Attempting to minimize (3.24) always leads to $$h\approx 0$$ that results in a useless interpolation of the data. Let’s see an example.

# Grid for representing (3.22)
hGrid <- seq(0.1, 1, l = 200)^2
error <- sapply(hGrid, function(h)
mean((Y - mNW(x = X, X = X, Y = Y, h = h))^2))

# Error curve
plot(hGrid, error, type = "l")
rug(hGrid)
abline(v = hGrid[which.min(error)], col = 2) The root of the problem is the comparison of $$Y_i$$ with $$\hat m(X_i;p,h)$$, since there is nothing that forbids $$h\to0$$ and as a consequence $$\hat m(X_i;p,h)\to Y_i$$. We can change this behavior if we compare $$Y_i$$ with $$\hat m_{-i}(X_i;p,h)$$, the leave-one-out estimate computed without the $$i$$-th datum $$(X_i,Y_i)$$:

\begin{align*} \mathrm{CV}(h)&:=\frac{1}{n}\sum_{i=1}^n(Y_i-\hat m_{-i}(X_i;p,h))^2,\\ h_\mathrm{CV}&:=\arg\min_{h>0}\mathrm{CV}(h). \end{align*}

The optimization of the above criterion might seem to be computationally expensive, since it is required to compute $$n$$ regressions for a single evaluation of the objective function.

Proposition 3.1 The weights of the leave-one-out estimator $$\hat m_{-i}(x;p,h)=\sum_{\substack{j=1\\j\neq i}}^nW_{-i,j}^p(x)Y_j$$ can be obtained from $$\hat m(x;p,h)=\sum_{i=1}^nW_{i}^p(x)Y_i$$:

\begin{align*} W_{-i,j}^p(x)=\frac{W^p_j(x)}{\sum_{\substack{k=1\\k\neq i}}^nW_k^p(x)}=\frac{W^p_j(x)}{1-W_i^p(x)}. \end{align*}

This implies that

\begin{align*} \mathrm{CV}(h)=\frac{1}{n}\sum_{i=1}^n\left(\frac{Y_i-\hat m(X_i;p,h)}{1-W_i^p(X_i)}\right)^2. \end{align*}

Let’s implement this simple bandwidth selector in R.

# Generate some data to test the implementation
set.seed(12345)
n <- 100
eps <- rnorm(n, sd = 2)
m <- function(x) x^2 + sin(x)
X <- rnorm(n, sd = 1.5)
Y <- m(X) + eps
xGrid <- seq(-10, 10, l = 500)

# Objective function
cvNW <- function(X, Y, h, K = dnorm) {
sum(((Y - mNW(x = X, X = X, Y = Y, h = h, K = K)) /
(1 - K(0) / colSums(K(outer(X, X, "-") / h))))^2)
}

# Find optimum CV bandwidth, with sensible grid
bw.cv.grid <- function(X, Y,
h.grid = diff(range(X)) * (seq(0.1, 1, l = 200))^2,
K = dnorm, plot.cv = FALSE) {
obj <- sapply(h.grid, function(h) cvNW(X = X, Y = Y, h = h, K = K))
h <- h.grid[which.min(obj)]
if (plot.cv) {
plot(h.grid, obj, type = "o")
rug(h.grid)
abline(v = h, col = 2, lwd = 2)
}
h
}

# Bandwidth
h <- bw.cv.grid(X = X, Y = Y, plot.cv = TRUE) h
##  0.3117806

# Plot result
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("topright", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2) ### References

Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications. Vol. 66. Monographs on Statistics and Applied Probability. London: Chapman & Hall.

Wand, M. P., and M. C. Jones. 1995. Kernel Smoothing. Vol. 60. Monographs on Statistics and Applied Probability. London: Chapman & Hall, Ltd. https://doi.org/10.1007/978-1-4899-4493-1.