2.5 Confidence intervals

Obtaining a Confidence Interval (CI) for \(f(x)\) is a challenging task. Due to the bias results seen in Section 2.3, we know that the kde is biased for finite sample sizes, \(\mathbb{E}[\hat f(x;h)]=(K_h*f)(x)\), and it is only asymptotically unbiased when \(h\to0\). This bias is called the smoothing bias and in essence complicates the obtention of CIs for \(f(x)\), but not of \((K_h*f)(x)\). Let’s see with an illustrative example the differences between these two objects.

Two well-known facts for normal densities (see Appendix C in Wand and Jones (1995)) are:

\[\begin{align} (\phi_{\sigma_1}(\cdot-\mu_1)*\phi_{\sigma_2}(\cdot-\mu_2))(x)&=\phi_{(\sigma_1^2+\sigma_2^2)^{1/2}}(x-\mu_1-\mu_2),\tag{2.23}\\ \int\phi_{\sigma_1}(x-\mu_1)\phi_{\sigma_2}(x-\mu_2)\,\mathrm{d}x&=\phi_{(\sigma_1^2+\sigma_2^2)^{1/2}}(\mu_1-\mu_2),\\ \phi_{\sigma}(x-\mu)^r&=\frac{1}{\sigma^{r-1}}(2\pi)^{(1-r)/2}\phi_{\sigma/ r^{1/2}}(x-\mu)\frac{1}{r^{1/2}}.\tag{2.24} \end{align}\]

As a consequence, if \(K=\phi\) (i.e., \(K_h=\phi_h\)) and \(f(\cdot)=\phi_\sigma(\cdot-\mu)\), we have that

\[\begin{align} (K_h*f)(x)&=\phi_{(h^2+\sigma^2)^{1/2}}(x-\mu),\tag{2.25}\\ (K_h^2*f)(x)&=\left(\frac{1}{(2\pi)^{1/2}h}\phi_{h/2^{1/2}}/2^{1/2}*f\right)(x)\nonumber\\ &=\frac{1}{2\pi^{1/2}h}\left(\phi_{h/2^{1/2}}*f\right)(x)\nonumber\\ &=\frac{1}{2\pi^{1/2}h}\phi_{(h^2/2+\sigma^2)^{1/2}}(x-\mu).\tag{2.26} \end{align}\]

Thus, the exact expectation of the kde for estimating the density of a \(\mathcal{N}(\mu,\sigma^2)\) is the density of a \(\mathcal{N}(\mu,\sigma^2+h^2)\). Clearly, when \(h\to0\), the bias disappears. Removing this finite-sample size bias is not simple: if the bias is expanded, \(f''\) appears. Thus for attempting to unbias \(\hat f(\cdot;h)\) we have to estimate \(f''\), which is a harder task than estimating \(f\)! Taking second derivatives on the kde does not simply work out-of-the-box, since the bandwidths for estimating \(f\) and \(f''\) scale differently. We refer the interested reader to Section 2.12 in Wand and Jones (1995) for a quick review of derivative kde.

The previous deadlock can be solved if we limit our ambitions. Rather than constructing a confidence interval for \(f(x)\), we will do it for \(\mathbb{E}[\hat f(x;h)]=(K_h*f)(x)\). There is nothing wrong with this, as long as we are careful when we report the results to make it clear that the CI is for \((K_h*f)(x)\) and not \(f(x)\).

The building block for the CI for \(\mathbb{E}[\hat f(x;h)]=(K_h*f)(x)\) is Theorem 2.2, which stated that:

\[\begin{align*} \sqrt{nh}(\hat f(x;h)-\mathbb{E}[\hat f(x;h)])\stackrel{d}{\longrightarrow}\mathcal{N}(0,R(K)f(x)). \end{align*}\]

Plugging-in \(\hat f(x;h)=f(x)+O_\mathbb{P}(h^2+(nh)^{-1})=f(x)(1+o_\mathbb{P}(1))\) (see Exercise 2.8) as an estimate for \(f(x)\) in the variance, we have by the Slutsky’s theorem that

\[\begin{align*} \sqrt{\frac{nh}{R(K)\hat f(x;h)}}&(\hat f(x;h)-\mathbb{E}[\hat f(x;h)])\\ &=\sqrt{\frac{nh}{R(K)f(x)}}(\hat f(x;h)-\mathbb{E}[\hat f(x;h)])(1+o_\mathbb{P}(1))\\ &\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}\]

Therefore, an asymptotic \(100(1-\alpha)\%\) confidence interval for \(\mathbb{E}[\hat f(x;h)]\) that can be straightforwardly computed is:

\[\begin{align} I=\left(\hat f(x;h)\pm z_{\alpha}\sqrt{\frac{R(K)\hat f(x;h)}{nh}}\right).\tag{2.27} \end{align}\]

Recall that if we wanted to do obtain a CI with the second result in Theorem 2.2 we will need to estimate \(f''(x)\).

Remark. Several points regarding the CI require proper awareness:

  1. The CI is for \(\mathbb{E}[\hat f(x;h)]=(K_h*f)(x)\), not \(f(x)\).
  2. The CI is pointwise. That means that \(\mathbb{P}\left[\mathbb{E}[\hat f(x;h)]\in I\right]\approx 1-\alpha\) for each \(x\in\mathbb{R}\), but not simultaneously. This is,\(\mathbb{P}\left[\mathbb{E}[\hat f(x;h)]\in I,\,\forall x\in\mathbb{R}\right]\neq 1-\alpha\).
  3. We are approximating \(f(x)\) in the variance by \(\hat f(x;h)=f(x)+O_\mathbb{P}(h^2+(nh)^{-1})\). Additionally, the convergence to a normal distribution happens at rate \(\sqrt{nh}\). Hence both \(h\) and \(nh\) need to small and large, respectively, for a good coverage.
  4. The CI is for a deterministic bandwidth \(h\) (i.e., not selected from the sample), which is not usually the case in practise. If a bandwidth selector is employed, the coverage may be affected, specially for small \(n\).

We illustrate the construction of the CI in (2.27) for the situation where the reference density is a \(\mathcal{N}(\mu,\sigma^2)\) and the normal kernel is employed. This allows to use (2.25) and (2.26) in combination with (2.6) and (2.7) to obtain:

\[\begin{align*} \mathbb{E}[\hat f(x;h)]&=\phi_{(h^2+\sigma^2)^{1/2}}(x-\mu),\\ \mathbb{V}\mathrm{ar}[\hat f(x;h)]&=\frac{1}{n}\left(\frac{\phi_{(h^2/2+\sigma^2)^{1/2}}(x-\mu)}{2\pi^{1/2}h}-(\phi_{(h^2+\sigma^2)^{1/2}}(x-\mu))^2\right). \end{align*}\]

The following piece of code evaluates the proportion of times that \(\mathbb{E}[\hat f(x;h)]\) belongs to \(I\) for each \(x\in\mathbb{R}\), both estimating and knowing the variance in the asymptotic distribution.

# R(K) for a normal
Rk <- 1 / (2 * sqrt(pi))

# Generate a sample from a N(mu, sigma^2)
n <- 100
mu <- 0
sigma <- 1
set.seed(123456)
x <- rnorm(n = n, mean = mu, sd = sigma)

# Compute the kde (NR bandwidth)
kde <- density(x = x, from = -4, to = 4, n = 1024, bw = "nrd")

# Selected bandwidth
h <- kde$bw

# Estimate the variance
hatVarKde <- kde$y * Rk / (n * h)

# True expectation and variance (because the density is a normal)
EKh <- dnorm(x = kde$x, mean = mu, sd = sqrt(sigma^2 + h^2))
varKde <- (dnorm(kde$x, mean = mu, sd = sqrt(h^2 / 2 + sigma^2)) /
             (2 * sqrt(pi) * h) - EKh^2) / n

# CI with estimated variance
alpha <- 0.05
zalpha <- qnorm(1 - alpha/2)
ciLow1 <- kde$y - zalpha * sqrt(hatVarKde)
ciUp1 <- kde$y + zalpha * sqrt(hatVarKde)

# CI with known variance
ciLow2 <- kde$y - zalpha * sqrt(varKde)
ciUp2 <- kde$y + zalpha * sqrt(varKde)

# Plot estimate, CIs and expectation
plot(kde, main = "Density and CIs", ylim = c(0, 1))
lines(kde$x, ciLow1, col = "gray")
lines(kde$x, ciUp1, col = "gray")
lines(kde$x, ciUp2, col = "gray", lty = 2)
lines(kde$x, ciLow2, col = "gray", lty = 2)
lines(kde$x, EKh, col = "red")
legend("topright", legend = c("Estimate", "CI estimated var",
                              "CI known var", "Smoothed density"),
       col = c("black", "gray", "gray", "red"), lwd = 2, lty = c(1, 1, 2, 1))

The above code computes the CI. But it does not show any insight on the effective coverage of the CI. The next simulation exercise tackles with this issue.

# Simulation setting
n <- 100; h <- 0.2
mu <- 0; sigma <- 1 # Normal parameters
M <- 5e2 # Number of replications in the simulation
nGrid <- 512 # Number of x's for computing the kde
alpha <- 0.05; zalpha <- qnorm(1 - alpha/2) # alpha for CI

# Compute expectation and variance
kde <- density(x = 0, bw = h, from = -4, to = 4, n = nGrid) # Just to get kde$x
EKh <- dnorm(x = kde$x, mean = mu, sd = sqrt(sigma^2 + h^2))
varKde <- (dnorm(kde$x, mean = mu, sd = sqrt(h^2 / 2 + sigma^2)) /
           (2 * sqrt(pi) * h) - EKh^2) / n

# For storing if the mean is inside the CI
insideCi1 <- insideCi2 <- matrix(nrow = M, ncol = nGrid)

# Simulation
set.seed(12345)
for (i in 1:M) {

  # Sample & kde
  x <- rnorm(n = n, mean = mu, sd = sigma)
  kde <- density(x = x, bw = h, from = -4, to = 4, n = nGrid)
  hatSdKde <- sqrt(kde$y * Rk / (n * h))

  # CI with estimated variance
  ciLow1 <- kde$y - zalpha * hatSdKde
  ciUp1 <- kde$y + zalpha * hatSdKde

  # CI with known variance
  ciLow2 <- kde$y - zalpha * sqrt(varKde)
  ciUp2 <- kde$y + zalpha * sqrt(varKde)

  # Check if for each x the mean is inside the CI
  insideCi1[i, ] <- EKh > ciLow1 & EKh < ciUp1
  insideCi2[i, ] <- EKh > ciLow2 & EKh < ciUp2

}

# Plot results
plot(kde$x, colMeans(insideCi1), ylim = c(0.25, 1), type = "l",
     main = "Coverage CIs", xlab = "x", ylab = "Coverage")
lines(kde$x, colMeans(insideCi2), col = 4)
abline(h = 1 - alpha, col = 2)
abline(h = 1 - alpha + c(-1, 1) * qnorm(0.975) *
         sqrt(alpha * (1 - alpha) / M), col = 2, lty = 2)
legend(x = "bottom", legend = c("CI estimated var", "CI known var",
                                "Nominal level",
                                "95% CI for the nominal level"),
       col = c(1, 4, 2, 2), lwd = 2, lty = c(1, 1, 1, 2))

Exercise 2.3 Explore the coverage of the asymptotic CI for varying values of \(h\). To that end, adapt the previous code to work in a manipulate environment like the example given below.
# Load manipulate
# install.packages("manipulate")
library(manipulate)

# Sample
x <- rnorm(100)

# Simple plot of kde for varying h
manipulate({

  kde <- density(x = x, from = -4, to = 4, bw = h)
  plot(kde, ylim = c(0, 1), type = "l", main = "")
  curve(dnorm(x), from = -4, to = 4, col = 2, add = TRUE)
  rug(x)

}, h = slider(min = 0.01, max = 2, initial = 0.5, step = 0.01))

References

Wand, M. P., and M. C. Jones. 1995. Kernel Smoothing. Vol. 60. Monographs on Statistics and Applied Probability. London: Chapman & Hall, Ltd. https://doi.org/10.1007/978-1-4899-4493-1.