2.5 Confidence intervals
Obtaining a Confidence Interval (CI) for f(x) is a challenging task. Due to the bias results seen in Section 2.3, we know that the kde is biased for finite sample sizes, E[ˆf(x;h)]=(Kh∗f)(x), and it is only asymptotically unbiased when h→0. This bias is called the smoothing bias and in essence complicates the obtention of CIs for f(x), but not of (Kh∗f)(x). Let’s see with an illustrative example the differences between these two objects.
Two well-known facts for normal densities (see Appendix C in Wand and Jones (1995)) are:
(ϕσ1(⋅−μ1)∗ϕσ2(⋅−μ2))(x)=ϕ(σ21+σ22)1/2(x−μ1−μ2),∫ϕσ1(x−μ1)ϕσ2(x−μ2)dx=ϕ(σ21+σ22)1/2(μ1−μ2),ϕσ(x−μ)r=1σr−1(2π)(1−r)/2ϕσ/r1/2(x−μ)1r1/2.
As a consequence, if K=ϕ (i.e., Kh=ϕh) and f(⋅)=ϕσ(⋅−μ), we have that
(Kh∗f)(x)=ϕ(h2+σ2)1/2(x−μ),(K2h∗f)(x)=(1(2π)1/2hϕh/21/2/21/2∗f)(x)=12π1/2h(ϕh/21/2∗f)(x)=12π1/2hϕ(h2/2+σ2)1/2(x−μ).
Thus, the exact expectation of the kde for estimating the density of a N(μ,σ2) is the density of a N(μ,σ2+h2). Clearly, when h→0, the bias disappears. Removing this finite-sample size bias is not simple: if the bias is expanded, f″ appears. Thus for attempting to unbias \hat f(\cdot;h) we have to estimate f'', which is a harder task than estimating f! Taking second derivatives on the kde does not simply work out-of-the-box, since the bandwidths for estimating f and f'' scale differently. We refer the interested reader to Section 2.12 in Wand and Jones (1995) for a quick review of derivative kde.
The previous deadlock can be solved if we limit our ambitions. Rather than constructing a confidence interval for f(x), we will do it for \mathbb{E}[\hat f(x;h)]=(K_h*f)(x). There is nothing wrong with this, as long as we are careful when we report the results to make it clear that the CI is for (K_h*f)(x) and not f(x).
The building block for the CI for \mathbb{E}[\hat f(x;h)]=(K_h*f)(x) is Theorem 2.2, which stated that:
\begin{align*} \sqrt{nh}(\hat f(x;h)-\mathbb{E}[\hat f(x;h)])\stackrel{d}{\longrightarrow}\mathcal{N}(0,R(K)f(x)). \end{align*}
Plugging-in \hat f(x;h)=f(x)+O_\mathbb{P}(h^2+(nh)^{-1})=f(x)(1+o_\mathbb{P}(1)) (see Exercise 2.8) as an estimate for f(x) in the variance, we have by the Slutsky’s theorem that
\begin{align*} \sqrt{\frac{nh}{R(K)\hat f(x;h)}}&(\hat f(x;h)-\mathbb{E}[\hat f(x;h)])\\ &=\sqrt{\frac{nh}{R(K)f(x)}}(\hat f(x;h)-\mathbb{E}[\hat f(x;h)])(1+o_\mathbb{P}(1))\\ &\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}
Therefore, an asymptotic 100(1-\alpha)\% confidence interval for \mathbb{E}[\hat f(x;h)] that can be straightforwardly computed is:
\begin{align} I=\left(\hat f(x;h)\pm z_{\alpha}\sqrt{\frac{R(K)\hat f(x;h)}{nh}}\right).\tag{2.27} \end{align}
Recall that if we wanted to do obtain a CI with the second result in Theorem 2.2 we will need to estimate f''(x).
Remark. Several points regarding the CI require proper awareness:
- The CI is for \mathbb{E}[\hat f(x;h)]=(K_h*f)(x), not f(x).
- The CI is pointwise. That means that \mathbb{P}\left[\mathbb{E}[\hat f(x;h)]\in I\right]\approx 1-\alpha for each x\in\mathbb{R}, but not simultaneously. This is,\mathbb{P}\left[\mathbb{E}[\hat f(x;h)]\in I,\,\forall x\in\mathbb{R}\right]\neq 1-\alpha.
- We are approximating f(x) in the variance by \hat f(x;h)=f(x)+O_\mathbb{P}(h^2+(nh)^{-1}). Additionally, the convergence to a normal distribution happens at rate \sqrt{nh}. Hence both h and nh need to small and large, respectively, for a good coverage.
- The CI is for a deterministic bandwidth h (i.e., not selected from the sample), which is not usually the case in practice. If a bandwidth selector is employed, the coverage may be affected, specially for small n.
We illustrate the construction of the CI in (2.27) for the situation where the reference density is a \mathcal{N}(\mu,\sigma^2) and the normal kernel is employed. This allows to use (2.25) and (2.26) in combination with (2.6) and (2.7) to obtain:
\begin{align*} \mathbb{E}[\hat f(x;h)]&=\phi_{(h^2+\sigma^2)^{1/2}}(x-\mu),\\ \mathbb{V}\mathrm{ar}[\hat f(x;h)]&=\frac{1}{n}\left(\frac{\phi_{(h^2/2+\sigma^2)^{1/2}}(x-\mu)}{2\pi^{1/2}h}-(\phi_{(h^2+\sigma^2)^{1/2}}(x-\mu))^2\right). \end{align*}
The following piece of code evaluates the proportion of times that \mathbb{E}[\hat f(x;h)] belongs to I for each x\in\mathbb{R}, both estimating and knowing the variance in the asymptotic distribution.
# R(K) for a normal
<- 1 / (2 * sqrt(pi))
Rk
# Generate a sample from a N(mu, sigma^2)
<- 100
n <- 0
mu <- 1
sigma set.seed(123456)
<- rnorm(n = n, mean = mu, sd = sigma)
x
# Compute the kde (NR bandwidth)
<- density(x = x, from = -4, to = 4, n = 1024, bw = "nrd")
kde
# Selected bandwidth
<- kde$bw
h
# Estimate the variance
<- kde$y * Rk / (n * h)
hatVarKde
# True expectation and variance (because the density is a normal)
<- dnorm(x = kde$x, mean = mu, sd = sqrt(sigma^2 + h^2))
EKh <- (dnorm(kde$x, mean = mu, sd = sqrt(h^2 / 2 + sigma^2)) /
varKde 2 * sqrt(pi) * h) - EKh^2) / n
(
# CI with estimated variance
<- 0.05
alpha <- qnorm(1 - alpha/2)
zalpha <- kde$y - zalpha * sqrt(hatVarKde)
ciLow1 <- kde$y + zalpha * sqrt(hatVarKde)
ciUp1
# CI with known variance
<- kde$y - zalpha * sqrt(varKde)
ciLow2 <- kde$y + zalpha * sqrt(varKde)
ciUp2
# Plot estimate, CIs and expectation
plot(kde, main = "Density and CIs", ylim = c(0, 1))
lines(kde$x, ciLow1, col = "gray")
lines(kde$x, ciUp1, col = "gray")
lines(kde$x, ciUp2, col = "gray", lty = 2)
lines(kde$x, ciLow2, col = "gray", lty = 2)
lines(kde$x, EKh, col = "red")
legend("topright", legend = c("Estimate", "CI estimated var",
"CI known var", "Smoothed density"),
col = c("black", "gray", "gray", "red"), lwd = 2, lty = c(1, 1, 2, 1))
The above code computes the CI. But it does not show any insight on the effective coverage of the CI. The next simulation exercise tackles with this issue.
# Simulation setting
<- 100; h <- 0.2
n <- 0; sigma <- 1 # Normal parameters
mu <- 5e2 # Number of replications in the simulation
M <- 512 # Number of x's for computing the kde
nGrid <- 0.05; zalpha <- qnorm(1 - alpha/2) # alpha for CI
alpha
# Compute expectation and variance
<- density(x = 0, bw = h, from = -4, to = 4, n = nGrid) # Just to get kde$x
kde <- dnorm(x = kde$x, mean = mu, sd = sqrt(sigma^2 + h^2))
EKh <- (dnorm(kde$x, mean = mu, sd = sqrt(h^2 / 2 + sigma^2)) /
varKde 2 * sqrt(pi) * h) - EKh^2) / n
(
# For storing if the mean is inside the CI
<- insideCi2 <- matrix(nrow = M, ncol = nGrid)
insideCi1
# Simulation
set.seed(12345)
for (i in 1:M) {
# Sample & kde
<- rnorm(n = n, mean = mu, sd = sigma)
x <- density(x = x, bw = h, from = -4, to = 4, n = nGrid)
kde <- sqrt(kde$y * Rk / (n * h))
hatSdKde
# CI with estimated variance
<- kde$y - zalpha * hatSdKde
ciLow1 <- kde$y + zalpha * hatSdKde
ciUp1
# CI with known variance
<- kde$y - zalpha * sqrt(varKde)
ciLow2 <- kde$y + zalpha * sqrt(varKde)
ciUp2
# Check if for each x the mean is inside the CI
<- EKh > ciLow1 & EKh < ciUp1
insideCi1[i, ] <- EKh > ciLow2 & EKh < ciUp2
insideCi2[i, ]
}
# Plot results
plot(kde$x, colMeans(insideCi1), ylim = c(0.25, 1), type = "l",
main = "Coverage CIs", xlab = "x", ylab = "Coverage")
lines(kde$x, colMeans(insideCi2), col = 4)
abline(h = 1 - alpha, col = 2)
abline(h = 1 - alpha + c(-1, 1) * qnorm(0.975) *
sqrt(alpha * (1 - alpha) / M), col = 2, lty = 2)
legend(x = "bottom", legend = c("CI estimated var", "CI known var",
"Nominal level",
"95% CI for the nominal level"),
col = c(1, 4, 2, 2), lwd = 2, lty = c(1, 1, 1, 2))
Exercise 2.3 Explore the coverage of the asymptotic CI for varying values of h. To that end, adapt the previous code to work in a manipulate
environment like the example given below.
# Load manipulate
# install.packages("manipulate")
library(manipulate)
# Sample
<- rnorm(100)
x
# Simple plot of kde for varying h
manipulate({
<- density(x = x, from = -4, to = 4, bw = h)
kde plot(kde, ylim = c(0, 1), type = "l", main = "")
curve(dnorm(x), from = -4, to = 4, col = 2, add = TRUE)
rug(x)
h = slider(min = 0.01, max = 2, initial = 0.5, step = 0.01)) },