2.4 Bandwidth selection

As we saw in the previous sections, bandwidth selection is a key issue in density estimation. The purpose of this section is to introduce objective and automatic bandwidth selectors that attempt to minimize the estimation error of the target density f.

The first step is to define a global, rather than local, error criterion. The Integrated Squared Error (ISE),

ISE[f^(;h)]:=(f^(x;h)f(x))2dx,

is the squared distance between the kde and the target density. The ISE is a random quantity, since it depends directly on the sample X1,,Xn. As a consequence, looking for an optimal-ISE bandwidth is a hard task, since the optimality is dependent on the sample itself and not only on the population and n. To avoid this problematic, it is usual to compute the Mean Integrated Squared Error (MISE):

MISE[f^(;h)]:=E[ISE[f^(;h)]]=E[(f^(x;h)f(x))2dx]=E[(f^(x;h)f(x))2]dx=MSE[f^(x;h)]dx.

The MISE is convenient due to its mathematical tractability and its natural relation with the MSE. There are, however, other error criteria that present attractive properties, such as the Mean Integrated Absolute Error (MIAE):

MIAE[f^(;h)]:=E[|f^(x;h)f(x)|dx]=E[|f^(x;h)f(x)|]dx.

The MIAE, unless the MISE, is invariant with respect to monotone transformations of the data. For example, if g(x)=f(t1(x))(t1)(x) is the density of Y=t(X) and Xf, then if the change of variables y=t(x) is made,

|f^(x;h)f(x)|dx=|f^(t1(y);h)f(t1(y))|)(t1)(y)dy=|g^(y;h)g(y)|dy.

Despite this appealing property, the analysis of MIAE is substantially more complicated. We refer to Devroye and Györfi (1985) for a comprehensive treatment of absolute value metrics for kde.

Once the MISE is set as the error criterion to be minimized, our aim is to find

hMISE:=argminh>0MISE[f^(;h)].

For that purpose, we need an explicit expression of the MISE that we can attempt to minimize. An asymptotic expansion for the MISE can be easily derived from the MSE expression.

Corollary 2.2 Under A1A3,

MISE[f^(;h)]=14μ22(K)R(f)h4+R(K)nh(2.17)+o(h4+(nh)1).

Therefore, MISE[f^(;h)]0 when n.

Proof. Trivial.

The dominating part of the MISE is denoted as AMISE, which stands for Asymptotic MISE: AMISE[f^(;h)]=14μ22(K)R(f)h4+R(K)nh. Due to its closed expression, it is possible to obtain a bandwidth that minimizes the AMISE.

Corollary 2.3 The bandwidth that minimizes the AMISE is

hAMISE=[R(K)μ22(K)R(f)n]1/5.

The optimal AMISE is:

infh>0AMISE[f^(;h)]=54(μ22(K)R(K)4)4/5R(f)1/5n4/5.

Proof. Solving ddhAMISE[f^(;h)]=0, i.e. μ22(K)R(f)h3R(K)n1h2=0, yields hAMISE and AMISE[f^(;hAMISE)] gives the AMISE-optimal error.

The AMISE-optimal order deserves some further inspection. It can be seen in Section 3.2 of Scott (2015) that the AMISE-optimal order for the histogram of Section 2.1 is (34)2/3R(f)1/3n2/3. Two facts are of interest. First, the MISE of the histogram is asymptotically larger than the MISE of the kde (n4/5=o(n2/3)). This is a quantification of the quite apparent visual improvement of the kde over the histogram. Second, R(f) appears instead of R(f), evidencing that the histogram is affected not only by the curvature of the target density f but also by how fast it varies.

Unfortunately, the AMISE bandwidth depends on R(f)=(f(x))2dx, which measures the curvature of the density. As a consequence, it can not be readily applied in practice. In the next subsection we will see how to plug-in estimates for R(f).

2.4.1 Plug-in rules

A simple solution to estimate R(f) is to assume that f is the density of a N(μ,σ2), and then plug-in the form of the curvature for such density:

R(ϕσ(μ))=38π1/2σ5.

While doing so, we approximate the curvature of an arbitrary density by means of the curvature of a Normal. We have that

hAMISE=[8π1/2R(K)3μ22(K)n]1/5σ.

Interestingly, the bandwidth is directly proportional to the standard deviation of the target density. Replacing σ by an estimate yields the normal scale bandwidth selector, which we denote by h^NS to emphasize its randomness:

h^NS=[8π1/2R(K)3μ22(K)n]1/5σ^.

The estimate σ^ can be chosen as the standard deviation s, or, in order to avoid the effects of potential outliers, as the standardized interquantile range:

σ^IQR:=X([0.75n])X([0.25n])Φ1(0.75)Φ1(0.25).

Silverman (1986) suggests employing the minimum of both quantities,

(2.18)σ^=min(s,σ^IQR).

When combined with a normal kernel, for which μ2(K)=1 and R(K)=ϕ2(0)=12π (check (2.24)), this particularization of h^NS gives the famous rule-of-thumb for bandwidth selection:

h^RT=(43)1/5n1/5σ^1.06n1/5σ^.

h^RT is implemented in R through the function bw.nrd (not to confuse with bw.nrd0).

# Data
x <- rnorm(100)

# Rule-of-thumb
bw.nrd(x = x)
## [1] 0.3897358

# Same as
iqr <- diff(quantile(x, c(0.75, 0.25))) / diff(qnorm(c(0.75, 0.25)))
1.06 * length(x)^(-1/5) * min(sd(x), iqr)
## [1] 0.3871416

The previous selector is an example of a zero-stage plug-in selector, a terminology which lays on the fact that R(f) was estimated by plugging-in a parametric assumption at the “very beginning.” Because we could have opted to estimate R(f) nonparametrically and then plug-in the estimate into into hAMISE. Let’s explore this possibility in more detail. But first, note the useful equality

f(s)(x)2dx=(1)sf(2s)(x)f(x)dx.

This holds by a iterative application of integration by parts. For example, for s=2, take u=f(x) and dv=f(x)dx. This gives

f(x)2dx=[f(x)f(x)]+f(x)f(x)dx=f(x)f(x)dx

under the assumption that the derivatives vanish at infinity. Applying again integration by parts with u=f(x) and dv=f(x)dx gives the result. This simple derivation has an important consequence: for estimating the functionals R(f(s)) it is only required to estimate for r=2s the functionals

ψr:=f(r)(x)f(x)dx=E[f(r)(X)].

Particularly, R(f)=ψ4.

Thanks to the previous expression, a possible way to estimate ψr nonparametrically is

ψ^r(g)=1ni=1nf^(r)(Xi;g)(2.19)=1n2i=1nj=1nLg(r)(XiXj),

where f^(r)(;g) is the r-th derivative of a kde with bandwidth g and kernel L, i.e. f^(r)(x;g)=1ngri=1nL(r)(xXig). Note that g and L can be different from h and K, respectively. It turns out that estimating ψr involves the adequate selection of a bandwidth g. The agenda is analogous to the one for hAMISE, but now taking into account that both ψ^r(g) and ψr are scalar quantities:

  1. Under certain regularity assumptions (see Section 3.5 in Wand and Jones (1995) for full details), the asymptotic bias and variance of ψ^r(g) are obtained. With them, we can compute the asymptotic expansion of the MSE and obtain the Asymptotic Mean Squared Error AMSE:

    AMSE[ψ^r(g)]={L(r)(0)ngr+1+μ2(L)ψr+2g24}+2R(L(r))ψ0n2g2r+1+4n{f(r)(x)2f(x)dxψr2}.

    Note: k is the highest integer such that μk(L)>0. In these notes we have restricted to the case k=2 for the kernels K, but there are theoretical gains if one allows high-order kernels L with vanishing even moments larger than 2 for estimating ψr.

  2. Obtain the AMSE-optimal bandwidth:

    gAMSE=[k!L(r)(0)μk(L)ψr+kn]1/(r+k+1)

    The order of the optimal AMSE is

    infg>0AMSE[ψ^r(g)]={O(n(2k+1)/(r+k+1)),k<r,O(n1),kr,

    which shows that a parametric-like rate of convergence can be achieved with high-order kernels. If we consider L=K and k=2, then

    gAMSE=[2K(r)(0)μ2(L)ψr+2n]1/(r+3).

The above result has a major problem: it depends on ψr+2! Thus if we want to estimate R(f)=ψ4 by ψ^4(gAMSE) we will need to estimate ψ6. But ψ^6(gAMSE) will depend on ψ8 and so on. The solution to this convoluted problem is to stop estimating the functional ψr after a given number of stages, therefore the terminology -stage plug-in selector. At the stage, the functional ψ+4 in the AMSE-optimal bandwidth for estimating ψ+2 is computed assuming that the density is a N(μ,σ2), for which

ψr=(1)r/2r!(2σ)r+1(r/2)!π,for r even.

Typically, two stages are considered a good trade-off between bias (mitigated when increases) and variance (augments with ) of the plug-in selector. This is the method proposed by Sheather and Jones (1991), where they consider L=K and k=2, yielding what we call the direct plug-in (DPI). The algorithm is:

  1. Estimate ψ8 using ψ^8NS:=10532πσ^9, where σ^ is given in (2.18)

  2. Estimate ψ6 using ψ^6(g1) from (2.19), where

    g1:=[2K(6)(0)μ2(K)ψ^8NSn]1/9.

  3. Estimate ψ4 using ψ^4(g2) from (2.19), where

    g2:=[2K(4)(0)μ2(K)ψ^6(g1)n]1/7.

  4. The selected bandwidth is

    h^DPI:=[R(K)μ22(K)ψ^4(g2)n]1/5.

Remark. The derivatives K(r) for the normal kernel can be obtained using Hermite polynomials: ϕ(r)(x)=ϕ(x)Hr(x). For r=4,6, H4(x)=x46x2+3 and H6(x)=x616x4+45x215.

h^DPI is implemented in R through the function bw.SJ (use method = "dpi"). The package ks provides hpi, which is a faster implementation that allows for more flexibility and has a somehow more complete documentation.

# Data
x <- rnorm(100)

# Rule-of-thumb
bw.SJ(x = x, method = "dpi")
## [1] 0.450495

# Similar to
library(ks)
hpi(x) # Default is two-stages
## [1] 0.4537216

2.4.2 Cross-validation

We turn now our attention to a different philosophy of bandwidth estimation. Instead of trying to minimize the AMISE by plugging-in estimates for the unknown curvature term, we directly attempt to minimize the MISE by using the sample twice: one for computing the kde and other for evaluating its performance on estimating f. To avoid the clear dependence on the sample, we do this evaluation in a cross-validatory way: the data used for computing the kde is not used for its evaluation.

We begin by expanding the square in the MISE expression:

MISE[f^(;h)]=E[(f^(x;h)f(x))2dx]=E[f^(x;h)2dx]2E[f^(x;h)f(x)dx]+f(x)2dx.

Since the last term does not depend on h, minimizing MISE[f^(;h)] is equivalent to minimizing

E[f^(x;h)2dx]2E[f^(x;h)f(x)dx].

This quantity is unknown, but it can be estimated unbiasedly (see Exercise 2.7) by

(2.20)LSCV(h):=f^(x;h)2dx2n1i=1nf^i(Xi;h),

where f^i(;h) is the leave-one-out kde and is based on the sample with the Xi removed:

f^i(x;h)=1n1j=1jinKh(xXj).

The motivation for (2.20) is the following. The first term is unbiased by design. The second arises from approximating f^(x;h)f(x)dx by Monte Carlo, or in other words, by replacing f(x)dx=dF(x) with dFn(x). This gives

f^(x;h)f(x)dx1ni=1nf^(Xi;h)

and, in order to mitigate the dependence of the sample, we replace f^(Xi;h) by f^i(Xi;h) above. In that way, we use the sample for estimating the integral involving f^(;h), but for each Xi we compute the kde on the remaining points. The least squares cross-validation (LSCV) selector, also denoted unbiased cross-validation (UCV) selector, is defined as

h^LSCV:=argminh>0LSCV(h).

Numerical optimization is required for obtaining h^LSCV, contrary to the previous plug-in selectors, and there is little control on the shape of the objective function. This will be also the case for the forthcoming bandwidth selectors. The following remark warns about the dangers of numerical optimization in this context.

Remark. Numerical optimization of the LSCV function can be challenging. In practice, several local minima are possible, and the roughness of the objective function can vary notably depending on n and f. As a consequence, optimization routines may get trapped in spurious solutions. To be on the safe side, it is always advisable to check the solution by plotting LSCV(h) for a range of h, or to perform a search in a bandwidth grid: h^LSCVargminh1,,hGLSCV(h).

h^LSCV is implemented in R through the function bw.ucv. bw.ucv uses R’s optimize, which is quite sensible to the selection of the search interval (long intervals containing the solution may lead to unsatisfactory termination of the search; and short intervals might not contain the minimum). Therefore, some care is needed and that is why the bw.ucv.mod function is presented.

# Data
set.seed(123456)
x <- rnorm(100)

# UCV gives a warning
bw.ucv(x = x)
## [1] 0.4499177

# Extend search interval
args(bw.ucv)
## function (x, nb = 1000L, lower = 0.1 * hmax, upper = hmax, tol = 0.1 * 
##     lower) 
## NULL
bw.ucv(x = x, lower = 0.01, upper = 1)
## [1] 0.5482419

# bw.ucv.mod replaces the optimization routine of bw.ucv by an exhaustive search on
# "h.grid" (chosen adaptatively from the sample) and optionally plots the LSCV curve
# with "plot.cv"
bw.ucv.mod <- function(x, nb = 1000L,
                       h.grid = diff(range(x)) * (seq(0.1, 1, l = 200))^2,
                       plot.cv = FALSE) {
  if ((n <- length(x)) < 2L)
    stop("need at least 2 data points")
  n <- as.integer(n)
  if (is.na(n))
    stop("invalid length(x)")
  if (!is.numeric(x))
    stop("invalid 'x'")
  nb <- as.integer(nb)
  if (is.na(nb) || nb <= 0L)
    stop("invalid 'nb'")
  storage.mode(x) <- "double"
  hmax <- 1.144 * sqrt(var(x)) * n^(-1/5)
  Z <- .Call(stats:::C_bw_den, nb, x)
  d <- Z[[1L]]
  cnt <- Z[[2L]]
  fucv <- function(h) .Call(stats:::C_bw_ucv, n, d, cnt, h)
  # h <- optimize(fucv, c(lower, upper), tol = tol)$minimum
  # if (h < lower + tol | h > upper - tol)
  #   warning("minimum occurred at one end of the range")
  obj <- sapply(h.grid, function(h) fucv(h))
  h <- h.grid[which.min(obj)]
  if (plot.cv) {
    plot(h.grid, obj, type = "o")
    rug(h.grid)
    abline(v = h, col = 2, lwd = 2)
  }
  h
}

# Compute the bandwidth and plot the LSCV curve
bw.ucv.mod(x = x, plot.cv = TRUE)
## [1] 0.5431732

# We can compare with the default bw.ucv output
abline(v = bw.ucv(x = x), col = 3)

The next cross-validation selector is based on biased cross-validation (BCV). The BCV selector presents a hybrid strategy that combines plug-in and cross-validation ideas. It starts by considering the AMISE expression in (2.17)

AMISE[f^(;h)]=14μ22(K)R(f)h4+R(K)nh

and then plugs-in an estimate for R(f) based on a modification of R(f^(;h)). The modification is

R(f)~:=R(f^(;h))R(K)nh5(2.21)=1ni=1nj=1jin(KhKh)(XiXj),

a leave-out-diagonals estimate of R(f). It is designed to reduce the bias of R(f^(;h)), since E[R(f^(;h))]=R(f)+R(K)nh5+O(h2) (Scott and Terrell 1987). Plugging-in (2.21) into the AMISE expression yields the BCV objective function and the BCV bandwidth selector:

BCV(h):=14μ22(K)R(f)~h4+R(K)nh,h^BCV:=argminh>0BCV(h).

The appealing property of h^BCV is that it has a considerably smaller variance compared to h^LSCV. This reduction in variance comes at the price of an increased bias, which tends to make h^BCV larger than hMISE.

h^BCV is implemented in R through the function bw.bcv. Again, bw.bcv uses R’s optimize so the bw.bcv.mod function is presented to have better guarantees on finding the adequate minimum.

# Data
set.seed(123456)
x <- rnorm(100)

# BCV gives a warning
bw.bcv(x = x)
## [1] 0.4500924

# Extend search interval
args(bw.bcv)
## function (x, nb = 1000L, lower = 0.1 * hmax, upper = hmax, tol = 0.1 * 
##     lower) 
## NULL
bw.bcv(x = x, lower = 0.01, upper = 1)
## [1] 0.5070129

# bw.bcv.mod replaces the optimization routine of bw.bcv by an exhaustive search on
# "h.grid" (chosen adaptatively from the sample) and optionally plots the BCV curve
# with "plot.cv"
bw.bcv.mod <- function(x, nb = 1000L,
                       h.grid = diff(range(x)) * (seq(0.1, 1, l = 200))^2,
                       plot.cv = FALSE) {
  if ((n <- length(x)) < 2L)
    stop("need at least 2 data points")
  n <- as.integer(n)
  if (is.na(n))
    stop("invalid length(x)")
  if (!is.numeric(x))
    stop("invalid 'x'")
  nb <- as.integer(nb)
  if (is.na(nb) || nb <= 0L)
    stop("invalid 'nb'")
  storage.mode(x) <- "double"
  hmax <- 1.144 * sqrt(var(x)) * n^(-1/5)
  Z <- .Call(stats:::C_bw_den, nb, x)
  d <- Z[[1L]]
  cnt <- Z[[2L]]
  fbcv <- function(h) .Call(stats:::C_bw_bcv, n, d, cnt, h)
  # h <- optimize(fbcv, c(lower, upper), tol = tol)$minimum
  # if (h < lower + tol | h > upper - tol)
  #   warning("minimum occurred at one end of the range")
  obj <- sapply(h.grid, function(h) fbcv(h))
  h <- h.grid[which.min(obj)]
  if (plot.cv) {
    plot(h.grid, obj, type = "o")
    rug(h.grid)
    abline(v = h, col = 2, lwd = 2)
  }
  h
}

# Compute the bandwidth and plot the BCV curve
bw.bcv.mod(x = x, plot.cv = TRUE)
## [1] 0.5130493

# We can compare with the default bw.bcv output
abline(v = bw.bcv(x = x), col = 3)

2.4.3 Comparison of bandwidth selectors

We state next some insights from the convergence results of the DPI, LSCV, and BCV selectors. All of them are based in results of the kind

nν(h^/hMISE1)dN(0,σ2),

where σ2 depends only on K and f, and measures how variable is the selector. The rate nν serves to quantify how fast the relative error h^/hMISE1 decreases (the larger the ν, the faster the convergence). Under certain regularity conditions, we have:

  • n1/10(h^LSCV/hMISE1)dN(0,σLSCV2) and n1/10(h^BCV/hMISE1)dN(0,σBCV2). Both selectors have a slow rate of convergence (compare it with the n1/2 of the CLT). Inspection of the variances of both selectors reveals that, for the normal kernel σLSCV2/σBCV215.7. Therefore, LSCV is considerably more variable than BCV.

  • n5/14(h^DPI/hMISE1)dN(0,σDPI2). Thus, the DPI selector has a convergence rate much faster than the cross-validation selectors. There is an appealing explanation for this phenomenon. Recall that h^BCV minimizes the slightly modified version of BCV(h) given by

    14μ22(K)ψ~4(h)h4+R(K)nh

    and

    ψ~4(h):=1n(n1)i=1nj=1jin(KhKh)(XiXj)=nn1R(f)~.

    ψ~4 is a leave-out-diagonals estimate of ψ4. Despite being different from ψ^4, it serves for building a DPI analogous to BCV that points towards the precise fact that drags down the performance of BCV. The modified version of the DPI minimizes

    14μ22(K)ψ~4(g)h4+R(K)nh,

    where g is independent of h. The two methods differ on the the way g is chosen: BCV sets g=h and the modified DPI looks for the best g in terms of the AMSE[ψ~4(g)]. It can be seen that gAMSE=O(n2/13), whereas the h used in BCV is asymptotically O(n1/5). This suboptimality on the choice of g is the reason of the asymptotic deficiency of BCV.

We focus now on exploring the empirical performance of bandwidth selectors. The workhorse for doing that is simulation. A popular collection of simulation scenarios was given by Marron and Wand (1992) and are conveniently available through the package nor1mix. They form a collection of normal r-mixtures of the form

f(x;μ,σ,w):=j=1rwjϕσj(xμj),

where wj0, j=1,,r and j=1rwj=1. Densities of this form are specially attractive since they allow for arbitrarily flexibility and, if the normal kernel is employed, they allow for explicit and exact MISE expressions:

MISEr[f^(;h)]=(2πnh)1+w{(1n1)Ω22Ω1+Ω0}w,(2.22)(Ωa)ij=ϕ(ah2+σi2+σj2)1/2(μiμj),i,j=1,,r.

# Load package
library(nor1mix)

# Available models
?MarronWand

# Simulating
samp <- rnorMix(n = 500, obj = MW.nm9) # MW object in the second argument
hist(samp, freq = FALSE)

# Density evaluation
x <- seq(-4, 4, length.out = 400)
lines(x, dnorMix(x = x, obj = MW.nm9), col = 2)


# Plot a MW object directly
# A normal with the same mean and variance is plotted in dashed lines
par(mfrow = c(2, 2))
plot(MW.nm5)
plot(MW.nm7)
plot(MW.nm10)
plot(MW.nm12)
lines(MW.nm10) # Also possible

Figure 2.4 presents a visualization of the performance of the kde with different bandwidth selectors, carried out in the family of mixtures of Marron and Wand (1992).

Figure 2.4: Performance comparison of bandwidth selectors. The RT, DPI, LSCV, and BCV are computed for each sample for a normal mixture density. For each sample, computes the ISEs of the selectors and sorts them from best to worst. Changing the scenarios gives insight on the adequacy of each selector to hard- and simple-to-estimate densities. Application also available here.

References

Devroye, L., and L. Györfi. 1985. Nonparametric Density Estimation. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. New York: John Wiley & Sons, Inc.
Marron, J. S., and M. P. Wand. 1992. “Exact Mean Integrated Squared Error.” Ann. Statist. 20 (2): 712–36.
Scott, D. W. 2015. Multivariate Density Estimation. Second. Wiley Series in Probability and Statistics. Hoboken: John Wiley & Sons, Inc.
Scott, D. W., and G. R. Terrell. 1987. “Biased and Unbiased Cross-Validation in Density Estimation.” J. Am. Stat. Assoc. 82 (400): 1131–46.
Sheather, S. J., and M. C. Jones. 1991. “A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation.” J. Roy. Statist. Soc. Ser. B 53 (3): 683–90.
Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability. London: Chapman & Hall.
Wand, M. P., and M. C. Jones. 1995. Kernel Smoothing. Vol. 60. Monographs on Statistics and Applied Probability. London: Chapman & Hall, Ltd. https://doi.org/10.1007/978-1-4899-4493-1.