2.5 Practical issues
In this section we discuss several practical issues for kernel density estimation.
2.5.1 Boundary issues and transformations
In Section 2.3 we assumed certain regularity conditions for Assumption A1 stated that should be twice continuously differentiable (on !). It is simple to think a counterexample to that: take any density with a support with boundary, for example a in as seen below. The kde will run into trouble!
# Sample from a LN(0, 1)
set.seed(123456)
samp <- rlnorm(n = 500)
# kde and density
plot(density(samp), ylim = c(0, 0.8))
curve(dlnorm(x), from = -2, to = 10, n = 500, col = 2, add = TRUE)
rug(samp)
What is happening is clear: the kde is spreading probability mass outside the support of because the kernels are functions defined in Since the kde places probability mass at negative values, it takes it from the positive side, resulting in a severe negative bias about the right-hand side of As a consequence, the kde does not integrate one in the support of the data. No matter what the sample size considered is, the kde will always have a negative bias of at the boundary, instead of the standard
A simple approach to deal with the boundary bias is to map a non-real-supported density into a real-supported density which is simpler to estimate, by means of a transformation :
The transformation kde is obtained by replacing with the usual kde, but acting on the transformed data :
Note that is in the scale of not Hence, another benefit of this approach, in addition to its simplicity, is that bandwidth selection can be done transparently50 in terms of the already discussed bandwidth selectors: select a bandwidth from and use it in (2.34). A table with some common transformations51 is the following.
Transform. | Data in | ||
---|---|---|---|
Log | |||
Probit | |||
Shifted power | (data skewed) | if | if |
Figure 2.12: Construction of the transformation kde for the log and probit transformations. The left panel shows the kde (2.7) applied to the transformed data. The right plot shows the transformed kde (2.34) applied to the original data. Application available here.
The code below illustrates how to compute a transformation kde in practice from density
.
# kde with log-transformed data
kde <- density(log(samp))
plot(kde, main = "Kde of transformed data")
rug(log(samp))
# Careful: kde$x is in the reals!
range(kde$x)
## [1] -4.542984 3.456035
# Untransform kde$x so the grid is in (0, infty)
kde_transf <- kde
kde_transf$x <- exp(kde_transf$x)
# Transform the density using the chain rule
kde_transf$y <- kde_transf$y * 1 / kde_transf$x
# Transformed kde
plot(kde_transf, main = "Transformed kde", xlim = c(0, 15))
curve(dlnorm(x), col = 2, add = TRUE, n = 500)
rug(samp)
Exercise 2.20 Consider the data given by set.seed(12345); x <- rbeta(n = 500, shape1 = 2, shape2 = 2)
. Compute:
- The untransformed kde employing the DPI and LSCV selectors. Overlay the true density.
- The transformed kde employing a probit transformation and using the DPI and LSCV selectors on the transformed data.
2.5.2 Sampling
Sampling a kde is relatively straightforward. The trick is to recall
as a mixture density made of mixture components in which each of them is sampled independently. The only part that might require special treatment is sampling from the density although R contains specific sampling functions for the most popular kernels.
The algorithm for sampling points goes as follows:
- Choose at random.
- Sample from
- Repeat the previous steps times.
Let’s see a quick example.
# Sample the claw
n <- 100
set.seed(23456)
samp <- nor1mix::rnorMix(n = n, obj = nor1mix::MW.nm10)
# Kde
h <- 0.1
plot(density(samp, bw = h), main = "", col = 4)
# # Naive sampling algorithm
# N <- 1e6
# samp_kde <- numeric(N)
# for (k in 1:N) {
#
# i <- sample(x = 1:n, size = 1)
# samp_kde[k] <- rnorm(n = 1, mean = samp[i], sd = h)
#
# }
# Sample N points from the kde
N <- 1e6
i <- sample(x = n, size = N, replace = TRUE)
samp_kde <- rnorm(N, mean = samp[i], sd = h)
# Add kde of the sampled kde -- almost equal
lines(density(samp_kde), col = 3)
legend("topright", legend = c("Kde", "Kde of sampled kde"),
lwd = 2, col = 4:3)
Exercise 2.23 Sample data points from the kde of iris$Petal.Width
that is computed with the NS selector.
Exercise 2.24 The dataset sunspot.year
contains the yearly numbers of sunspots from 1700 to 1988 (rounded to one digit). Employing a log-transformed kde with DPI bandwidth, sample new sunspots observations. Check by simulation that the sampling is done appropriately by comparing the log-transformed kde of the sampled data with the original kde.