3.5 Applications of kernel density estimation

Once we are able to adequately estimate the multivariate density \(f\) of a random vector \(\mathbf{X}\) by \(\hat{f}(\cdot;\mathbf{H})\), we can perform a series of interesting applications that go beyond the mere visualization and graphical description of the estimated density. These applications are intimately related to multivariate analysis methods rooted on the normality assumption:

  • Density level set estimation. The level set of the density \(f\) at level \(c\geq0\) is defined as \(\mathcal{L}(f;c):=\{\mathbf{x}\in\mathbb{R}^p: f(\mathbf{x})\geq c\}\). An estimation of \(\mathcal{L}(f;c)\) is useful to visualize the highest density regions, which give a concise view of the most likely values for \(\mathbf{X}\) and therefore summarize the structure of \(\mathbf{X}\). In addition, this summary is produced without needing to restrict to connected sets,73 as the box plot74 does,75 and with the additional benefit of determining the approximate probability contained in them. Level sets also are useful to estimate the support of \(\mathbf{X}\) and to detect multivariate outliers without the need to employ the Mahalanobis distance.76

  • Clustering or unsupervised learning. The aim of clustering techniques, such as hierarchical clustering or \(k\)-means,77 is to find homogeneous clusters (subgroups) within the sample (group). Despite their usefulness, these two techniques have important drawbacks: they do not automatically determine the number of clusters and they lack from a simple interpretation in terms of the population, relying entirely on the sample for its construction and performance evaluation. Based on a density \(f\), the population clusters of the domain of \(\mathbf{X}\) can be precisely defined as the “domains of attraction of the density modes.” The mean shift clustering algorithm replaces \(f\) with \(\hat{f}(\cdot;\mathbf{H})\) and gives a neat clustering recipe that automatically determines the number of clusters.

  • Classification or supervised learning. Suppose \(\mathbf{X}\) has a categorical random variable \(Y\) as a companion, the labels of which indicate several classes within the population \(\mathbf{X}\). Then, the task of assigning a point \(\mathbf{x}\in\mathbb{R}^p\) to a class of \(Y\), based on a sample \((\mathbf{X}_1,Y_1),\ldots,(\mathbf{X}_n,Y_n)\), is a classification problem. Classic multivariate techniques such as Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA)78 tackle this task by confronting normals associated with each class, say \(\mathcal{N}_p(\boldsymbol{\mu}_1,\boldsymbol{\Sigma}_1)\) and \(\mathcal{N}_p(\boldsymbol{\mu}_2,\boldsymbol{\Sigma}_2)\), and then assigning \(\mathbf{x}\in\mathbb{R}^p\) to the class that maximizes \(\phi_{\hat{\boldsymbol{\Sigma}}_i}(\mathbf{x}-\hat{\boldsymbol{\mu}}_i)\), \(i=1,2\). The very same principle can be followed by replacing the estimated normal densities with the kdes of each class.

  • Description of the main features of the data. Principal Component Analysis (PCA) is a very well-known linear dimension-reduction technique that is closely related to the normal distribution.79 PCA helps to describe the main features of a dataset by estimating the principal directions on \(\mathbb{R}^p\), \(p\geq2\), of the maximum projected variance of \(\mathbf{X}\). These principal directions are highly related to the density ridges of \(\phi_{\boldsymbol{\Sigma}}(\cdot-\boldsymbol{\mu})\), the latter concept generating flexible principal curves for any density \(f\). As a consequence, density ridges of \(\hat{f}(\cdot;\mathbf{H})\) give a flexible non-linear description of the structure of the data.

Next, we review the details associated with these applications, which are excellently described in Chapters 6 and 7 in Chacón and Duong (2018).

3.5.1 Level set estimation

The estimation of the level set \(\mathcal{L}(f;c)=\{\mathbf{x}\in\mathbb{R}^p: f(\mathbf{x})\geq c\}\), useful to determine high-density regions, can be straightforwardly done by plugging the kde of \(f\) and considering

\[\begin{align*} \mathcal{L}(\hat{f}(\cdot;\mathbf{H});c) = \{\mathbf{x}\in\mathbb{R}^p:\hat{f}(\mathbf{x};\mathbf{H})\geq c\}. \end{align*}\]

Obtaining the representation of \(\mathcal{L}(\hat{f}(\cdot;\mathbf{H});c)\) in practice involves considering a grid in \(\mathbb{R}^p\) in order to evaluate the condition \(\hat{f}(\mathbf{x};\mathbf{H})\geq c\) and to determine the region of \(\mathbb{R}^p\) in which it is satisfied.

We illustrate the computation of \(\mathcal{L}(\hat{f}(\cdot;\mathbf{H});c)\) for a couple of simulated examples in \(\mathbb{R}\). Notice, in the code below, how the function kde_level_set automatically deals with the annoyance that \(\mathcal{L}(\hat{f}(\cdot;\mathbf{H});c)\) may not be a connected region, that is, that \(\mathcal{L}(\hat{f}(\cdot;\mathbf{H});c)=\bigcup_{i=1}^k[a_i,b_i]\) with \(k\) unknown beforehand.

# Simulated sample
n <- 100
set.seed(12345)
samp <- rnorm(n = n)

# Kde as usual, but force to evaluate it at seq(-4, 4, length = 4096)
bw <- bw.nrd(x = samp)
kde <- density(x = samp, bw = bw, n = 4096, from = -4, to = 4)

# For a given c, what is the theoretical level set? Since we know that the
# real density is symmetric and unimodal, then the level set is an interval
# of the form [-x_c, x_c]
c <- 0.2
x_c <- tryCatch(uniroot(function(x) dnorm(x) - c, lower = 0, upper = 4)$root,
                error = function(e) NA)

# Show theoretical level set
x <- seq(-4, 4, by = 0.01)
plot(x, dnorm(x), type = "l", ylim = c(0, 0.5), ylab = "Density")
rug(samp)
polygon(x = c(-x_c, -x_c, x_c, x_c), y = c(0, c, c, 0),
        col = rgb(0, 0, 0, alpha = 0.5), density = 10)

# Function to compute and plot a kde level set. Observe that kde stands for an
# object containing the output of density(), although obvious modifications
# could be done to the function to receive a ks::kde object
# as the main argument
kde_level_set <- function(kde, c, add_plot = FALSE, ...) {

  # Begin and end index for the potentially many intervals in the level sets
  # of the kde
  kde_larger_c <- kde$y >= c
  run_length_kde <- rle(kde_larger_c) # Trick to compute the length of the
  # sequence of TRUEs that indicates an interval for which kde$y >= c
  begin <- which(diff(kde_larger_c) > 0) # Trick to search for the beginning
  # of each of the intervals
  end <- begin + run_length_kde$lengths[run_length_kde$values] - 1 # Compute
  # the end of the intervals from begin + length

  # Add polygons to a density plot? If so, ... are the additional parameters
  # for polygon()
  if (add_plot) {

    apply(cbind(begin, end), 1, function(ind) {
      polygon(x = c(kde$x[ind[1]], kde$x[ind[1]],
                    kde$x[ind[2]], kde$x[ind[2]]),
              y = c(0, kde$y[ind[1]],
                    kde$y[ind[2]], 0), ...)
      })

  }

  # Return the [a_i, b_i], i = 1, ..., K in the K rows
  return(cbind(kde$x[begin], kde$x[end]))

}

# Add kde and level set
lines(kde, col = 2)
kde_level_set(kde = kde, c = c, add_plot = TRUE,
              col = rgb(1, 0, 0, alpha = 0.5))
##           [,1]     [,2]
## [1,] -1.018803 1.442735
abline(h = c, col = 4) # Level
legend("topright", legend = c("True density", "Kde", "True level set",
                              "Kde level set", "Level c"),
       lwd = 2, col = c(1, 2, rgb(0:1, 0, 0, alpha = 0.5), 4))
Level set \(\mathcal{L}(f;c)\) and its estimation by \(\mathcal{L}(\hat{f}(\cdot;h);c)\) for \(c=0.2\) and \(f=\phi\).

Figure 3.4: Level set \(\mathcal{L}(f;c)\) and its estimation by \(\mathcal{L}(\hat{f}(\cdot;h);c)\) for \(c=0.2\) and \(f=\phi\).

The following code chunk illustrates how changing the bandwidth \(h\) and the level of the set \(c\) affects the connectedness of \(\mathcal{L}(\hat{f}(\cdot;h);c)\).

# Simulated sample
n <- 100
set.seed(12345)
samp <- rnorm(n = n)

# Interactive visualization
x <- seq(-4, 4, by = 0.01)
manipulate::manipulate({

  # Show theoretical level set
  plot(x, dnorm(x), type = "l", ylim = c(0, 0.5), ylab = "Density")
  rug(samp)
  x_c <- tryCatch(uniroot(function(x) dnorm(x) - c, lower = 0, upper = 4)$root,
                  error = function(e) NA) # tryCatch() to bypass errors
  polygon(x = c(-x_c, -x_c, x_c, x_c), y = c(0, c, c, 0),
          col = rgb(0, 0, 0, alpha = 0.5), density = 10)

  # Add estimation
  kde <- density(x = samp, bw = bw, n = 1e5, from = -4, to = 4)
  lines(kde, col = 2)
  kde_level_set(kde = kde, c = c, add_plot = TRUE,
                col = rgb(1, 0, 0, alpha = 0.5))
  abline(h = c, col = 4) # Level
  legend("topright", legend = c("True density", "Kde", "True level set",
                                "Kde level set", "Level c"),
         lwd = 2, col = c(1, 2, rgb(0:1, 0, 0, alpha = 0.5), 4))

}, c = manipulate::slider(min = 0.01, max = 0.5, initial = 0.2, step = 0.01),
bw = manipulate::slider(min = 0.01, max = 1, initial = 0.25, step = 0.01))

Exercise 3.12 Consider the bimodal density \(f\) given in nor1mix::MW.nm6. Consider \(c=0.15,0.25\).

  • Compute \(\mathcal{L}(f;c)\).
  • From a sample of size \(n=200\), compute \(\mathcal{L}(\hat{f}(\cdot;\hat{h}_\mathrm{DPI});c)\).

Highest density regions for a given probability

The level \(c\) in \(\mathcal{L}(f;c)=\{\mathbf{x}\in\mathbb{R}^p: f(\mathbf{x})\geq c\}\) may be difficult to interpret: its effective scale depends on the dimension80 \(p\) of the random vector \(\mathbf{X}\) and on the units in which their components are measured. A more convenient parametrization of the level set is attained if we state the probability that is intended to contain, rather than the minimum value of the density attained in the region. For that purpose, it is usually considered the largest81 \(c_\alpha\) such that

\[\begin{align*} \int_{\mathcal{L}(f;c_\alpha)}f(\mathbf{x})\,\mathrm{d}\mathbf{x}\geq 1-\alpha, \quad \alpha\in(0,1). \end{align*}\]

This formulation raises a key interpretation of \(\mathcal{L}(f;c_\alpha)\):

\(\mathcal{L}(f;c_\alpha)\) is the smallest82 region of \(\mathbb{R}^p\) that contains at least \(1-\alpha\) of the probability of \(\mathbf{X}\).

As defined, \(c_\alpha\) depends on \(f\), which is of course unknown. However, it can be estimated in a very elegant and computationally efficient way. To see it, first recall that

\[\begin{align} \int_{\mathcal{L}(f;c_\alpha)}f(\mathbf{x})\,\mathrm{d}\mathbf{x}&=\mathbb{P}[\mathbf{X}\in\mathcal{L}(f;c_\alpha)]\nonumber\\ &=\mathbb{P}[f(\mathbf{X})\geq c_\alpha]\geq1-\alpha,\tag{3.19} \end{align}\]

which amounts to find the largest \(c_\alpha\) such that \(\mathbb{P}[f(\mathbf{X})\leq c_\alpha]\leq \alpha\). Therefore,

\(c_\alpha\) is precisely the lower \(\alpha\)-quantile of the random variable \(f(\mathbf{X})\)!

From this insight, it follows that, if we knew \(f\), then from the sample \(\mathbf{X}_1,\ldots,\mathbf{X}_n\) we could estimate \(c_\alpha\) by the (lower) sample \(\alpha\)-quantile of \(f(\mathbf{X}_1),\ldots,f(\mathbf{X}_n)\). Since this is not the case, we can opt to replace \(f\) with the kde and define

\(\hat{c}_\alpha\) as the sample \(\alpha\)-quantile of \(\hat{f}(\mathbf{X}_1;\mathbf{H}),\ldots,\hat{f}(\mathbf{X}_n;\mathbf{H})\).

Then, the estimator of \(\mathcal{L}(f;c_\alpha)\), the highest density region that accumulates \(1-\alpha\) of the probability, follows by considering the kde twice, one for obtaining \(\hat{c}_\alpha\) and another for replacing \(f\) with \(\hat{f}(\cdot;\mathbf{H})\), resulting in \(\mathcal{L}(\hat{f}(\cdot;\mathbf{H}),\hat{c}_\alpha)\).

The following code illustrates how to estimate \(\mathcal{L}(f,c_\alpha)\) by \(\mathcal{L}(\hat{f}(\cdot;\mathbf{H}),\hat{c}_\alpha)\). Since we need to evaluate the kde at \(\mathbf{X}_1,\ldots,\mathbf{X}_n\), we employ ks::kde instead of density, as the latter allows evaluating the kde at a uniformly spaced grid only.

# Simulate sample
n <- 200
set.seed(12345)
samp <- rnorm(n = n)

# We want to estimate the highest density region containing 0.75 probability
alpha <- 0.25

# For the N(0, 1), we know that this region is the interval [-x_c, x_c] with
x_c <- qnorm(1 - alpha / 2)
c_alpha <- dnorm(x_c)
c_alpha
## [1] 0.2058535
# This corresponds to the c_alpha

# Theoretical level set
x <- seq(-4, 4, by = 0.01)
plot(x, dnorm(x), type = "l", ylim = c(0, 0.5), ylab = "Density")
rug(samp)
polygon(x = c(-x_c, -x_c, x_c, x_c), y = c(0, c_alpha, c_alpha, 0),
        col = rgb(0, 0, 0, alpha = 0.5), density = 10)
abline(h = c_alpha, col = 3, lty = 2) # Level

# Kde
bw <- bw.nrd(x = samp)
c_alpha_hat <- quantile(ks::kde(x = samp, h = bw, eval.points = samp)$estimate,
                        probs = alpha)
c_alpha_hat
##       25% 
## 0.1838304
kde <- density(x = samp, bw = bw, n = 4096, from = -4, to = 4)
lines(kde, col = 2)
kde_level_set(kde = kde, c = c_alpha_hat, add_plot = TRUE,
              col = rgb(1, 0, 0, alpha = 0.5))
##         [,1]     [,2]
## [1,] -1.2337 1.378266
abline(h = c_alpha_hat, col = 4, lty = 2) # Level
legend("topright", legend = expression("True density", "Kde", "True level set",
                                       "Kde level set", "Level " * c[alpha],
                                       "Level " * hat(c)[alpha]),
       lwd = 2, col = c(1, 2, rgb(0:1, 0, 0, alpha = 0.5), 3:4),
       lty = c(rep(1, 4), rep(2, 4)))
Level set \(\mathcal{L}(f;c_\alpha)\) and its estimation by \(\mathcal{L}(\hat{f}(\cdot;h);\hat{c}_\alpha)\) for \(\alpha=0.25\) and \(f=\phi\).

Figure 3.5: Level set \(\mathcal{L}(f;c_\alpha)\) and its estimation by \(\mathcal{L}(\hat{f}(\cdot;h);\hat{c}_\alpha)\) for \(\alpha=0.25\) and \(f=\phi\).

Observe that equation (3.19) points towards a simple way of approximating the multidimensional integral \(\int_{\mathcal{L}(f;c)}f(\mathbf{x})\,\mathrm{d}\mathbf{x}\) throughout a sample \(\mathbf{X}_1,\ldots,\mathbf{X}_n\) of \(f\):

\[\begin{align} \int_{\mathcal{L}(f;c)}f(\mathbf{x})\,\mathrm{d}\mathbf{x}=\mathbb{P}[f(\mathbf{X})\geq c]\approx \frac{1}{n}\sum_{i=1}^n1_{\{f(\mathbf{X}_i)\geq c\}}.\tag{3.20} \end{align}\]

This Monte Carlo method has the important advantage of avoiding an expensive numerical integration. This is precisely the same advantage behind the approximation of \(c_\alpha\) (essentially, such that \(\int_{\mathcal{L}(f;c_\alpha)}f(\mathbf{x})\,\mathrm{d}\mathbf{x}=1-\alpha\) for \(\alpha\in(0,1)\)) by the \(\alpha\)-quantile of \(f(\mathbf{X}_1),\ldots,f(\mathbf{X}_n)\).

The application of (3.20) is illustrated below.

# N(0, 1) case
alpha <- 0.3
x_c <- qnorm(1 - alpha / 2)
c_alpha <- dnorm(x_c)
c_alpha
## [1] 0.2331588

# Approximates c_alpha
quantile(dnorm(samp), probs = alpha)
##       30% 
## 0.2043856

# True integral: 1 - alpha (by construction)
1 - 2 * pnorm(-x_c)
## [1] 0.7

# Monte Carlo integration, approximates 1 - alpha
mean(dnorm(samp) >= c_alpha)
## [1] 0.655

Exercise 3.13 Consider the kde for the faithful$eruptions dataset with a DPI bandwidth.

  • What is the estimation of the shortest set that contains the \(50\%\) of the probability? And the \(75\%\)?
  • Compare the results obtained with the application of a boxplot. Which graphical analysis do you think is more informative?
Exercise 3.14 Repeat Exercise 3.13 with the data airquality$Wind.

Exercise 3.15 Section 2F in Tukey (1977) presents “the Rayleigh example” as an illustration of the weakness of box plots (referred to as “schematic plots” in Tukey’s terminology) for analyzing bimodal data. The weights of nitrogen obtained by Lord Rayleigh (table in page 49 in Tukey (1977)) are:

x <- c(2.30143, 2.29816, 2.30182, 2.29890, 2.31017,
       2.30986, 2.31010, 2.31001, 2.29889, 2.29940,
       2.29849, 2.29889, 2.31024, 2.31030, 2.31028)
  1. Obtain the boxplot of x. Do you see any interesting insight?
  2. Compute \(\hat h_{\mathrm{LSCV}}\) (beware of local minima) and draw a kde based on this bandwidth.
  3. Estimate and plot the shortest sets containing \(50\%\) and \(75\%\) of the probability. What are your insights?

Exercise 3.16 Repeat Exercise 3.12 but now considering \(\alpha=0.25,0.50\) and obtaining the corresponding \(c_\alpha\) and \(\hat{c}_\alpha\). Compute \(c_\alpha\) by Monte Carlo from (3.20) using \(M=10,000\) replications.

Exercise 3.17 Adapt the function kde_level_set to:

  1. Receive as main arguments a sample x and a bandwidth h, instead of a density object.
  2. Use internally ks::kde instead of density.
  3. Receive an \(\alpha\), instead of \(c\), and internally determine \(\hat c_\alpha\).
  4. Retain the functionality of kde_level_set and return also c_alpha_hat.
Call this new function kde_level_set2 and validate it with the examples seen so far.

Bivariate and trivariate level sets

The computation of bivariate and trivariate level sets is conceptually similar, but their parametrization and graphical display are more challenging. In \(\mathbb{R}^2\), the level sets are simply obtained from the contour levels of \(\hat{f}(\cdot;\mathbf{H})\), which is done easily with the ks package.

# Simulated sample from a mixture of normals
n <- 200
set.seed(123456)
mu <- c(2, 2)
Sigma1 <- diag(c(2, 2))
Sigma2 <- diag(c(1, 1))
samp <- rbind(mvtnorm::rmvnorm(n = n / 2, mean = mu, sigma = Sigma1),
              mvtnorm::rmvnorm(n = n / 2, mean = -mu, sigma = Sigma2))

# Level set of the true density at levels c
c <- c(0.01, 0.03)
x <- seq(-5, 5, by = 0.1)
xx <- as.matrix(expand.grid(x, x))
contour(x, x, 0.5 * matrix(mvtnorm::dmvnorm(xx, mean = mu, sigma = Sigma1) +
                             mvtnorm::dmvnorm(xx, mean = -mu, sigma = Sigma2),
                           nrow = length(x), ncol = length(x)),
        levels = c)

# Plot of the contour level
H <- ks::Hpi(x = samp)
kde <- ks::kde(x = samp, H = H)
plot(kde, display = "slice", abs.cont = c, add = TRUE, col = 4) # Argument
# "abs.cont" for specifying c rather than (1 - alpha) * 100 in "cont"
legend("topleft", lwd = 2, col = c(1, 4),
       legend = expression(L * "(" * f * ";" * c * ")",
                           L * "(" * hat(f) * "(" %.% ";" * H * "), " * c *")"))


# Computation of the probability accumulated in the level sets by numerical
# integration
ks::contourSizes(kde, abs.cont = c)
## [1] 35.865815  7.321872

Exercise 3.18 Compute the level set containing the \(50\%\) of the data of the following bivariate datasets: faithful, data(unicef, package = "ks"), and iris[, 1:2].

In \(\mathbb{R}^3\), the level sets are more challenging to visualize, as it is required to combine three-dimensional contours and the use of transparencies.

# Simulate a sample from a mixture of normals
n <- 5e2
set.seed(123456)
mu <- c(2, 2, 2)
Sigma1 <- rbind(c(1, 0.5, 0.2),
                c(0.5, 1, 0.5),
                c(0.2, 0.5, 1))
Sigma2 <- rbind(c(1, 0.25, -0.5),
                c(0.25, 1, 0.5),
                c(-0.5, 0.5, 1))
samp <- rbind(mvtnorm::rmvnorm(n = n / 2, mean = mu, sigma = Sigma1),
              mvtnorm::rmvnorm(n = n / 2, mean = -mu, sigma = Sigma2))

# Plot of the contour level, changing the color palette
H <- ks::Hns(x = samp)
kde <- ks::kde(x = samp, H = H)
plot(kde, cont = 100 * c(0.99, 0.95, 0.5), col.fun = viridis::viridis,
     drawpoints = TRUE, col.pt = 1, theta = 20, phi = 20)

# Simulate a large sample from a single normal
n <- 5e4
set.seed(123456)
mu <- c(0, 0, 0)
Sigma <- rbind(c(1, 0.5, 0.2),
                c(0.5, 1, 0.5),
                c(0.2, 0.5, 1))
samp <- mvtnorm::rmvnorm(n = n, mean = mu, sigma = Sigma)

# Plot of the contour level
H <- ks::Hns(x = samp)
kde <- ks::kde(x = samp, H = H)
plot(kde, cont = 100 * c(0.75, 0.5, 0.25), xlim = c(-2.5, 2.5),
     ylim = c(-2.5, 2.5), zlim = c(-2.5, 2.5))

Support estimation and outlier detection

In addition to the highest density regions that accumulate \(1-\alpha\) probability, for a relatively large \(\alpha>0.5\), it is also interesting to set \(\alpha\approx0\), as this gives an estimation of the effective support (excluding \(\alpha\) probability) of \(f\). The estimated effective support of \(\mathbf{X}\) gives valuable graphical insight into the structure of \(\mathbf{X}\) and is helpful, for example, to determine the plausible region for data points coming from \(f\) and then flag as outliers observations that fall outside the region, for arbitrary dimension.83

The following code chunk shows how to compute estimates of the support in \(\mathbb{R}^2\) via ks::ksupp.

# Compute kde of unicef dataset
data(unicef, package = "ks")
kde <- ks::kde(x = unicef)

# ks::ksupp evaluates whether the points in the grid spanned by ks::kde belong
# to the level set for alpha = 0.05 and then returns the points that belong to
# the level set
sup <- ks::ksupp(fhat = kde, cont = 95) # Effective support up to a 5% of data
plot(sup)

When dealing with sets defined by points on \(\mathbb{R}^p\), it is useful to consider the convex hull of the set of points, which is defined as the minor convex84 set that contains all the points. The convex hull in \(\mathbb{R}^2\) is computed with chull.

# The convex hull boundary of the level set can be computed with chull()
# It returns the indexes of the points passed that form the corners of the
# polygon of the convex hull
ch <- chull(sup)
plot(sup)
# One extra point for closing the polygon
lines(sup[c(ch, ch[1]), ], col = 2, lwd = 2)

The convex hull may be seen as a conservative estimate of the effective support, since it may enlarge the level set estimation considerably. Another disadvantage is that it merges the individual components of unconnected supports. However, convex hulls pose interesting advantages: they can be computed in \(\mathbb{R}^p\) via convex polyhedrons and then it is possible to check if a given point belongs to them. Thus, they give a handle to parametrize unknown-form level sets in \(\mathbb{R}^p\).

These two functionalities are available in the geometry package85 and have an interesting application: rather than evaluating if \(\mathbf{x}\in\mathcal{L}(\hat{f}(\cdot;\mathbf{H});c)\), it is possible to just86 check if \(\mathbf{x}\in\mathrm{conv}(\{\mathbf{x}_1,\ldots,\mathbf{x}_N\})\), where \(\mathrm{conv}\) denotes the convex hull operator and \(\{\mathbf{x}_1,\ldots,\mathbf{x}_N\}\) is a collection of points that roughly determines \(\mathcal{L}(\hat{f}(\cdot;\mathbf{H});c)\). For example, the output of ks::ksupp in \(\mathbb{R}^2\) or some points that belong to \(\mathcal{L}(\hat{f}(\cdot;\mathbf{H});c)\) for \(\mathbb{R}^p\), \(p\geq2\).

# Compute the convex hull of sup via geometry::convhulln()
C <- geometry::convhulln(p = sup)
# The output of geometry::convhulln() is different from chull()

# The geometry::inhulln() allows to check if points are inside the convex hull
geometry::inhulln(ch = C, p = rbind(c(50, 50), c(150, 50)))
## [1] FALSE  TRUE

# The convex hull works as well in R^p. An example in which the level set is
# evaluated by Monte Carlo and then the convex hull of the points in the level
# set is computed

# Sample
set.seed(2134)
samp <- mvtnorm::rmvnorm(n = 1e2, mean = rep(0, 3))

# Evaluation sample: random data in [-3, 3]^3
M <- 1e3
eval_set <- matrix(runif(n = 3 * M, -3, 3), M, 3)

# Kde of samp, evaluated at eval_set
H <- ks::Hns.diag(samp)
kde <- ks::kde(x = samp, H = H, eval.points = eval_set)

# Convex hull of points in the level set for a given c
c <- 0.01
C <- geometry::convhulln(p = eval_set[kde$estimate > c, ])

# We can test if a new point belongs to the level set by just checking if
# it belongs to the convex hull, which is much more efficient as it avoids
# re-evaluating the kde
new_points <- rbind(c(1, 1, 1), c(2, 2, 2))
geometry::inhulln(ch = C, p = new_points)
## [1]  TRUE FALSE
ks::kde(x = samp, H = H, eval.points = new_points)$estimate > c
## [1]  TRUE FALSE

# # Performance evaluation
# microbenchmark::microbenchmark(
#   geometry::inhulln(ch = C, p = new_points),
#   ks::kde(x = samp, H = H, eval.points = new_points)$estimate > c)

Detecting multivariate outliers in \(\mathbb{R}^p\) is more challenging than in \(\mathbb{R}\), and, unfortunately, the mere inspection of the marginals in the search for extreme values is not enough for successful outlier-hunting. A simple example in \(\mathbb{R}^2\) is given in Figure 3.6.

An outlier (green point) in \(\mathbb{R}^2\) that is not an outlier in any of the two marginals. The contour levels represent the joint density of a \(\mathcal{N}_2(\mathbf{0},\boldsymbol{\Sigma})\), where \(\boldsymbol{\Sigma}\) is such that \(\sigma^2_{11}=\sigma^2_{22}=1\) and \(\sigma_{12}=0.8\). The black points are sampled from that normal. The red and blue lines represent the \(1-\alpha\) probability intervals \((-z_{\alpha/2},z_{\alpha/2})\) for \(\alpha=0.01\).

Figure 3.6: An outlier (green point) in \(\mathbb{R}^2\) that is not an outlier in any of the two marginals. The contour levels represent the joint density of a \(\mathcal{N}_2(\mathbf{0},\boldsymbol{\Sigma})\), where \(\boldsymbol{\Sigma}\) is such that \(\sigma^2_{11}=\sigma^2_{22}=1\) and \(\sigma_{12}=0.8\). The black points are sampled from that normal. The red and blue lines represent the \(1-\alpha\) probability intervals \((-z_{\alpha/2},z_{\alpha/2})\) for \(\alpha=0.01\).

The next exercise shows an application of a level set estimation to outlier detection.

Exercise 3.19 Megaloblastic anemia is a kind of anemia that produces abnormally large red blood cells. Among others, indicators of megaloblastic anemia are low levels of: vitamin B12 (reference values: \(178-1,100\) pg/ml), Folic Acid (FA; reference values: \(>5.4\) ng), Mean Corpuscular Volume (MCV; reference values: \(78-100\) fL), Hemoglobin (H; reference values: \(13-17\) g/dL), and Iron (I; reference values: \(60-160\) \(\mu\)g/dL). However, none can diagnose megaloblastic anemia by itself.

The dataset healthy-patients.txt contains reallistically simulated data from \(n_1=5,835\) blood analysis that include measurements for \(\mathbf{X}=(\mathrm{B12}, \mathrm{FA}, \mathrm{MCV}, \mathrm{H}, \mathrm{I})\), that is, a sample \(\mathbf{X}_1,\ldots,\mathbf{X}_{n_1}\). On the other hand, the dataset new-patients.txt contains observations of \(\mathbf{X}\) for \(n_2=117\) new patients, for which the presence of megaloblastic anemia is unknown. This sample is denoted by \(\tilde{\mathbf{X}}_{1},\ldots,\tilde{\mathbf{X}}_{n_2}\).

Our aim is to identify outliers in this second dataset with respect to the first, in order to identify potential manifestations of megaloblastic anemia. To do so, we follow the following steps:

  1. Obtain the normal scale bandwidth \(\hat{\mathbf{H}}_\mathrm{NS}\) for the sample \(\mathbf{X}_1,\ldots,\mathbf{X}_{n_1}\).
  2. Obtain the level \(\hat{c}_\alpha\) such that \(0.995\) of the estimated probability of \(\mathbf{X}\) is contained in \(\mathcal{L}(\hat{f}(\cdot;\mathbf{H});\hat{c}_\alpha)\). Hint: recall how \(\hat{c}_\alpha\) was defined.
  3. Evaluate the kde \(\hat{f}_i(\tilde{\mathbf{X}}_i;\hat{\mathbf{H}}_\mathrm{NS})\), \(i=1,\ldots,n_2\), for the new sample \(\tilde{\mathbf{X}}_1,\ldots,\tilde{\mathbf{X}}_{n_2}\), where \(\hat{f}_i(\cdot;\hat{\mathbf{H}}_\mathrm{NS})\) stands for the kde of \(\mathbf{X}_1,\ldots,\mathbf{X}_{n_1},\tilde{\mathbf{X}}_i\).87
  4. Check if \(\hat{f}_i(\tilde{\mathbf{X}}_i;\hat{\mathbf{H}}_\mathrm{NS})<\hat{c}_\alpha\), \(i=1,\ldots,n\), and flag \(\tilde{\mathbf{X}}_{i}\) as an outlier if the condition holds.

Are there outliers? Do the observations flagged as outliers have the measurements inside the reference values?

Exercise 3.20 The wines.txt dataset contains chemical analyses of wines grown in Piedmont (Italy) coming from three different vintages: Nebbiolo, Barberas, and Grignolino. Perform a PCA after removing the vintage variable and standardizing the variables.

  1. Considering the three first PCs, what is the smallest set that contains the \(90\%\) of the probability? Represent the set graphically.
  2. Is the point \(\mathbf{x}_1=(0,-1.5,0)\) inside the level set? Could it be considered an outlier (remember the previous exercise)?
  3. What about \(\mathbf{x}_2=(2, -1.5, 2)\)?
  4. And what about \(\mathbf{x}_3=(3, -3, 3)\)?
  5. Visualize the locations of \(\mathbf{x}_1\), \(\mathbf{x}_2\), and \(\mathbf{x}_3\) to interpret the previous results

We conclude the section by highlighting a useful fact about the computation of areas under normal densities that is helpful to determine exactly the density levels for given probabilities. The density level \(c_\alpha\) that contains \(1-\alpha\) probability for \(\mathbf{X}\sim\mathcal{N}_p(\boldsymbol{\mu},\boldsymbol{\Sigma})\) can be computed exactly by

\[\begin{align*} 1-\alpha=\mathbb{P}[\phi_{\boldsymbol\Sigma}(\mathbf{X}-\boldsymbol{\mu})\geq c]=\mathbb{P}\left[\chi^2_p\leq -2\log\left(|\boldsymbol{\Sigma}|^{1/2}(2\pi)^{p/2}c\right)\right], \end{align*}\]

since \((\mathbf{X}-\boldsymbol{\mu})'{\boldsymbol\Sigma}^{-1}(\mathbf{X}-\boldsymbol{\mu})\sim \chi_p^2\). Therefore, for a given \(\alpha\in(0,1)\)

\[\begin{align} c_\alpha=|\boldsymbol{\Sigma}|^{-1/2}(2\pi)^{-p/2}e^{-\chi^2_{p;\alpha}/2},\tag{3.21} \end{align}\]

where \(\chi^2_{p;\alpha}\) is the \(\alpha\)-upper quantile of a \(\chi^2_p\). The following code shows the computation of (3.21).

alpha <- 0.4
p <- 2
c_alpha <- exp(-0.5 * qchisq(p = 1 - alpha, df = p)) /
  (sqrt(det(Sigma)) * (2 * pi)^(p / 2))

3.5.2 Mean shift clustering

Perhaps one of the best-known clustering methods is \(k\)-means. Given a sample \(\mathbf{X}_1,\ldots,\mathbf{X}_n\) in \(\mathbb{R}^p\), \(k\)-means is defined as the problem of finding the clusters \(C_1,\ldots,C_k\) such that the within-cluster variation is as small as possible:

\[\begin{align*} \min_{C_1,\ldots,C_k}\left\{\sum_{j=1}^k \mathrm{W}(C_j)\right\} \end{align*}\]

where the within-cluster variation88 of \(C_j\) is defined as

\[\begin{align*} \mathrm{W}(C_j):=\frac{1}{|C_j|}\sum_{i,i'\in C_j} \|\mathbf{X}_i-\mathbf{X}_{i'}\|^2 \end{align*}\]

with \(|C_j|:=\#\{i=1,\ldots,n:\mathbf{X}_i\in C_j\}\) denoting the number of observations in the \(j\)-th cluster.

\(k\)-means partitions for a two-dimensional dataset with \(k=1,2,3,4\). The center of each cluster is displayed with an asterisk.

Figure 3.7: \(k\)-means partitions for a two-dimensional dataset with \(k=1,2,3,4\). The center of each cluster is displayed with an asterisk.

Behind the usefulness and simplicity of \(k\)-means hides an important drawback inherent to many clustering techniques: it is a method mostly defined on a sample basis89 and for which a population analogue (in terms of the distribution of \(\mathbf{X}\)) is not clearly evident.90 This has an important consequence: the somehow subjective concept of “cluster” depends on the observed sample, and thus there is no immediate objective definition in terms of the population.91 This implies that there is no clear population reference with which to compare the performance of a given sample clustering, that precise indications on the difficulty of performing clustering on a certain scenario are harder to make, and that there is no noticeable ground truth on what should be the right number of clusters for a given population.92

\(k\)-means partitions for \(10,000\) observations sampled from the claw density for \(k=5\). Notice how despite the density’s having \(5\) clear modes, the clusters are not associated with them (the blue on the right hand side captures the data on the right tail), even for the large sample size considered.

Figure 3.8: \(k\)-means partitions for \(10,000\) observations sampled from the claw density for \(k=5\). Notice how despite the density’s having \(5\) clear modes, the clusters are not associated with them (the blue on the right hand side captures the data on the right tail), even for the large sample size considered.

In the following, we study a principled population approach to clustering that tries to bypass these issues.

A population approach to clustering

The general goal of clustering, to find clusters of data with low within-cluster variation, can be regarded as the task of determining data-rich regions on the sample. From the density perspective, data-rich regions have a precise definition: modes and high-density regions. Therefore, modes are going to be crucial for defining population clusters in the sense introduced by Chacón (2015).

Given the random vector \(\mathbf{X}\) in \(\mathbb{R}^p\) with pdf \(f\), we denote the modes of \(f\) by \(\boldsymbol{\xi}_1,\ldots,\boldsymbol{\xi}_m\). These are the local maxima of \(f\), i.e., \(\mathrm{D}f(\boldsymbol{\xi}_j)=\mathbf{0}\), \(j=1,\ldots,m\). Intuitively, we can think of the population clusters as the regions of \(\mathbb{R}^p\) that are “associated” with each of the modes of \(f\). This “association” can be visualized, for example, by a gravitational analogy: if \(\boldsymbol{\xi}_1,\ldots,\boldsymbol{\xi}_m\) denote fixed equal-mass planets distributed on the space, then the population clusters can be thought of as the regions that determine the domains of attraction of each planet for an object \(\mathbf{x}\) in the space that has zero initial speed. If \(\mathbf{x}\) is attracted by \(\boldsymbol{\xi}_1\), then \(\mathbf{x}\) belongs to the cluster defined by \(\boldsymbol{\xi}_1\). In our setting, the role of the gravity attraction is played by the gradient of the density \(f\), \(\mathrm{D}f:\mathbb{R}^p\longrightarrow \mathbb{R}^p\), which forms a vector field over \(\mathbb{R}^p\).

The code below illustrates the previous analogy and serves to present the function numDeriv::grad.

# Planets
th <- 2 * pi / 3
r <- 2
xi_1 <- r * c(cos(th + 0.5), sin(th + 0.5))
xi_2 <- r * c(cos(2 * th + 0.5), sin(2 * th + 0.5))
xi_3 <- r * c(cos(3 * th + 0.5), sin(3 * th + 0.5))

# Gravity force
gravity <- function(x) {

  (mvtnorm::dmvnorm(x = x, mean = xi_1, sigma = diag(rep(0.5, 2))) +
     mvtnorm::dmvnorm(x = x, mean = xi_2, sigma = diag(rep(0.5, 2))) +
     mvtnorm::dmvnorm(x = x, mean = xi_3, sigma = diag(rep(0.5, 2)))) / 3

}

# Compute numerically the gradient of an arbitrary function
attraction <- function(x) numDeriv::grad(func = gravity, x = x)

# Evaluate the vector field
x <- seq(-4, 4, l = 20)
xy <- expand.grid(x = x, y = x)
dir <- apply(xy, 1, attraction)

# Scale arrows to unit length for better visualization
len <- sqrt(colSums(dir^2))
dir <- 0.25 * scale(dir, center = FALSE, scale = len)

# Colors of the arrows according to their original magnitude
brk <- quantile(len, probs = seq(0, 1, length.out = 21))
cuts <- cut(x = len, breaks = brk)
cols <- viridis::viridis(20)[cuts]

# Vector field plot
plot(0, 0, type = "n", xlim = c(-4, 4), ylim = c(-4, 4),
     xlab = "x", ylab = "y")
arrows(x0 = xy$x, y0 = xy$y,
       x1 = xy$x + dir[1, ], y1 = xy$y + dir[2, ],
       angle = 10, length = 0.1, col = cols, lwd = 2)
points(rbind(xi_1, xi_2, xi_3), pch = 19, cex = 1.5)
Sketch of the gravity vector field associated with three equal-mass planets (black points). The vector field is computed as the gradient of a mixture of three bivariate normals centered at the black points and having covariance matrices \(\frac{1}{2}\mathbf{I}_2\). The direction of the arrows denotes the direction of the gravity field, and their color, the strength of the gravity force.

Figure 3.9: Sketch of the gravity vector field associated with three equal-mass planets (black points). The vector field is computed as the gradient of a mixture of three bivariate normals centered at the black points and having covariance matrices \(\frac{1}{2}\mathbf{I}_2\). The direction of the arrows denotes the direction of the gravity field, and their color, the strength of the gravity force.

The previous idea can be mathematically formalized as follows. We seek to partition93 \(\mathbb{R}^p\) in a collection of disjoint subsets \(W^s_+(\boldsymbol{\xi}_1),\ldots, W^s_+(\boldsymbol{\xi}_m)\,\)94 defined as

\[\begin{align*} W^s_+(\boldsymbol{\xi}):=\left\{\mathbf{x}\in\mathbb{R}^p:\lim_{t\to\infty}\boldsymbol{\phi}_{\mathbf{x}}(t)=\boldsymbol{\xi}\right\} \end{align*}\]

where \(\boldsymbol{\phi}_{\mathbf{x}}:\mathbb{R}\longrightarrow\mathbb{R}^p\) is a curve in \(\mathbb{R}^p\) parametrized by \(t\in\mathbb{R}\) that satisfies the following Ordinal Differential Equation (ODE):

\[\begin{align} \frac{\mathrm{d}}{\mathrm{d}t}\boldsymbol{\phi}_{\mathbf{x}}(t)=\mathrm{D}f(\boldsymbol{\phi}_{\mathbf{x}}(t)),\quad \boldsymbol{\phi}_{\mathbf{x}}(0)=\mathbf{x}.\tag{3.22} \end{align}\]

This ODE admits a clear interpretation: the flow curve \(\boldsymbol{\phi}_{\mathbf{x}}\) is the path that, originated at \(\mathbf{x}\), describes \(\mathbf{x}\) when reaching \(\boldsymbol{\xi}_j\) through the direction of maximum ascent.95 The ODE can be solved through different numerical schemes. For example, observing that \(\frac{\mathrm{d}}{\mathrm{d}t}\boldsymbol{\phi}_{\mathbf{x}}(t)=\lim_{h\to0}\frac{\boldsymbol{\phi}_{\mathbf{x}}(t+h)-\boldsymbol{\phi}_{\mathbf{x}}(t)}{h}\), the Euler method considers the approximation

\[\begin{align} \boldsymbol{\phi}_{\mathbf{x}}(t+h)\approx\boldsymbol{\phi}_{\mathbf{x}}(t)+h\mathrm{D}f(\boldsymbol{\phi}_{\mathbf{x}}(t))\quad\text{if } h\approx 0,\tag{3.23} \end{align}\]

which motivates the iterative scheme96

\[\begin{align} \begin{cases} \mathbf{x}_{t+1}=\mathbf{x}_t+h\mathrm{D}f(\mathbf{x}_t),\quad t=0,\ldots,N,\\ \mathbf{x}_{0}=\mathbf{x}, \end{cases} \tag{3.24} \end{align}\]

for a step \(h>0\) and a number of maximum iterations \(N\).

The following chunk of code illustrates the implementation of the Euler method (3.22) and gives insight into the paths \(\boldsymbol{\phi}_{\mathbf{x}}\). The code implements the exact gradient of a multivariate normal given in (3.9).

# Mixture parameters
mu_1 <- rep(1, 2)
mu_2 <- rep(-1.5, 2)
Sigma_1 <- matrix(c(1, -0.75, -0.75, 3), nrow = 2, ncol = 2)
Sigma_2 <- matrix(c(2, 0.75, 0.75, 3), nrow = 2, ncol = 2)
Sigma_1_inv <- solve(Sigma_1)
Sigma_2_inv <- solve(Sigma_2)
w <- 0.45

# Density
f <- function(x) {
  w * mvtnorm::dmvnorm(x = x, mean = mu_1, sigma = Sigma_1) +
    (1 - w) * mvtnorm::dmvnorm(x = x, mean = mu_2, sigma = Sigma_2)
}

# Gradient (caution: only works adequately for x a vector, it is not
# vectorized; observe that in the Sigma_inv %*% (x - mu) part the subtraction
# of mu and premultiplication by Sigma_inv are specific to a *single* point x)
Df <- function(x) {
  -(w * mvtnorm::dmvnorm(x = x, mean = mu_1, sigma = Sigma_1) *
      Sigma_1_inv %*% (x - mu_1) +
    (1 - w) * mvtnorm::dmvnorm(x = x, mean = mu_2, sigma = Sigma_2) *
      Sigma_2_inv %*% (x - mu_2))
}

# Plot density
ks::plotmixt(mus = rbind(mu_1, mu_2), Sigmas = rbind(Sigma_1, Sigma_2),
             props = c(w, 1 - w), display = "filled.contour2",
             gridsize = rep(251, 2), xlim = c(-5, 5), ylim = c(-5, 5),
             cont = seq(0, 90, by = 10), col.fun = viridis::viridis)

# Euler solution
x <- c(-2, 2)
# x <- c(-4, 0)
# x <- c(-4, 4)
N <- 1e3
h <- 0.5
phi <- matrix(nrow = N + 1, ncol = 2)
phi[1, ] <- x
for (t in 1:N) {

  phi[t + 1, ] <- phi[t, ] + h * Df(phi[t, ])# / f(phi[t, ])

}
lines(phi, type = "l")
points(rbind(x), pch = 19)
text(rbind(x), labels = "x", pos = 3)

# Mean of the components
points(rbind(mu_1, mu_2), pch = 16, col = 4)
text(rbind(mu_1, mu_2), labels = expression(mu[1], mu[2]), pos = 4, col = 4)

# The modes are different from the mean of the components! -- see the gradients
cbind(Df(mu_1), Df(mu_2))
##              [,1]         [,2]
## [1,] -0.005195479 1.528460e-04
## [2,] -0.002886377 7.132814e-05

# Modes
xi_1 <- optim(par = mu_1, fn = function(x) sum(Df(x)^2))$par
xi_2 <- optim(par = mu_2, fn = function(x) sum(Df(x)^2))$par
points(rbind(xi_1, xi_2), pch = 16, col = 2)
text(rbind(xi_1, xi_2), labels = expression(xi[1], xi[2]), col = 2, pos = 2)
The curve \(\boldsymbol{\phi}_{\mathbf{x}}\) computed by the Euler method, in black. The population density is the mixture of bivariate normals \(w\phi_{\boldsymbol{\Sigma}_1}(\cdot-\boldsymbol{\mu}_1)+(1-w)\phi_{\boldsymbol{\Sigma}_2}(\cdot-\boldsymbol{\mu}_2)\), where \(\boldsymbol{\mu}_1=(1,1)'\), \(\boldsymbol{\mu}_2=(-1.5,-1.5)'\), \(\boldsymbol{\Sigma}_1=(1, -0.75; -0.75, 3)\), \(\boldsymbol{\Sigma}_2=(2, 0.75; 0.75, 3)\), and \(w=0.45\). The component means \(\boldsymbol{\mu}_1\) and \(\boldsymbol{\mu}_2\) are shown in blue, whereas the two modes \(\boldsymbol{\xi}_1\) and \(\boldsymbol{\xi}_2\) of the density are represented in red. Note that the modes and the component means may be different.

Figure 3.10: The curve \(\boldsymbol{\phi}_{\mathbf{x}}\) computed by the Euler method, in black. The population density is the mixture of bivariate normals \(w\phi_{\boldsymbol{\Sigma}_1}(\cdot-\boldsymbol{\mu}_1)+(1-w)\phi_{\boldsymbol{\Sigma}_2}(\cdot-\boldsymbol{\mu}_2)\), where \(\boldsymbol{\mu}_1=(1,1)'\), \(\boldsymbol{\mu}_2=(-1.5,-1.5)'\), \(\boldsymbol{\Sigma}_1=(1, -0.75; -0.75, 3)\), \(\boldsymbol{\Sigma}_2=(2, 0.75; 0.75, 3)\), and \(w=0.45\). The component means \(\boldsymbol{\mu}_1\) and \(\boldsymbol{\mu}_2\) are shown in blue, whereas the two modes \(\boldsymbol{\xi}_1\) and \(\boldsymbol{\xi}_2\) of the density are represented in red. Note that the modes and the component means may be different.

Exercise 3.21 Inspect the code generating Figure 3.10. Then:

  1. Experiment with different values for the initial point \(\mathbf{x}\) and inspect the resulting flow curves. What happens if \(\mathbf{x}\) is in a low-density region?
  2. Change the step \(h\) to larger and smaller values. What happens? Has \(h\) an influence on the convergence to a mode?
  3. Does the “adequate” step \(h\) depend on \(\mathbf{x}\)?

Clustering a sample: kernel mean shift clustering

From the previous example, it is clear how to assign a point \(\mathbf{x}\) to a population cluster: compute its associated flow curve and assign the \(j\)-th cluster label if \(\mathbf{x}\in W_+^s(\boldsymbol{\xi}_j)\). More importantly, once the population view of clustering is clear, performing clustering for a sample is conceptually trivial: just replace \(f\) in (3.24) with its kde \(\hat{f}(\cdot;\mathbf{H})\)!97 By doing so, it results that

\[\begin{align} \begin{cases} \mathbf{x}_{t+1}=\mathbf{x}_t+h\mathrm{D}\hat{f}(\mathbf{x}_t;\mathbf{H}),\quad t=0,\ldots,N,\\ \mathbf{x}_{0}=\mathbf{x}. \end{cases} \tag{3.25} \end{align}\]

A couple98 more tweaks are required to turn (3.25) into a more practical relation. The first tweak boosts the travel through low-density regions and decreases the ascent at high-density regions (see Figure 3.11) by adapting the step size taken at \(\mathbf{x}_{t+1}\) by the density at \(\mathbf{x}_{t}\). In version (3.24), this amounts to considering the normalized gradient99

\[\begin{align*} \boldsymbol{\eta}(\mathbf{x}):=\frac{\mathrm{D}f(\mathbf{x})}{f(\mathbf{x})} \end{align*}\]

instead of the unnormalized gradient. Considering the normalized gradient can be seen as a replacement in (3.25) of the constant step \(h\) by the variable step \(h(\mathbf{x}):=a/f(\mathbf{x})\), \(a>0\).

Comparison between unnormalized and normalized gradient vector fields. Observe how the unnormalized gradient almost vanishes at regions with low density and “overshoots” close to the modes. The normalized gradient gently adapts to the density region, boosting and slowing ascent as required. Both gradients have been standardized so that the norm of the maximum gradient is \(2\), therefore enabling their comparison.Comparison between unnormalized and normalized gradient vector fields. Observe how the unnormalized gradient almost vanishes at regions with low density and “overshoots” close to the modes. The normalized gradient gently adapts to the density region, boosting and slowing ascent as required. Both gradients have been standardized so that the norm of the maximum gradient is \(2\), therefore enabling their comparison.

Figure 3.11: Comparison between unnormalized and normalized gradient vector fields. Observe how the unnormalized gradient almost vanishes at regions with low density and “overshoots” close to the modes. The normalized gradient gently adapts to the density region, boosting and slowing ascent as required. Both gradients have been standardized so that the norm of the maximum gradient is \(2\), therefore enabling their comparison.

The second tweak replaces \(a\) in the variable step with a positive definite matrix \(\mathbf{A}\) to premultiply \(\boldsymbol{\eta}(\mathbf{x})\) and allow for more generality. This apparent innocuous change gives a convenient100 choice for \(\mathbf{A}\) and hence for the step in Euler’s method: \(\mathbf{A}=\mathbf{H}\).

Joining the previous two tweaks, it results the recurrence relation of kernel mean shift clustering:

\[\begin{align} \mathbf{x}_{t+1}=\mathbf{x}_t+\mathbf{H}\hat{\boldsymbol{\eta}}(\mathbf{x}_t;\mathbf{H}),\quad \hat{\boldsymbol{\eta}}(\mathbf{x};\mathbf{H}):=\frac{\mathrm{D}\hat{f}(\mathbf{x}_t;\mathbf{H})}{\hat{f}(\mathbf{x}_t;\mathbf{H})}.\tag{3.26} \end{align}\]

The recipe for clustering a sample \(\mathbf{X}_1,\ldots,\mathbf{X}_n\) is now simple:

  1. Select a “suitable” bandwidth \(\hat{\mathbf{H}}\).
  2. For each element \(\mathbf{X}_i\), iterate the recurrence relation (3.26) “until convergence” to a given \(\mathbf{y}_i\), \(i=1,\ldots,n\).
  3. Find the set of “unique” end points \(\{\boldsymbol{\xi}_1,\ldots,\boldsymbol{\xi}_m\}\) (the modes) among \(\{\mathbf{y}_1,\ldots,\mathbf{y}_n\}\).
  4. Label \(\mathbf{X}_i\) as \(j\) if it is associated with the \(j\)-th mode \(\boldsymbol{\xi}_j\).

Some practical details are important when implementing the previous algorithm. First and more importantly, a “suitable” bandwidth \(\hat{\mathbf{H}}\) has to be considered in order to carry out (3.26). Observe the crucial role of the bandwidth:

  • A large bandwidth \(\hat{\mathbf{H}}\) will oversmooth the data and underestimate the number of modes. In the most extreme case, it will merge all the possible clusters into a single one positioned close to the sample mean \(\bar{\mathbf{X}}\).
  • A small bandwidth \(\hat{\mathbf{H}}\) will undersmooth the data, thus overestimating the number of modes. In the most extreme case, it will detect as many modes as data points.

The kind of bandwidth selectors recommended are the ones designed for gradient density estimation (see Sections 3.1 and 3.4), since (3.26) critically depends on estimating adequately the gradient (see discussion in Section 6.2.2 in Chacón and Duong (2018)). These bandwidth selectors yield larger bandwidths than the ones designed for density estimation (compare (3.18) with (3.17)).

In addition, both the iteration “until convergence” and the “uniqueness” of the end points101 have to be evaluated in practice with adequate numerical tolerances. The function ks::kms implements kernel mean shift clustering and automatically employs tested choices for numerical tolerances and convergence criteria.

# A simulated example for which the population clusters are known
# Extracted from ?ks::dmvnorm.mixt
mus <- rbind(c(-1, 0), c(1, 2 / sqrt(3)), c(1, -2 / sqrt(3)))
Sigmas <- 1/25 * rbind(ks::invvech(c(9, 63/10, 49/4)),
                       ks::invvech(c(9, 0, 49/4)),
                       ks::invvech(c(9, 0, 49/4)))
props <- c(3, 3, 1) / 7

# Sample the mixture
set.seed(123456)
x <- ks::rmvnorm.mixt(n = 1000, mus = mus, Sigmas = Sigmas, props = props)

# Kernel mean shift clustering. If H is not specified, then
# H = ks::Hpi(x, deriv.order = 1) is employed. Its computation may take some
# time, so it is advisable to compute it separately for later reuse
H <- ks::Hpi(x = x, deriv.order = 1)
kms <- ks::kms(x = x, H = H)

# Plot clusters
plot(kms, col = viridis::viridis(kms$nclust), pch = 19, xlab = "x", ylab = "y")


# Summary
summary(kms)
## Number of clusters = 3 
## Cluster label table = 521 416 63 
## Cluster modes =
##           V1          V2
## 1  1.0486466  0.96316436
## 2 -1.0049258  0.08419048
## 3  0.9888924 -1.43852908

# Objects in the kms object
kms$nclust # Number of clusters found
## [1] 3
kms$nclust.table # Sizes of clusters
## 
##   1   2   3 
## 521 416  63
kms$mode # Estimated modes
##            [,1]        [,2]
## [1,]  1.0486466  0.96316436
## [2,] -1.0049258  0.08419048
## [3,]  0.9888924 -1.43852908

# With keep.path = TRUE the ascending paths are returned
kms <- ks::kms(x = x, H = H, keep.path = TRUE)
cols <- viridis::viridis(kms$nclust, alpha = 0.5)[kms$label]
plot(x, col = cols, pch = 19, xlab = "x", ylab = "y")
for (i in 1:nrow(x)) lines(kms$path[[i]], col = cols[i])
points(kms$mode, pch = 8, cex = 2, lwd = 2)

The partition of the whole space102 (not just the sample) can be done with ks::kms.part, whereas ks::mvnorm.mixt.part allows to partition the population if this is a mixture of normals.

# Partition of the whole sample space
kms_part <- ks::kms.part(x = x, H = H, xmin = c(-3, -3), xmax = c(3, 4),
                         gridsize = c(150, 150))
plot(kms_part, display = "filled.contour2", col = viridis::viridis(kms$nclust),
     xlab = "x", ylab = "y")
points(kms_part$mode, pch = 8, cex = 2, lwd = 2)


# Partition of the population
mixt_part <- ks::mvnorm.mixt.part(mus = mus, Sigmas = Sigmas, props = props,
                                  xmin = c(-3, -3), xmax = c(3, 4),
                                  gridsize = c(150, 150))
plot(mixt_part, display = "filled.contour2", col = viridis::viridis(kms$nclust),
     xlab = "x", ylab = "y")

# Obtain the modes of a mixture of normals automatically
modes <- ks::mvnorm.mixt.mode(mus = mus, Sigmas = Sigmas, props = props)
points(modes, pch = 8, cex = 2, lwd = 2)

modes
##            [,1]         [,2]
## [1,] -0.9975330  0.002125069
## [2,]  0.9898321  1.153091638
## [3,]  0.9999969 -1.119886815
mus
##      [,1]      [,2]
## [1,]   -1  0.000000
## [2,]    1  1.154701
## [3,]    1 -1.154701

Finally, the ks::kms function is applied to the iris[, 1:3] dataset to illustrate a three-dimensional example.

# Obtain PI bandwidth
H <- ks::Hpi(x = iris[, 1:3], deriv.order = 1)

# Many (8) clusters: probably due to the repetitions in the data
kms_iris <- ks::kms(x = iris[, 1:3], H = H)
summary(kms_iris)
## Number of clusters = 8 
## Cluster label table = 47 3 25 11 55 3 3 3 
## Cluster modes =
##   Sepal.Length Sepal.Width Petal.Length
## 1     5.065099    3.442888     1.470614
## 2     5.783786    3.975575     1.255177
## 3     6.726385    3.026522     4.801402
## 4     5.576415    2.478507     3.861941
## 5     6.081276    2.884925     4.710959
## 6     6.168988    2.232741     4.310609
## 7     6.251387    3.375452     5.573491
## 8     7.208475    3.596510     6.118948

# Force to only find clusters that contain at least 10% of the data
# kms merges internally the small clusters with the closest ones
kms_iris <- ks::kms(x = iris[, 1:3], H = H, min.clust.size = 15)
summary(kms_iris)
## Number of clusters = 3 
## Cluster label table = 50 31 69 
## Cluster modes =
##   Sepal.Length Sepal.Width Petal.Length
## 1     5.065099    3.442888     1.470614
## 2     6.726385    3.026522     4.801402
## 3     5.576415    2.478507     3.861941

# Pairs plot -- good match of clustering with Species
plot(kms_iris, pch = as.numeric(iris$Species) + 1,
     col = viridis::viridis(kms_iris$nclust))

# See ascending paths
kms_iris <- ks::kms(x = iris[, 1:3], H = H, min.clust.size = 15,
                    keep.path = TRUE)
cols <- viridis::viridis(kms_iris$nclust)[kms_iris$label]
rgl::plot3d(kms_iris$x, col = cols)
for (i in 1:nrow(iris)) rgl::lines3d(kms_iris$path[[i]], col = cols[i])
rgl::points3d(kms_iris$mode, size = 5)
rgl::rglwidget()