2.1 Histograms

2.1.1 Histogram

The simplest method to estimate a density f from an iid sample X1,,Xn is the histogram. From an analytical point of view, the idea is to aggregate the data in intervals of the form [x0,x0+h) and then use their relative frequency to approximate the density at x[x0,x0+h), f(x), by the estimate of15

f(x0)=F(x0)=limh0+F(x0+h)F(x0)h=limh0+P[x0<X<x0+h]h.

More precisely, given an origin t0 and a bandwidth h>0, the histogram builds a piecewise constant function in the intervals {Bk:=[tk,tk+1):tk=t0+hk,kZ} by counting the number of sample points inside each of them. These constant-length intervals are also called bins. The fact that they are of constant length h is important: we can easily standardize the counts on any bin by h in order to have relative frequencies per length16 in the bins. The histogram at a point x is defined as

(2.1)f^H(x;t0,h):=1nhi=1n1{XiBk:xBk}.

Equivalently, if we denote the number of observations X1,,Xn in Bk as vk, then the histogram can be written as

f^H(x;t0,h)=vknh, if xBk for a certain kZ.

The computation of histograms is straightforward in R. As an example, we consider the old-but-gold faithful dataset. This dataset contains the duration of the eruption and the waiting time between eruptions for the Old Faithful geyser in Yellowstone National Park (USA).

# The faithful dataset is included in R
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

# Duration of eruption
faith_eruptions <- faithful$eruptions

# Default histogram: automatically chosen bins and absolute frequencies!
histo <- hist(faith_eruptions)


# List that contains several objects
str(histo)
## List of 6
##  $ breaks  : num [1:9] 1.5 2 2.5 3 3.5 4 4.5 5 5.5
##  $ counts  : int [1:8] 55 37 5 9 34 75 54 3
##  $ density : num [1:8] 0.4044 0.2721 0.0368 0.0662 0.25 ...
##  $ mids    : num [1:8] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
##  $ xname   : chr "faith_eruptions"
##  $ equidist: logi TRUE
##  - attr(*, "class")= chr "histogram"

# With relative frequencies
hist(faith_eruptions, probability = TRUE)


# Choosing the breaks
# t0 = min(faithE), h = 0.25
Bk <- seq(min(faith_eruptions), max(faith_eruptions), by = 0.25)
hist(faith_eruptions, probability = TRUE, breaks = Bk)
rug(faith_eruptions) # Plotting the sample

Exercise 2.1 For iris$Sepal.Length, compute:

  1. The histogram of relative frequencies with five bins.
  2. The histogram of absolute frequencies with t0=4.3 and h=1.

Add the rug of the data for both histograms.

The analysis of f^H(x;t0,h) as a random variable is simple, once one recognizes that the bin counts vk are distributed as B(n,pk), with pk:=P[XBk]=Bkf(t)dt.17 If f is continuous, then by the integral mean value theorem, pk=hf(ξk,h) for a ξk,h(tk,tk+1). Assume that kZ is such that xBk=[tk,tk+1).18 Therefore:

(2.2)E[f^H(x;t0,h)]=npknh=f(ξk,h),Var[f^H(x;t0,h)]=npk(1pk)n2h2=f(ξk,h)(1hf(ξk,h))nh.

The results above yield interesting insights:

  1. If h0, then ξk,hx,19 resulting in f(ξk,h)f(x), and thus (2.1) becomes an asymptotically (when h0) unbiased estimator of f(x).
  2. But if h0, the variance increases. For decreasing the variance, nh is required.
  3. The variance is directly dependent on f(ξk,h)(1hf(ξk,h))f(x) (as h0), hence there is more variability at regions with higher density.

A more detailed analysis of the histogram can be seen in Section 3.2.2 in Scott (2015). We skip it since the detailed asymptotic analysis for the more general kernel density estimator will be given in Section 2.2.

Exercise 2.2 Given (2.2), obtain MSE[f^H(x;t0,h)]. What should happen in order to have MSE[f^H(x;t0,h)]0?

Clearly, the shape of the histogram depends on:

  • t0, since the separation between bins happens at t0+kh, kZ;
  • h, which controls the bin size and the effective number of bins for aggregating the sample.

We focus first on exploring the dependence on t0, as it serves for motivating the next density estimator.

# Sample from a U(0, 1)
set.seed(1234567)
u <- runif(n = 100)

# Bins for t0 = 0, h = 0.2
Bk1 <- seq(0, 1, by = 0.2)

# Bins for t0 = -0.1, h = 0.2
Bk2 <- seq(-0.1, 1.1, by = 0.2)

# Comparison of histograms for different t0's
hist(u, probability = TRUE, breaks = Bk1, ylim = c(0, 1.5),
     main = "t0 = 0, h = 0.2")
rug(u)
abline(h = 1, col = 2)
hist(u, probability = TRUE, breaks = Bk2, ylim = c(0, 1.5),
     main = "t0 = -0.1, h = 0.2")
rug(u)
abline(h = 1, col = 2)
The dependence of the histogram on the origin \(t_0.\) The red curve represents the uniform pdf.The dependence of the histogram on the origin \(t_0.\) The red curve represents the uniform pdf.

Figure 2.2: The dependence of the histogram on the origin t0. The red curve represents the uniform pdf.

High dependence on t0 also happens when estimating densities that are not compactly supported. The following snippet of code points towards it.

# Sample 50 points from a N(0, 1) and 25 from a N(3.25, 0.25)
set.seed(1234567)
samp <- c(rnorm(n = 50, mean = 0, sd = 1),
          rnorm(n = 25, mean = 3.25, sd = sqrt(0.5)))

# min and max for choosing Bk1 and Bk2
range(samp)
## [1] -2.082486  4.344547

# Comparison
Bk1 <- seq(-2.5, 5, by = 0.5)
Bk2 <- seq(-2.25, 5.25, by = 0.5)
hist(samp, probability = TRUE, breaks = Bk1, ylim = c(0, 0.5),
     main = "t0 = -2.5, h = 0.5")
curve(2/3 * dnorm(x, mean = 0, sd = 1) +
        1/3 * dnorm(x, mean = 3.25, sd = sqrt(0.5)), col = 2, add = TRUE,
      n = 200)
rug(samp)
hist(samp, probability = TRUE, breaks = Bk2, ylim = c(0, 0.5),
     main = "t0 = -2.25, h = 0.5")
curve(2/3 * dnorm(x, mean = 0, sd = 1) +
        1/3 * dnorm(x, mean = 3.25, sd = sqrt(0.5)), col = 2, add = TRUE,
      n = 200)
rug(samp)
The dependence of the histogram on the origin \(t_0\) for non-compactly supported densities. The red curve represents the underlying pdf, a mixture of two normals.The dependence of the histogram on the origin \(t_0\) for non-compactly supported densities. The red curve represents the underlying pdf, a mixture of two normals.

Figure 2.3: The dependence of the histogram on the origin t0 for non-compactly supported densities. The red curve represents the underlying pdf, a mixture of two normals.

Clearly, the subjectivity introduced by the dependence of t0 is something that we would like to get rid of. We can do so by allowing the bins to be dependent on x (the point at which we want to estimate f(x)), rather than fixing them beforehand.

2.1.2 Moving histogram

An alternative to avoid the dependence on t0 is the moving histogram, also known as the naive density estimator.20 The idea is to aggregate the sample X1,,Xn in intervals of the form (xh,x+h) and then use its relative frequency in (xh,x+h) to approximate the density at x, which can be expressed as

f(x)=F(x)=limh0+F(x+h)F(xh)2h=limh0+P[xh<X<x+h]2h.

Recall the differences with the histogram: the intervals depend on the evaluation point x and are centered about it. This allows us to directly estimate f(x) (without the proxy f(x0)) using an estimate based on the symmetric derivative of F at x, instead of employing an estimate based on the forward derivative of F at x0.

More precisely, given a bandwidth h>0, the naive density estimator builds a piecewise constant function by considering the relative frequency of X1,,Xn inside (xh,x+h):

(2.3)f^N(x;h):=12nhi=1n1{xh<Xi<x+h}.

Figure 2.4 shows the moving histogram for the same sample used in Figure 2.3, clearly revealing the remarkable improvement with respect to the histograms shown when estimating the underlying density.

The naive density estimator \(\hat{f}_\mathrm{N}(\cdot;h)\) (black curve). The red curve represents the underlying pdf, a mixture of two normals.

Figure 2.4: The naive density estimator f^N(;h) (black curve). The red curve represents the underlying pdf, a mixture of two normals.

Exercise 2.3 Is f^N(;h) continuous in general? Justify your answer. If the answer is negative, then:

  1. What is the maximum number of discontinuities it may have?
  2. What should happen to have fewer discontinuities than its maximum?

Exercise 2.4 Implement your own version of the moving histogram in R. It must be a function that takes as inputs:

  1. a vector with the evaluation points x;
  2. sample X1,,Xn;
  3. bandwidth h.

The function must return (2.3) evaluated for each x. Test the implementation by comparing the density of a N(0,1) when estimated with n=100 observations.

Analogously to the histogram, the analysis of f^N(x;h) as a random variable follows from realizing that

i=1n1{xh<Xi<x+h}B(n,px,h),px,h:=P[xh<X<x+h]=F(x+h)F(xh).

Then:

(2.4)E[f^N(x;h)]=F(x+h)F(xh)2h,Var[f^N(x;h)]=F(x+h)F(xh)4nh2(2.5)(F(x+h)F(xh))24nh2.

Exercise 2.5 Derive expressions (2.4) and (2.5) from the binomial relation indicated above.

Results (2.4) and (2.5) provide interesting insights into the effect of h:

  1. If h0, then:

    • E[f^N(x;h)]f(x) and (2.3) is an asymptotically unbiased estimator of f(x).
    • Var[f^N(x;h)]f(x)2nhf(x)2n.
  2. If h, then:

    • E[f^N(x;h)]0.
    • Var[f^N(x;h)]0.
  3. The variance shrinks to zero if nh.21 So both the bias and the variance can be reduced if n, h0, and nh, simultaneously.

  4. The variance is almost proportional22 to f(x).

The animation in Figure 2.5 illustrates the previous points and gives insights into how the performance of (2.3) varies smoothly with h.

Figure 2.5: Bias and variance for the moving histogram. The animation shows how for small bandwidths the bias of f^N(x;h) on estimating f(x) is small, but the variance is high, and how for large bandwidths the bias is large and the variance is small. The variance is visualized through the asymptotic 95% confidence intervals for f^N(x;h). Application available here.

The estimator (2.3) raises an important question:

Why give the same weight to all X1,,Xn in (xh,x+h) for approximating P[xh<X<x+h]2h?

We are estimating f(x)=F(x) by estimating F(x+h)F(xh)2h through the relative frequency of X1,,Xn in the interval (xh,x+h). Therefore, it seems reasonable that the data points closer to x are more important to assess the infinitesimal probability of x than the ones further away. This observation shows that (2.3) is indeed a particular case of a wider and more sensible class of density estimators, which we will see next.

References

Scott, D. W. 2015. Multivariate Density Estimation. Second. Wiley Series in Probability and Statistics. Hoboken: John Wiley & Sons. https://doi.org/10.1002/9781118575574.

  1. Note that we estimate f(x) by means of an estimate for f(x0), where x is at most h>0 units above x0. Thus, we do not estimate directly f(x) with the histogram.↩︎

  2. Recall that, with this standardization, we approach to the probability density concept.↩︎

  3. Note that it is key that the {Bk} are deterministic (and not sample-dependent) for this result to hold.↩︎

  4. This is an important point. Notice also that this k depends on h because tk=t0+kh, therefore the k for which x[tk,tk+1) will change when, for example, h0.↩︎

  5. Because x[tk,tk+1) with k changing as h0 (see the previous footnote) and the interval ends up collapsing in x, so any point in [tk,tk+1) converges to x.↩︎

  6. The motivation for this terminology will be apparent in Section 2.2.↩︎

  7. Or, in other words, if h1=o(n), i.e., if h1 grows more slowly than n does.↩︎

  8. Why so?↩︎