2.1 Histograms

2.1.1 Histogram

The simplest method to estimate a density f from an iid sample X1,,Xn is the histogram. From an analytical point of view, the idea is to aggregate the data in intervals of the form [x0,x0+h) and then use their relative frequency to approximate the density at x[x0,x0+h), f(x), by the estimate of

f(x0)=F(x0)=limh0+F(x0+h)F(x0)h=limh0+P[x0<X<x0+h]h.

More precisely, given an origin t0 and a bandwidth h>0, the histogram builds a piecewise constant function in the intervals {Bk:=[tk,tk+1):tk=t0+hk,kZ} by counting the number of sample points inside each of them. These constant-length intervals are also denoted bins. The fact that they are of constant length h is important, since it allows to standardize by h in order to have relative frequencies per length. The histogram at a point x is defined as

(2.1)f^H(x;t0,h):=1nhi=1n1{XiBk:xBk}.

Equivalently, if we denote the number of points in Bk as vk, then the histogram is f^H(x;t0,h)=vknh if xBk for a kZ.

The analysis of f^H(x;t0,h) as a random variable is simple, once it is recognized that the bin counts vk are distributed as B(n,pk), with pk=P[XBk]=Bkf(t)dt4. If f is continuous, then by the mean value theorem, pk=hf(ξk,h) for a ξk,h(tk,tk+1). Assume that kZ is such that xBk=[tk,tk+1). Therefore:

E[f^H(x;t0,h)]=npknh=f(ξk,h),Var[f^H(x;t0,h)]=npk(1pk)n2h2=f(ξk,h)(1hf(ξk,h))nh.

The above results show interesting insights:

  1. If h0, then ξh,kx, resulting in f(ξk,h)f(x), and thus (2.1) becomes an asymptotically unbiased estimator of f(x).
  2. But if h0, the variance increases. For decreasing the variance, nh is required.
  3. The variance is directly dependent on f(x)(1hf(x))f(x), hence there is more variability at regions with high density.

A more detailed analysis of the histogram can be seen in Section 3.2.2 of Scott (2015). We skip it here since the detailed asymptotic analysis for the more general kernel density estimator is given in Section 2.2.

The implementation of histograms is very simple in R. As an example, we consider the old-but-gold dataset faithful. This dataset contains the duration of the eruption and the waiting time between eruptions for the Old Faithful geyser in Yellowstone National Park (USA).

# The faithful dataset is included in R
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

# Duration of eruption
faithE <- faithful$eruptions

# Default histogram: automatically chosen bins and absolute frequencies!
histo <- hist(faithE)


# List that contains several objects
str(histo)
## List of 6
##  $ breaks  : num [1:9] 1.5 2 2.5 3 3.5 4 4.5 5 5.5
##  $ counts  : int [1:8] 55 37 5 9 34 75 54 3
##  $ density : num [1:8] 0.4044 0.2721 0.0368 0.0662 0.25 ...
##  $ mids    : num [1:8] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
##  $ xname   : chr "faithE"
##  $ equidist: logi TRUE
##  - attr(*, "class")= chr "histogram"

# With relative frequencies
hist(faithE, probability = TRUE)


# Choosing the breaks
# t0 = min(faithE), h = 0.25
Bk <- seq(min(faithE), max(faithE), by = 0.25)
hist(faithE, probability = TRUE, breaks = Bk)
rug(faithE) # Plotting the sample

The shape of the histogram depends on:

  • t0, since the separation between bins happens at t0k, kZ;
  • h, which controls the bin size and the effective number of bins for aggregating the sample.

We focus first on exploring the dependence on t0, as it serves for motivating the next density estimator.

# Uniform sample
set.seed(1234567)
u <- runif(n = 100)

# t0 = 0, h = 0.2
Bk1 <- seq(0, 1, by = 0.2)

# t0 = -0.1, h = 0.2
Bk2 <- seq(-0.1, 1.1, by = 0.2)

# Comparison
hist(u, probability = TRUE, breaks = Bk1, ylim = c(0, 1.5),
     main = "t0 = 0, h = 0.2")
rug(u)
abline(h = 1, col = 2)

hist(u, probability = TRUE, breaks = Bk2, ylim = c(0, 1.5),
     main = "t0 = -0.1, h = 0.2")
rug(u)
abline(h = 1, col = 2)

High dependence on t0 also happens when estimating densities that are not compactly supported. The next snippet of code points towards it.

# Sample 100 points from a N(0, 1) and 50 from a N(3, 0.25)
set.seed(1234567)
samp <- c(rnorm(n = 100, mean = 0, sd = 1),
          rnorm(n = 50, mean = 3.25, sd = sqrt(0.5)))

# min and max for choosing Bk1 and Bk2
range(samp)
## [1] -2.107233  4.679118

# Comparison
Bk1 <- seq(-2.5, 5, by = 0.5)
Bk2 <- seq(-2.25, 5.25, by = 0.5)
hist(samp, probability = TRUE, breaks = Bk1, ylim = c(0, 0.5),
     main = "t0 = -2.5, h = 0.5")
rug(samp)

hist(samp, probability = TRUE, breaks = Bk2, ylim = c(0, 0.5),
     main = "t0 = -2.25, h = 0.5")
rug(samp)

2.1.2 Moving histogram

An alternative to avoid the dependence on t0 is the moving histogram or naive density estimator. The idea is to aggregate the sample X1,,Xn in intervals of the form (xh,x+h) and then use its relative frequency in (xh,x+h) to approximate the density at x:

f(x)=F(x)=limh0+F(x+h)F(xh)2h=limh0+P[xh<X<x+h]2h.

Recall the differences with the histogram: the intervals depend on the evaluation point x and are centered around it. That allows to directly estimate f(x) (without the proxy f(x0)) by an estimate of the symmetric derivative.

More precisely, given a bandwidth h>0, the naive density estimator builds a piecewise constant function by considering the relative frequency of X1,,Xn inside (xh,x+h):

(2.2)f^N(x;h):=12nhi=1n1{xh<Xi<x+h}.

The function has 2n discontinuities that are located at Xi±h.

Similarly to the histogram, the analysis of f^N(x;h) as a random variable follows from realizing that i=1n1{xh<Xi<x+h}B(n,px,h), px,h=P[xh<X<x+h]=F(x+h)F(xh). Then:

E[f^N(x;h)]=F(x+h)F(xh)2h,Var[f^N(x;h)]=F(x+h)F(xh)4nh2(F(x+h)F(xh))24nh2.

These two results provide interesting insights on the effect of h:

  1. If h0, then E[f^N(x;h)]f(x) and (2.2) is an asymptotically unbiased estimator of f(x). But also Var[f^N(x;h)]f(x)2nhf(x)2n.
  2. If h, then E[f^N(x;h)]0 and Var[f^N(x;h)]0.
  3. The variance shrinks to zero if nh (or, in other words, if h1=o(n), i.e. if h1 grows slower than n). So both the bias and the variance can be reduced if n, h0, and nh, simultaneously.
  4. The variance is (almost) proportional to f(x).

The animation in Figure 2.1 illustrates the previous points and gives insight on how the performance of (2.2) varies smoothly with h.

Figure 2.1: Bias and variance for the moving histogram. The animation shows how for small bandwidths the bias of f^N(x;h) on estimating f(x) is small, but the variance is high, and how for large bandwidths the bias is large and the variance is small. The variance is represented by the asymptotic 95% confidence intervals for f^N(x;h). Application also available here.

The estimator (2.2) raises an important question: Why giving the same weight to all X1,,Xn in (xh,x+h)? After all, we are estimating f(x)=F(x) by estimating F(x+h)F(xh)2h through the relative frequency of X1,,Xn in (xh,x+h). Should not be the data points closer to x more important than the ones further away? The answer to this question shows that (2.2) is indeed a particular case of a wider class of density estimators.

References

Scott, D. W. 2015. Multivariate Density Estimation. Second. Wiley Series in Probability and Statistics. Hoboken: John Wiley & Sons, Inc.

  1. Note that it is key that the {Bk} are deterministic (and not sample-dependent) for this result to hold.↩︎