2.1 Histograms
2.1.1 Histogram
The simplest method to estimate a density from an iid sample is the histogram. From an analytical point of view, the idea is to aggregate the data in intervals of the form and then use their relative frequency to approximate the density at by the estimate of
More precisely, given an origin and a bandwidth the histogram builds a piecewise constant function in the intervals by counting the number of sample points inside each of them. These constant-length intervals are also denoted bins. The fact that they are of constant length is important, since it allows to standardize by in order to have relative frequencies per length. The histogram at a point is defined as
Equivalently, if we denote the number of points in as then the histogram is if for a
The analysis of as a random variable is simple, once it is recognized that the bin counts are distributed as with 4. If is continuous, then by the mean value theorem, for a Assume that is such that Therefore:
The above results show interesting insights:
- If then resulting in and thus (2.1) becomes an asymptotically unbiased estimator of
- But if the variance increases. For decreasing the variance, is required.
- The variance is directly dependent on hence there is more variability at regions with high density.
A more detailed analysis of the histogram can be seen in Section 3.2.2 of Scott (2015). We skip it here since the detailed asymptotic analysis for the more general kernel density estimator is given in Section 2.2.
The implementation of histograms is very simple in R. As an example, we consider the old-but-gold dataset faithful
. This dataset contains the duration of the eruption and the waiting time between eruptions for the Old Faithful geyser in Yellowstone National Park (USA).
# The faithful dataset is included in R
head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
# Duration of eruption
<- faithful$eruptions
faithE
# Default histogram: automatically chosen bins and absolute frequencies!
<- hist(faithE) histo
# List that contains several objects
str(histo)
## List of 6
## $ breaks : num [1:9] 1.5 2 2.5 3 3.5 4 4.5 5 5.5
## $ counts : int [1:8] 55 37 5 9 34 75 54 3
## $ density : num [1:8] 0.4044 0.2721 0.0368 0.0662 0.25 ...
## $ mids : num [1:8] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
## $ xname : chr "faithE"
## $ equidist: logi TRUE
## - attr(*, "class")= chr "histogram"
# With relative frequencies
hist(faithE, probability = TRUE)
# Choosing the breaks
# t0 = min(faithE), h = 0.25
<- seq(min(faithE), max(faithE), by = 0.25)
Bk hist(faithE, probability = TRUE, breaks = Bk)
rug(faithE) # Plotting the sample
The shape of the histogram depends on:
- since the separation between bins happens at
- which controls the bin size and the effective number of bins for aggregating the sample.
We focus first on exploring the dependence on as it serves for motivating the next density estimator.
# Uniform sample
set.seed(1234567)
<- runif(n = 100)
u
# t0 = 0, h = 0.2
<- seq(0, 1, by = 0.2)
Bk1
# t0 = -0.1, h = 0.2
<- seq(-0.1, 1.1, by = 0.2)
Bk2
# Comparison
hist(u, probability = TRUE, breaks = Bk1, ylim = c(0, 1.5),
main = "t0 = 0, h = 0.2")
rug(u)
abline(h = 1, col = 2)
hist(u, probability = TRUE, breaks = Bk2, ylim = c(0, 1.5),
main = "t0 = -0.1, h = 0.2")
rug(u)
abline(h = 1, col = 2)
High dependence on also happens when estimating densities that are not compactly supported. The next snippet of code points towards it.
# Sample 100 points from a N(0, 1) and 50 from a N(3, 0.25)
set.seed(1234567)
<- c(rnorm(n = 100, mean = 0, sd = 1),
samp rnorm(n = 50, mean = 3.25, sd = sqrt(0.5)))
# min and max for choosing Bk1 and Bk2
range(samp)
## [1] -2.107233 4.679118
# Comparison
<- seq(-2.5, 5, by = 0.5)
Bk1 <- seq(-2.25, 5.25, by = 0.5)
Bk2 hist(samp, probability = TRUE, breaks = Bk1, ylim = c(0, 0.5),
main = "t0 = -2.5, h = 0.5")
rug(samp)
hist(samp, probability = TRUE, breaks = Bk2, ylim = c(0, 0.5),
main = "t0 = -2.25, h = 0.5")
rug(samp)
2.1.2 Moving histogram
An alternative to avoid the dependence on is the moving histogram or naive density estimator. The idea is to aggregate the sample in intervals of the form and then use its relative frequency in to approximate the density at :
Recall the differences with the histogram: the intervals depend on the evaluation point and are centered around it. That allows to directly estimate (without the proxy ) by an estimate of the symmetric derivative.
More precisely, given a bandwidth the naive density estimator builds a piecewise constant function by considering the relative frequency of inside :
The function has discontinuities that are located at
Similarly to the histogram, the analysis of as a random variable follows from realizing that Then:
These two results provide interesting insights on the effect of :
- If then and (2.2) is an asymptotically unbiased estimator of But also
- If then and
- The variance shrinks to zero if (or, in other words, if i.e. if grows slower than ). So both the bias and the variance can be reduced if and simultaneously.
- The variance is (almost) proportional to
The animation in Figure 2.1 illustrates the previous points and gives insight on how the performance of (2.2) varies smoothly with
Figure 2.1: Bias and variance for the moving histogram. The animation shows how for small bandwidths the bias of on estimating is small, but the variance is high, and how for large bandwidths the bias is large and the variance is small. The variance is represented by the asymptotic confidence intervals for Application also available here.
The estimator (2.2) raises an important question: Why giving the same weight to all in ? After all, we are estimating by estimating through the relative frequency of in Should not be the data points closer to more important than the ones further away? The answer to this question shows that (2.2) is indeed a particular case of a wider class of density estimators.
References
Note that it is key that the are deterministic (and not sample-dependent) for this result to hold.↩︎