2.1 Histograms
2.1.1 Histogram
The simplest method to estimate a density from an iid sample is the histogram. From an analytical point of view, the idea is to aggregate the data in intervals of the form and then use their relative frequency to approximate the density at by the estimate of15
More precisely, given an origin and a bandwidth the histogram builds a piecewise constant function in the intervals by counting the number of sample points inside each of them. These constant-length intervals are also called bins. The fact that they are of constant length is important: we can easily standardize the counts on any bin by in order to have relative frequencies per length16 in the bins. The histogram at a point is defined as
Equivalently, if we denote the number of observations in as then the histogram can be written as
The computation of histograms is straightforward in R. As an example, we consider the old-but-gold faithful
dataset. This dataset contains the duration of the eruption and the waiting time between eruptions for the Old Faithful geyser in Yellowstone National Park (USA).
# The faithful dataset is included in R
head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
# Duration of eruption
faith_eruptions <- faithful$eruptions
# Default histogram: automatically chosen bins and absolute frequencies!
histo <- hist(faith_eruptions)
# List that contains several objects
str(histo)
## List of 6
## $ breaks : num [1:9] 1.5 2 2.5 3 3.5 4 4.5 5 5.5
## $ counts : int [1:8] 55 37 5 9 34 75 54 3
## $ density : num [1:8] 0.4044 0.2721 0.0368 0.0662 0.25 ...
## $ mids : num [1:8] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
## $ xname : chr "faith_eruptions"
## $ equidist: logi TRUE
## - attr(*, "class")= chr "histogram"
# With relative frequencies
hist(faith_eruptions, probability = TRUE)
# Choosing the breaks
# t0 = min(faithE), h = 0.25
Bk <- seq(min(faith_eruptions), max(faith_eruptions), by = 0.25)
hist(faith_eruptions, probability = TRUE, breaks = Bk)
rug(faith_eruptions) # Plotting the sample
Exercise 2.1 For iris$Sepal.Length
, compute:
- The histogram of relative frequencies with five bins.
- The histogram of absolute frequencies with and
Add the rug of the data for both histograms.
The analysis of as a random variable is simple, once one recognizes that the bin counts are distributed as with 17 If is continuous, then by the integral mean value theorem, for a Assume that is such that 18 Therefore:
The results above yield interesting insights:
- If then 19 resulting in and thus (2.1) becomes an asymptotically (when ) unbiased estimator of
- But if the variance increases. For decreasing the variance, is required.
- The variance is directly dependent on (as ), hence there is more variability at regions with higher density.
A more detailed analysis of the histogram can be seen in Section 3.2.2 in Scott (2015). We skip it since the detailed asymptotic analysis for the more general kernel density estimator will be given in Section 2.2.
Exercise 2.2 Given (2.2), obtain What should happen in order to have ?
Clearly, the shape of the histogram depends on:
- since the separation between bins happens at
- which controls the bin size and the effective number of bins for aggregating the sample.
We focus first on exploring the dependence on as it serves for motivating the next density estimator.
# Sample from a U(0, 1)
set.seed(1234567)
u <- runif(n = 100)
# Bins for t0 = 0, h = 0.2
Bk1 <- seq(0, 1, by = 0.2)
# Bins for t0 = -0.1, h = 0.2
Bk2 <- seq(-0.1, 1.1, by = 0.2)
# Comparison of histograms for different t0's
hist(u, probability = TRUE, breaks = Bk1, ylim = c(0, 1.5),
main = "t0 = 0, h = 0.2")
rug(u)
abline(h = 1, col = 2)
hist(u, probability = TRUE, breaks = Bk2, ylim = c(0, 1.5),
main = "t0 = -0.1, h = 0.2")
rug(u)
abline(h = 1, col = 2)


Figure 2.2: The dependence of the histogram on the origin The red curve represents the uniform pdf.
High dependence on also happens when estimating densities that are not compactly supported. The following snippet of code points towards it.
# Sample 50 points from a N(0, 1) and 25 from a N(3.25, 0.25)
set.seed(1234567)
samp <- c(rnorm(n = 50, mean = 0, sd = 1),
rnorm(n = 25, mean = 3.25, sd = sqrt(0.5)))
# min and max for choosing Bk1 and Bk2
range(samp)
## [1] -2.082486 4.344547
# Comparison
Bk1 <- seq(-2.5, 5, by = 0.5)
Bk2 <- seq(-2.25, 5.25, by = 0.5)
hist(samp, probability = TRUE, breaks = Bk1, ylim = c(0, 0.5),
main = "t0 = -2.5, h = 0.5")
curve(2/3 * dnorm(x, mean = 0, sd = 1) +
1/3 * dnorm(x, mean = 3.25, sd = sqrt(0.5)), col = 2, add = TRUE,
n = 200)
rug(samp)
hist(samp, probability = TRUE, breaks = Bk2, ylim = c(0, 0.5),
main = "t0 = -2.25, h = 0.5")
curve(2/3 * dnorm(x, mean = 0, sd = 1) +
1/3 * dnorm(x, mean = 3.25, sd = sqrt(0.5)), col = 2, add = TRUE,
n = 200)
rug(samp)


Figure 2.3: The dependence of the histogram on the origin for non-compactly supported densities. The red curve represents the underlying pdf, a mixture of two normals.
Clearly, the subjectivity introduced by the dependence of is something that we would like to get rid of. We can do so by allowing the bins to be dependent on (the point at which we want to estimate ), rather than fixing them beforehand.
2.1.2 Moving histogram
An alternative to avoid the dependence on is the moving histogram, also known as the naive density estimator.20 The idea is to aggregate the sample in intervals of the form and then use its relative frequency in to approximate the density at which can be expressed as
Recall the differences with the histogram: the intervals depend on the evaluation point and are centered about it. This allows us to directly estimate (without the proxy ) using an estimate based on the symmetric derivative of at instead of employing an estimate based on the forward derivative of at
More precisely, given a bandwidth the naive density estimator builds a piecewise constant function by considering the relative frequency of inside :
Figure 2.4 shows the moving histogram for the same sample used in Figure 2.3, clearly revealing the remarkable improvement with respect to the histograms shown when estimating the underlying density.

Figure 2.4: The naive density estimator (black curve). The red curve represents the underlying pdf, a mixture of two normals.
Exercise 2.3 Is continuous in general? Justify your answer. If the answer is negative, then:
- What is the maximum number of discontinuities it may have?
- What should happen to have fewer discontinuities than its maximum?
Exercise 2.4 Implement your own version of the moving histogram in R. It must be a function that takes as inputs:
- a vector with the evaluation points
- sample
- bandwidth
The function must return (2.3) evaluated for each Test the implementation by comparing the density of a when estimated with observations.
Analogously to the histogram, the analysis of as a random variable follows from realizing that
Then:
Results (2.4) and (2.5) provide interesting insights into the effect of :
If then:
- and (2.3) is an asymptotically unbiased estimator of
If then:
The variance shrinks to zero if 21 So both the bias and the variance can be reduced if and simultaneously.
The variance is almost proportional22 to
The animation in Figure 2.5 illustrates the previous points and gives insights into how the performance of (2.3) varies smoothly with
Figure 2.5: Bias and variance for the moving histogram. The animation shows how for small bandwidths the bias of on estimating is small, but the variance is high, and how for large bandwidths the bias is large and the variance is small. The variance is visualized through the asymptotic confidence intervals for Application available here.
The estimator (2.3) raises an important question:
Why give the same weight to all in for approximating ?
We are estimating by estimating through the relative frequency of in the interval Therefore, it seems reasonable that the data points closer to are more important to assess the infinitesimal probability of than the ones further away. This observation shows that (2.3) is indeed a particular case of a wider and more sensible class of density estimators, which we will see next.
References
Note that we estimate by means of an estimate for where is at most units above Thus, we do not estimate directly with the histogram.↩︎
Recall that, with this standardization, we approach to the probability density concept.↩︎
Note that it is key that the are deterministic (and not sample-dependent) for this result to hold.↩︎
This is an important point. Notice also that this depends on because therefore the for which will change when, for example, ↩︎
Because with changing as (see the previous footnote) and the interval ends up collapsing in so any point in converges to ↩︎
The motivation for this terminology will be apparent in Section 2.2.↩︎
Or, in other words, if i.e., if grows more slowly than does.↩︎
Why so?↩︎