2.3 The central limit theorem

The Central Limit Theorem (CLT) is a cornerstone result in statistics, if not the cornerstone result. The result states that the sampling distribution of the sample mean ˉX of iid rv’s X1,,Xn converges to a normal distribution as n. The beauty of this result is that it holds irrespective of the original distribution of X1,,Xn.

Theorem 2.5 (Central limit theorem) Let X1,,Xn be iid rv’s with expectation E[Xi]=μ and variance Var[Xi]=σ2<. Then, the cdf of the rv

Un=ˉXμσ/n

converges to the cdf of a N(0,1) as n.

Remark. To denote “approximately distributed as” we will use . We can therefore write that, due to the CLT, ˉXμσ/nN(0,1) (as n). Observe that this notation is different to , used to denote “distributed as”, or , used for approximations of numerical quantities.

Proof (Proof of Theorem 2.5). We compute the mgf of the rv Un. Since X1,,Xn are independent,

MUn(s)=E[exp{sUn}]=E[exp{sni=1(Xiμσn)}]=ni=1E[exp{sXiμσn}].

Now, denoting Z:=(X1μ)/σ (this variable is standardized but it is not a standard normal variable), we have

MUn(s)=(E[exp{snZ}])n=[MZ(sn)]n,

The rv Z is such that E[Z]=0 and E[Z2]=1. We perform a second-order Taylor expansion of MZ(s/n) about s=0:

MZ(sn)=MZ(0)+M(1)Z(0)(s/n)1!+M(2)Z(0)(s/n)22!+R(s/n),

where M(k)Z(0) is the k-th derivative of the mgf evaluated at s=0 and the remainder term R is a function that satisfies

lim

i.e., is contribution in the Taylor expansion is negligible in comparison with the second term (s/\sqrt{n})^2 when n\to\infty.

Since the derivatives evaluated at zero are equal to the moments (Theorem 1.1), we have:

\begin{align*} M_Z\left(\frac{s}{\sqrt{n}}\right) &=\mathbb{E}\left[e^0\right]+\mathbb{E}[Z]\frac{s}{\sqrt{n}}+\mathbb{E}\left[Z^2\right]\frac{s^2}{2n}+R\left(\frac{s}{\sqrt{n}}\right) \\ &=1+\frac{s^2}{2n}+R\left(\frac{s}{\sqrt{n}}\right) \\ &=1+\frac{1}{n}\left[\frac{s^2}{2}+nR\left(\frac{s}{\sqrt{n}}\right)\right]. \end{align*}

Now,

\begin{align*} \lim_{n\to\infty} M_{U_n}(s)&=\lim_{n\to\infty}\left[M_Z\left(\frac{s}{\sqrt{n}}\right)\right]^n\\ &=\lim_{n\to\infty}\left\{1+\frac{1}{n}\left[\frac{s^2}{2}+nR\left(\frac{s}{\sqrt{n}}\right)\right]\right\}^n. \end{align*}

In order to compute the limit, we use that, as x\to0, \log(1+x)=x+L(x) with remainder L such that \lim_{x\to 0} L(x)/x=0. Then,

\begin{align} \lim_{n\to\infty}nR\left(\frac{s}{\sqrt{n}}\right)&=\lim_{n\to\infty} s^2 \frac{R(s/\sqrt{n})}{(s/\sqrt{n})^2}=0,\tag{2.10}\\ \lim_{n\to\infty} nL\left(\frac{s^2}{2n}\right)&=\lim_{n\to\infty}\frac{2}{s^2}\frac{L\left(s^2/(2n)\right)}{s^2/(2n)}=0.\tag{2.11} \end{align}

Then, applying the logarithm and using its expansion plus (2.10) and (2.11), we have

\begin{align*} \lim_{n\to\infty} \log M_{U_n}(s)&=\lim_{n\to\infty}n\log\left\{1+\frac{1}{n}\left[\frac{s^2}{2}+nR\left(\frac{s}{\sqrt{n}}\right)\right]\right\}\\ &=\lim_{n\to\infty}n \left\{\frac{1}{n}\left[\frac{s^2}{2}+nR\left(\frac{s}{\sqrt{n}}\right)\right]+L\left(\frac{s^2}{2n}\right)\right\}\\ &=\frac{s^2}{2}+\lim_{n\to\infty} nR\left(\frac{s}{\sqrt{n}}\right)+\lim_{n\to\infty} nL\left(\frac{s^2}{2n}\right)\\ &=\frac{s^2}{2}. \end{align*}

Therefore,

\begin{align*} \lim_{n\to\infty}M_{U_n}(s)=e^{s^2/2}=M_{\mathcal{N}(0,1)}(s), \end{align*}

where M_{\mathcal{N}(0,1)}(s)=e^{s^2/2} is seen in Exercise 1.18.

The application of Theorem 1.3 ends the proof.

Remark. If X_1,\ldots,X_n are normal rv’s, then U_n is exactly distributed as \mathcal{N}(0,1). This is the exact version of the CLT.

Example 2.13 The grades in a first-year course of a given university have a mean of \mu=6.1 points (over 10 points) and a standard deviation of \sigma=1.5, both known from long-term records. A specific generation of n=38 students had a mean of 5.5 points. Can we state that these students have a significant lower performance? Or perhaps this is a reasonable deviation from the average performance merely due to randomness? In order to answer, compute the probability that the sample mean is at most 5.5 when n=38.

Let \bar{X} be the sample mean of the srs of n=38 students grades. By the CLT we know that

\begin{align*} Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\cong \mathcal{N}(0,1). \end{align*}

Then, we are interested in computing the probability

\begin{align*} \mathbb{P}(\bar{X}\leq 5.5)=\mathbb{P}\left(Z\leq \frac{5.5-6.1}{1.5/\sqrt{38}}\right)\approx\mathbb{P}(Z\leq -2.4658)\approx 0.0068, \end{align*}

which has been obtained with

pnorm(-2.4658)
## [1] 0.006835382
pnorm(5.5, mean = 6.1, sd = 1.5 / sqrt(38)) # Alternatively
## [1] 0.006836039

Observe that this probability is very small. This suggests that it is highly unlikely that this generation of students belongs to the population with mean \mu=6.1. It is more likely that its true mean is smaller than 6.1 and therefore the generation of current students has a significantly lower performance than previous generations.

Example 2.14 The random waiting time in a cash register of a supermarket is distributed according to a distribution with mean 1.5 minutes and variance 1. What is the approximate probability of serving n=100 clients in less than 2 hours?

If X_i denotes the waiting time of the i-th client, then we want to compute

\begin{align*} \mathbb{P}\left(\sum_{i=1}^{100} X_i\leq 120\right)=\mathbb{P}\left(\bar{X}\leq 1.2\right). \end{align*}

As the sample size is large, the CLT entails that \bar{X}\cong\mathcal{N}(1.5, 1/100), so

\begin{align*} \mathbb{P}\left(\bar{X}\leq 1.2\right)=\mathbb{P}\left(\frac{\bar{X}-1.5}{1/\sqrt{100}} \leq \frac{1.2-1.5}{1/\sqrt{100}}\right)\approx \mathbb{P}(Z\leq -3). \end{align*}

Employing pnorm() we can compute \mathbb{P}(Z\leq -3) right away:

pnorm(-3)
## [1] 0.001349898

The CLT employs a type of convergence for sequences of rv’s that we precise next.

Definition 2.6 (Convergence in distribution) A sequence of rv’s X_1,X_2,\ldots converges in distribution to a rv X if

\begin{align*} \lim_{n\to\infty} F_{X_n}(x)=F_X(x), \end{align*}

for all the points x\in\mathbb{R} where F_X(x) is continuous. The convergence in distribution is denoted by

\begin{align*} X_n \stackrel{d}{\longrightarrow} X. \end{align*}

The thesis of the CLT can be rewritten in terms of this (more explicit) notation as follows:

\begin{align*} U_n=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}

This means that the cdf of U_n, F_{U_n}, verifies that \lim_{n\to\infty} F_{U_n}(x)=\Phi(x) for all x\in\mathbb{R}. We can visualize how this convergence happens by plotting the cdf’s of U_n for increasing values of n and matching them with \Phi, the cdf of a standard normal. This is done in Figures 2.8 and 2.9, where the cdf’s F_{U_n} are approximated by Monte Carlo.

Cdf’s of \(U_n=(\bar{X}-\mu)/(\sigma/\sqrt{n})\) for \(X_1,\ldots,X_n\sim \mathrm{Exp}(2).\) A fast convergence of \(F_{U_n}\) towards \(\Phi\) appears, despite the heavy non-normality of \(\mathrm{Exp}(2)\).

Figure 2.8: Cdf’s of U_n=(\bar{X}-\mu)/(\sigma/\sqrt{n}) for X_1,\ldots,X_n\sim \mathrm{Exp}(2). A fast convergence of F_{U_n} towards \Phi appears, despite the heavy non-normality of \mathrm{Exp}(2).

Cdf’s of \(U_n=(\bar{X}-\mu)/(\sigma/\sqrt{n})\) for \(X_1,\ldots,X_n\sim \mathrm{Bin}(1,p).\) The cdf’s \(F_{U_n}\) are discrete for every \(n,\) which makes their convergence towards \(\Phi\) slower than in the continuous case.

Figure 2.9: Cdf’s of U_n=(\bar{X}-\mu)/(\sigma/\sqrt{n}) for X_1,\ldots,X_n\sim \mathrm{Bin}(1,p). The cdf’s F_{U_n} are discrete for every n, which makes their convergence towards \Phi slower than in the continuous case.

The diagram below summarizes the available sampling distributions for the sample mean:

\begin{align*} \!\!\!\!\!\!\!\!(X_1,\ldots,X_n)\begin{cases} \sim \mathcal{N}(\mu,\sigma^2) & \implies \bar{X}\sim \mathcal{N}(\mu,\sigma^2/n) \\ \nsim \mathcal{N}(\mu,\sigma^2) & \begin{cases} n \;\text{ small} & \implies \text{?} \\ n \;\text{ large} & \implies \bar{X}\cong \mathcal{N}(\mu,\sigma^2/n) \end{cases} \end{cases} \end{align*}

The following examples provide another view on the applicability of the CLT: to approximate certain distributions for large values of their parameters.

Example 2.15 (Normal approximation to the binomial) Let Y be a binomial rv of size n\in\mathbb{N} and success probability p\in[0,1] (see Example 1.22), denoted by \mathrm{Bin}(n,p). It holds that Y=\sum_{i=1}^n X_i, where X_i\sim \mathrm{Ber}(p), i=1,\ldots,n, with \mathrm{Ber}(p)\stackrel{d}{=}\mathrm{Bin}(1,p) and X_1,\ldots,X_n independent.

It is easy to see that

\begin{align*} \mathbb{E}[X_i]=p, \quad \mathbb{V}\mathrm{ar}[X_i]=p(1-p). \end{align*}

Then, applying the CLT, we have that

\begin{align*} \bar{X}=\frac{Y}{n}\cong \mathcal{N}\left(p,\frac{p(1-p)}{n}\right), \end{align*}

which implies that \mathrm{Bin}(n,p)/n\cong \mathcal{N}\left(p,\frac{p(1-p)}{n}\right), or, equivalently,

\begin{align} \mathrm{Bin}(n,p)\cong \mathcal{N}(np,np(1-p)) \tag{2.12} \end{align}

as n\to\infty. This approximation works well even when n is not very large (i.e., n<30), as long as p is “not very close to zero or one”, which is often translated as requiring that both np>5 and np(1-p)>5 hold.

Normal approximation to the \(\mathrm{Bin}(n,p)\) for \(n=5\) and \(n=50\) with \(p=0.25\).Normal approximation to the \(\mathrm{Bin}(n,p)\) for \(n=5\) and \(n=50\) with \(p=0.25\).

Figure 2.10: Normal approximation to the \mathrm{Bin}(n,p) for n=5 and n=50 with p=0.25.

A continuity correction is often employed for improving the accuracy in the computation of binomial probabilities. Then, if \tilde{Y}\sim\mathcal{N}(np,np(1-p)) denotes the approximating rv, better approximations are obtained with

\begin{align*} \mathbb{P}(Y\leq m) & \approx \mathbb{P}(\tilde{Y}\leq m+0.5), \\ \mathbb{P}(Y\geq m) & \approx \mathbb{P}(\tilde{Y}\geq m-0.5), \\ \mathbb{P}(Y=m) & \approx \mathbb{P}(m-0.5\leq \tilde{Y}\leq m+0.5), \end{align*}

where m\in\mathbb{N} and m\leq n.

Example 2.16 Let Y\sim\mathrm{Bin}(25,0.4). Using the normal distribution, approximate the probability that Y\leq 8 and that Y=8.

Using the continuity correction, \mathbb{P}(Y\leq 8) can be approximated as

\begin{align*} \mathbb{P}(Y\leq 8)\approx \mathbb{P}(\tilde{Y}\leq 8.5)=\mathbb{P}(\mathcal{N}(10,6)\leq 8.5). \end{align*}

and its actual value is

pnorm(8.5, mean = 10, sd = sqrt(6))
## [1] 0.2701457

The probability of Y=8 is approximated as

\begin{align*} \mathbb{P}(Y=8) \approx \mathbb{P}(7.5\leq \tilde{Y}\leq 8.5)=\mathbb{P}(7.5\leq \mathcal{N}(10,6)\leq 8.5) \end{align*}

and its actual value is

pnorm(8.5, mean = 10, sd = sqrt(6)) - pnorm(7.5, mean = 10, sd = sqrt(6))
## [1] 0.1164286

We can compare these two approximated probabilities with the exact ones

\begin{align*} \mathbb{P}(\mathrm{Bin}(n, p) \leq m) \text{ and } \mathbb{P}(\mathrm{Bin}(n, p) = m), \end{align*}

which can be computed right away with pbinom(m, size = n, prob = p) and dbinom(m, size = n, prob = p), respectively:

pbinom(8, size = 25, prob = 0.4)
## [1] 0.2735315
dbinom(8, size = 25, prob = 0.4)
## [1] 0.1199797

The normal approximation is therefore quite precise, even for n=25.

Example 2.17 A candidate for city major believes that he/she may win the elections if he/she obtains at least 55\% of the votes in district D. In addition, he/she assumes that the 50\% of all the voters in the city support him/her. If n=100 voters vote in district D (consider them as a srs of the voters of the city), what is the exact probability that the candidate wins at least the 55\% of the votes?

Let Y be the number of voters in district D that supports the candidate. Y is distributed as \mathrm{Bin}(100,0.5) and we want to know the probability

\begin{align*} \mathbb{P}(Y/n\geq 0.55)=\mathbb{P}(Y\geq 55). \end{align*}

The desired probability is

\begin{align*} \mathbb{P}(\mathrm{Bin}(100, 0.5) \geq 55)=1-\mathbb{P}(\mathrm{Bin}(100, 0.5) < 55), \end{align*}

which can be computed as:

1 - pbinom(54, size = 100, prob = 0.5)
## [1] 0.1841008
pbinom(54, size = 100, prob = 0.5, lower.tail = FALSE) # Alternatively
## [1] 0.1841008

The previous value was the exact probability for a binomial. Let us see now how close the CLT approximation is. Because of the CLT,

\begin{align*} \mathbb{P}(Y/n\geq 0.55)\approx\mathbb{P}\left(\mathcal{N}\left(0.5, \frac{0.25}{100}\right)\geq 0.55\right) \end{align*}

with the actual value given by

1 - pnorm(0.55, mean = 0.5, sd = sqrt(0.25 / 100))
## [1] 0.1586553

If the continuity correction was employed, the probability is approximated to

\begin{align*} \mathbb{P}(Y\geq 0.55n)\approx\mathbb{P}\left(\tilde{Y}\geq 0.55n-0.5\right)=\mathbb{P}\left(\mathcal{N}\left(50, 25\right)\geq 54.5\right), \end{align*}

which takes the value

1 - pnorm(54.5, mean = 50, sd = sqrt(25))
## [1] 0.1840601

As seen, the continuity correction offers a significant improvement, even for n=100 and p=0.5.

Example 2.18 (Normal approximation to additive distributions) The additive property of the binomial is key for obtaining its normal approximation by the CLT seen in Example 2.15. On one hand, if X_1,\ldots,X_n are \mathrm{Bin}(1,p), then n\bar{X} is \mathrm{Bin}(n,p). On the other hand, the CLT thesis can also be expressed as

\begin{align} n\bar{X}\cong\mathcal{N}(n\mu,n\sigma^2) \tag{2.13} \end{align}

as n\to\infty. Equaling both results we have the approximation (2.12).

Different additive properties are also satisfied by other distributions, as we saw in Exercise 1.20 (Poisson distribution) and Exercise 1.21 (gamma distribution). In addition, the chi-square distribution, as a particular case of the gamma, has also an additive property (Proposition 2.2) on the degrees of freedom \nu. Therefore, we can obtain normal approximations for these distributions as well.

For the Poisson, we know that, if (X_1,\ldots,X_n) is a srs of a \mathrm{Pois}(\lambda), then

\begin{align} n\bar{X}=\sum_{i=1}^nX_i\sim\mathrm{Pois}(n\lambda).\tag{2.14} \end{align}

Since the expectation and the variance of a \mathrm{Pois}(\lambda) are both equal to \lambda, equaling (2.13) and (2.14) when n\to\infty yields

\begin{align*} \mathrm{Pois}(n\lambda)\cong \mathcal{N}(n\lambda,n\lambda). \end{align*}

Normal approximation to the \(\mathrm{Pois}(\lambda)\) for \(\lambda=5\) and \(\lambda=20\).Normal approximation to the \(\mathrm{Pois}(\lambda)\) for \(\lambda=5\) and \(\lambda=20\).

Figure 2.11: Normal approximation to the \mathrm{Pois}(\lambda) for \lambda=5 and \lambda=20.

Note that this approximation is equivalent to saying that when the intensity parameter \lambda tends to infinity, then \mathrm{Pois}(\lambda) is approximately a \mathcal{N}(\lambda,\lambda).

Following similar arguments for the chi-square distribution, it is easy to see that, when \nu\to\infty,

\begin{align} \chi^2_\nu\cong \mathcal{N}(\nu,2\nu). \tag{2.15} \end{align}

Normal approximation to the \(\chi^2_\nu\) for \(\nu=10\) and \(\nu=50\).Normal approximation to the \(\chi^2_\nu\) for \(\nu=10\) and \(\nu=50\).

Figure 2.12: Normal approximation to the \chi^2_\nu for \nu=10 and \nu=50.