7 📝 Limit Theorems

'AreUnormal' by Enrico Chavez

Figure 3.1: ‘AreUnormal’ by Enrico Chavez

7.1 Markov’s and Chebyshev’s Inequalities

We will start by overviewing two inequalities that allow the computation of upper bounds for probability statements and play an important role in stablishing the convergence results we’ll see further in this chapter.

7.1.1 Markov’s inequality

Proposition 7.1 (Markov’s Inequality) Let \(Z\) be random variable and \(h(z)\) a non-negative valued function for all \(z\in \mathbb{R}\). Then: \[\begin{equation} \Pr(h(Z)\geq \zeta)\leq \frac{E[h(Z)]}{\zeta}\quad\text{for all}\,\zeta>0. \label{Eq. M1} \end{equation}\]
Proof. To verify Markov’s inequality, observe that \[\begin{eqnarray*} E[h( Z)]&=&\int_{-\infty }^{\infty }h(z) f_{Z}\left(z\right)dz\\ &=&\int_{\{z:h(z)\geq\zeta\}}h(z) f_{Z}\left(z\right) dz+ \int_{\{z:h(z)<\zeta\}}h(z) f_{Z}\left(z\right) dz \\ &\geq&\int_{\{z:h(z)\geq\zeta\}}h(z) f_{Z}\left(z\right) dz\\ &\geq&\int_{\{z:h(z)\geq\zeta\}}\zeta f_{Z}\left(z\right)dz=\zeta\Pr(h(Z)\geq\zeta)\,, \end{eqnarray*}\] giving the desired result on division by \(\zeta\).

Example 5.2 (Markov’s Inequality) Q. On the A2 highway (in the Luzern Canton), the speed limit is \(80\) Km/h. Most drivers are not driving so fast and the average speed on the high way is \(70\) Km/h. If \(Z\) denotes a randomly chosen driver’s speed, what is the probability that such a person is driving faster than the speed limit? \

A. Since we do not have the whole distribution of \(Z\), but we have only limited info (i.e. we know \(E[Z]=70\) Km/h), we have to resort on Markov’s inequality. So using () we obtain an upper bound to the probability:

\[ P(Z \geq 80) \leq \frac{70}{80} = 0.875. \]
Proposition 7.2 (Chebychev’s Inequality) For any random variable \(Z\) with mean \(\mu_Z\) and variance \(\sigma_Z^2<\infty\) \[\begin{equation*} \Pr \left( \left\vert Z-\mu_Z\right\vert <r\sigma_Z\right) \geq 1-\frac{1 }{r^{2}} \end{equation*}\] for all \(r>0\).

Remark. Note that an equivalent expression is given by \[\begin{equation} \Pr \left( \left\vert Z-\mu_Z\right\vert \geq r\sigma_Z\right) \leq \frac{1 }{r^{2}} \label{Eq. C2} \end{equation}\]

Put in words, this inequality says that the probability that a random variable lies more than \(r\) standard deviations away from its mean value is bounded above by \(1/r^2\).

Proof. Chebyshev’s inequality is, in turn, a special case of Markov’s inequality.

  • Chebyshev’s inequality now follows as a direct corollary of Markov’s inequality on taking \(h(z)=(z-\mu_Z)^2\) and \(\zeta=r^2\sigma_Z^2\).
Chebychev’s inequality can be used to construct crude bounds on the probabilities associated with deviations of a random variable from its mean.

Example 7.1 (Chebyshev’s Inequality) Q. On the A2 highway (in the Luzern Canton), the speed limit is \(80\) Km/h. Most drivers are not driving so fast and the average speed on the high way is \(70\) Km/h, . If \(Z\) denotes a randomly chosen driver’s speed, what is the probability that such a person is driving faster than the speed limit?

A. Since we do not have the whole distribution of \(Z\), but we have only limited info (i.e. we know \(E[Z]=70\) Km/h AND \(V(Z)=9\) \((Km/h)^2\)), we have to resort on Chebyshev’s inequality and give an upper bound to the probability. Thus,

\[\begin{eqnarray*} P( Z \geq 80) &=& P( Z - E[Z]\geq 80 - 70) \\ &\leq& P(\vert Z-E[Z] \vert \geq 10) \leq P\left( \frac{\vert Z-E[Z] \vert }{\sqrt{V(Z)}}\geq \frac{10}{\sqrt{9}}\right) \end{eqnarray*}\]

Using (), with \(r=\frac{10}{3}\) and \(\sigma_Z= 3\), we finally get

\[\begin{eqnarray*} P( Z \geq 80) \leq P\left(\Big\vert Z-E[Z] \Big\vert \geq \left(\frac{10}{3}\right) {3}\right) \leq \frac{1}{\frac{10^2}{3^2}} \leq \frac{9}{100} \leq 0.09 \end{eqnarray*}\]

Remark. Chebychev’s inequality can be rewirtten in a different way.

Indeed, for any random variable \(Z\) with mean \(\mu_Z\) and variance \(\sigma_Z^2<\infty\) \[\begin{equation} \Pr \left( \left\vert Z-\mu_Z\right\vert \geq \varepsilon \right) \leq \frac{E[Z-\mu_Z]^2}{\varepsilon^{2}} = \frac{\sigma_Z^2}{\varepsilon^{2}}. \label{Eq. C3} \end{equation}\]

It’s easy to check that Eq. () coincides with Eq. (), setting in Eq. ()

\[ \varepsilon = r \sigma_Z. \]

Do the check as an exercise!!

7.2 Sequences of Random Variables

Definition 5.2 A sequence of random variables is an ordered list of random variables of the form

\[\begin{equation*} S_{1},S\,_{2},...,S_{n},... \end{equation*}\]

where, in an abstract sense, the sequence is infinitely long.

We would like to say something about how these random variables behave as \(n\) gets larger and larger (i.e. as \(n\) tends towards infinity, denoted by \(n\rightarrow\infty\) )

The study of such limiting behaviour is commonly called a study of `asymptotics’ — after the word asymptote used in standard calculus.

7.2.1 Example: Bernoulli Trials and their sum

Let \(\tilde Z\) denote a dichotomous random variable with \(\tilde Z\sim \mathcal{B}(p)\). A sequence of Bernoulli trials provides us with a sequence of values \(\tilde Z_{1},\tilde Z_{2},...,\tilde Z_{n},...\) %where each \(\tilde {Z}_{i}\) is such that

\[\begin{eqnarray*} \Pr("Success")=\Pr \left( \tilde{Z}_{i}=1\right) = p & \text{and} & \Pr("Failure")=\Pr \left( \tilde Z_{i}=0\right) = 1-p \end{eqnarray*}\]

Now let \[S_n=\sum_{s=1}^n \tilde Z_s,\] the number of “Successes” in the first \(n\) Bernoulli trials. This yields a new sequence of random variables

\[\begin{eqnarray*} S_{1} &=& \tilde Z_{1} \\ S_{2} &=&\left( \tilde Z_{1}+ \tilde Z_{2}\right)\\ &&\vdots \\ S_{n} &=&\left( \tilde Z_{1}+ \tilde Z_{2}+\cdots + \tilde Z_{n}\right) = \sum_{i=1}^n \tilde Z_i \end{eqnarray*}\]

This new sequence is such that \(S_n\sim B(n,p)\) for each \(n\).

Now consider the sequence:
\[{P}_n=S_n/n,\] for \(n=1,2,\ldots\), corresponds to the proportion of `Successes’in the first \(n\) Bernoulli trials.

It is natural to ask how the behaviour of \({P}_n\) is related to the true probability of a `Success’ (\(p\)).

Specifically, the open question at this point is: \

“Do these results imply that \({P}_n\) collapses onto the true \(p\) as \(n\) increases, and if so, in what way?” \

To gain a clue, let us consider the simulated values of \({P}_n\).

7.2.2 Example: Bernoulli Trials and limit behaviour

Remark. This numerical illustration leads us to suspect that there is a sense in which \({P}_n\) converges to \(p\) — notice that although the sequence is random, the `limiting’ value here is a constant (i.e. is non-random).

So, informally, we can claim that a sequence of random variables \(X_{1},X_{2},...,X_{n},...\) is thought to converge if the probability distribution of \(X_{n}\) gets more and more concentrated around a single point as \(n\) tends to infinity.

7.3 Convergence in Probability (\(\overset{p}{\rightarrow }\))

More formally,

Definition 5.3 A sequence of random variables \(X_{1},X_{2},...,X_{n},...\) is said to converge in probability to a number \(\alpha\) if for any arbitrary constant \(\varepsilon >0\)

\[\begin{equation*} \lim_{n\rightarrow \infty }\Pr \left( \left\vert X_{n}-\alpha \right\vert >\varepsilon \right) =0 \end{equation*}\]

If this is the case, we write \(X_{n}\overset{p}{\rightarrow }\alpha\) or \(p\lim X_{n}=\alpha\).

A sequence of random variables \(X_{1},X_{2},...,X_{n},...\) is said to converge in probability to a random variable \(X\) if for any arbitrary constant \(\varepsilon >0\)

\[\begin{equation*} \lim_{n\rightarrow \infty }\Pr \left( \left\vert X_{n}-X \right\vert >\varepsilon \right) =0\,, \end{equation*}\]

written \(X_{n}\overset{p}{\rightarrow }X\) or \(p\lim(X_{n}-X)=0\).

7.3.1 Operational Rules for \(\overset{p}{\rightarrow }\)

Let us itemize some rules. To this end, let \(a\) be any (nonrandom) number so:

  • If \(X_{n}\overset{p}{\rightarrow } \alpha\) then

  • \(aX_{n}\overset{p}{\rightarrow }a\alpha\) and

  • \(a+X_{n}\overset{p}{\rightarrow }a+\alpha\),

  • If \(X_{n}\overset{p}{\rightarrow }X\) then

    • \(aX_{n}\overset{p}{\rightarrow }aX\) and
    • \(a+X_{n}\overset{p}{\rightarrow }a+X\)
  • If \(X_{n}\overset{p}{\rightarrow }\alpha\) and \(Y_{n}\overset{p}{\rightarrow }\gamma\) then

    • \(X_{n}Y_{n}\overset{p}{\rightarrow }\alpha \gamma\) and
    • \(X_{n}+Y_{n}\overset{p}{\rightarrow }\alpha +\gamma\).
  • If \(X_{n}\overset{p}{\rightarrow }X\) and \(Y_{n}\overset{p}{\rightarrow }Y\) then

    • \(X_{n}Y_{n}\overset{p}{\rightarrow }X Y\) and
    • \(X_{n}+Y_{n}\overset{p}{\rightarrow }X +Y\)
  • Let \(g\left( x\right)\) be any (non-random) continuous function. If \(X_{n}\overset{p}{\rightarrow }\alpha\) then \[g\left( X_{n}\right) \overset{p}{\rightarrow }g\left( \alpha \right),\] and if \(X_{n}\overset{p}{\rightarrow }X\) then

\[g\left( X_{n}\right) \overset{p}{\rightarrow }g\left( X \right).\]

Suppose \(X_{1},X_{2},...,X_{n},...\) is a sequence of random variables with common distribution \(F_X(x)\) and moments \(\mu_r=E [X^r]\). At any given point along the sequence, \(X_{1},X_{2},...,X_{n}\) constitutes a simple random sample of size \(n\). \

For each fixed sample size \(n\), the \(r\)th sample moment is (using an obvious notation) \[\begin{equation*} M_{(r,n)}=\frac{1}{n}\left( X_{1}^r+X_{2}^r+\cdots +X_{n}^r\right)=\frac{1}{n}\sum_{s=1}^nX_s^r\,, \end{equation*}\] and we know that \[E[M_{(r,n)}]=\mu_r\quad\text{and}\quad Var(M_{(r,n)})=\frac{1}{n}(\mu_{2r}-\mu_r^2)\,.\]

Now consider the sequence of sample moments \(M_{(r,1)},M_{(r,2)},...,M_{(r,n)},...\) or, equivalently, \(\{M_{(r,i)}\}_{i=1}^{n}\).

7.3.2 Convergence of Sample Moments as a motivation…

The distribution of \(M_{(r,n)}\) (which is unknown because \(F_X(x)\) has not been specified) is thus concentrated around \(\mu_r\) for all \(n\), with a variance which tends to zero as \(n\) increases. \

So the distribution of \(M_{(r,n)}\) becomes more and more concentrated around \(\mu_r\) as $n $ increases and therefore we might that \[\begin{equation*} M_{(r,n)}\overset{p}{\rightarrow }\mu_r. \end{equation*}\]

In fact, this result follows from what is known as the Weak Law of Large Numbers (WLLN).

7.3.3 The Weak Law of Large Numbers (WLLN)

Proposition 7.3 Let \(X_{1},X_{2},...,X_{n},...\) be a sequence of random variables with common probability distribution \(F_X(x)\), and let \(Y=h(X)\) be such that \[\begin{eqnarray*} E[Y]=E\left[ h(X)\right] &=&\mu_Y \\ Var(Y)=Var\left( h(X)\right) &=&\sigma_Y ^{2}<\infty\,. \end{eqnarray*}\]% Set \[\overline{Y}_n=\frac{1}{n}\sum_{s=1}^nY_s\quad\text{where}\quad Y_s=h(X_s)\,,\quad s=1,\ldots,n\,.\] Then for any two numbers \(\varepsilon\) and \(\delta\) satisfying \(\varepsilon>0\) and \(0<\delta<1\)

\[\Pr \left( \left\vert \overline{Y}_{n}-\mu_Y \right\vert<\varepsilon \right)\geq 1-\delta\]

for all \(n>\sigma_Y^2/(\varepsilon^2\delta)\). Choosing both \(\varepsilon\) and \(\delta\) to be arbitrarily small implies that \(p\lim_{n\rightarrow\infty}(\overline{Y}_{n}-\mu_Y)=0\), or equivalently \(\overline{Y}_{n}\overset{p}{\rightarrow }\mu_Y\).

7.3.4 The WLLN and Chebyshev’s Inequality

  • First note that \(E[\overline{Y}_n]=\mu_Y\) and \(Var(\overline{Y}_n)=\sigma_Y^2/n\).
  • Now, according to Chebyshev’s inequality \[\begin{eqnarray*} \Pr \left( |\overline{Y}_{n}-\mu_Y| <\varepsilon\right) &\geq &1-\frac{E\left[ \left( \overline{Y}_{n}-\mu_Y \right) ^{2}\right] }{\varepsilon^{2}} \\ &=&1-\frac{\sigma_Y ^{2}/n}{\varepsilon^{2}} \\ &=&1-\frac{\sigma_Y ^{2}}{n\varepsilon^{2}}\geq 1-\delta \end{eqnarray*}\] for all \(n>\sigma_Y^2/(\varepsilon^2\delta)\).
  • Thus the WLLN is proven, provided we can verify Chebyshev’s inequality.
  • Note that by considering the limit as \(n\rightarrow \infty\) we also have% \[\begin{equation*} \lim_{n\rightarrow \infty }\Pr \left( \left\vert \overline{Y}_{n}-\mu_Y\right\vert <\varepsilon\right) \geq \lim_{n\rightarrow \infty }\left( 1-\frac{\sigma^{2}}{n\varepsilon^{2}}\right) =1\,, \end{equation*}\] again implying that \(\left( \overline{Y}_{n}-\mu_Y \right) \overset{p}{\rightarrow }0\).
Definition 7.1 Consider, therefore, a sequence of random variables \(X_{1},X_{2},...,X_{n},...\) with corresponding CDFs \(F_{X_{1}}\left( x\right) ,F_{X_{2}}\left( x\right),...,F_{X_{n}}\left(x\right) ,...\). We say that the sequence \(X_{1},X_{2},...,X_{n},...\) converges in distribution to the random variable \(X\), having probability distribution \(F_X(x)\), if and only if \[\begin{equation*} \lim_{n\rightarrow \infty }F_{X_n}\left( x\right) =F_{X}\left( x\right) \end{equation*}\] at all points \(x\) where \(F_{X}\left( x\right)\) is continuous. In this case we write \(X_{n}\overset{D}{\rightarrow }X\)
  • If \(p\lim_{n\rightarrow\infty}(X_n-X)=0\) then \(X_{n}\overset{D}{\rightarrow }X\).

  • Let \(a\) be any real number. If \(X_{n}\overset{D}{\rightarrow }X\), then \(aX_{n}\overset{D}{\rightarrow }aX\)

  • If \(Y_{n}\overset{p}{\rightarrow }\phi\) and \(X_{n}\overset{D}{% \rightarrow }X\), then

  • \(Y_{n}X_{n}\overset{D}{\rightarrow }\phi X,\) and

  • \(Y_{n}+X_{n}\overset{D}{\rightarrow }\phi +X\)

  • If \(X_{n}\overset{D}{\rightarrow }X\) and \(g\left( x\right)\) is any continuous function, then \(g\left( X_{n}\right) \overset{D}{\rightarrow }% g\left( X\right)\)

Example 7.2 Suppose \(X_{1},X_{2},...,X_{n},...\) is a sequence of independent random variables where \(X_n\sim B(n,p)\) with probability of “Success” \(p\).

  • We already know that, if \(p=\lambda/n\), where \(\lambda>0\) is fixed, then as \(n\) goes to infinity, \(F_{X_{n}}\left( x\right)\) converges to the probability distribution of a \(Poisson\left( \lambda \right)\) random variable. So, \(X_{n}\overset{D}{\rightarrow }X\), where $XPoisson() $

  • Now consider another case. If \(p\) is fixed, the probability distribution of \[\begin{equation*} Y_{n}=\frac{X_{n}-np}{\sqrt{np\left( 1-p\right) }} \end{equation*}\] converges, as \(n\) goes to infinity, to that of a standard Normal random variable [Theorem of De Moivre-Laplace]. So, \(Y_ {n}\overset{D}{\rightarrow }Y\), where $Y(0,1) $.

Example 5.5

%

Example 3.5 Let us consider a sequence of continuous r.v.’s \(X_1, X_2, ..., X_n,...\), where \(X_n\) has range \((0, n]\), for \(n > 0\) and CDF \[ F_{X_n} (x) = 1- \left( 1- \frac{x}{n} \right)^n, \ \ 0<x\leq n. \] Then, as \(n \to \infty\), the limiting support is \((0,\infty)\), and \(\forall x >0\), we have \[ F_{X_n} (x) \to F_X(x) = 1 - e^{-x} \] which is the CDF of an exponential r.v. (at all continuity points).

So, we conclude that \(X_n\) convergences in distribution to an exponential r.v., that is \[ X_n \overset{D}{\rightarrow } X, \quad X \sim \exp(1). \]

The following theorem is often said to be one of the most important results. Its significance lies in the fact that it allows accurate probability calculations to be made without knowledge of the underlying distributions!

Theorem 7.1 Let \(X_{1},X_{2},...,X_{n},...\) be a sequence of random variables and let \(Y=h(X)\) be such that \[\begin{eqnarray*} E[Y]=E\left[ h(X)\right] &=&\mu_Y \\ Var(Y)=Var\left( h(X)\right) &=&\sigma_Y ^{2}<\infty\,. \end{eqnarray*}\]% Set \[ \overline{Y}_n=\frac{1}{n}\sum_{s=1}^nY_s\quad\text{where}\quad Y_s=h(X_s)\,,\quad s=1,\ldots,n\,. \] Then (under quite general regularity conditions)%

\[\begin{equation*} \frac{\sqrt{n}\left( \overline{Y}_{n}-\mu_Y \right) }{\sigma_Y }\overset{D}{% \rightarrow }N\left( 0,1\right). \end{equation*}\]

%

%

Remark. Several generalizations of this statement are available. For instance, one can state a CLT for data which are independent but NOT identically distributed. Another possibility is to define a CLT for data which are NOT independent, namely for dependent data — for this you need to attend my course about Time Series, in the Fall semester at the Master in Statistics at University of Geneva !!!!