1.3 Stochastic convergence review

Let \(X_n\) be a sequence of random variables defined in a common probability space \((\Omega,\mathcal{A},\mathbb{P})\). The four most common types of convergence of \(X_n\) to a random variable in \((\Omega,\mathcal{A},\mathbb{P})\) are the following.

Definition 1.1 (Convergence in distribution) \(X_n\) converges in distribution to \(X\), written \(X_n\stackrel{d}{\longrightarrow}X\), if \(\lim_{n\to\infty}F_n(x)=F(x)\) for all \(x\) for which \(F\) is continuous, where \(X_n\sim F_n\) and \(X\sim F\).

“Convergence in distribution” is also referred to as weak convergence. Proposition 1.6 justifies why this terminology.

Definition 1.2 (Convergence in probability) \(X_n\) converges in probability to \(X\), written \(X_n\stackrel{\mathbb{P}}{\longrightarrow}X\), if \(\lim_{n\to\infty}\mathbb{P}[|X_n-X|>\varepsilon]=0\), \(\forall \varepsilon>0\).

Definition 1.3 (Convergence almost surely) \(X_n\) converges almost surely (as) to \(X\), written \(X_n\stackrel{\mathrm{as}}{\longrightarrow}X\), if \(\mathbb{P}[\{\omega\in\Omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\}]=1\).

Definition 1.4 (Convergence in \(r\)-mean) For \(r\geq1\), \(X_n\) converges in \(r\)-mean to \(X\), written \(X_n\stackrel{r}{\longrightarrow}X\), if \(\lim_{n\to\infty}\mathbb{E}[|X_n-X|^r]=0\).

Remark. The previous definitions can be extended to a sequence of \(p\)-random vectors \(\mathbf{X}_n\). For Definitions 1.2 and 1.4, replace \(|\cdot|\) with the Euclidean norm \(||\cdot||\). Alternatively, Definition 1.2 can be extended marginally: \(\mathbf{X}_n\stackrel{\mathbb{P}}{\longrightarrow}\mathbf{X}:\iff X_{j,n}\stackrel{\mathbb{P}}{\longrightarrow}X_j\), \(\forall j=1,\ldots,p\). For Definition 1.1, replace \(F_n\) and \(F\) by the joint cdfs of \(\mathbf{X}_n\) and \(\mathbf{X}\), respectively. Definition 1.3 also extends marginally.

The \(2\)-mean convergence plays a remarkable role in defining a consistent estimator \(\hat\theta_n=T(X_1,\ldots,X_n)\) of a parameter \(\theta\). We say that the estimator is consistent if its Mean Squared Error (MSE),

\[\begin{align*} \mathrm{MSE}[\hat\theta_n]:\!&=\mathbb{E}[(\hat\theta_n-\theta)^2]\\ &=(\mathbb{E}[\hat\theta_n]-\theta)^2+\mathbb{V}\mathrm{ar}[\hat\theta_n]\\ &=:\mathrm{Bias}[\hat\theta_n]^2+\mathbb{V}\mathrm{ar}[\hat\theta_n], \end{align*}\]

goes to zero as \(n\to\infty\). Equivalently written, if \(\hat\theta_n\stackrel{2}{\longrightarrow}\theta\).

If \(X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} X\) and \(X\) is a degenerate random variable such that \(\mathbb{P}[X=c]=1\), \(c\in\mathbb{R}\), then we write \(X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} c\) (the list-notation \(\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow}\) is used to condensate four different convergence results in the same line).

The relations between the types of convergences are conveniently summarized in the following proposition.

Proposition 1.6 Let \(X_n\) be a sequence of random variables and \(X\) a random variable. Then the following implication diagram is satisfied:

\[\begin{align*} \begin{array}{rcl} X_n\stackrel{r}{\longrightarrow}X \quad\implies & X_n\stackrel{\mathbb{P}}{\longrightarrow}X & \impliedby\quad X_n\stackrel{\mathrm{as}}{\longrightarrow}X\\ & \Downarrow & \\ & X_n\stackrel{d}{\longrightarrow}X & \end{array} \end{align*}\]

Also, if \(s\geq r\geq 1\), then \(X_n\stackrel{s}{\longrightarrow}X\implies X_n\stackrel{r}{\longrightarrow}X\).

None of the converses holds in general. However, there are some notable exceptions:

  1. If \(X_n\stackrel{d}{\longrightarrow}c\), then \(X_n\stackrel{\mathbb{P}}{\longrightarrow}c\), \(c\in\mathbb{R}\).
  2. If \(\forall\varepsilon>0\), \(\lim_{n\to\infty}\sum_n\mathbb{P}[|X_n-X|>\varepsilon]<\infty\) (implies7 \(X_n\stackrel{\mathbb{P}}{\longrightarrow}X\)), then \(X_n\stackrel{\mathrm{as}}{\longrightarrow}X\).
  3. If \(X_n\stackrel{\mathbb{P}}{\longrightarrow}X\) and \(\mathbb{P}[|X_n|\leq M]=1\), \(\forall n\in\mathbb{N}\) and \(M>0\,\)8, then \(X_n\stackrel{r}{\longrightarrow}X\) for \(r\geq1\).
  4. If \(S_n=\sum_{i=1}^nX_i\) with \(X_1,\ldots,X_n\) iid, then \(S_n\stackrel{\mathbb{P}}{\longrightarrow}S\iff S_n\stackrel{\mathrm{as}}{\longrightarrow}S\).

The corner stone limit result in probability is the Central Limit Theorem (CLT). One of its simpler versions has the following form.

Theorem 1.1 (CLT) Let \(X_n\) be a sequence of iid random variables with \(\mathbb{E}[X_i]=\mu\) and \(\mathbb{V}\mathrm{ar}[X_i]=\sigma^2<\infty\), \(i\in\mathbb{N}\). Then:

\[\begin{align*} \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}\]

We will later use the following CLT for random variables that are independent but not identically distributed due to its easy-to-check moment conditions.

Theorem 1.2 (Lyapunov’s CLT) Let \(X_n\) be a sequence of independent random variables with \(\mathbb{E}[X_i]=\mu_i\) and \(\mathbb{V}\mathrm{ar}[X_i]=\sigma_i^2<\infty\), \(i\in\mathbb{N}\), and such that for some \(\delta>0\)

\[\begin{align*} \frac{1}{s_n^{2+\delta}}\sum_{i=1}^n\mathbb{E}\left[|X_i-\mu_i|^{2+\delta}\right]\longrightarrow0\text{ as }n\to\infty, \end{align*}\]

with \(s_n^2=\sum_{i=1}^n\sigma^2_i\). Then:

\[\begin{align*} \frac{1}{s_n}\sum_{i=1}^n(X_i-\mu_i)\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}\]

Finally, the following results will be useful (\('\) denotes transposition).

Theorem 1.3 (Cramér–Wold device) Let \(\mathbf{X}_n\) be a sequence of \(p\)-dimensional random vectors. Then:

\[\begin{align*} \mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{X}\iff \mathbf{c}'\mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{c}'\mathbf{X},\quad \forall \mathbf{c}\in\mathbb{R}^p. \end{align*}\]

Theorem 1.4 (Continuous mapping theorem) If \(\mathbf{X}_n\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mathbf{X}\), then

\[\begin{align*} g(\mathbf{X}_n)\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}g(\mathbf{X}) \end{align*}\]

for any continuous function \(g\).

Theorem 1.5 (Slutsky’s theorem) Let \(X_n\) and \(Y_n\) be sequences of random variables and \(c\in\mathbb{R}\). Then:

  1. If \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P}}{\longrightarrow} c\), then \(X_nY_n\stackrel{d}{\longrightarrow}cX\).
  2. If \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P}}{\longrightarrow} c\), \(c\neq0\), then \(\frac{X_n}{Y_n}\stackrel{d}{\longrightarrow}\frac{X}{c}\).
  3. If \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P}}{\longrightarrow} c\), then \(X_n+Y_n\stackrel{d}{\longrightarrow}X+c\).

Theorem 1.6 (Limit algebra for \((\mathbb{P},\,r,\,\mathrm{as})\)-convergence) Let \(X_n\) and \(Y_n\) be sequences of random variables, and \(a_n\to a\) and \(b_n\to b\) two sequences. Denote \(X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X\) to “\(X_n\) converges in probability (respectively almost surely, respectively in \(r\)-mean) to \(X\).”

  1. If \(X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}Y\), then \(a_nX_n+b_nY_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}aX+bY\).
  2. If \(X_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}Y\), then \(X_nY_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}XY\).

Remark. Recall the abscence of the analogous results for convergence in distribution. There are no such results! Particularly, it is false that \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{d}{\longrightarrow}Y\) imply \(X_n+Y_n\stackrel{d}{\longrightarrow}X+Y\) or \(X_nY_n\stackrel{d}{\longrightarrow}XY\). It is true, however, that \((X_n,Y_n)\stackrel{d}{\longrightarrow}(X,Y)\) (a much stronger premise) implies both \(X_n+Y_n\stackrel{d}{\longrightarrow}X+Y\) and \(X_nY_n\stackrel{d}{\longrightarrow}XY\).

Theorem 1.7 (Delta method) If \(\sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)\), then

\[\begin{align*} \sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\mu))^2\sigma^2\right) \end{align*}\]

for any function such that \(g\) is differentiable at \(\mu\) and \(g'(\mu)\neq0\).

Example 1.2 It is well known that, given a parametric density \(f_\theta\) with parameter \(\theta\in\Theta\) and iid \(X_1,\ldots,X_n\sim f_\theta\), then the Maximum Likelihood (ML) estimator \(\hat\theta_{\mathrm{ML}}:=\arg\max_{\theta\in\Theta}\sum_{i=1}^n\log f_\theta(X_i)\) (the parameter that maximizes the “probability” of the data based on the model) converges to a normal under certain regularity conditions:

\[\begin{align*} \sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,I(\theta)^{-1}\right), \end{align*}\]

where \(I(\theta):=-\mathbb{E}_\theta\left[\frac{\partial^2\log f_\theta(x)}{\partial\theta^2}\right]\) is known as the Fisher information. Then, it is satisfied that

\[\begin{align*} \sqrt{n}(g(\hat\theta_{\mathrm{ML}})-g(\theta))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\theta))^2I(\theta)^{-1}\right). \end{align*}\]

Note that, we had we applied the continuous mapping theorem for \(g\), we would have obtained a different result:

\[\begin{align*} g(\sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta))\stackrel{d}{\longrightarrow}g\left(\mathcal{N}\left(0,I(\theta)^{-1}\right)\right). \end{align*}\]

Exercise 1.10 Let’s dig further into the differences between the delta method and the continuous mapping theorem when applied to \(\sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)\):

  1. Under what kind of maps \(g\) the results \(\sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}(0,(g'(\mu))^2\sigma^2)\) and \(g(\sqrt{n}(X_n-\mu))\stackrel{d}{\longrightarrow}g(\mathcal{N}(0,\sigma^2))\) are equivalent?
  2. Take \(g(x)=e^x\). What two results do you obtain with the delta method and the continuous mapping theorem when applied to \(\sqrt{n}\bar X \stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)\)?

Theorem 1.8 (Weak and strong laws of large numbers) Let \(X_n\) be an iid sequence with \(\mathbb{E}[X_i]=\mu\), \(i\geq1\). Then: \(\frac{1}{n}\sum_{i=1}^nX_i\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mu\).


  1. Intuitively: if convergence in probability is fast enought, then we have almost surely convergence.↩︎

  2. “Uniformly bounded random variables.”↩︎