## 1.3 Stochastic convergence review

Let $$X_n$$ be a sequence of random variables defined in a common probability space $$(\Omega,\mathcal{A},\mathbb{P})$$. The four most common types of convergence of $$X_n$$ to a random variable in $$(\Omega,\mathcal{A},\mathbb{P})$$ are the following.

Definition 1.1 (Convergence in distribution) $$X_n$$ converges in distribution to $$X$$, written $$X_n\stackrel{d}{\longrightarrow}X$$, if $$\lim_{n\to\infty}F_n(x)=F(x)$$ for all $$x$$ for which $$F$$ is continuous, where $$X_n\sim F_n$$ and $$X\sim F$$.

“Convergence in distribution” is also referred to as weak convergence. Proposition 1.6 justifies why this terminology.

Definition 1.2 (Convergence in probability) $$X_n$$ converges in probability to $$X$$, written $$X_n\stackrel{\mathbb{P}}{\longrightarrow}X$$, if $$\lim_{n\to\infty}\mathbb{P}[|X_n-X|>\varepsilon]=0$$, $$\forall \varepsilon>0$$.

Definition 1.3 (Convergence almost surely) $$X_n$$ converges almost surely (as) to $$X$$, written $$X_n\stackrel{\mathrm{as}}{\longrightarrow}X$$, if $$\mathbb{P}[\{\omega\in\Omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\}]=1$$.

Definition 1.4 (Convergence in $$r$$-mean) For $$r\geq1$$, $$X_n$$ converges in $$r$$-mean to $$X$$, written $$X_n\stackrel{r}{\longrightarrow}X$$, if $$\lim_{n\to\infty}\mathbb{E}[|X_n-X|^r]=0$$.

Remark. The previous definitions can be extended to a sequence of $$p$$-random vectors $$\mathbf{X}_n$$. For Definitions 1.2 and 1.4, replace $$|\cdot|$$ with the Euclidean norm $$||\cdot||$$. Alternatively, Definition 1.2 can be extended marginally: $$\mathbf{X}_n\stackrel{\mathbb{P}}{\longrightarrow}\mathbf{X}:\iff X_{j,n}\stackrel{\mathbb{P}}{\longrightarrow}X_j$$, $$\forall j=1,\ldots,p$$. For Definition 1.1, replace $$F_n$$ and $$F$$ by the joint cdfs of $$\mathbf{X}_n$$ and $$\mathbf{X}$$, respectively. Definition 1.3 also extends marginally.

The $$2$$-mean convergence plays a remarkable role in defining a consistent estimator $$\hat\theta_n=T(X_1,\ldots,X_n)$$ of a parameter $$\theta$$. We say that the estimator is consistent if its Mean Squared Error (MSE),

\begin{align*} \mathrm{MSE}[\hat\theta_n]:\!&=\mathbb{E}[(\hat\theta_n-\theta)^2]\\ &=(\mathbb{E}[\hat\theta_n]-\theta)^2+\mathbb{V}\mathrm{ar}[\hat\theta_n]\\ &=:\mathrm{Bias}[\hat\theta_n]^2+\mathbb{V}\mathrm{ar}[\hat\theta_n], \end{align*}

goes to zero as $$n\to\infty$$. Equivalently written, if $$\hat\theta_n\stackrel{2}{\longrightarrow}\theta$$.

If $$X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} X$$ and $$X$$ is a degenerate random variable such that $$\mathbb{P}[X=c]=1$$, $$c\in\mathbb{R}$$, then we write $$X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} c$$ (the list-notation $$\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow}$$ is used to condensate four different convergence results in the same line).

The relations between the types of convergences are conveniently summarized in the following proposition.

Proposition 1.6 Let $$X_n$$ be a sequence of random variables and $$X$$ a random variable. Then the following implication diagram is satisfied:

\begin{align*} \begin{array}{rcl} X_n\stackrel{r}{\longrightarrow}X \quad\implies & X_n\stackrel{\mathbb{P}}{\longrightarrow}X & \impliedby\quad X_n\stackrel{\mathrm{as}}{\longrightarrow}X\\ & \Downarrow & \\ & X_n\stackrel{d}{\longrightarrow}X & \end{array} \end{align*}

Also, if $$s\geq r\geq 1$$, then $$X_n\stackrel{s}{\longrightarrow}X\implies X_n\stackrel{r}{\longrightarrow}X$$.

None of the converses holds in general. However, there are some notable exceptions:

1. If $$X_n\stackrel{d}{\longrightarrow}c$$, then $$X_n\stackrel{\mathbb{P}}{\longrightarrow}c$$, $$c\in\mathbb{R}$$.
2. If $$\forall\varepsilon>0$$, $$\lim_{n\to\infty}\sum_n\mathbb{P}[|X_n-X|>\varepsilon]<\infty$$ (implies7 $$X_n\stackrel{\mathbb{P}}{\longrightarrow}X$$), then $$X_n\stackrel{\mathrm{as}}{\longrightarrow}X$$.
3. If $$X_n\stackrel{\mathbb{P}}{\longrightarrow}X$$ and $$\mathbb{P}[|X_n|\leq M]=1$$, $$\forall n\in\mathbb{N}$$ and $$M>0\,$$8, then $$X_n\stackrel{r}{\longrightarrow}X$$ for $$r\geq1$$.
4. If $$S_n=\sum_{i=1}^nX_i$$ with $$X_1,\ldots,X_n$$ iid, then $$S_n\stackrel{\mathbb{P}}{\longrightarrow}S\iff S_n\stackrel{\mathrm{as}}{\longrightarrow}S$$.

The corner stone limit result in probability is the Central Limit Theorem (CLT). One of its simpler versions has the following form.

Theorem 1.1 (CLT) Let $$X_n$$ be a sequence of iid random variables with $$\mathbb{E}[X_i]=\mu$$ and $$\mathbb{V}\mathrm{ar}[X_i]=\sigma^2<\infty$$, $$i\in\mathbb{N}$$. Then:

\begin{align*} \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}

We will later use the following CLT for random variables that are independent but not identically distributed due to its easy-to-check moment conditions.

Theorem 1.2 (Lyapunov’s CLT) Let $$X_n$$ be a sequence of independent random variables with $$\mathbb{E}[X_i]=\mu_i$$ and $$\mathbb{V}\mathrm{ar}[X_i]=\sigma_i^2<\infty$$, $$i\in\mathbb{N}$$, and such that for some $$\delta>0$$

\begin{align*} \frac{1}{s_n^{2+\delta}}\sum_{i=1}^n\mathbb{E}\left[|X_i-\mu_i|^{2+\delta}\right]\longrightarrow0\text{ as }n\to\infty, \end{align*}

with $$s_n^2=\sum_{i=1}^n\sigma^2_i$$. Then:

\begin{align*} \frac{1}{s_n}\sum_{i=1}^n(X_i-\mu_i)\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}

Finally, the following results will be useful ($$'$$ denotes transposition).

Theorem 1.3 (Cramér–Wold device) Let $$\mathbf{X}_n$$ be a sequence of $$p$$-dimensional random vectors. Then:

\begin{align*} \mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{X}\iff \mathbf{c}'\mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{c}'\mathbf{X},\quad \forall \mathbf{c}\in\mathbb{R}^p. \end{align*}

Theorem 1.4 (Continuous mapping theorem) If $$\mathbf{X}_n\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mathbf{X}$$, then

\begin{align*} g(\mathbf{X}_n)\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}g(\mathbf{X}) \end{align*}

for any continuous function $$g$$.

Theorem 1.5 (Slutsky’s theorem) Let $$X_n$$ and $$Y_n$$ be sequences of random variables and $$c\in\mathbb{R}$$. Then:

1. If $$X_n\stackrel{d}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P}}{\longrightarrow} c$$, then $$X_nY_n\stackrel{d}{\longrightarrow}cX$$.
2. If $$X_n\stackrel{d}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P}}{\longrightarrow} c$$, $$c\neq0$$, then $$\frac{X_n}{Y_n}\stackrel{d}{\longrightarrow}\frac{X}{c}$$.
3. If $$X_n\stackrel{d}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P}}{\longrightarrow} c$$, then $$X_n+Y_n\stackrel{d}{\longrightarrow}X+c$$.

Theorem 1.6 (Limit algebra for $$(\mathbb{P},\,r,\,\mathrm{as})$$-convergence) Let $$X_n$$ and $$Y_n$$ be sequences of random variables, and $$a_n\to a$$ and $$b_n\to b$$ two sequences. Denote $$X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X$$ to “$$X_n$$ converges in probability (respectively almost surely, respectively in $$r$$-mean) to $$X$$.”

1. If $$X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}Y$$, then $$a_nX_n+b_nY_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}aX+bY$$.
2. If $$X_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}Y$$, then $$X_nY_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}XY$$.

Remark. Recall the abscence of the analogous results for convergence in distribution. There are no such results! Particularly, it is false that $$X_n\stackrel{d}{\longrightarrow}X$$ and $$Y_n\stackrel{d}{\longrightarrow}Y$$ imply $$X_n+Y_n\stackrel{d}{\longrightarrow}X+Y$$ or $$X_nY_n\stackrel{d}{\longrightarrow}XY$$. It is true, however, that $$(X_n,Y_n)\stackrel{d}{\longrightarrow}(X,Y)$$ (a much stronger premise) implies both $$X_n+Y_n\stackrel{d}{\longrightarrow}X+Y$$ and $$X_nY_n\stackrel{d}{\longrightarrow}XY$$.

Theorem 1.7 (Delta method) If $$\sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)$$, then

\begin{align*} \sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\mu))^2\sigma^2\right) \end{align*}

for any function such that $$g$$ is differentiable at $$\mu$$ and $$g'(\mu)\neq0$$.

Example 1.2 It is well known that, given a parametric density $$f_\theta$$ with parameter $$\theta\in\Theta$$ and iid $$X_1,\ldots,X_n\sim f_\theta$$, then the Maximum Likelihood (ML) estimator $$\hat\theta_{\mathrm{ML}}:=\arg\max_{\theta\in\Theta}\sum_{i=1}^n\log f_\theta(X_i)$$ (the parameter that maximizes the “probability” of the data based on the model) converges to a normal under certain regularity conditions:

\begin{align*} \sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,I(\theta)^{-1}\right), \end{align*}

where $$I(\theta):=-\mathbb{E}_\theta\left[\frac{\partial^2\log f_\theta(x)}{\partial\theta^2}\right]$$ is known as the Fisher information. Then, it is satisfied that

\begin{align*} \sqrt{n}(g(\hat\theta_{\mathrm{ML}})-g(\theta))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\theta))^2I(\theta)^{-1}\right). \end{align*}

Note that, we had we applied the continuous mapping theorem for $$g$$, we would have obtained a different result:

\begin{align*} g(\sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta))\stackrel{d}{\longrightarrow}g\left(\mathcal{N}\left(0,I(\theta)^{-1}\right)\right). \end{align*}

Exercise 1.10 Let’s dig further into the differences between the delta method and the continuous mapping theorem when applied to $$\sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)$$:

1. Under what kind of maps $$g$$ the results $$\sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}(0,(g'(\mu))^2\sigma^2)$$ and $$g(\sqrt{n}(X_n-\mu))\stackrel{d}{\longrightarrow}g(\mathcal{N}(0,\sigma^2))$$ are equivalent?
2. Take $$g(x)=e^x$$. What two results do you obtain with the delta method and the continuous mapping theorem when applied to $$\sqrt{n}\bar X \stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)$$?

Theorem 1.8 (Weak and strong laws of large numbers) Let $$X_n$$ be an iid sequence with $$\mathbb{E}[X_i]=\mu$$, $$i\geq1$$. Then: $$\frac{1}{n}\sum_{i=1}^nX_i\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mu$$.

1. Intuitively: if convergence in probability is fast enought, then we have almost surely convergence.↩︎

2. “Uniformly bounded random variables.”↩︎