1.3 Basic stochastic convergence review

Let $X_n$ be a sequence of random variables defined in a common probability space $(\Omega,\mathcal{A},\mathbb{P}).$ The four most common types of convergence of $X_n$ to a random variable in $(\Omega,\mathcal{A},\mathbb{P})$ are the following.

Definition 1.1 (Convergence in distribution) $X_n$ converges in distribution to $X,$ written $X_n\stackrel{d}{\longrightarrow}X,$ if $\lim_{n\to\infty}F_n(x)=F(x)$ for all $x$ for which $F$ is continuous, where $X_n\sim F_n$ and $X\sim F.$

Definition 1.2 (Convergence in probability) $X_n$ converges in probability to $X,$ written $X_n\stackrel{\mathbb{P}}{\longrightarrow}X,$ if $\lim_{n\to\infty}\mathbb{P}[|X_n-X|>\varepsilon]=0,$ $\forall \varepsilon>0.$

Definition 1.3 (Convergence almost surely) $X_n$ converges almost surely (as) to $X,$ written $X_n\stackrel{\mathrm{as}}{\longrightarrow}X,$ if $\mathbb{P}[\{\omega\in\Omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\}]=1.$

Definition 1.4 (Convergence in $r$-mean) $X_n$ converges in $r$ -mean to $X,$ written $X_n\stackrel{r}{\longrightarrow}X,$ if $\lim_{n\to\infty}\mathbb{E}[|X_n-X|^r]=0.$

Remark. The previous definitions can be extended to a sequence of $p$ -random vectors $\mathbf{X}_n.$ For Definitions 1.2 and 1.4, replace $|\cdot|$ by the Euclidean norm $||\cdot||.$ Alternatively, Definition 1.2 can be extended marginally: $\mathbf{X}_n\stackrel{\mathbb{P}}{\longrightarrow}\mathbf{X}:\iff X_{j,n}\stackrel{\mathbb{P}}{\longrightarrow}X_j,$ $\forall j=1,\ldots,p.$ For Definition 1.1, replace $F_n$ and $F$ by the joint cdfs of $\mathbf{X}_n$ and $\mathbf{X},$ respectively. Definition 1.3 extends marginally as well.

The $2$ -mean convergence plays a remarkable role in defining a consistent estimator $\hat\theta_n=T(X_1,\ldots,X_n)$ of a parameter $\theta.$ We say that the estimator is consistent if its mean squared error (MSE),

$\begin{align*} \mathrm{MSE}[\hat\theta]:=&\,\mathbb{E}[(\hat\theta_n-\theta)^2]\\ =&\,(\mathbb{E}[\hat\theta_n]-\theta)^2+\mathbb{V}\mathrm{ar}[\hat\theta_n]\\ =&:\mathrm{Bias}[\hat\theta_n]^2+\mathbb{V}\mathrm{ar}[\hat\theta_n], \end{align*}$

goes to zero as $n\to\infty.$ Equivalently, if $\hat\theta_n\stackrel{2}{\longrightarrow}\theta.$

If $X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} X$ and $X$ is a degenerate random variable such that $\mathbb{P}[X=c]=1,$ $c\in\mathbb{R},$ then we write $X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} c$ (the list-notation is used to condensate four different convergence results in the same line).

The relations between the types of convergences are conveniently summarized in the next proposition.

Proposition 1.6 Let $X_n$ be a sequence of random variables and $X$ a random variable. Then the following implication diagram is satisfied:

$\begin{align*} \begin{array}{rcl} X_n\stackrel{r}{\longrightarrow}X \quad\implies & X_n\stackrel{\mathbb{P}}{\longrightarrow}X & \impliedby\quad X_n\stackrel{\mathrm{as}}{\longrightarrow}X\\ & \Downarrow & \\ & X_n\stackrel{d}{\longrightarrow}X & \end{array} \end{align*}$

None of the converses hold in general. However, there are some notable exceptions:

If $X_n\stackrel{d}{\longrightarrow}c,$ then $X_n\stackrel{\mathbb{P}}{\longrightarrow}c.$
If $\forall\varepsilon>0,$ $\lim_{n\to\infty}\sum_n\mathbb{P}[|X_n-X|>\varepsilon]<\infty$ (implies³ $X_n\stackrel{\mathbb{P}}{\longrightarrow}X$ ), then $X_n\stackrel{\mathrm{as}}{\longrightarrow}X.$
If $X_n\stackrel{\mathbb{P}}{\longrightarrow}X$ and $\mathbb{P}[|X_n|\leq M]=1,$ $\forall n\in\mathbb{N}$ and $M>0,$ then $X_n\stackrel{r}{\longrightarrow}X$ for $r\geq1.$
If $S_n=\sum_{i=1}^nX_i$ with $X_1,\ldots,X_n$ iid, then $S_n\stackrel{\mathbb{P}}{\longrightarrow}S\iff S_n\stackrel{\mathrm{as}}{\longrightarrow}S.$

Also, if $s\geq r\geq 1,$ then $X_n\stackrel{s}{\longrightarrow}X\implies X_n\stackrel{r}{\longrightarrow}X.$

The corner stone limit result in probability is the central limit theorem (CLT). In its simpler version, it entails that:

Theorem 1.1 (CLT) Let $X_n$ be a sequence of iid random variables with $\mathbb{E}[X_i]=\mu$ and $\mathbb{V}\mathrm{ar}[X_i]=\sigma^2<\infty,$ $i\in\mathbb{N}.$ Then:

$\begin{align*} \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}$

We will use later the following CLT for random variables that are independent but not identically distributed due to its easy-to-check moment conditions.

Theorem 1.2 (Lyapunov's CLT) Let $X_n$ be a sequence of independent random variables with $\mathbb{E}[X_i]=\mu_i$ and $\mathbb{V}\mathrm{ar}[X_i]=\sigma_i^2<\infty,$ $i\in\mathbb{N},$ and such that for some $\delta>0$

$\begin{align*} \frac{1}{s_n^{2+\delta}}\sum_{i=1}^n\mathbb{E}\left[|X_i-\mu_i|^{2+\delta}\right]\longrightarrow0\text{ as }n\to\infty, \end{align*}$

with $s_n^2=\sum_{i=1}^n\sigma^2_i.$ Then:

$\begin{align*} \frac{1}{s_n}\sum_{i=1}^n(X_i-\mu_i)\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}$

Finally, the next results will be of usefulness ( $'$ denotes transposition).

Theorem 1.3 (Cramér--Wold device) Let $\mathbf{X}_n$ be a sequence of $p$ -dimensional random vectors. Then:

$\begin{align*} \mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{X}\iff \mathbf{c}'\mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{c}'\mathbf{X},\quad \forall \mathbf{c}\in\mathbb{R}^p. \end{align*}$

Theorem 1.4 (Continuous mapping theorem) If $\mathbf{X}_n\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mathbf{X},$ then $g(\mathbf{X}_n)\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}g(\mathbf{X})$ for any continuous function $g.$

Theorem 1.5 (Slutsky's theorem) Let $X_n$ and $Y_n$ be sequences of random variables.

If $X_n\stackrel{d}{\longrightarrow}X$ and $Y_n\stackrel{\mathbb{P}}{\longrightarrow} c,$ then $X_nY_n\stackrel{d}{\longrightarrow}cX.$
If $X_n\stackrel{d}{\longrightarrow}X$ and $Y_n\stackrel{\mathbb{P}}{\longrightarrow} c,$ $c\neq0,$ then $\frac{X_n}{Y_n}\stackrel{d}{\longrightarrow}\frac{X}{c}.$
If $X_n\stackrel{d}{\longrightarrow}X$ and $Y_n\stackrel{\mathbb{P}}{\longrightarrow} c,$ then $X_n+Y_n\stackrel{d}{\longrightarrow}X+c.$

Theorem 1.6 (Limit algebra for $(\mathbb{P} \ r \ \mathrm{as})$-convergence) Let $X_n$ and $Y_n$ be sequences of random variables and $a_n\to a$ and $b_n\to b$ two sequences. Denote $X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X$ to “ $X_n$ converges in probability (respectively almost surely, respectively in $r$ -mean).”

If $X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X$ and $Y_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X,$ then $a_nX_n+b_nY_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}aX+bY.$
If $X_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X$ and $Y_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X,$ then $X_nY_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}XY.$

Remark. Recall the absence of results for convergence in distribution. They do not hold. Particularly: $X_n\stackrel{d}{\longrightarrow}X$ and $Y_n\stackrel{d}{\longrightarrow}X$ do not imply $X_n+Y_n\stackrel{d}{\longrightarrow}X+Y.$

Theorem 1.7 (Delta method) If $\sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2),$ then $\sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\mu))^2\sigma^2\right)$ for any function such that $g$ is differentiable at $\mu$ and $g'(\mu)\neq0.$

Example 1.2 It is well known that, given a parametric density $f_\theta,$ with parameter $\theta\in\Theta,$ and iid $X_1,\ldots,X_n\sim f_\theta,$ then the maximum likelihood (ML) estimator $\hat\theta_{\mathrm{ML}}:=\arg\max_{\theta\in\Theta}\sum_{i=1}^n\log f_\theta(X_i)$ (the parameter that maximizes the probability of the data based on the model) converges to a normal under certain regularity conditions:

$\begin{align*} \sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,I(\theta)^{-1}\right), \end{align*}$

where $I(\theta):=-\mathbb{E}_\theta\left[\frac{\partial^2\log f_\theta(x)}{\partial\theta^2}\right].$ Then, it is satisfied that:

$\begin{align*} \sqrt{n}(g(\hat\theta_{\mathrm{ML}})-g(\theta))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\theta))^2I(\theta)^{-1}\right). \end{align*}$

If we apply the continuous mapping theorem for $g,$ we would have obtained that

$\begin{align*} g(\sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta))\stackrel{d}{\longrightarrow}g\left(\mathcal{N}\left(0,I(\theta)^{-1}\right)\right), \end{align*}$

Theorem 1.8 (Weak and strong laws of large numbers) Let $X_n$ be a iid sequence with $\mathbb{E}[X_i]=\mu,$ $i\geq1.$ Then: $\frac{1}{n}\sum_{i=1}^nX_i\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mu.$

Intuitively: if convergence in probability is fast enough, then we have almost surely convergence.↩︎