## 1.3 Basic stochastic convergence review

Let $$X_n$$ be a sequence of random variables defined in a common probability space $$(\Omega,\mathcal{A},\mathbb{P})$$. The four most common types of convergence of $$X_n$$ to a random variable in $$(\Omega,\mathcal{A},\mathbb{P})$$ are the following.

Definition 1.1 (Convergence in distribution) $$X_n$$ converges in distribution to $$X$$, written $$X_n\stackrel{d}{\longrightarrow}X$$, if $$\lim_{n\to\infty}F_n(x)=F(x)$$ for all $$x$$ for which $$F$$ is continuous, where $$X_n\sim F_n$$ and $$X\sim F$$.
Definition 1.2 (Convergence in probability) $$X_n$$ converges in probability to $$X$$, written $$X_n\stackrel{\mathbb{P}}{\longrightarrow}X$$, if $$\lim_{n\to\infty}\mathbb{P}[|X_n-X|>\varepsilon]=0$$, $$\forall \varepsilon>0$$.
Definition 1.3 (Convergence almost surely) $$X_n$$ converges almost surely (as) to $$X$$, written $$X_n\stackrel{\mathrm{as}}{\longrightarrow}X$$, if $$\mathbb{P}[\{\omega\in\Omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\}]=1$$.
Definition 1.4 (Convergence in $$r$$-mean) $$X_n$$ converges in $$r$$-mean to $$X$$, written $$X_n\stackrel{r}{\longrightarrow}X$$, if $$\lim_{n\to\infty}\mathbb{E}[|X_n-X|^r]=0$$.
Remark. The previous definitions can be extended to a sequence of $$p$$-random vectors $$\mathbf{X}_n$$. For Definitions 1.2 and 1.4, replace $$|\cdot|$$ by the Euclidean norm $$||\cdot||$$. Alternatively, Definition 1.2 can be extended marginally: $$\mathbf{X}_n\stackrel{\mathbb{P}}{\longrightarrow}\mathbf{X}:\iff X_{j,n}\stackrel{\mathbb{P}}{\longrightarrow}X_j$$, $$\forall j=1,\ldots,p$$. For Definition 1.1, replace $$F_n$$ and $$F$$ by the joint cdfs of $$\mathbf{X}_n$$ and $$\mathbf{X}$$, respectively. Definition 1.3 extends marginally as well.

The $$2$$-mean convergence plays a remarkable role in defining a consistent estimator $$\hat\theta_n=T(X_1,\ldots,X_n)$$ of a parameter $$\theta$$. We say that the estimator is consistent if its mean squared error (MSE),

\begin{align*} \mathrm{MSE}[\hat\theta]:=&\,\mathbb{E}[(\hat\theta_n-\theta)^2]\\ =&\,(\mathbb{E}[\hat\theta_n]-\theta)^2+\mathbb{V}\mathrm{ar}[\hat\theta_n]\\ =&:\mathrm{Bias}[\hat\theta_n]^2+\mathbb{V}\mathrm{ar}[\hat\theta_n], \end{align*}

goes to zero as $$n\to\infty$$. Equivalently, if $$\hat\theta_n\stackrel{2}{\longrightarrow}\theta$$.

If $$X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} X$$ and $$X$$ is a degenerate random variable such that $$\mathbb{P}[X=c]=1$$, $$c\in\mathbb{R}$$, then we write $$X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} c$$ (the list-notation is used to condensate four different convergence results in the same line).

The relations between the types of convergences are conveniently summarized in the next proposition.

Proposition 1.6 Let $$X_n$$ be a sequence of random variables and $$X$$ a random variable. Then the following implication diagram is satisfied:

\begin{align*} \begin{array}{rcl} X_n\stackrel{r}{\longrightarrow}X \quad\implies & X_n\stackrel{\mathbb{P}}{\longrightarrow}X & \impliedby\quad X_n\stackrel{\mathrm{as}}{\longrightarrow}X\\ & \Downarrow & \\ & X_n\stackrel{d}{\longrightarrow}X & \end{array} \end{align*}

None of the converses hold in general. However, there are some notable exceptions:

1. If $$X_n\stackrel{d}{\longrightarrow}c$$, then $$X_n\stackrel{\mathbb{P}}{\longrightarrow}c$$.
2. If $$\forall\varepsilon>0$$, $$\lim_{n\to\infty}\sum_n\mathbb{P}[|X_n-X|>\varepsilon]<\infty$$ (implies3 $$X_n\stackrel{\mathbb{P}}{\longrightarrow}X$$), then $$X_n\stackrel{\mathrm{as}}{\longrightarrow}X$$.
3. If $$X_n\stackrel{\mathbb{P}}{\longrightarrow}X$$ and $$\mathbb{P}[|X_n|\leq M]=1$$, $$\forall n\in\mathbb{N}$$ and $$M>0$$, then $$X_n\stackrel{r}{\longrightarrow}X$$ for $$r\geq1$$.
4. If $$S_n=\sum_{i=1}^nX_i$$ with $$X_1,\ldots,X_n$$ iid, then $$S_n\stackrel{\mathbb{P}}{\longrightarrow}S\iff S_n\stackrel{\mathrm{as}}{\longrightarrow}S$$.

Also, if $$s\geq r\geq 1$$, then $$X_n\stackrel{s}{\longrightarrow}X\implies X_n\stackrel{r}{\longrightarrow}X$$.

The corner stone limit result in probability is the central limit theorem (CLT). In its simpler version, it entails that:

Theorem 1.1 (CLT) Let $$X_n$$ be a sequence of iid random variables with $$\mathbb{E}[X_i]=\mu$$ and $$\mathbb{V}\mathrm{ar}[X_i]=\sigma^2<\infty$$, $$i\in\mathbb{N}$$. Then:

\begin{align*} \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}

We will use later the following CLT for random variables that are independent but not identically distributed due to its easy-to-check moment conditions.

Theorem 1.2 (Lyapunov’s CLT) Let $$X_n$$ be a sequence of independent random variables with $$\mathbb{E}[X_i]=\mu_i$$ and $$\mathbb{V}\mathrm{ar}[X_i]=\sigma_i^2<\infty$$, $$i\in\mathbb{N}$$, and such that for some $$\delta>0$$

\begin{align*} \frac{1}{s_n^{2+\delta}}\sum_{i=1}^n\mathbb{E}\left[|X_i-\mu_i|^{2+\delta}\right]\longrightarrow0\text{ as }n\to\infty, \end{align*}

with $$s_n^2=\sum_{i=1}^n\sigma^2_i$$. Then:

\begin{align*} \frac{1}{s_n}\sum_{i=1}^n(X_i-\mu_i)\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}

Finally, the next results will be of usefulness ($$'$$ denotes transposition).

Theorem 1.3 (Cramér–Wold device) Let $$\mathbf{X}_n$$ be a sequence of $$p$$-dimensional random vectors. Then:

\begin{align*} \mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{X}\iff \mathbf{c}'\mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{c}'\mathbf{X},\quad \forall \mathbf{c}\in\mathbb{R}^p. \end{align*}

Theorem 1.4 (Continuous mapping theorem) If $$\mathbf{X}_n\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mathbf{X}$$, then $$g(\mathbf{X}_n)\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}g(\mathbf{X})$$ for any continuous function $$g$$.

Theorem 1.5 (Slutsky’s theorem) Let $$X_n$$ and $$Y_n$$ be sequences of random variables.

1. If $$X_n\stackrel{d}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P}}{\longrightarrow} c$$, then $$X_nY_n\stackrel{d}{\longrightarrow}cX$$.
2. If $$X_n\stackrel{d}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P}}{\longrightarrow} c$$, $$c\neq0$$, then $$\frac{X_n}{Y_n}\stackrel{d}{\longrightarrow}\frac{X}{c}$$.
3. If $$X_n\stackrel{d}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P}}{\longrightarrow} c$$, then $$X_n+Y_n\stackrel{d}{\longrightarrow}X+c$$.

Theorem 1.6 (Limit algebra for $$(\mathbb{P},\,r,\,\mathrm{as})$$-convergence) Let $$X_n$$ and $$Y_n$$ be sequences of random variables and $$a_n\to a$$ and $$b_n\to b$$ two sequences. Denote $$X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X$$ to “$$X_n$$ converges in probability (respectively almost surely, respectively in $$r$$-mean)”.

1. If $$X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X$$, then $$a_nX_n+b_nY_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}aX+bY$$.
2. If $$X_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X$$ and $$Y_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X$$, then $$X_nY_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}XY$$.
Remark. Recall the abscense of results for convergence in distribution. They do not hold. Particularly: $$X_n\stackrel{d}{\longrightarrow}X$$ and $$Y_n\stackrel{d}{\longrightarrow}X$$ do not imply $$X_n+Y_n\stackrel{d}{\longrightarrow}X+Y$$.
Theorem 1.7 (Delta method) If $$\sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)$$, then $$\sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\mu))^2\sigma^2\right)$$ for any function such that $$g$$ is differentiable at $$\mu$$ and $$g'(\mu)\neq0$$.

Example 1.2 It is well known that, given a parametric density $$f_\theta$$, with parameter $$\theta\in\Theta$$, and iid $$X_1,\ldots,X_n\sim f_\theta$$, then the maximum likelihood (ML) estimator $$\hat\theta_{\mathrm{ML}}:=\arg\max_{\theta\in\Theta}\sum_{i=1}^n\log f_\theta(X_i)$$ (the parameter that maximizes the probability of the data based on the model) converges to a normal under certain regularity conditions:

\begin{align*} \sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,I(\theta)^{-1}\right), \end{align*}

where $$I(\theta):=-\mathbb{E}_\theta\left[\frac{\partial^2\log f_\theta(x)}{\partial\theta^2}\right]$$. Then, it is satisfied that:

\begin{align*} \sqrt{n}(g(\hat\theta_{\mathrm{ML}})-g(\theta))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\theta))^2I(\theta)^{-1}\right). \end{align*}

If we apply the continuous mapping theorem for $$g$$, we would have obtained that

\begin{align*} g(\sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta))\stackrel{d}{\longrightarrow}g\left(\mathcal{N}\left(0,I(\theta)^{-1}\right)\right), \end{align*}

Theorem 1.8 (Weak and strong laws of large numbers) Let $$X_n$$ be a iid sequence with $$\mathbb{E}[X_i]=\mu$$, $$i\geq1$$. Then: $$\frac{1}{n}\sum_{i=1}^nX_i\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mu$$.

1. Intuitively: if convergence in probability is fast enought, then we have almost surely convergence.↩︎