1.3 Basic stochastic convergence review

Let Xn be a sequence of random variables defined in a common probability space (Ω,A,P). The four most common types of convergence of Xn to a random variable in (Ω,A,P) are the following.

Definition 1.1 (Convergence in distribution) Xn converges in distribution to X, written XndX, if lim for all x for which F is continuous, where X_n\sim F_n and X\sim F.

Definition 1.2 (Convergence in probability) X_n converges in probability to X, written X_n\stackrel{\mathbb{P}}{\longrightarrow}X, if \lim_{n\to\infty}\mathbb{P}[|X_n-X|>\varepsilon]=0, \forall \varepsilon>0.

Definition 1.3 (Convergence almost surely) X_n converges almost surely (as) to X, written X_n\stackrel{\mathrm{as}}{\longrightarrow}X, if \mathbb{P}[\{\omega\in\Omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\}]=1.

Definition 1.4 (Convergence in $r$-mean) X_n converges in r-mean to X, written X_n\stackrel{r}{\longrightarrow}X, if \lim_{n\to\infty}\mathbb{E}[|X_n-X|^r]=0.

Remark. The previous definitions can be extended to a sequence of p-random vectors \mathbf{X}_n. For Definitions 1.2 and 1.4, replace |\cdot| by the Euclidean norm ||\cdot||. Alternatively, Definition 1.2 can be extended marginally: \mathbf{X}_n\stackrel{\mathbb{P}}{\longrightarrow}\mathbf{X}:\iff X_{j,n}\stackrel{\mathbb{P}}{\longrightarrow}X_j, \forall j=1,\ldots,p. For Definition 1.1, replace F_n and F by the joint cdfs of \mathbf{X}_n and \mathbf{X}, respectively. Definition 1.3 extends marginally as well.

The 2-mean convergence plays a remarkable role in defining a consistent estimator \hat\theta_n=T(X_1,\ldots,X_n) of a parameter \theta. We say that the estimator is consistent if its mean squared error (MSE),

\begin{align*} \mathrm{MSE}[\hat\theta]:=&\,\mathbb{E}[(\hat\theta_n-\theta)^2]\\ =&\,(\mathbb{E}[\hat\theta_n]-\theta)^2+\mathbb{V}\mathrm{ar}[\hat\theta_n]\\ =&:\mathrm{Bias}[\hat\theta_n]^2+\mathbb{V}\mathrm{ar}[\hat\theta_n], \end{align*}

goes to zero as n\to\infty. Equivalently, if \hat\theta_n\stackrel{2}{\longrightarrow}\theta.

If X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} X and X is a degenerate random variable such that \mathbb{P}[X=c]=1, c\in\mathbb{R}, then we write X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} c (the list-notation is used to condensate four different convergence results in the same line).

The relations between the types of convergences are conveniently summarized in the next proposition.

Proposition 1.6 Let X_n be a sequence of random variables and X a random variable. Then the following implication diagram is satisfied:

\begin{align*} \begin{array}{rcl} X_n\stackrel{r}{\longrightarrow}X \quad\implies & X_n\stackrel{\mathbb{P}}{\longrightarrow}X & \impliedby\quad X_n\stackrel{\mathrm{as}}{\longrightarrow}X\\ & \Downarrow & \\ & X_n\stackrel{d}{\longrightarrow}X & \end{array} \end{align*}

None of the converses hold in general. However, there are some notable exceptions:

  1. If X_n\stackrel{d}{\longrightarrow}c, then X_n\stackrel{\mathbb{P}}{\longrightarrow}c.
  2. If \forall\varepsilon>0, \lim_{n\to\infty}\sum_n\mathbb{P}[|X_n-X|>\varepsilon]<\infty (implies3 X_n\stackrel{\mathbb{P}}{\longrightarrow}X), then X_n\stackrel{\mathrm{as}}{\longrightarrow}X.
  3. If X_n\stackrel{\mathbb{P}}{\longrightarrow}X and \mathbb{P}[|X_n|\leq M]=1, \forall n\in\mathbb{N} and M>0, then X_n\stackrel{r}{\longrightarrow}X for r\geq1.
  4. If S_n=\sum_{i=1}^nX_i with X_1,\ldots,X_n iid, then S_n\stackrel{\mathbb{P}}{\longrightarrow}S\iff S_n\stackrel{\mathrm{as}}{\longrightarrow}S.

Also, if s\geq r\geq 1, then X_n\stackrel{s}{\longrightarrow}X\implies X_n\stackrel{r}{\longrightarrow}X.

The corner stone limit result in probability is the central limit theorem (CLT). In its simpler version, it entails that:

Theorem 1.1 (CLT) Let X_n be a sequence of iid random variables with \mathbb{E}[X_i]=\mu and \mathbb{V}\mathrm{ar}[X_i]=\sigma^2<\infty, i\in\mathbb{N}. Then:

\begin{align*} \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}

We will use later the following CLT for random variables that are independent but not identically distributed due to its easy-to-check moment conditions.

Theorem 1.2 (Lyapunov's CLT) Let X_n be a sequence of independent random variables with \mathbb{E}[X_i]=\mu_i and \mathbb{V}\mathrm{ar}[X_i]=\sigma_i^2<\infty, i\in\mathbb{N}, and such that for some \delta>0

\begin{align*} \frac{1}{s_n^{2+\delta}}\sum_{i=1}^n\mathbb{E}\left[|X_i-\mu_i|^{2+\delta}\right]\longrightarrow0\text{ as }n\to\infty, \end{align*}

with s_n^2=\sum_{i=1}^n\sigma^2_i. Then:

\begin{align*} \frac{1}{s_n}\sum_{i=1}^n(X_i-\mu_i)\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}

Finally, the next results will be of usefulness (' denotes transposition).

Theorem 1.3 (Cramér--Wold device) Let \mathbf{X}_n be a sequence of p-dimensional random vectors. Then:

\begin{align*} \mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{X}\iff \mathbf{c}'\mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{c}'\mathbf{X},\quad \forall \mathbf{c}\in\mathbb{R}^p. \end{align*}

Theorem 1.4 (Continuous mapping theorem) If \mathbf{X}_n\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mathbf{X}, then g(\mathbf{X}_n)\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}g(\mathbf{X}) for any continuous function g.

Theorem 1.5 (Slutsky's theorem) Let X_n and Y_n be sequences of random variables.

  1. If X_n\stackrel{d}{\longrightarrow}X and Y_n\stackrel{\mathbb{P}}{\longrightarrow} c, then X_nY_n\stackrel{d}{\longrightarrow}cX.
  2. If X_n\stackrel{d}{\longrightarrow}X and Y_n\stackrel{\mathbb{P}}{\longrightarrow} c, c\neq0, then \frac{X_n}{Y_n}\stackrel{d}{\longrightarrow}\frac{X}{c}.
  3. If X_n\stackrel{d}{\longrightarrow}X and Y_n\stackrel{\mathbb{P}}{\longrightarrow} c, then X_n+Y_n\stackrel{d}{\longrightarrow}X+c.

Theorem 1.6 (Limit algebra for $(\mathbb{P} \ r \ \mathrm{as})$-convergence) Let X_n and Y_n be sequences of random variables and a_n\to a and b_n\to b two sequences. Denote X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X to “X_n converges in probability (respectively almost surely, respectively in r-mean).”

  1. If X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X and Y_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X, then a_nX_n+b_nY_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}aX+bY.
  2. If X_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X and Y_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X, then X_nY_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}XY.

Remark. Recall the absence of results for convergence in distribution. They do not hold. Particularly: X_n\stackrel{d}{\longrightarrow}X and Y_n\stackrel{d}{\longrightarrow}X do not imply X_n+Y_n\stackrel{d}{\longrightarrow}X+Y.

Theorem 1.7 (Delta method) If \sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2), then \sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\mu))^2\sigma^2\right) for any function such that g is differentiable at \mu and g'(\mu)\neq0.

Example 1.2 It is well known that, given a parametric density f_\theta, with parameter \theta\in\Theta, and iid X_1,\ldots,X_n\sim f_\theta, then the maximum likelihood (ML) estimator \hat\theta_{\mathrm{ML}}:=\arg\max_{\theta\in\Theta}\sum_{i=1}^n\log f_\theta(X_i) (the parameter that maximizes the probability of the data based on the model) converges to a normal under certain regularity conditions:

\begin{align*} \sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,I(\theta)^{-1}\right), \end{align*}

where I(\theta):=-\mathbb{E}_\theta\left[\frac{\partial^2\log f_\theta(x)}{\partial\theta^2}\right]. Then, it is satisfied that:

\begin{align*} \sqrt{n}(g(\hat\theta_{\mathrm{ML}})-g(\theta))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\theta))^2I(\theta)^{-1}\right). \end{align*}

If we apply the continuous mapping theorem for g, we would have obtained that

\begin{align*} g(\sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta))\stackrel{d}{\longrightarrow}g\left(\mathcal{N}\left(0,I(\theta)^{-1}\right)\right), \end{align*}

Theorem 1.8 (Weak and strong laws of large numbers) Let X_n be a iid sequence with \mathbb{E}[X_i]=\mu, i\geq1. Then: \frac{1}{n}\sum_{i=1}^nX_i\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mu.


  1. Intuitively: if convergence in probability is fast enough, then we have almost surely convergence.↩︎