1.3 Basic stochastic convergence review

Let \(X_n\) be a sequence of random variables defined in a common probability space \((\Omega,\mathcal{A},\mathbb{P}).\) The four most common types of convergence of \(X_n\) to a random variable in \((\Omega,\mathcal{A},\mathbb{P})\) are the following.

Definition 1.1 (Convergence in distribution) \(X_n\) converges in distribution to \(X,\) written \(X_n\stackrel{d}{\longrightarrow}X,\) if \(\lim_{n\to\infty}F_n(x)=F(x)\) for all \(x\) for which \(F\) is continuous, where \(X_n\sim F_n\) and \(X\sim F.\)

Definition 1.2 (Convergence in probability) \(X_n\) converges in probability to \(X,\) written \(X_n\stackrel{\mathbb{P}}{\longrightarrow}X,\) if \(\lim_{n\to\infty}\mathbb{P}[|X_n-X|>\varepsilon]=0,\) \(\forall \varepsilon>0.\)

Definition 1.3 (Convergence almost surely) \(X_n\) converges almost surely (as) to \(X,\) written \(X_n\stackrel{\mathrm{as}}{\longrightarrow}X,\) if \(\mathbb{P}[\{\omega\in\Omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\}]=1.\)

Definition 1.4 (Convergence in $r$-mean) \(X_n\) converges in \(r\)-mean to \(X,\) written \(X_n\stackrel{r}{\longrightarrow}X,\) if \(\lim_{n\to\infty}\mathbb{E}[|X_n-X|^r]=0.\)

Remark. The previous definitions can be extended to a sequence of \(p\)-random vectors \(\mathbf{X}_n.\) For Definitions 1.2 and 1.4, replace \(|\cdot|\) by the Euclidean norm \(||\cdot||.\) Alternatively, Definition 1.2 can be extended marginally: \(\mathbf{X}_n\stackrel{\mathbb{P}}{\longrightarrow}\mathbf{X}:\iff X_{j,n}\stackrel{\mathbb{P}}{\longrightarrow}X_j,\) \(\forall j=1,\ldots,p.\) For Definition 1.1, replace \(F_n\) and \(F\) by the joint cdfs of \(\mathbf{X}_n\) and \(\mathbf{X},\) respectively. Definition 1.3 extends marginally as well.

The \(2\)-mean convergence plays a remarkable role in defining a consistent estimator \(\hat\theta_n=T(X_1,\ldots,X_n)\) of a parameter \(\theta.\) We say that the estimator is consistent if its mean squared error (MSE),

\[\begin{align*} \mathrm{MSE}[\hat\theta]:=&\,\mathbb{E}[(\hat\theta_n-\theta)^2]\\ =&\,(\mathbb{E}[\hat\theta_n]-\theta)^2+\mathbb{V}\mathrm{ar}[\hat\theta_n]\\ =&:\mathrm{Bias}[\hat\theta_n]^2+\mathbb{V}\mathrm{ar}[\hat\theta_n], \end{align*}\]

goes to zero as \(n\to\infty.\) Equivalently, if \(\hat\theta_n\stackrel{2}{\longrightarrow}\theta.\)

If \(X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} X\) and \(X\) is a degenerate random variable such that \(\mathbb{P}[X=c]=1,\) \(c\in\mathbb{R},\) then we write \(X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} c\) (the list-notation is used to condensate four different convergence results in the same line).

The relations between the types of convergences are conveniently summarized in the next proposition.

Proposition 1.6 Let \(X_n\) be a sequence of random variables and \(X\) a random variable. Then the following implication diagram is satisfied:

\[\begin{align*} \begin{array}{rcl} X_n\stackrel{r}{\longrightarrow}X \quad\implies & X_n\stackrel{\mathbb{P}}{\longrightarrow}X & \impliedby\quad X_n\stackrel{\mathrm{as}}{\longrightarrow}X\\ & \Downarrow & \\ & X_n\stackrel{d}{\longrightarrow}X & \end{array} \end{align*}\]

None of the converses hold in general. However, there are some notable exceptions:

  1. If \(X_n\stackrel{d}{\longrightarrow}c,\) then \(X_n\stackrel{\mathbb{P}}{\longrightarrow}c.\)
  2. If \(\forall\varepsilon>0,\) \(\lim_{n\to\infty}\sum_n\mathbb{P}[|X_n-X|>\varepsilon]<\infty\) (implies3 \(X_n\stackrel{\mathbb{P}}{\longrightarrow}X\)), then \(X_n\stackrel{\mathrm{as}}{\longrightarrow}X.\)
  3. If \(X_n\stackrel{\mathbb{P}}{\longrightarrow}X\) and \(\mathbb{P}[|X_n|\leq M]=1,\) \(\forall n\in\mathbb{N}\) and \(M>0,\) then \(X_n\stackrel{r}{\longrightarrow}X\) for \(r\geq1.\)
  4. If \(S_n=\sum_{i=1}^nX_i\) with \(X_1,\ldots,X_n\) iid, then \(S_n\stackrel{\mathbb{P}}{\longrightarrow}S\iff S_n\stackrel{\mathrm{as}}{\longrightarrow}S.\)

Also, if \(s\geq r\geq 1,\) then \(X_n\stackrel{s}{\longrightarrow}X\implies X_n\stackrel{r}{\longrightarrow}X.\)

The corner stone limit result in probability is the central limit theorem (CLT). In its simpler version, it entails that:

Theorem 1.1 (CLT) Let \(X_n\) be a sequence of iid random variables with \(\mathbb{E}[X_i]=\mu\) and \(\mathbb{V}\mathrm{ar}[X_i]=\sigma^2<\infty,\) \(i\in\mathbb{N}.\) Then:

\[\begin{align*} \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}\]

We will use later the following CLT for random variables that are independent but not identically distributed due to its easy-to-check moment conditions.

Theorem 1.2 (Lyapunov's CLT) Let \(X_n\) be a sequence of independent random variables with \(\mathbb{E}[X_i]=\mu_i\) and \(\mathbb{V}\mathrm{ar}[X_i]=\sigma_i^2<\infty,\) \(i\in\mathbb{N},\) and such that for some \(\delta>0\)

\[\begin{align*} \frac{1}{s_n^{2+\delta}}\sum_{i=1}^n\mathbb{E}\left[|X_i-\mu_i|^{2+\delta}\right]\longrightarrow0\text{ as }n\to\infty, \end{align*}\]

with \(s_n^2=\sum_{i=1}^n\sigma^2_i.\) Then:

\[\begin{align*} \frac{1}{s_n}\sum_{i=1}^n(X_i-\mu_i)\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}\]

Finally, the next results will be of usefulness (\('\) denotes transposition).

Theorem 1.3 (Cramér--Wold device) Let \(\mathbf{X}_n\) be a sequence of \(p\)-dimensional random vectors. Then:

\[\begin{align*} \mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{X}\iff \mathbf{c}'\mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{c}'\mathbf{X},\quad \forall \mathbf{c}\in\mathbb{R}^p. \end{align*}\]

Theorem 1.4 (Continuous mapping theorem) If \(\mathbf{X}_n\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mathbf{X},\) then \(g(\mathbf{X}_n)\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}g(\mathbf{X})\) for any continuous function \(g.\)

Theorem 1.5 (Slutsky's theorem) Let \(X_n\) and \(Y_n\) be sequences of random variables.

  1. If \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P}}{\longrightarrow} c,\) then \(X_nY_n\stackrel{d}{\longrightarrow}cX.\)
  2. If \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P}}{\longrightarrow} c,\) \(c\neq0,\) then \(\frac{X_n}{Y_n}\stackrel{d}{\longrightarrow}\frac{X}{c}.\)
  3. If \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P}}{\longrightarrow} c,\) then \(X_n+Y_n\stackrel{d}{\longrightarrow}X+c.\)

Theorem 1.6 (Limit algebra for $(\mathbb{P} \ r \ \mathrm{as})$-convergence) Let \(X_n\) and \(Y_n\) be sequences of random variables and \(a_n\to a\) and \(b_n\to b\) two sequences. Denote \(X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X\) to “\(X_n\) converges in probability (respectively almost surely, respectively in \(r\)-mean).”

  1. If \(X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X,\) then \(a_nX_n+b_nY_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}aX+bY.\)
  2. If \(X_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X,\) then \(X_nY_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}XY.\)

Remark. Recall the absence of results for convergence in distribution. They do not hold. Particularly: \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{d}{\longrightarrow}X\) do not imply \(X_n+Y_n\stackrel{d}{\longrightarrow}X+Y.\)

Theorem 1.7 (Delta method) If \(\sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2),\) then \(\sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\mu))^2\sigma^2\right)\) for any function such that \(g\) is differentiable at \(\mu\) and \(g'(\mu)\neq0.\)

Example 1.2 It is well known that, given a parametric density \(f_\theta,\) with parameter \(\theta\in\Theta,\) and iid \(X_1,\ldots,X_n\sim f_\theta,\) then the maximum likelihood (ML) estimator \(\hat\theta_{\mathrm{ML}}:=\arg\max_{\theta\in\Theta}\sum_{i=1}^n\log f_\theta(X_i)\) (the parameter that maximizes the probability of the data based on the model) converges to a normal under certain regularity conditions:

\[\begin{align*} \sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,I(\theta)^{-1}\right), \end{align*}\]

where \(I(\theta):=-\mathbb{E}_\theta\left[\frac{\partial^2\log f_\theta(x)}{\partial\theta^2}\right].\) Then, it is satisfied that:

\[\begin{align*} \sqrt{n}(g(\hat\theta_{\mathrm{ML}})-g(\theta))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\theta))^2I(\theta)^{-1}\right). \end{align*}\]

If we apply the continuous mapping theorem for \(g,\) we would have obtained that

\[\begin{align*} g(\sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta))\stackrel{d}{\longrightarrow}g\left(\mathcal{N}\left(0,I(\theta)^{-1}\right)\right), \end{align*}\]

Theorem 1.8 (Weak and strong laws of large numbers) Let \(X_n\) be a iid sequence with \(\mathbb{E}[X_i]=\mu,\) \(i\geq1.\) Then: \(\frac{1}{n}\sum_{i=1}^nX_i\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mu.\)


  1. Intuitively: if convergence in probability is fast enough, then we have almost surely convergence.↩︎