1.3 Stochastic convergence review
Let \(X_n\) be a sequence of random variables defined in a common probability space \((\Omega,\mathcal{A},\mathbb{P}).\) The four most common types of convergence of \(X_n\) to a random variable in \((\Omega,\mathcal{A},\mathbb{P})\) are the following.
Definition 1.1 (Convergence in distribution) \(X_n\) converges in distribution to \(X,\) written \(X_n\stackrel{d}{\longrightarrow}X,\) if \(\lim_{n\to\infty}F_n(x)=F(x)\) for all \(x\) for which \(F\) is continuous, where \(X_n\sim F_n\) and \(X\sim F.\)
“Convergence in distribution” is also referred to as weak convergence. Proposition 1.6 justifies why this terminology.
Definition 1.2 (Convergence in probability) \(X_n\) converges in probability to \(X,\) written \(X_n\stackrel{\mathbb{P}}{\longrightarrow}X,\) if \(\lim_{n\to\infty}\mathbb{P}[|X_n-X|>\varepsilon]=0,\) \(\forall \varepsilon>0.\)
Definition 1.3 (Convergence almost surely) \(X_n\) converges almost surely (as) to \(X,\) written \(X_n\stackrel{\mathrm{as}}{\longrightarrow}X,\) if \(\mathbb{P}[\{\omega\in\Omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\}]=1.\)
Definition 1.4 (Convergence in \(r\)-mean) For \(r\geq1,\) \(X_n\) converges in \(r\)-mean to \(X,\) written \(X_n\stackrel{r}{\longrightarrow}X,\) if \(\lim_{n\to\infty}\mathbb{E}[|X_n-X|^r]=0.\)
Remark. The previous definitions can be extended to a sequence of \(p\)-random vectors \(\mathbf{X}_n.\) For Definitions 1.2 and 1.4, replace \(|\cdot|\) with the Euclidean norm \(||\cdot||.\) Alternatively, Definition 1.2 can be extended marginally: \(\mathbf{X}_n\stackrel{\mathbb{P}}{\longrightarrow}\mathbf{X}:\iff X_{j,n}\stackrel{\mathbb{P}}{\longrightarrow}X_j,\) \(\forall j=1,\ldots,p.\) For Definition 1.1, replace \(F_n\) and \(F\) by the joint cdfs of \(\mathbf{X}_n\) and \(\mathbf{X},\) respectively. Definition 1.3 also extends marginally.
The \(2\)-mean convergence plays a remarkable role in defining a consistent estimator \(\hat\theta_n=T(X_1,\ldots,X_n)\) of a parameter \(\theta.\) We say that the estimator is consistent if its Mean Squared Error (MSE),
\[\begin{align*} \mathrm{MSE}[\hat\theta_n]:\!&=\mathbb{E}[(\hat\theta_n-\theta)^2]\\ &=(\mathbb{E}[\hat\theta_n]-\theta)^2+\mathbb{V}\mathrm{ar}[\hat\theta_n]\\ &=:\mathrm{Bias}[\hat\theta_n]^2+\mathbb{V}\mathrm{ar}[\hat\theta_n], \end{align*}\]
goes to zero as \(n\to\infty.\) Equivalently written, if \(\hat\theta_n\stackrel{2}{\longrightarrow}\theta.\)
If \(X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} X\) and \(X\) is a degenerate random variable such that \(\mathbb{P}[X=c]=1,\) \(c\in\mathbb{R},\) then we write \(X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} c\) (the list-notation \(\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow}\) is used to condensate four different convergence results in the same line).
The relations between the types of convergences are conveniently summarized in the following proposition.
Proposition 1.6 Let \(X_n\) be a sequence of random variables and \(X\) a random variable. Then the following implication diagram is satisfied:
\[\begin{align*} \begin{array}{rcl} X_n\stackrel{r}{\longrightarrow}X \quad\implies & X_n\stackrel{\mathbb{P}}{\longrightarrow}X & \impliedby\quad X_n\stackrel{\mathrm{as}}{\longrightarrow}X\\ & \Downarrow & \\ & X_n\stackrel{d}{\longrightarrow}X & \end{array} \end{align*}\]
Also, if \(s\geq r\geq 1,\) then \(X_n\stackrel{s}{\longrightarrow}X\implies X_n\stackrel{r}{\longrightarrow}X.\)
None of the converses holds in general. However, there are some notable exceptions:
- If \(X_n\stackrel{d}{\longrightarrow}c,\) then \(X_n\stackrel{\mathbb{P}}{\longrightarrow}c,\) \(c\in\mathbb{R}.\)
- If \(\forall\varepsilon>0,\) \(\lim_{n\to\infty}\sum_n\mathbb{P}[|X_n-X|>\varepsilon]<\infty\) (implies7 \(X_n\stackrel{\mathbb{P}}{\longrightarrow}X\)), then \(X_n\stackrel{\mathrm{as}}{\longrightarrow}X.\)
- If \(X_n\stackrel{\mathbb{P}}{\longrightarrow}X\) and \(\mathbb{P}[|X_n|\leq M]=1,\) \(\forall n\in\mathbb{N}\) and \(M>0\,\)8, then \(X_n\stackrel{r}{\longrightarrow}X\) for \(r\geq1.\)
- If \(S_n=\sum_{i=1}^nX_i\) with \(X_1,\ldots,X_n\) iid, then \(S_n\stackrel{\mathbb{P}}{\longrightarrow}S\iff S_n\stackrel{\mathrm{as}}{\longrightarrow}S.\)
The weak Law of Large Numbers (LLN) and its strong version are the two most representative results of convergence in probability and almost surely.
Theorem 1.1 (Weak and strong LLN) Let \(X_n\) be an iid sequence with \(\mathbb{E}[X_i]=\mu,\) \(i\geq1.\) Then: \(\frac{1}{n}\sum_{i=1}^nX_i\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mu.\)
The corner stone limit result in probability is the Central Limit Theorem (CLT). One of its simpler versions has the following form.
Theorem 1.2 (CLT) Let \(X_n\) be a sequence of iid random variables with \(\mathbb{E}[X_i]=\mu\) and \(\mathbb{V}\mathrm{ar}[X_i]=\sigma^2<\infty,\) \(i\in\mathbb{N}.\) Then:
\[\begin{align*} \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}\]
We will later use the following CLT for random variables that are independent but not identically distributed due to its easy-to-check moment conditions.
Theorem 1.3 (Lyapunov's CLT) Let \(X_n\) be a sequence of independent random variables with \(\mathbb{E}[X_i]=\mu_i\) and \(\mathbb{V}\mathrm{ar}[X_i]=\sigma_i^2<\infty,\) \(i\in\mathbb{N},\) and such that for some \(\delta>0\)
\[\begin{align*} \frac{1}{s_n^{2+\delta}}\sum_{i=1}^n\mathbb{E}\left[|X_i-\mu_i|^{2+\delta}\right]\longrightarrow0\text{ as }n\to\infty, \end{align*}\]
with \(s_n^2=\sum_{i=1}^n\sigma^2_i.\) Then:
\[\begin{align*} \frac{1}{s_n}\sum_{i=1}^n(X_i-\mu_i)\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}\]
Finally, the following results will be useful (\('\) denotes transposition). In particular, Slutsky’s theorem allows mixing the LLNs and the CLT with additions and products.
Theorem 1.4 (Cramér–Wold device) Let \(\mathbf{X}_n\) be a sequence of \(p\)-dimensional random vectors. Then:
\[\begin{align*} \mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{X}\iff \mathbf{c}'\mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{c}'\mathbf{X},\quad \forall \mathbf{c}\in\mathbb{R}^p. \end{align*}\]
Theorem 1.5 (Continuous mapping theorem) If \(\mathbf{X}_n\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mathbf{X},\) then
\[\begin{align*} g(\mathbf{X}_n)\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}g(\mathbf{X}) \end{align*}\]
for any continuous function \(g.\)
Theorem 1.6 (Slutsky's theorem) Let \(X_n\) and \(Y_n\) be sequences of random variables and \(c\in\mathbb{R}.\) Then:
- If \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P}}{\longrightarrow} c,\) then \(X_nY_n\stackrel{d}{\longrightarrow}cX.\)
- If \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P}}{\longrightarrow} c,\) \(c\neq0,\) then \(\frac{X_n}{Y_n}\stackrel{d}{\longrightarrow}\frac{X}{c}.\)
- If \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P}}{\longrightarrow} c,\) then \(X_n+Y_n\stackrel{d}{\longrightarrow}X+c.\)
Theorem 1.7 (Limit algebra for \((\mathbb{P},\,r,\,\mathrm{as})\)-convergence) Let \(X_n\) and \(Y_n\) be sequences of random variables, and \(a_n\to a\) and \(b_n\to b\) two sequences.
- If \(X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}Y,\) then \(a_nX_n+b_nY_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}aX+bY.\)
- If \(X_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X\) and \(Y_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}Y,\) then \(X_nY_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}XY.\)
Remark. Recall the absence of the analogous results for convergence in distribution. In general, there are no such results!
- Particularly, it is false that, in general, \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{d}{\longrightarrow}Y\) imply \(X_n+Y_n\stackrel{d}{\longrightarrow}X+Y\) or \(X_nY_n\stackrel{d}{\longrightarrow}XY\).
- It is true, however, that \((X_n,Y_n)\stackrel{d}{\longrightarrow}(X,Y)\) (a much stronger premise) implies both \(X_n+Y_n\stackrel{d}{\longrightarrow}X+Y\) and \(X_nY_n\stackrel{d}{\longrightarrow}XY\), as Theorem 1.5 indicates. Note that \(X_n+Y_n\stackrel{d}{\longrightarrow}X+Y\) is also implied by \((X_n,Y_n)\stackrel{d}{\longrightarrow}(X,Y)\) by Theorem 1.4 with \(\mathbf{c}=(1,1)'.\)
- Consequently, it is also true that, under independence of \(X_n\) and \(Y_n\), \(X_n\stackrel{d}{\longrightarrow}X\) and \(Y_n\stackrel{d}{\longrightarrow}Y\) imply \(X_n+Y_n\stackrel{d}{\longrightarrow}X+Y\) and \(X_nY_n\stackrel{d}{\longrightarrow}XY\)
Theorem 1.8 (Delta method) If \(\sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2),\) then
\[\begin{align*} \sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\mu))^2\sigma^2\right) \end{align*}\]
for any function \(g\) that is differentiable at \(\mu\) and such that \(g'(\mu)\neq0.\)
Example 1.2 It is well known that, given a parametric density \(f_\theta\) with parameter \(\theta\in\Theta\) and iid \(X_1,\ldots,X_n\sim f_\theta,\) then the Maximum Likelihood (ML) estimator \(\hat\theta_{\mathrm{ML}}:=\arg\max_{\theta\in\Theta}\sum_{i=1}^n\log f_\theta(X_i)\) (the parameter that maximizes the “probability” of the data based on the model) converges to a normal under certain regularity conditions:
\[\begin{align*} \sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,I(\theta)^{-1}\right), \end{align*}\]
where \(I(\theta):=-\mathbb{E}_\theta\left[\frac{\partial^2\log f_\theta(x)}{\partial\theta^2}\right]\) is known as the Fisher information. Then, it is satisfied that
\[\begin{align*} \sqrt{n}(g(\hat\theta_{\mathrm{ML}})-g(\theta))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\theta))^2I(\theta)^{-1}\right). \end{align*}\]
Note that, had we applied the continuous mapping theorem for \(g,\) we would have obtained a different result:
\[\begin{align*} g(\sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta))\stackrel{d}{\longrightarrow}g\left(\mathcal{N}\left(0,I(\theta)^{-1}\right)\right). \end{align*}\]
Exercise 1.10 Let’s dig further into the differences between the delta method and the continuous mapping theorem when applied to \(\sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)\):
- Under what kind of maps \(g\) the results \(\sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}(0,(g'(\mu))^2\sigma^2)\) and \(g(\sqrt{n}(X_n-\mu))\stackrel{d}{\longrightarrow}g(\mathcal{N}(0,\sigma^2))\) are equivalent?
- Take \(g(x)=e^x.\) What two results do you obtain with the delta method and the continuous mapping theorem when applied to \(\sqrt{n}\bar X \stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)\)?