1.3 Stochastic convergence review
Let Xn be a sequence of random variables defined in a common probability space (Ω,A,P). The four most common types of convergence of Xn to a random variable in (Ω,A,P) are the following.
Definition 1.1 (Convergence in distribution) Xn converges in distribution to X, written Xnd⟶X, if lim for all x for which F is continuous, where X_n\sim F_n and X\sim F.
“Convergence in distribution” is also referred to as weak convergence. Proposition 1.6 justifies why this terminology.
Definition 1.2 (Convergence in probability) X_n converges in probability to X, written X_n\stackrel{\mathbb{P}}{\longrightarrow}X, if \lim_{n\to\infty}\mathbb{P}[|X_n-X|>\varepsilon]=0, \forall \varepsilon>0.
Definition 1.3 (Convergence almost surely) X_n converges almost surely (as) to X, written X_n\stackrel{\mathrm{as}}{\longrightarrow}X, if \mathbb{P}[\{\omega\in\Omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\}]=1.
Definition 1.4 (Convergence in r-mean) For r\geq1, X_n converges in r-mean to X, written X_n\stackrel{r}{\longrightarrow}X, if \lim_{n\to\infty}\mathbb{E}[|X_n-X|^r]=0.
Remark. The previous definitions can be extended to a sequence of p-random vectors \mathbf{X}_n. For Definitions 1.2 and 1.4, replace |\cdot| with the Euclidean norm ||\cdot||. Alternatively, Definition 1.2 can be extended marginally: \mathbf{X}_n\stackrel{\mathbb{P}}{\longrightarrow}\mathbf{X}\,:\!\iff X_{j,n}\stackrel{\mathbb{P}}{\longrightarrow}X_j, \forall j=1,\ldots,p. For Definition 1.1, replace F_n and F by the joint cdfs of \mathbf{X}_n and \mathbf{X}, respectively.8 Definition 1.3 also extends marginally.
The 2-mean convergence plays a remarkable role in defining a consistent estimator \hat\theta_n=T(X_1,\ldots,X_n) of a parameter \theta. We say that the estimator is consistent if its Mean Squared Error (MSE),
\begin{align*} \mathrm{MSE}[\hat\theta_n]:\!&=\mathbb{E}[(\hat\theta_n-\theta)^2]\\ &=(\mathbb{E}[\hat\theta_n]-\theta)^2+\mathbb{V}\mathrm{ar}[\hat\theta_n]\\ &=:\mathrm{Bias}[\hat\theta_n]^2+\mathbb{V}\mathrm{ar}[\hat\theta_n], \end{align*}
goes to zero as n\to\infty. Equivalently written, if \hat\theta_n\stackrel{2}{\longrightarrow}\theta.
If X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} X and X is a degenerate random variable such that \mathbb{P}[X=c]=1, c\in\mathbb{R}, then we write X_n\stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} c (the list-notation \stackrel{d,\mathbb{P},r,\mathrm{as}}{\longrightarrow} is used here and henceforth to condense several different convergence results in the same line).
The relations between the types of convergences are conveniently summarized in the following proposition.
Proposition 1.6 Let X_n be a sequence of random variables and X a random variable. Then the following implication diagram is satisfied:
\begin{align*} \begin{array}{rcl} X_n\stackrel{r}{\longrightarrow}X \quad\implies & X_n\stackrel{\mathbb{P}}{\longrightarrow}X & \impliedby\quad X_n\stackrel{\mathrm{as}}{\longrightarrow}X\\ & \Downarrow & \\ & X_n\stackrel{d}{\longrightarrow}X & \end{array} \end{align*}
Also, if s\geq r\geq 1, then X_n\stackrel{s}{\longrightarrow}X\implies X_n\stackrel{r}{\longrightarrow}X.
None of the converses holds in general. However, there are some notable exceptions:
- If X_n\stackrel{d}{\longrightarrow}c, then X_n\stackrel{\mathbb{P}}{\longrightarrow}c, c\in\mathbb{R}.
- If \forall\varepsilon>0, \lim_{n\to\infty}\sum_n\mathbb{P}[|X_n-X|>\varepsilon]<\infty (implies9 X_n\stackrel{\mathbb{P}}{\longrightarrow}X), then X_n\stackrel{\mathrm{as}}{\longrightarrow}X.
- If X_n\stackrel{\mathbb{P}}{\longrightarrow}X and \mathbb{P}[|X_n|\leq M]=1, \forall n\in\mathbb{N} and M>0,10 then X_n\stackrel{r}{\longrightarrow}X for r\geq1.
- If S_n=\sum_{i=1}^nX_i with X_1,\ldots,X_n iid, then S_n\stackrel{\mathbb{P}}{\longrightarrow}S\iff S_n\stackrel{\mathrm{as}}{\longrightarrow}S.
The weak Law of Large Numbers (LLN) and its strong version are the two most representative results of convergence in probability and almost surely.
Theorem 1.1 (Weak and strong LLN) Let X_n be an iid sequence with \mathbb{E}[X_i]=\mu, i\geq1. Then: \frac{1}{n}\sum_{i=1}^nX_i\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mu.
The corner stone limit result in probability is the Central Limit Theorem (CLT). One of its simpler versions has the following form.
Theorem 1.2 (CLT) Let X_n be a sequence of iid random variables with \mathbb{E}[X_i]=\mu and \mathbb{V}\mathrm{ar}[X_i]=\sigma^2<\infty, i\in\mathbb{N}. Then:
\begin{align*} \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}
We will later use the following CLT for random variables that are independent but not identically distributed due to its easy-to-check moment conditions.
Theorem 1.3 (Lyapunov's CLT) Let X_n be a sequence of independent random variables with \mathbb{E}[X_i]=\mu_i and \mathbb{V}\mathrm{ar}[X_i]=\sigma_i^2<\infty, i\in\mathbb{N}, and such that for some \delta>0
\begin{align*} \frac{1}{s_n^{2+\delta}}\sum_{i=1}^n\mathbb{E}\left[|X_i-\mu_i|^{2+\delta}\right]\longrightarrow0\text{ as }n\to\infty, \end{align*}
with s_n^2=\sum_{i=1}^n\sigma^2_i. Then:
\begin{align*} \frac{1}{s_n}\sum_{i=1}^n(X_i-\mu_i)\stackrel{d}{\longrightarrow}\mathcal{N}(0,1). \end{align*}
Finally, the following results will be useful (' denotes transposition). In particular, Slutsky’s theorem allows mixing the LLNs and the CLT with additions and products.
Theorem 1.4 (Cramér–Wold device) Let \mathbf{X}_n be a sequence of p-dimensional random vectors. Then:
\begin{align*} \mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{X}\iff \mathbf{c}'\mathbf{X}_n\stackrel{d}{\longrightarrow}\mathbf{c}'\mathbf{X},\quad \forall \mathbf{c}\in\mathbb{R}^p. \end{align*}
Theorem 1.5 (Continuous mapping theorem) If \mathbf{X}_n\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}\mathbf{X}, then
\begin{align*} g(\mathbf{X}_n)\stackrel{d,\,\mathbb{P},\,\mathrm{as}}{\longrightarrow}g(\mathbf{X}) \end{align*}
for any continuous function g.
Theorem 1.6 (Slutsky's theorem) Let X_n and Y_n be sequences of random variables and c\in\mathbb{R}. Then:
- If X_n\stackrel{d}{\longrightarrow}X and Y_n\stackrel{\mathbb{P}}{\longrightarrow} c, then X_nY_n\stackrel{d}{\longrightarrow}cX.
- If X_n\stackrel{d}{\longrightarrow}X and Y_n\stackrel{\mathbb{P}}{\longrightarrow} c, c\neq0, then \frac{X_n}{Y_n}\stackrel{d}{\longrightarrow}\frac{X}{c}.
- If X_n\stackrel{d}{\longrightarrow}X and Y_n\stackrel{\mathbb{P}}{\longrightarrow} c, then X_n+Y_n\stackrel{d}{\longrightarrow}X+c.
Theorem 1.7 (Limit algebra for (\mathbb{P},\,r,\,\mathrm{as})-convergence) Let X_n and Y_n be sequences of random variables, and a_n\to a and b_n\to b two sequences.
- If X_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}X and Y_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}Y, then a_nX_n+b_nY_n\stackrel{\mathbb{P},\,r,\,\mathrm{as}}{\longrightarrow}aX+bY.
- If X_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}X and Y_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}Y, then X_nY_n\stackrel{\mathbb{P},\,\mathrm{as}}{\longrightarrow}XY.
Remark. Recall the absence of the analogous results for convergence in distribution. In general, there are no such results!
- Particularly, it is false that, in general, X_n\stackrel{d}{\longrightarrow}X and Y_n\stackrel{d}{\longrightarrow}Y imply X_n+Y_n\stackrel{d}{\longrightarrow}X+Y or X_nY_n\stackrel{d}{\longrightarrow}XY.
- It is true, however, that (X_n,Y_n)\stackrel{d}{\longrightarrow}(X,Y) (a much stronger premise) implies both X_n+Y_n\stackrel{d}{\longrightarrow}X+Y and X_nY_n\stackrel{d}{\longrightarrow}XY, as Theorem 1.5 indicates. Note that X_n+Y_n\stackrel{d}{\longrightarrow}X+Y is also implied by (X_n,Y_n)\stackrel{d}{\longrightarrow}(X,Y) by Theorem 1.4 with \mathbf{c}=(1,1)'.
- Consequently, it is also true that, under independence of X_n and Y_n, X_n\stackrel{d}{\longrightarrow}X and Y_n\stackrel{d}{\longrightarrow}Y imply X_n+Y_n\stackrel{d}{\longrightarrow}X+Y and X_nY_n\stackrel{d}{\longrightarrow}XY
Theorem 1.8 (Delta method) If \sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2), then
\begin{align*} \sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\mu))^2\sigma^2\right) \end{align*}
for any function g that is differentiable at \mu and such that g'(\mu)\neq0.
Example 1.2 It is well known that, given a parametric density f_\theta with parameter \theta\in\Theta and iid X_1,\ldots,X_n\sim f_\theta, then the Maximum Likelihood (ML) estimator \hat\theta_{\mathrm{ML}}:=\arg\max_{\theta\in\Theta}\sum_{i=1}^n\log f_\theta(X_i) (the parameter that maximizes the “probability” of the data based on the model) converges to a normal under certain regularity conditions:
\begin{align*} \sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,I(\theta)^{-1}\right), \end{align*}
where I(\theta):=-\mathbb{E}_\theta\left[\frac{\partial^2\log f_\theta(x)}{\partial\theta^2}\right] is known as the Fisher information. Then, it is satisfied that
\begin{align*} \sqrt{n}(g(\hat\theta_{\mathrm{ML}})-g(\theta))\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,(g'(\theta))^2I(\theta)^{-1}\right). \end{align*}
Note that, had we applied the continuous mapping theorem for g, we would have obtained a different result:
\begin{align*} g(\sqrt{n}(\hat\theta_{\mathrm{ML}}-\theta))\stackrel{d}{\longrightarrow}g\left(\mathcal{N}\left(0,I(\theta)^{-1}\right)\right). \end{align*}
Exercise 1.10 Let’s dig further into the differences between the delta method and the continuous mapping theorem when applied to \sqrt{n}(X_n-\mu)\stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2):
- Under what kind of maps g the results \sqrt{n}(g(X_n)-g(\mu))\stackrel{d}{\longrightarrow}\mathcal{N}(0,(g'(\mu))^2\sigma^2) and g(\sqrt{n}(X_n-\mu))\stackrel{d}{\longrightarrow}g(\mathcal{N}(0,\sigma^2)) are equivalent?
- Take g(x)=e^x. What two results do you obtain with the delta method and the continuous mapping theorem when applied to \sqrt{n}\bar X \stackrel{d}{\longrightarrow}\mathcal{N}(0,\sigma^2)?