Chapter 2 Special Distribution, Order Statistics, Convergence (Lecture on 01/09/2020)

Continue from Chapter 1, there are another two important distributions, namely the Student’s t distribution and F distribution.

The intuition behind Student’s t distribution is we want the variability of ˉX as an estimate of μ in case of σ unknown. Suppose X1,,Xn are random sample from N(μ,σ2), then from Theorem 1.4, ˉXμσ/nN(0,1), which can be used as a basis of inference. However, if σ is unknown, a natural idea is to consider using S to substitute it and consider ˉXμS/n. ˉXμS/n=(ˉXμ)/(σ/n)S2/σ2 Noticing that the numerator of (2.1) is a N(0,1) r.v. and the denominator is, by Theorem 1.4, χ2n1/(n1), independent of the numerator. This leads to the Student’s t distribution.

Definition 2.1 (Student’s t Distribution) Let X1,,Xn be a random sample from a N(μ,σ2) distribution. Then the quantity ˉXμS/n has a t distribution with n-1 degrees of freedom. Equivalently, r.v. T has a t distribution with p degrees of freedom, written as Ttp if it has pdf fT(t)=Γ(p12)Γ(p2)1(pπ)1/21(1+t2/p)(p+1)/2
The derivation of pdf of t distribution is straight forward. Simply apply transformation t=uv/p and ω=v for independent r.v. UN(0,1) and Vχ2p.

The t distribution has no mgf! It dose not have moments of all orders. If there are p degrees of freedom, then there are only p-1 moments. And we have the following property for tp distribution.

Lemma 2.1 If Tptp, then ETp=0,p>1Var(Tp)=pp2,p>2 (This is from part (a) of Exercise 5.18 from Casella and Berger (2002))

Proof. For the mean, using the definition of mean we have ETp=+tΓ(p12)Γ(p2)1(pπ)1/21(1+t2/p)(p+1)/2dt Noticing that the integrant of (2.4) is a odd function, therefore, the integral is 0 when p>1.

As for the variance, noticing that Tp=UV/p with independent UN(0,1) and Vχ2p. Thus, Var(T2p)=E(T2p)=pE(U2)E(V1)=pp2,p>2 where we used the result that the expectation of inverse chi squared distribution with p degrees of freedom is 1p2.

For F distribution, the intuition behind is to compare the variability of populations of N(μ1,σ21) and N(μ2,σ22). The quantity of interest would be σ21σ22, whose information is contained in S21S22. The F distribution gives distribution on (2.6) that allows people to compare the two ratio. S21/S22σ21/σ22=S21/σ21S22/σ22 Noticing that from (2.6), F distribution is the ratio of two independent scaled chi squared random variables.

Definition 2.2 (F Distribution) Let X1,,Xn be a random sample from N(μX,σ2X), and let Y1,,Ym be a random sample from N(μY,σ2Y). Then r.v. F=S2X/σ2XS2Y/σ2Y has F distribution with n-1 and m-1 degrees of freedom. Or if Uχ2n and Vχ2m, then U/nV/mF(n,m). Random variable F with p and q degrees of freedom has pdf fF(x)=Γ(p+q2)Γ(p2)Γ(q2)(pq)p/2xp21[1+(p/q)x](p+q)/2,0<x<
Example 2.1 (Expectation of F distribution) We compute the expectation of Fn1,m1 as follows EFn1,m1=E(χ2n1/(n1)χ2m1/(m1))=E(χ2n1n1)E(m1χ2m1)=(n1n1)(m1m3)=m1m3 Thus, for m>3, we have EFn1,m1=m1m3. Also from (2.8), removing expectation and for reasonablely large m, S2X/S2Yσ2X/σ2Ym1m31 as expected.

Theorem 2.1 (Properties of F Distribution) a. If XFp,q, then 1/XFq,p.

  1. If Xtq, then X2F1,q.

  2. If XFp,q, then (p/q)X1+(p/q)XBeta(p/2,q/2)

(This is from Exercise 5.17 and Exercise 5.18 on Casella and Berger (2002))

Proof. a. By definition, X=U/pV/q with independent Uχ2p and Vχ2q. Therefore, 1/X=V/pU/q follows Fq,p by definition.

  1. By definition, X=UV/q with independent UN(0,1) and Vχ2q. Therefore, X2=U2/1V/q follows F1,q by definition.

  2. It can be done by using variable transformation.
Definition 2.3 (Order Statistics) The order statistics of a random sample X1,,Xn are the sample values placed in ascending order, which is denoted by X(1),,X(n). It follows that X(1)X(n).
There are some commonly used examples of order statistics, such as sample range R:=X(n)X(1), sample median defined by M={X(n+1)/2nodd(Xn/2+Xn/2+1)/2neven Sample median is more robust than sample mean. For 0p1, (100p)th sample percentile is the observation such that approximately np of the observations are less than this observation. Lower quartile is the 25th percentile and upper quartile is the 75th percentile. Their difference is termed interquartile range which is also a measure of dispersion.
Theorem 2.2 (PDF of order statistics) Let X1,,Xn be a random sample, with discrete distribution Pr(X=xi)=pi. Define P0=0,Pi=ij=1pj. Then \begin{align} &Pr(X_{(j)}\leq x_i)=\sum_{k=j}^n{n \choose k}P_i^k(1-P_i)^{n-k} \tag{2.10}\\ &P(X_{(j)}=x_i)=\sum_{k=j}^n{n \choose k}[P_i^k(1-P_i)^{n-k}-P_{i-1}^k(1-P_{i-1})^{n-k}] \tag{2.11} \end{align} If X has continuous cdf F_X and pdf f_X, then \begin{equation} f_{X_{(i)}}(x)=\frac{n!}{(j-1)!(n-j)!}f_{X}(x)[F_X(x)]^{j-1}[1-F_X(x)]^{n-j} \tag{2.12} \end{equation}

Proof. For fixed i, let Y be a random variable counts the number of X_1,\cdots,X_n that are less than or equal to x_i. Then it follows that Y\sim Bin(n,P_i). The event \{X_{(j)}\leq x_i\} is equivalent to \{Y\geq j\}. (2.10) is just the binominal probability of P(Y\geq j)=P(X_{j}\leq x_i). Equation (2.11) is just difference \begin{equation} P(X_{(j)}=x_i)=P(X_{(j)}\leq x_i)-P(X_{(j)}\leq x_{i-1}) \tag{2.13} \end{equation} with exception for the case i=1 where P(X_{(j)}=x_1)=P(X_{(j)}\leq x_1).

For continuous case, Y\sim Bin(n,F_X(x)). Thus \begin{equation} F_{X_{(j)}}(x)=P(Y\geq j)=\sum_{k=j}^n{n \choose k}[F_X(x)]^k[1-F_X(x)]^{n-k} \tag{2.14} \end{equation} and the pdf of X_{(j)} is get by differentiate cdf \begin{equation} \begin{split} f_{X_{(j)}}(x)&=\frac{d}{dx}F_{X_{(j)}}(x)\\ &=\sum_{k=j}^n{n \choose k}(k[F_X(x)]^{k-1}[1-F_X(x)]^{n-k}f_X(x)\\ &-(n-k)[F_X(x)]^k[1-F_X(x)]^{n-k-1}f_X(x))\\ &={n \choose j}j[F_X(x)]^{j-1}[1-F_X(x)]^{n-j}f_X(x)\\ &+\sum_{k=j+1}^n{n \choose k}k[F_X(x)]^{k-1}[1-F_X(x)]^{n-k}f_X(x)\\ &-\sum_{k=j}^{n-1}{n \choose k}(n-k)[F_X(x)]^k[1-F_X(x)]^{n-k-1}f_X(x))\\ &=\frac{n!}{(j-1)!(n-j)!}f_X(x)[F_X(x)]^{j-1}[1-F_X(x)]^{n-j}\\ &+\sum_{k=j}^{n-1}{n \choose {k+1}}(k+1)[F_X(x)]^k[1-F_X(x)]^{n-k-1}f_X(x))\\ &-\sum_{k=j}^{n-1}{n \choose k}(n-k)[F_X(x)]^k[1-F_X(x)]^{n-k-1}f_X(x)) \end{split} \tag{2.15} \end{equation} Noting that \begin{equation} {n \choose {k+1}}(k+1)=\frac{n!}{k!(n-k-1)!}={n \choose k}(n-k) \tag{2.16} \end{equation} Thus, the last two term of (2.15) cancel out and it lefts with (2.12).
Definition 2.4 (Converge in Probability) A sequence of random variables X_1,X_2,\cdots converges in probability to a random variable X if, for every \epsilon>0,\lim_{n\to\infty}P(|X_n-X|\geq\epsilon)=0 or equivalently \lim_{n\to\infty}P(|X_n-X|<\epsilon)=1.
Note that this definition dose not require independency! and converges in probability is also referred as weak convergence.
Theorem 2.3 (Weak Law of Large Numbers) Let X_1,X_2,\cdots be i.i.d. random variables with EX_i=\mu and Var(X_i)=\sigma^2<\infty. Define \bar{X}_n=\frac{1}{n}\sum_{i=1}^nX_i. Then \bar{X}_n converges in probability to \mu, i.e. \begin{equation} \lim_{n\to\infty}P(|\bar{X}_n-\mu|<\epsilon)=1,\quad \forall \epsilon>0 \tag{2.17} \end{equation}
Proof. By Chebychev inequatility \begin{equation} P(|\bar{X}_n-\mu|\geq\epsilon)=P((\bar{X}_n-\mu)^2\geq\epsilon^2)\leq\frac{E(\bar{X}_n-\mu)^2}{\epsilon^2}=\frac{Var(\bar{X}_n)}{\epsilon^2}=\frac{\sigma^2}{n\epsilon^2} \tag{2.18} \end{equation} Hence, P(|\bar{X}_n-\mu|<\epsilon)=1-P(|\bar{X}_n-\mu|\geq\epsilon)\geq1-\frac{\sigma^2}{n\epsilon^2}\to1 as n\to\infty
A sequence of the “same” sample quantity approaches a constant as n\to\infty, is known as consistency.

Theorem 2.4 Suppose X_1,X_2,\cdots converges in probability to a random variable X and h is a continuous function. Then h(X_1),h(X_2),\cdots converges in probability to h(X).

(This is from Exercise 5.39 in Casella and Berger (2002))
Proof. Since h is a continuous function, for any \epsilon>0, there exist some \delta>0 such that as long as |x_n-x|<\delta, |h(x_n)-h(x)|<\epsilon. Since X_1,X_2,\cdots converges in probability to X, . Thus, for any \epsilon>0, there exist \delta>0 such that 1\geq P(|h(X_n)-h(x)|<\epsilon)\geq P(|X_n-x|<\delta)\to1 as n\to\infty. Thus, P(|h(X_n)-h(x)|<\epsilon)\to1 as we desired.
Definition 2.5 (Almost Surely Convergence) A sequence of random variables X_1,X_2,\cdots converges almost surely to a r.v. X if for every \epsilon>0, \begin{equation} P(\lim_{n\to\infty}|X_n-X|<\epsilon)=1 \tag{2.19} \end{equation}
Almost surely convergence is sometimes also referred as convergence with probability 1 or strong convergence. Strong in the sense that compared with converge in probability. It means that a sequence of random variables can converge in probability while NOT converge almost surely. One example is that suppose sample space S be the closed interval [0,1] with uniform probability distribution. Define X_1(s)=s+I_{[0,1]}(s),X_2(s)=s+I_{[0,\frac{1}{2}]}(s),X_3(s)=s+I_{[\frac{1}{2},1]}(s),X_4(s)=s+I_{[0,\frac{1}{3}]}(s),X_5(s)=s+I_{[\frac{1}{3},\frac{2}{3}]}(s),X_6(s)=s+I_{[\frac{2}{3},1]}(s),\cdots etc. Let X(s)=s. Then X_n converges to X in probability but not almost surely.
Theorem 2.5 (Strong Law of Large Numbers) Let X_1,X_2,\cdots be i.i.d. random variables with EX_i=\mu and Var(X_i)=\sigma^2<\infty. Define \bar{X}_n=\frac{1}{n}\sum_{i=1}^nX_i. Then \bar{X}_n converges to \mu almost surely, i.e. \begin{equation} P(\lim_{n\to\infty}|\bar{X}_n-\mu|<\epsilon)=1,\quad \forall \epsilon>0 \tag{2.20} \end{equation}
Both in Weak and Strong Law of Large Numbers we had a finite variance assumption, which is not actually required. The only moment condition needed is E|X_i|<\infty. See theoretical probability books for detail.
Definition 2.6 (Converge in Distribution) A sequence of random variables X_1,X_2,\cdots converges in distribution to a r.v. X if \begin{equation} \lim_{n\to\infty}F_{X_n}(x)=F_X(x) \tag{2.21} \end{equation} at all points x where F_X(x) is continuous.
Converge in distribution is actually convergence of cdfs. It is fundamentally different with converge in probability and converge almost surely, which are about converge of random variables.

Theorem 2.6 If the sequence of random variables X_1,X_2,\cdots converges in probability to a random variable X, the sequence also converges in distribution to X.

(This is Exercise 5.40 on Casella and Berger (2002))

Proof. Firstly we need to prove the following lemma. For any random variable X,Y on sample space S, let a be a real number and for any \epsilon>0, we have P(Y\leq a)\leq P(X\leq a+\epsilon)+P(|Y-X|>\epsilon). The proof is denote S_1:=\{s\in S:Y(s)\leq a\}, S_2:=\{s\in S: |Y(s)-X(s)|<\epsilon\} and S_3=:\{s\in S: X(s)\leq a+\epsilon\} then since Y\leq a and |Y-X|<\epsilon implies X\leq a+\epsilon, we have S_1\subset S_2^c\cup S_3 and thus P(Y\leq a)\leq P(X\leq a+\epsilon)+P(|Y-X|>\epsilon). The lemma is proved.

Then for any fixed t at which the cdf is continuous and \epsilon>0, it follows from the lemma that \begin{align} &P(X \leq t-\epsilon)\leq P(X_n\leq t)+P(|X_n-X|>\epsilon) \tag{2.22}\\ &P(X_n \leq t)\leq P(X\leq t+\epsilon)+P(|X_n-X|>\epsilon) \tag{2.23} \end{align} Therefore, P(X \leq t-\epsilon)-P(|X_n-X|>\epsilon)\leq P(X_n \leq t)\leq P(X\leq t+\epsilon)+P(|X_n-X|>\epsilon). Let n\to\infty we have P(X \leq t-\epsilon)\leq \lim_{n\to\infty}F_{X_n}(t)\leq P(X\leq t+\epsilon) holds for any \epsilon. Since by assumption F_X(x) continuous at t, we finally have \lim_{n\to\infty}F_{X_n}(x)=F_X(x) as we desired.

One special case where the inverse of Theorem 2.6 is stated below.

Theorem 2.7 The sequence of random variables X_1,X_2,\cdots converges in probability to a constant \mu iff the sequence also converges in distribution to \mu. That is \begin{equation} P(|X_n-\mu|>\epsilon)\to 0\quad\forall\epsilon>0\iff P(X_n\leq x)\to\left\{ \begin{aligned} &0 & if \, x<\mu \\ &1 & if \, x\geq\mu \end{aligned} \right. \tag{2.24} \end{equation}

(This is Exercise 5.41 on Casella and Berger (2002))

Proof. (\Longrightarrow) Set \epsilon=|x-\mu|>0. If x>\mu, then the set S_1:=\{s\in S: |X_n(s)-\mu|\leq\epsilon\} is contained in the set S_2:=\{s\in S: X_n(s)\leq x\}. Therefore, 1\geq P(X_n\leq x)\geq P(|X_n-\mu|\leq\epsilon)\to1 as n\to\infty. On the other hand, if x<\mu, then the set of S_1^*:=\{s\in S: |X_n(s)-\mu|\geq\epsilon\} contains the set S_2, which indicates 0\leq P(X_n\leq x)\leq P(|X_n-\mu|\geq\epsilon)\to0 as n\to\infty. Hence, we have proved \Longrightarrow part.

(\Longleftarrow) For any \epsilon>0, it follows that \begin{equation} \begin{split} 0&\leq P(|X_n-\mu|>\epsilon)\\ &\leq P(X_n-\mu<-\epsilon)+P(X_n-\mu>\epsilon)\\ &=P(X_n<\mu-\epsilon)+P(X_n>\mu+\epsilon)\\ &=P(X_n<\mu-\epsilon)+1-P(X_n\leq\mu+\epsilon) \end{split} \tag{2.25} \end{equation} Since \epsilon>0, we have as n\to\infty, P(|X_n-\mu|>\epsilon)\to 0 as we desired.
Theorem 2.8 (Central Limit Theorem) Let X_1,X_2,\cdots be a sequence of i.i.d. random variables whose mgfs exist in a neighborhood of 0. Let EX_i=\mu and Var(X_i)=\sigma^2>0. (Both of them are finite since mgf exists.) Let G_n(x) denote the cdf of \frac{\sqrt{n}(\bar{X}_n-\mu)}{\sigma}, then for any -\infty<x<\infty, \begin{equation} \lim_{n\to\infty}G_n(x)=\int_{-\infty}^x\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}}dy \tag{2.26} \end{equation} that is, \frac{\sqrt{n}(\bar{X}_n-\mu)}{\sigma} converge in distribution to standard normal random variable.

Proof. Show by mgf, that is for |t|<h, the mgf of \sqrt{n}(\bar{X}_n-\mu)/\sigma converges to e^{t^2/2}, the mgf of standard normal random variable

Define Y_i=\frac{X_i-\mu}{\sigma} and let M_Y(t) denote the common mgf of Y_is. Since \begin{equation} \frac{\sqrt{n}(\bar{X}_n-\mu)}{\sigma}=\frac{1}{\sqrt{n}}\sum_{i=1}^nY_i \tag{2.27} \end{equation} From the properties of mgfs, \begin{equation} \begin{split} M_{\sqrt{n}(\bar{X}_n-\mu)/\sigma}(t)&=M_{\sum_{i=1}^nY_i/\sqrt{n}}(t)\\ &=M_{\sum_{i=1}^nY_i}(\frac{t}{\sqrt{n}})\\ &=[M_Y(\frac{t}{\sqrt{n}})]^n \end{split} \tag{2.28} \end{equation} We now expand M_Y(t/\sqrt{n}) in a Taylor series around 0. We have \begin{equation} M_Y(t/\sqrt{n})=\sum_{k=0}^{\infty}M_Y^{(k)}(0)\frac{(t/\sqrt{n})^k}{k!} \tag{2.29} \end{equation} where M_Y^{(k)}(0)=(d^k/dt^k)M_Y(t)|_{t=0}. Using the fact that M_Y^{(0)}=1,EY=M_Y^{(1)}=0 and Var(Y)=M_Y^{(2)}=1, we have \begin{equation} M_Y(\frac{t}{\sqrt{n}})=1+\frac{(t/\sqrt{n})^2}{2!}+R_Y(\frac{t}{\sqrt{n}}) \tag{2.30} \end{equation} For fixed t\neq0, R_Y(\frac{t}{\sqrt{n}}) contains \frac{t}{\sqrt{n}} terms with order higher than 2, so \begin{equation} \lim_{n\to\infty}\frac{R_Y(\frac{t}{\sqrt{n}})}{(\frac{t}{\sqrt{n}})^2}=0 \tag{2.31} \end{equation} Since t is fixed, we also have \begin{equation} \lim_{n\to\infty}\frac{R_Y(\frac{t}{\sqrt{n}})}{(\frac{1}{\sqrt{n}})^2}=\lim_{n\to\infty}nR_Y(\frac{t}{\sqrt{n}})=0 \tag{2.32} \end{equation} which is also true at t=0. Thus, for any fixed t we have \begin{equation} \begin{split} \lim_{n\to\infty}(M_Y(\frac{t}{\sqrt{n}}))^n&=\lim_{n\to\infty}[1+\frac{(t/\sqrt{n})^2}{2!}+R_Y(\frac{t}{\sqrt{n}})]^n\\ &=\lim_{n\to\infty}[1+\frac{1}{n}(\frac{t^2}{2}+nR_Y(\frac{t}{\sqrt{n}}))]^n &=e^{t^2/2} \end{split} \tag{2.33} \end{equation} as we desired.

CLT describes the limiting distribution of sample mean. It can be shown that the only two required constrains are independence and finite variance, and it ends up with normality. CLT shows that we can use normal to approximate other distribution while the power of the approximation is case by case different.

We conclude this chapter by a useful theorem without proof.

Theorem 2.9 (Slutsky Theorem) If X_n\to X in distribution and Y_n\to a, a is a constant in probability, then

  1. Y_nX_n\to aX in distribution;

  2. X_n+Y_n\to X+a in distribution.

References

Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.