Chapter 1 Random Samples, Special Distribution (Lecture on 01/07/2020)

Often, the data collected in an experiment consist of several observations on a variable of interest. Random sampling is the model for data collection that is often used to describe this situation.

Definition 1.1 (Random Sample) The random variables \(X_1,\cdots,X_n\) are called a random sample of size n from the population \(f(x)\) if \(X_1,\cdots,X_n\) are mutually independent random variables and the marginal pdf or pmf of each \(X_i\) is the same function \(f(x)\). Alterantively, \(X_1,\cdots,X_n\) are called independent and identically distributed random variables with pdf or pmf \(f(x)\). This is commonly abbreviated to i.i.d. random variables.

The joint pdf or pmf of \(X_1,\cdots,X_n\) is given by

\[\begin{equation} f(X_1,\cdots,X_n)=f(x_1)f(x_2)\cdots f(x_n)=\prod_{i=1}^nf(x_i) \tag{1.1} \end{equation}\]

Since \(X_1,\cdots,X_n\) are identically distributed, all the margianl densities \(f(x)\) are the same function. Furthermore, if the population pdf or pmf is a member of a parametric family, with pdf or pmf given by \(f(x|\theta)\), then the joint pdf or pmf is

\[\begin{equation} f(X_1,\cdots,X_n|\theta)=\prod_{i=1}^nf(x_i|\theta) \tag{1.2} \end{equation}\]

Random sample is also refered to infinite population sampling. If you sample \(X_1,\cdots,X_n\) sequentially, the independent assumption indicates that the observed result \(x_1\) of \(X_1\) will not influence the observed result \(x_2\) of \(X_2\). “Removing” \(x_1\) from a infinite population does not change the population.
When sampling is from a finite population, it might be or might not be relevant to random sample. If it is sampled with replacement, then it is random sample. If it is sampled without replacement, then it is not random sample because it violates the independent assumption in definition 1.1. Beacuse \(p(X_2=y|X_1=y)=0\) while \(p(X_2=y|X_1=x)\neq 0\) which means the population distribution of \(X_2\) dose depend on the value of \(X_1\). However, \(X_1,\cdots,X_n\) are identically distributed, which can be proved by law of total probability. This kind of sampling is sometimes called simple random sampling. If the population size N is larger compared to sample size n, then samples are nearly independent and probability can be approximated by assuming independence.

When a sample \(X_1,\cdots,X_n\) is drawn, some summary of the values is usually computed. Any well-defined summary may be expressed mathematically as a function \(T(X_1,\cdots,X_n)\) whose domain includes the sample space of the random vector \((X_1,\cdots,X_n)\) . The function T may be real-valued or vector-valued; thus the summary is a random variable (or vector), \(Y=T(X_1,\cdots,X_n)\).

Definition 1.2 (Statistic) Let \(X_1,\cdots,X_n\) be a random sample of size n from the population and let \(T(X_1,\cdots,X_n)\) be a real-valued or vector-valued function whose domain includes the sample space of \((X_1,\cdots,X_n)\). Then the random variable or random vector \(Y=T(X_1,\cdots,X_n)\) is called a statistic. The probability distribution of a statistic Y is called the sampling distribution of Y.

The only restriction for statistic is that it cannot be a function of parameters. Sample mean, variance and standard deviation are often used and provide good summaries of the sample.

Definition 1.3 (Sample Mean) The sample mean is defined as \[\begin{equation} \bar{X}=\frac{X_1+\cdots+X_n}{n}=\frac{1}{n}\sum_{i=1}^nX_i \tag{1.3} \end{equation}\]
Definition 1.4 (Sameple Variance and Standard Deviation) The sample variance is defined as \[\begin{equation} S^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2 \tag{1.4} \end{equation}\] And sample standard deviation is defined as \[\begin{equation} S=\sqrt{S^2} \tag{1.5} \end{equation}\]

Sample mean minimizes the total quadratic difference, i.e. \[\begin{equation} min_{a}\sum_{i=1}^n(x_i-a)^2=\sum_{i=1}^n(x_i-\bar{x})^2 \tag{1.6} \end{equation}\] (1.6) can be easily proved by the classic trick of adding \(\bar{x}\) and substracting \(\bar{x}\) inside the brackets. Then apply another classic characteristic of sample mean: \[\begin{equation} \sum_{i=1}^n(x_i-\bar{x})=0 \tag{1.7} \end{equation}\] Another useful property of sample mean and variance is: \[\begin{equation} (n-1)s^2=\sum_{i=1}^nx_i^2-n\bar{x}^2 \tag{1.8} \end{equation}\]

Lemma 1.1 \(X_1,\cdots,X_n\) be a random sample form a population, \(g(x)\) be a function such that \(E(g(X_1))\) and \(Var(g(X_1))\) exist. Then \[\begin{equation} \begin{split} &E(\sum_{i=1}^n g(X_i))=n(E(g(X_1)))\\ &Var(\sum_{i=1}^n g(X_i))=n(Var(g(X_1))) \end{split} \tag{1.9} \end{equation}\]

Proof. The first part of (1.9) can easily be shown by the linear property of expectation. To prove the second part, note that \[\begin{equation} \begin{split} Var(\sum_{i=1}^n g(X_i))&=E[\sum_{i=1}^n g(X_i)-E(\sum_{i=1}^n g(X_i)]^2\\ &=E[\sum_{i=1}^n(g(X_i)-E g(X_i))]^2 \end{split} \tag{1.10} \end{equation}\]

Notice in (1.10) there are n terms of \((g(X_i)-E g(X_i))^2, i=1,\cdots,n\), and each of them is just \(Var(g(X_1))\). The remaining terms are all of the form \((g(X_i)-E g(X_i))(g(X_j)-E g(X_j)), i\neq j\), which is \(Cov(g(X_i),g(X_j))=0\).

Theorem 1.1 \(X_1,\cdots,X_n\) random sample from a population with mean \(\mu\) and variance \(\sigma^2<\infty\). Then

    1. \(E\bar{X}=\mu\)
    1. \(Var(\bar{X})=\frac{\sigma^2}{n}\)
    1. \(ES^2=\sigma^2\)
(Sample mean and variance are unbiased estimator!)

Proof. For (a), let \(g(X_i)=X_i/n\), so \(Eg(X_i)=\mu/n\), then apply Lemma 1.1.

For (b), \(Var(g(X_i))=\sigma^2/n^2\), then by Lemma 1.1, \(Var(\bar{X})=\frac{\sigma^2}{n}\).

Finally for (c), we have \[\begin{equation} \begin{split} ES^2&=E(\frac{1}{n-1}[\sum_{i=1}^nX_i^2-n\bar{X}^2])\\ &=\frac{1}{n-1}(nEX_1^2-nE\bar{X}^2)\\ &=\frac{1}{n-1}(n(\sigma^2+\mu^2)-n(\frac{\sigma^2}{n}+\mu^2))=\sigma^2 \end{split} \end{equation}\] where the last part use the fact that \(EY^2=Var(Y)+(EY)^2\) for any random variable Y.
From (a) and (c) of Theorem 1.1, sample mean and sample variance is unbiased estimator of population mean and variance.

Theorem 1.2 Let \(X_1,\cdots,X_n\) be a random sample from a population with pdf \(f_X(x)\), and \(\bar{X}\) denote the sample mean, then despite whether the mgf of X exists, \[\begin{equation} f_{\bar{X}}(x)=nf_{X_1+\cdots+X_n}(nx) \tag{1.11} \end{equation}\]

Futhermore, if mgf of X does exist, denoted as \(M_X(t)\), then \[\begin{equation} M_{\bar{X}}(t)=[M_X(\frac{t}{n})]^n \tag{1.12} \end{equation}\]

(This theroem combines Exercise 5.5 and Theorem 5.2.7 on Casella and Berger (2002))
Proof. Let \(Y=X_1+\cdots+X_n\), then \(Y=n\bar{X}\), and \[\begin{equation} f_X(x)=f_Y(nx)\left|\frac{dY}{dX}\right|=nf_Y(nx) \end{equation}\] For mgfs \[\begin{equation} M_{\bar{X}}(t)=Ee^{t\bar{X}}=Ee^{t(X_1+\cdots+X_n)/n}=Ee^{(t/n)}Y=[M_X(\frac{t}{n})]^n \end{equation}\] where the last step uses the i.i.d. property of random samples.
Convolution Formula is useful in finding pdf of \(\bar{X}\). If X and Y are independent random variables with pdfs \(f_X(x)\) and \(f_Y(y)\), then the pdf of \(Z=X+Y\) is \[\begin{equation} f_Z(z)=\int_{-\infty}^{+\infty}f_X(\omega)f_Y(z-\omega)d\omega \end{equation}\]

For special distributions, the first and the most important one to be considered is the multivariate normal distribution (MVN for short).

Definition 1.5 (Multivariate Normal Distribution) Let \(\mu\in\mathbb{R}^p\) and \(\Sigma_{p\times p}\) positive definite. A random vector \(X\in\mathbf{R}^p\) has a p-variate normal distribution with mean \(\mu\) and covariance matrix \(\Sigma\) if it has pdf \[\begin{equation} f(\mathbf{x})=|2\pi\Sigma|^{-\frac{1}{2}}exp[-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})] \tag{1.13} \end{equation}\] for \(X\in\mathbb{R}^p\), and will be denoted as \(X\sim N_p(\mathbf{\mu},\Sigma)\).

Recall the definition of moment generating function and characteristic function of a random variable X is defined as (1.14) and (1.15), respectively. \[\begin{equation} M_X(\mathbf{t})=E(e^{\mathbf{t}^TX}) \tag{1.14} \end{equation}\] \[\begin{equation} \Phi_X(\mathbf{t})=E(e^{i\mathbf{t}X}) \tag{1.15} \end{equation}\] If X and Y are independent, then we have the following property for mgf and characteristic function \[\begin{align} &M_{X+Y}(\mathbf{t})=M_X(\mathbf{t}) \cdot M_Y(\mathbf{t}) \tag{1.16} \\ &\Phi_{X+Y}(\mathbf{t})=\Phi_X(\mathbf{t}) \cdot \Phi_Y(\mathbf{t}) \tag{1.17} \end{align}\] Finally, the mgf and characteristic function of multivariate normally distributed r.v. \(X\) is given by \[\begin{align} &M_{X}(\mathbf{t})=exp(\mathbf{t}^T\mathbf{\mu}+\frac{1}{2}\mathbf{t}^T\Sigma\mathbf{t}) \tag{1.18} \\ &\Phi_{X}(\mathbf{t})=exp(i\mathbf{t}^T\mathbf{\mu}-\frac{1}{2}\mathbf{t}^T\Sigma\mathbf{t}) \tag{1.19} \end{align}\]

Theorem 1.3 Suppose \(X\sim N_p(\mathbf{\mu},\Sigma)\), then for any matrix \(B\in\mathbb{R}_{k\times p}\) with rank \(k\leq p\), \(Y=BX\), \(Y\sim N(B\mathbf{\mu},B\Sigma B^T)\).

(This theroem is Theorem 4.4a on Rencher and Schaalje (2007))

Proof. The mgf of \(Y\) is by definition

\[\begin{equation} M_Y(t)=E(e^{\mathbf{t}^TY})=E(e^{\mathbf{t}^TBX})=M_X(B^T\mathbf{t}) \tag{1.20} \end{equation}\] From (1.18) we have the form of \(M_X(t)\), therefore \[\begin{equation} M_Y(t)=exp(\mathbf{t}^TA\mathbf{\mu}+\frac{1}{2}\mathbf{t}^TA\Sigma A^T\mathbf{t}) \tag{1.21} \end{equation}\] Thus, the theorem is proved.
Definition 1.6 (Chi-square distribution) Another important special distribution is chi-square distribution, whose pdf is given by \[\begin{equation} f(x)=\frac{1}{\Gamma(p/2)2^{p/2}}x^{\frac{p}{2}-1}e^{-x/2} \tag{1.22} \end{equation}\] \(p\) is called the degree of freedom. Chi-square distribution with degree of freedom p is the sum of squares of p independent standard normal random variables.

Notice that chi-square distribution can be viewed as a Gamma distribution with shape parameter \(\alpha=\frac{p}{2}\) and rate parameter \(\beta=\frac{1}{2}\). Therefore, the mean and variance of \(X\sim\chi_p^2\) is \(\alpha/\beta=p\) and \(\alpha/\beta^2=2p\), respectively.

Lemma 1.2 If \(\chi_p^2\) denote a chi squared r.v. with p degrees of freedom, then

  1. If \(Z\sim N(0,1)\), then \(Z^2\sim\chi_1^2\)

  2. If \(X_1,\cdots,X_n\) are independent and \(X_i\sim\chi_{p_i}^2\), then \(X_1+\cdots+X_n\sim\chi_{p_1+\cdots+p_n}^2\).

The proof is stright forward, from probability class.

We conclude this chapter by a theorem about the the properties of \(\bar{X}\) and \(S^2\) when we have additional normality assumption.

Theorem 1.4 Let \(X_1,\cdots,X_n\) be r.v. from a \(N(\mu,\sigma^2)\) distribution, and let \(\bar{X}\) and \(S^2\) be the sample mean and sample variance defined in (1.3) and (1.4), then

  1. \(\bar{X}\) and \(S^2\) are independent.

  2. \(\bar{X}\) has a \(N(\mu,\sigma^2/n)\) distribution.

  3. \((n-1)S^2/\sigma^2\) has a chi squared distribution with n-1 degrees of freedom.

Proof. We can assume, without loss of generality, that \(\mu=0\) and \(\sigma=1\).

For (a), notice that \(S^2\) can be expressed as n-1 deviations \[\begin{equation} \begin{split} S^2&=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2\\ &=\frac{1}{n-1}((X_1-\bar{X})^2+\sum_{i=2}^n(X_i-\bar{X})^2)\\ &=\frac{1}{n-1}((\sum_{i=2}^n(X_i-\bar{X}))^2+\sum_{i=2}^n(X_i-\bar{X})^2) \end{split} \tag{1.23} \end{equation}\] where the last step uses the classic property of sample mean given in (1.7). Thus, \(S^2\) can be written as a function of only \((X_2-\bar{X},\cdots,X_n-\bar{X})\). The joint pdf of sample is \[\begin{equation} f(x_1,\cdots,x_n)=\frac{1}{(2\pi)^{n/2}}e^{-\frac{\sum_{i=1}^nx_i^2}{2}} \tag{1.24} \end{equation}\] Define variable transformtaion \(y_1=\bar{x},y_2=x_2-\bar{x},\cdots,y_n=x_n-\bar{x}\), then determination of the Jacobian of this transformation is \[\begin{equation} |J|=\begin{vmatrix}1 & -1 & -1 & \cdots &-1\\ 1 & 1 & 0 & \cdots & 0\\ 1 & 0 & 1 & \cdots & 0\\ \vdots & \vdots & \vdots & & \vdots\\ 1 & 0 & 0 & \cdots & 1 \end{vmatrix}=n \tag{1.25} \end{equation}\]
The proof of (1.25) can be done by induction. Therefore, we have \[\begin{equation} \begin{split} f(y_1,\cdots,y_n)&=\frac{n}{(2\pi)^{n/2}}e^{-\frac{(y_1-\sum_{i=2}^ny_i)^2}{2}}e^{-\frac{\sum_{i=2}^n(y_i+y_1)^2}{2}}\\ &=[(\frac{n}{2\pi})^{1/2}e^{-\frac{ny_1^2}{2}}][\frac{n^{1/2}}{(2\pi)^{(n-1)/2}}e^{-\frac{\sum_{i=2}^ny_i^2+(\sum_{i=2}^ny_i)^2}{2}}], \quad -\infty<y_i<\infty \end{split} \tag{1.26} \end{equation}\] Hence, the joint pdf factors and we have \(\bar{X}\) and \(S^2\) independent.

For (b), define \(B=\frac{1}{n}(1,\cdots,1)\), then \(\bar{X}=B\mathbf{X}\) with \(\mathbf{X}=(X_1,\cdots,X_n)^T\). By Theorem 1.3 we have (b) as desired.

Finally, for (c), proof by induction. Deonte the sample mean and variance on the first k observations as \(\bar{X}_k\) and \(S^2_k\). Therefore, \[\begin{equation} (n-1)S_n^2=(n-2)S_{n-1}^2+(\frac{n-1}{n})(X_n-\bar{X}_{n-1})^2 \tag{1.27} \end{equation}\] The proof (1.27) is shown in Exercise 1.1. First consider \(n=2\), from (1.27) we have that \[\begin{equation} S_2^2=\frac{1}{2}(X_2-X_1)^2 \tag{1.28} \end{equation}\] since \(\frac{X_2-X_1}{\sqrt{2}}\sim N(0,1)\), by property of chi-squared distribution in Lemma 1.2, \(S_2^2\sim\chi_1^2\). Proceeding with the induction, we assume that for \(n=k\), \((k-1)S_k^2\sim\chi_{k-1}^2\). For \(n=k+1\), by (1.27) \[\begin{equation} kS_{k+1}^2=(k-1)S_k^2+(\frac{k}{k+1})(X_{k+1}-\bar{X}_k)^2 \tag{1.29} \end{equation}\] By induction hypothesis, \((k-1)S_k^2\sim\chi_{k-1}^2\), we only need to show that \((\frac{k}{k+1})(X_{k+1}-\bar{X}_k)^2\sim\chi_1^2\) and independent of \(S_k^2\), then by Lemma 1.2 we will get the desired result.

Since the vector \((X_{k+1},\bar{X}_k)\) is independent of \(S_k^2\) and so any function of the vector, especially \((X_{k+1}-\bar{X}_k)^2\). Furthermore, \(X_{k+1}-\bar{X}_k\sim N(0,\frac{k+1}{k})\), becuase \(X_{k+1}\sim N(0,1)\) and \(X_k\sim N(0,\frac{1}{k})\) and they are independent. Therefore \((\frac{k}{k+1})(X_{k+1}-\bar{X}_k)^2\sim\chi_1^2\) as we desired.

The following exercise shows the result that need to be used in proof by induction of Theorem 1.4, part(c).

Exercise 1.1 Show the following

  1. \(\bar{X}_n=\frac{X_{n}+(n-1)\bar{X}_{n-1}}{n}\)

  2. \((n-1)S^2_n=(n-2)S_{n-1}^2+(\frac{n-1}{n})(X_n-\bar{X}_{n-1})^2\)

(This problem is revised from Exercise 5.15 from Casella and Berger (2002))
Proof. By definition of sample mean, it is straight forward that \[\begin{equation} \bar{X}_n=\frac{\sum_{i=1}^nX_i}{n}=\frac{X_n+(n-1)\bar{X}_{n-1}}{n} \tag{1.30} \end{equation}\] For (b), we have \[\begin{equation} \begin{split} &(n-2)S_{n-1}^2+(\frac{n-1}{n})(X_n-\bar{X}_{n-1})^2\\ &=\sum_{i=1}^{n-1}(X_i-\bar{X}_{n-1})^2+\frac{n-1}{n}(X_n-\bar{X}_{n-1})^2\\ &=\sum_{i=1}^{n-1}(X_i-\bar{X}_{n}+\bar{X}_{n}-\bar{X}_{n-1})^2+\frac{n-1}{n}(X_n-\bar{X}_{n-1})^2\\ &=\sum_{i=1}^{n-1}(X_i-\bar{X}_{n})^2+(n-1)(\bar{X}_{n}-\bar{X}_{n-1})^2+2(\bar{X}_n-\bar{X}_{n-1})\sum_{i=1}^{n-1}(X_i-\bar{X}_n)\\ &+\frac{n-1}{n}(X_n-\bar{X}_{n-1})^2 \end{split} \tag{1.31} \end{equation}\] To get (1.31), we need the following simple results \[\begin{equation} \begin{split} &\bar{X}_{n}-\bar{X}_{n-1}=\frac{X_n-\bar{X}_n}{n-1} \\ &\sum_{i=1}^{n-1}(X_i-\bar{X}_n)=\bar{X}_n-X_n \\ &X_n-\bar{X}_{n-1}=\frac{n(X_n-\bar{X}_n)}{n-1} \end{split} \tag{1.32} \end{equation}\] Substitute (1.32) into (1.31) we obtain \[\begin{equation} \begin{split} &(n-2)S_{n-1}^2+(\frac{n-1}{n})(X_n-\bar{X}_{n-1})^2\\ &=\sum_{i=1}^{n-1}(X_i-\bar{X}_{n})^2+(n-1)\times\frac{(X_n-\bar{X}_n)^2}{(n-1)^2}\\ &+2(\frac{X_n-\bar{X}_n}{n-1})\times(\bar{X}_n-X_n)+\frac{n-1}{n}\times(\frac{n}{n-1})^2\times(X_n-\bar{X}_n)^2\\ &=\sum_{i=1}^{n-1}(X_i-\bar{X}_{n})^2+(\frac{1}{n-1}-\frac{2}{n-1}+\frac{n}{n-1})(X_n-\bar{X}_n)^2\\ &=\sum_{i=1}^{n}(X_i-\bar{X}_{n})^2=(n-1)S^2_n \end{split} \tag{1.33} \end{equation}\]

References

Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.

Rencher, Alvin, and Bruce Schaalje. 2007. Linear Models in Statistics. 2nd ed. John Wiley; Sons, Ltd.