Chapter 11 Maximum Likelihood Estimator, Bayes Estimator, EM Algorithm (Lecture on 02/04/2020)

Definition 11.1 (Invariance Property of Maximum Likelihood Estimatros) Suppose that a distribution is indexed by a parameter θ, and one is interested in finding an estimator for some function of θ, denoted as τ(θ). The invariance property of MLE says that if ˆθ is the MLE of θ, then τ(ˆθ) is the MLE of τ(θ).
There are some constrains on the choice of τ(θ). If τ(θ) is one-to-one then this definition is fine. In this case, denote η=τ(θ), then the inverse function τ1(η)=θ exists, and the likelihood function of η is L(η|x)=ni=1f(xi|τ1(η))=L(τ1(η)|x) and sup Thus, the maximum of L^*(\eta|\mathbf{x}) is attained at \eta=\tau(\theta)=\tau(\hat{\theta}), showing that the MLE of \tau(\theta) is \tau(\hat{\theta}).
Definition 11.2 (Induced Likelihood Function) For \tau(\theta), the induced likelihood function L^* is given by \begin{equation} L^*(\eta|\mathbf{x})=\sup_{\{\theta:\tau(\theta)=\eta\}}L(\theta|\mathbf{x}) \tag{11.3} \end{equation} The value \hat{\eta} that maximizes L^*(\eta|\mathbf{x}) will be called the MLE of \eta=\tau(\theta).
Theorem 11.1 (Invariance Property of MLEs) If \hat{\theta} is the MLE of \theta, then for any function \tau(\theta), the MLE of \tau(\theta) is \tau(\hat{\theta}).
Proof. Let \hat{\eta} denote the value that maximizes L^*(\eta|\mathbf{x}). We must show that L^*(\eta|\mathbf{x})=L^*[\tau(\hat{\theta})|\mathbf{x}]. Since by definition of L^*, the maxima of L and L^* coincide. We have \begin{equation} \begin{split} L^*(\hat{\eta}|\mathbf{x})&=\sup_{\eta}\sup_{\{\theta:\tau(\theta)=\eta\}}L(\theta|\mathbf{x})\\ &=\sup_{\theta}L(\theta|\mathbf{x})\\ &=L(\hat{\theta}|\mathbf{x}) \end{split} \tag{11.4} \end{equation} Furthermore, \begin{equation} L(\hat{\theta}|\mathbf{x})=\sup_{\{\theta:\tau(\theta)=\tau(\hat{\theta})\}}L(\theta|\mathbf{x})=L^*(\tau(\hat{\theta})|\mathbf{x}) \tag{11.5} \end{equation} Hence L^*(\hat{\eta}|\mathbf{x})=L^*[\tau(\hat{\theta})|\mathbf{x}] and that \tau(\hat{\theta}) is the MLE of \tau(\theta).
The invariance property of MLEs also holds in multivariate case. If the MLE of (\theta_1,\cdots,\theta_k) is (\hat{\theta}_1,\cdots,\hat{\theta}_k) and \tau(\theta_1,\cdots,\theta_k) is any function of the parameters, the MLE of \tau(\theta_1,\cdots,\theta_k) is \tau(\hat{\theta}_1,\cdots,\hat{\theta}_k).
Example 11.1 (Normal MLEs, Both Parameters Unknown) Let X_1,\cdots,X_n be i.i.d. N(\theta,\sigma^2), with both \mu and \sigma^2 unknown. Then \begin{equation} L(\theta,\sigma^2|\mathbf{x})=\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\theta)^2}{2\sigma^2}) \tag{11.6} \end{equation} and \begin{equation} \log(L(\theta,\sigma^2|\mathbf{x}))=-\frac{n}{2}log(2\pi)-\frac{n}{2}\log(\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\theta)^2 \tag{11.7} \end{equation} The partial derivatives w.r.t. \theta and \sigma^2 are
\begin{equation} \begin{split} &\frac{\partial}{\partial\theta}\log(L(\theta,\sigma^2|\mathbf{x}))=\frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\theta)\\ &\frac{\partial}{\partial\sigma^2}\log(L(\theta,\sigma^2|\mathbf{x}))=-\frac{n}{2\sigma^2}+\frac{1}{2\sigma^4}\sum_{i=1}^n(x_i-\theta)^2 \end{split} \tag{11.8} \end{equation} Setting these partial derivatives equal to 0 and solving yields the soulution \hat{\theta}=\bar{x} and \hat{\sigma}^2=n^{-1}\sum_{i=1}^n(x_i-\bar{x})^2. We remain to verify this solution is a global maximum. Since \forall\theta\neq\bar{x},\sum_{i=1}^n(x_i-\theta)^2>\sum_{i=1}^n(x_i-\bar{x}), for any value of \sigma^2 \begin{equation} \frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{2\sigma^2})\geq\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\theta)^2}{2\sigma^2}) \tag{11.9} \end{equation} Therefore, the problem reduce to a one dimensional problem for \sigma^2 only, which achieves global maximum at \hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2. Therefore, (\bar{X},\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2) are the MLEs.
\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{2\sigma^2}) is called the profile likelihood for \sigma^2. From Miscellanea 7.5.5 on Casella and Berger (2002), likelihood has many modifications. Some are used to deal with nuisance parameters (such as profile likelihood); others are used when a more robust specification is desired (such as quasi likelihood) ; and others are useful when the data are censored (such as partial likelihood) .

Proposition 11.1 (Check Maximum for Two Dimensional Functions) To use two-variate calculus to verify that a function H(\theta_1,\theta_2) has a local maximum at (\hat{\theta}_1,\hat{\theta}_2), it must be shown that the following three conditions hold.

  1. The first-order partial derivatives are both 0, \begin{equation} \begin{split} &\frac{\partial}{\partial\theta_1}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}=0\quad and \\ &\frac{\partial}{\partial\theta_2}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}=0 \end{split} \tag{11.10} \end{equation}

  2. At least one second-order partial derivative is negative \begin{equation} \begin{split} &\frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}<0\quad or\\ &\frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}<0 \end{split} \tag{11.11} \end{equation}

  3. The Jacobian of the second-order partial derivatives is positive, \begin{equation} \begin{split} &\begin{vmatrix} \frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2) & \frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2)\\ \frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2) & \frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2) \end{vmatrix}_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}\\ &=\frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2)\frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2)-(\frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2))^2|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}>0 \end{split} \tag{11.12} \end{equation}
Example 11.2 For the normal likelihood, the second-order partial derivatives are \begin{equation} \begin{split} &\frac{\partial^2}{\partial\theta^2}\log L(\theta,\sigma^2|\mathbf{x})=-\frac{n}{\sigma^2}\\ &\frac{\partial^2}{\partial(\sigma^2)^2}\log L(\theta,\sigma^2|\mathbf{x})=\frac{n}{2\sigma^4}-\frac{1}{\sigma^6}\sum_{i=1}^n(x_i-\theta)^2\\ &\frac{\partial^2}{\partial\theta\partial\sigma^2}\log L(\theta,\sigma^2|\mathbf{x})=-\frac{1}{\sigma^4}\sum_{i=1}^n(x_i-\theta)\\ \end{split} \tag{11.13} \end{equation} and the Jacobian is \begin{equation} \begin{split} |J|&=\frac{1}{\sigma^6}[-\frac{n^2}{2}+\frac{n}{\sigma^2}\sum_{i=1}^n(x_i-\theta)^2-\frac{1}{\sigma^2}(\sum_{i=1}^n(x_i-\theta))^2]|_{\theta=\hat{\theta},\sigma^2=\hat{\sigma}^2}\\ &=\frac{1}{\hat{\sigma}^6}[-\frac{n^2}{2}+\frac{n^2}{\hat{\sigma}^2}--\frac{1}{\hat{\sigma}^2}(\sum_{i=1}^n(x_i-\bar{x}))^2]\\ &=\frac{n^2}{2\hat{\sigma}^6}>0 \end{split} \tag{11.14} \end{equation} Thus, the calculus conditions are satisfied and we have indeed found a maximum. (Stricly we still need to check the boundary.)
MLEs have numerical instable issue. We calculate \hat{\theta} based on L(\theta|\mathbf{x}), but if we inquire what value we would get for the MLE if we based our calculations on L(\theta|\mathbf{x}+\epsilon), for some small \epsilon. Intuitively, this new MLE should be close to \hat{\theta}. But this is not always the case. Such occurrences happen when the likelihood funcntion is very flat in the neighborhood of its maximum or when there is no finite maximum. When MLEs can not be found explicitly and numerical solutions are required, it is useful to investigate the stability of the solution.

Bayes Estimator

Example 11.3 (Binomial Bayes Estimation) Let X_1,\cdots,X_n be i.i.d. Bernoulli(p), then Y=\sum_{i=1}^nX_i is Bin(n,p). Assume the prior of p is Beta(\alpha,\beta). Then using Bayes theorem, the posterior distribution of p is Beta(y+\alpha,n-y+\beta). We can use the mean of the posterior distribution as a Bayesian point estimator of p. Thus, the Bayes estimator of p is \begin{equation} \hat{p}_{B}=\frac{y+\alpha}{n-y+\alpha} \tag{11.15} \end{equation}

The prior mean is \frac{\alpha}{\alpha+\beta} and from the data we may use p=\frac{y}{n} as estimation for p. The Bayes estimate of p combines all of this informtaion. We can rewrite the Bayes estimator as \begin{equation} \hat{p}_{B}=(\frac{n}{n+\alpha+\beta})(\frac{y}{n})+(\frac{\alpha+\beta}{n+\alpha+\beta})(\frac{\alpha}{\alpha+\beta}) \tag{11.16} \end{equation} Thus, \hat{p}_{B} is a linear combination of the prior mean and the sample mean, with weights being determined by \alpha,\beta and n.

When using Bayes estimation, it is often useful to choose the prior distribution in the conjugate family because it makes the estimator in closed-form.

Definition 11.3 (Conjugate Family) Let \mathcal{F} denote the class of p.d.f. or p.m.f. f(x|theta) (indexed by \theta). A class \Pi of prior distributions is a conjugate family for \mathcal{F} if the posterior distribution is in the class of \Pi for all f\in\mathcal{F}, all priors in \Pi and all x\in\mathcal{X}.
When have priors as conjugate family of likelihood, updating of the prior takes the form of updating its parameters.

Example 11.4 (Normal Bayes Estimation) Let \mathbf{X}\sim N(\theta,\sigma^2) and suppose the prior for \theta is N(\mu,\tau^2). Here we assume \sigma^2,\mu,\tau^2 are all known. The posterior distribution of \theta is therefore \theta|X\sim N(\theta_*,\sigma_*^2) with \begin{equation} \begin{split} &\mu_*=\frac{\tau^2}{\tau^2+\sigma^2}x+\frac{\sigma^2}{\sigma^2+\tau^2}\mu\\ &\sigma_*^2=\frac{\sigma^2\tau^2}{\sigma^2+\tau^2} \end{split} \tag{11.17} \end{equation} The Bayes estimator of \theta is given by \mu_*.

The Bayes estimator is also a linear combination of the prior and sample means. If the prior information becomes more vague, meaning that \tau^2 increase, more weight is given to the sample information. On the other hand, if the prior information is good, so that \sigma^2>\tau^2, then more weight is given to the prior mean.

EM (Expectation-Maximization) Algorithm

EM algorithm is designed to rather than detailing a procedure for solving for the MLE, it is guaranteed to converge to the MLE. It is based on the idea of replacing one difficult likelihood maximization with a sequence of easier maximizations whose limit is the answer to the original problem.

Example 11.5 (Multiple Poisson Rates) We observe X_1,\cdots,X_n and Y_1,\cdots,Y_n, all mutually independent, where Y_i\sim Pois(\beta\tau_i) and X_i\sim Pois(\tau_i). The joint p.m.f. is therefore \begin{equation} f((x_1,y_1),\cdots,(x_n,y_n)|\beta,\tau_1,\cdots,\tau_n)=\prod_{i=1}^n\frac{e^{-\beta\tau_i}(\beta\tau_i)^{y_i}}{y_i|}\frac{e^{-\tau_i}(\tau_i)^{x_i}}{x_i|} \tag{11.18} \end{equation} The likelihood estimators can be found by straightforward differentiation the log-likelihood function. \begin{equation} \ell(\beta,\tau_1,\cdots,\tau_n)=\log(\beta)\sum_{i=1}^ny_i+\sum_{i=1}^n(x_i+y_i)\log(\tau_i)-(\beta+1)\sum_{i=1}^n\tau_i-\sum_{i=1}^n\log(x_i!y_i!) \tag{11.19} \end{equation} Take first derivative w.r.t. to all parameters and equal to 0, we find the likelihood estimator \hat{\beta},\hat{\tau}_j,j=1,\cdots,n, satisfies \begin{equation} \begin{split} & \frac{\sum_{i=1}^ny_i}{\hat{\beta}}-\sum_{i=1}^n\hat{\tau}_i=0\\ & \frac{x_i+y_i}{\hat{\tau}_i}-(\hat{\beta}+1)=0\quad i=1,2,\cdots,n \end{split} \tag{11.20} \end{equation} Solving (11.20) we have the likelihood estimator as \begin{equation} \begin{split} & \hat{\beta}=\frac{\sum_{i=1}^ny_i}{\sum_{i=1}^nx_i}\\ & \hat{\tau}_i=\frac{x_i+y_i}{\hat{\beta}+1}\quad i=1,2,\cdots,n \end{split} \tag{11.21} \end{equation}

The likelihood based on (11.18) is the complete-data likelihood, and ((x_1,y_1),\cdots,(x_n,y_n)) is called the complete data. One common occurrence is missing data, whihc would make estimation more difficult. Suppose, for instance, x_1 is missing. Discarding y_1 is not a good idea since we will lose information in y_1. From the complete-data likelihood (11.18), integrate out x_1 would give us the likelihood in case of x_1 is missing \begin{equation} \sum_{x_1=0}^{\infty}f((x_1,y_1),\cdots,(x_n,y_n)|\beta,\tau_1,\cdots,\tau_n) \tag{11.22} \end{equation}

Definition 11.4 (Complete Data Problem and Incomplete Data Problem) In general, we can move from the complete-data problem to the incomplete-data problem or the reverse. If \mathbf{Y}=(Y_1,\cdots,Y_n) are the incomplete data and \mathbf{X}=(X_1,\cdots,X_m) are the augmented data, making (\mathbf{Y},\mathbf{X}) the complete data, the densities g(\cdot|\theta) of \mathbf{Y} and f(\cdot|\theta) of (\mathbf{Y},\mathbf{X}) have the relationship \begin{equation} g(\mathbf{y}|\theta)=\int_{\mathcal{X}}f(\mathbf{y},\mathbf{x}|\theta)d\mathbf{x} \tag{11.23} \end{equation} with sum replacing integrals in the discrete case.

If we turn these into likelihoods, L(\theta|\mathbf{y})=g(\mathbf{y}|\theta) is the incomplete-data likelihood and L(\theta|\mathbf{y},\mathbf{x})=f(\mathbf{y},\mathbf{x}|\theta) is the complete-data likelihood. When either of the likelihood is hard to work with, consider work with the other one.
Example 11.6 The incomplete-data likelihood is obtained from (11.18) by summing over x_1. This gives \begin{equation} L(\beta,\tau_1,\cdots,\tau_n|y_1,(x_2,y_2),\cdots,(x_n,y_n))=[\prod_{i=1}^n\frac{e^{-\beta\tau_i}(\beta\tau_i)^{y_i}}{y_i|}][\frac{e^{-\tau_i}(\tau_i)^{x_i}}{x_i|}] \tag{11.24} \end{equation} and (y_1,(x_2,y_2),\cdots,(x_n,y_n)) is the incomplete data. This is the likelihood that we need to maximize. Differentiation leads to the MLE equations \begin{equation} \begin{split} &\hat{\beta}=\frac{\sum_{i=1}^ny_i}{\sum_{i=1}^n\hat{\tau}_i}\\ &y_1=\hat{\tau}_1\hat{\beta}\\ &x_j+y_j=\hat{\tau}_j(\hat{\beta}+1)\quad j=2,3,\cdots,n \end{split} \tag{11.25} \end{equation} which we now solve with the EM algorithm.
Definition 11.5 (EM Algorithm) The EM algorithm allows us to maximize L(\theta|\mathbf{y}) by working with only L(\theta|\mathbf{y},\mathbf{x}) and the conditional p.d.f. or p.m.f. of \mathbf{X} given \mathbf{y} and \theta, defined by \begin{equation} \begin{split} &L(\theta|\mathbf{y},\mathbf{x})=f(\mathbf{y},\mathbf{x}|\theta)\\ &L(\theta|\mathbf{y})=g(\mathbf{y}|\theta)\\ &k(\mathbf{x}|\theta,\mathbf{y})=\frac{f(\mathbf{y},\mathbf{x}|\theta)}{g(\mathbf{y}|\theta)} \end{split} \tag{11.26} \end{equation} Rearrange the last equation in (11.26) gives the identity \begin{equation} \log(L(\theta|\mathbf{y}))=\log(L(\theta|\mathbf{y},\mathbf{x}))-\log(k(\mathbf{x}|\theta,\mathbf{y})) \tag{11.27} \end{equation} As \mathbf{x} is missing, we replace the right side of (11.27) with its expectation under k(\mathbf{x}|\theta^{\prime},\mathbf{y}) for some \theta^{\prime}, creating the new identity \begin{equation} \log(L(\theta|\mathbf{y}))=E[\log(L(\theta|\mathbf{y},\mathbf{x}))|\theta^{\prime},\mathbf{y}]-E[\log(k(\mathbf{x}|\theta,\mathbf{y}))|\theta^{\prime},\mathbf{y}] \tag{11.28} \end{equation} Now the algorithm start with an initial value \theta^{(0)}, we create a sequence of \theta^{(r)} according to \begin{equation} \theta^{(r+1)}=arg\max_{\theta}E[\log(L(\theta|\mathbf{y},\mathbf{x}))|\theta^{(r)},\mathbf{y}] \tag{11.29} \end{equation} The “E-step” of the algorithm calculates the expected log likelihood, and the “M-step” finds its maximum.

Example 11.7 Let (\mathbf{x},\mathbf{y})=((x_1,y_1),\cdots,(x_n,y_n)) denote the complete data and (\mathbf{x}_{(-1)},\mathbf{y})=(y_1,(x_2,y_2),\cdots,(x_n,y_n)) denote the incomplete data. The expected complete-data log likelihood is \begin{equation} \begin{split} & E[\log L(\beta,\tau_1,\cdots,\tau_n|(\mathbf{x},\mathbf{y}))|\tau^{(r)},(\mathbf{x}_{-1},\mathbf{y})]\\ &=\sum_{x_1=0}^{\infty}\log(\prod_{i=1}^n\frac{e^{-\beta\tau_i}(\beta\tau_i)^{y_i}}{y_i|}\frac{e^{-\tau_i}(\tau_i)^{x_i}}{x_i|})\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!}\\ &=\sum_{i=1}^n[-\beta\tau_i+y_i(\log\beta+\log\tau_i)-\log y_i!]+\sum_{i=2}^n[-\tau_i+x_i\log\tau_i-\log x_i!]\\ &+\sum_{x_1=0}^{\infty}[-\tau_1+x_1\log\tau_1-\log x_1!]\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!}\\ &=(\sum_{i=1}^n[-\beta\tau_i+y_i(\log\beta+\log\tau_i)]+\sum_{i=2}^n[-\tau_i+x_i\log\tau_i]+\sum_{x_1=0}^{\infty}[-\tau_1+x_1\log\tau_1]\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!})\\ &-(\sum_{i=1}^n\log y_i!+\sum_{i=2}^n \log x_i!+\sum_{x_1=0}^{\infty}\log x_1!\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!}) \end{split} \tag{11.30} \end{equation} The first equation in (11.30) uses @(eq:11020), (11.24) and the last equation of (11.26). The last equation in (11.30) is just grouping together terms involving \beta and \tau_i and terms that do not involve these parameters. We can ignore terms without \beta and \tau_i. Thus, we only need to maximize the terms in the first set of parentheses, where the last sum can be written as \begin{equation} -\tau_1+\log\tau_1\sum_{x_1=0}^{\infty}x_1\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!})=-\tau_1+\tau_1^{(r)}\log\tau_1 \tag{11.31} \end{equation} When substituting back to (11.30), we see that the expected complete-data likelihood is the same as the original complete-data likelihood, with the exception that x_1 is replaced by \tau_1^{(r)}. Thus, in the rth step the MLEs are only a minor variation of (11.21) and are given by \begin{equation} \begin{split} & \hat{\beta}^{(r+1)}=\frac{\sum_{i=1}^ny_i}{\hat{\tau}_1^{(r)}+\sum_{i=2}^nx_i}\\ & \hat{\tau}_1^{(r+1)}=\frac{\hat{\tau}_1^{(r)}+y_1}{\hat{\beta}^{(r+1)}+1}\\ & \hat{\tau}_i^{(r+1)}=\frac{x_i+y_i}{\hat{\beta}^{(r+1)}+1}\quad i=2,3,\cdots,n \end{split} \tag{11.32} \end{equation}

This defines both the E-step (which results in the substitution of \hat{\tau}_1^{(r)} for x_1) and the M-step (which results in the calculation in (11.32) for the MLEs at the rth iteration). The properties of the EM algorithm give us assurance that the sequence (\hat{\beta}^{(r)},\hat{\tau}_1^{(r)},\cdots,\hat{\tau}_n^{(r)}) converges to the incomplete-data MLE as r\to\infty.
Theorem 11.2 (Monotonic EM sequence) The sequence \{\hat{\theta}_{(r)}\} defined by (11.29) satisfies \begin{equation} L(\hat{\theta}^{(r+1)}|\mathbf{y})\geq L(\hat{\theta}^{(r)}|\mathbf{y}) \tag{11.33} \end{equation} with equality holding if and only if successive iterations yield the same value of the maximized expected complete-data log likelihood, that is
\begin{equation} E[\log(L(\hat{\theta}^{(r+1)}|\mathbf{y},\mathbf{X}))|\hat{\theta}^{(r)},\mathbf{y}]=E[\log(L(\hat{\theta}^{(r)}|\mathbf{y},\mathbf{X}))|\hat{\theta}^{(r)},\mathbf{y}] \tag{11.34} \end{equation}

References

Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.