Chapter 11 Maximum Likelihood Estimator, Bayes Estimator, EM Algorithm (Lecture on 02/04/2020)

Definition 11.1 (Invariance Property of Maximum Likelihood Estimatros) Suppose that a distribution is indexed by a parameter \(\theta\), and one is interested in finding an estimator for some function of \(\theta\), denoted as \(\tau(\theta)\). The invariance property of MLE says that if \(\hat{\theta}\) is the MLE of \(\theta\), then \(\tau(\hat{\theta})\) is the MLE of \(\tau(\theta)\).
There are some constrains on the choice of \(\tau(\theta)\). If \(\tau(\theta)\) is one-to-one then this definition is fine. In this case, denote \(\eta=\tau(\theta)\), then the inverse function \(\tau^{-1}(\eta)=\theta\) exists, and the likelihood function of \(\eta\) is \[\begin{equation} L^*(\eta|\mathbf{x})=\prod_{i=1}^nf(x_i|\tau^{-1}(\eta))=L(\tau^{-1}(\eta)|\mathbf{x}) \tag{11.1} \end{equation}\] and \[\begin{equation} \sup_{\eta}L^*(\eta|\mathbf{x})=\sup_{\eta}L(\tau^{-1}(\eta)|\mathbf{x})=\sup_{\theta}L(\theta|\mathbf{x}) \tag{11.2} \end{equation}\] Thus, the maximum of \(L^*(\eta|\mathbf{x})\) is attained at \(\eta=\tau(\theta)=\tau(\hat{\theta})\), showing that the MLE of \(\tau(\theta)\) is \(\tau(\hat{\theta})\).
Definition 11.2 (Induced Likelihood Function) For \(\tau(\theta)\), the induced likelihood function \(L^*\) is given by \[\begin{equation} L^*(\eta|\mathbf{x})=\sup_{\{\theta:\tau(\theta)=\eta\}}L(\theta|\mathbf{x}) \tag{11.3} \end{equation}\] The value \(\hat{\eta}\) that maximizes \(L^*(\eta|\mathbf{x})\) will be called the MLE of \(\eta=\tau(\theta)\).
Theorem 11.1 (Invariance Property of MLEs) If \(\hat{\theta}\) is the MLE of \(\theta\), then for any function \(\tau(\theta)\), the MLE of \(\tau(\theta)\) is \(\tau(\hat{\theta})\).
Proof. Let \(\hat{\eta}\) denote the value that maximizes \(L^*(\eta|\mathbf{x})\). We must show that \(L^*(\eta|\mathbf{x})=L^*[\tau(\hat{\theta})|\mathbf{x}]\). Since by definition of \(L^*\), the maxima of \(L\) and \(L^*\) coincide. We have \[\begin{equation} \begin{split} L^*(\hat{\eta}|\mathbf{x})&=\sup_{\eta}\sup_{\{\theta:\tau(\theta)=\eta\}}L(\theta|\mathbf{x})\\ &=\sup_{\theta}L(\theta|\mathbf{x})\\ &=L(\hat{\theta}|\mathbf{x}) \end{split} \tag{11.4} \end{equation}\] Furthermore, \[\begin{equation} L(\hat{\theta}|\mathbf{x})=\sup_{\{\theta:\tau(\theta)=\tau(\hat{\theta})\}}L(\theta|\mathbf{x})=L^*(\tau(\hat{\theta})|\mathbf{x}) \tag{11.5} \end{equation}\] Hence \(L^*(\hat{\eta}|\mathbf{x})=L^*[\tau(\hat{\theta})|\mathbf{x}]\) and that \(\tau(\hat{\theta})\) is the MLE of \(\tau(\theta)\).
The invariance property of MLEs also holds in multivariate case. If the MLE of \((\theta_1,\cdots,\theta_k)\) is \((\hat{\theta}_1,\cdots,\hat{\theta}_k)\) and \(\tau(\theta_1,\cdots,\theta_k)\) is any function of the parameters, the MLE of \(\tau(\theta_1,\cdots,\theta_k)\) is \(\tau(\hat{\theta}_1,\cdots,\hat{\theta}_k)\).
Example 11.1 (Normal MLEs, Both Parameters Unknown) Let \(X_1,\cdots,X_n\) be i.i.d. \(N(\theta,\sigma^2)\), with both \(\mu\) and \(\sigma^2\) unknown. Then \[\begin{equation} L(\theta,\sigma^2|\mathbf{x})=\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\theta)^2}{2\sigma^2}) \tag{11.6} \end{equation}\] and \[\begin{equation} \log(L(\theta,\sigma^2|\mathbf{x}))=-\frac{n}{2}log(2\pi)-\frac{n}{2}\log(\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\theta)^2 \tag{11.7} \end{equation}\] The partial derivatives w.r.t. \(\theta\) and \(\sigma^2\) are
\[\begin{equation} \begin{split} &\frac{\partial}{\partial\theta}\log(L(\theta,\sigma^2|\mathbf{x}))=\frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\theta)\\ &\frac{\partial}{\partial\sigma^2}\log(L(\theta,\sigma^2|\mathbf{x}))=-\frac{n}{2\sigma^2}+\frac{1}{2\sigma^4}\sum_{i=1}^n(x_i-\theta)^2 \end{split} \tag{11.8} \end{equation}\] Setting these partial derivatives equal to 0 and solving yields the soulution \(\hat{\theta}=\bar{x}\) and \(\hat{\sigma}^2=n^{-1}\sum_{i=1}^n(x_i-\bar{x})^2\). We remain to verify this solution is a global maximum. Since \(\forall\theta\neq\bar{x},\sum_{i=1}^n(x_i-\theta)^2>\sum_{i=1}^n(x_i-\bar{x})\), for any value of \(\sigma^2\) \[\begin{equation} \frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{2\sigma^2})\geq\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\theta)^2}{2\sigma^2}) \tag{11.9} \end{equation}\] Therefore, the problem reduce to a one dimensional problem for \(\sigma^2\) only, which achieves global maximum at \(\hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2\). Therefore, \((\bar{X},\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2)\) are the MLEs.
\(\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{2\sigma^2})\) is called the profile likelihood for \(\sigma^2\). From Miscellanea 7.5.5 on Casella and Berger (2002), likelihood has many modifications. Some are used to deal with nuisance parameters (such as profile likelihood); others are used when a more robust specification is desired (such as quasi likelihood) ; and others are useful when the data are censored (such as partial likelihood) .

Proposition 11.1 (Check Maximum for Two Dimensional Functions) To use two-variate calculus to verify that a function \(H(\theta_1,\theta_2)\) has a local maximum at \((\hat{\theta}_1,\hat{\theta}_2)\), it must be shown that the following three conditions hold.

  1. The first-order partial derivatives are both 0, \[\begin{equation} \begin{split} &\frac{\partial}{\partial\theta_1}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}=0\quad and \\ &\frac{\partial}{\partial\theta_2}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}=0 \end{split} \tag{11.10} \end{equation}\]

  2. At least one second-order partial derivative is negative \[\begin{equation} \begin{split} &\frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}<0\quad or\\ &\frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}<0 \end{split} \tag{11.11} \end{equation}\]

  3. The Jacobian of the second-order partial derivatives is positive, \[\begin{equation} \begin{split} &\begin{vmatrix} \frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2) & \frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2)\\ \frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2) & \frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2) \end{vmatrix}_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}\\ &=\frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2)\frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2)-(\frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2))^2|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}>0 \end{split} \tag{11.12} \end{equation}\]
Example 11.2 For the normal likelihood, the second-order partial derivatives are \[\begin{equation} \begin{split} &\frac{\partial^2}{\partial\theta^2}\log L(\theta,\sigma^2|\mathbf{x})=-\frac{n}{\sigma^2}\\ &\frac{\partial^2}{\partial(\sigma^2)^2}\log L(\theta,\sigma^2|\mathbf{x})=\frac{n}{2\sigma^4}-\frac{1}{\sigma^6}\sum_{i=1}^n(x_i-\theta)^2\\ &\frac{\partial^2}{\partial\theta\partial\sigma^2}\log L(\theta,\sigma^2|\mathbf{x})=-\frac{1}{\sigma^4}\sum_{i=1}^n(x_i-\theta)\\ \end{split} \tag{11.13} \end{equation}\] and the Jacobian is \[\begin{equation} \begin{split} |J|&=\frac{1}{\sigma^6}[-\frac{n^2}{2}+\frac{n}{\sigma^2}\sum_{i=1}^n(x_i-\theta)^2-\frac{1}{\sigma^2}(\sum_{i=1}^n(x_i-\theta))^2]|_{\theta=\hat{\theta},\sigma^2=\hat{\sigma}^2}\\ &=\frac{1}{\hat{\sigma}^6}[-\frac{n^2}{2}+\frac{n^2}{\hat{\sigma}^2}--\frac{1}{\hat{\sigma}^2}(\sum_{i=1}^n(x_i-\bar{x}))^2]\\ &=\frac{n^2}{2\hat{\sigma}^6}>0 \end{split} \tag{11.14} \end{equation}\] Thus, the calculus conditions are satisfied and we have indeed found a maximum. (Stricly we still need to check the boundary.)
MLEs have numerical instable issue. We calculate \(\hat{\theta}\) based on \(L(\theta|\mathbf{x})\), but if we inquire what value we would get for the MLE if we based our calculations on \(L(\theta|\mathbf{x}+\epsilon)\), for some small \(\epsilon\). Intuitively, this new MLE should be close to \(\hat{\theta}\). But this is not always the case. Such occurrences happen when the likelihood funcntion is very flat in the neighborhood of its maximum or when there is no finite maximum. When MLEs can not be found explicitly and numerical solutions are required, it is useful to investigate the stability of the solution.

Bayes Estimator

Example 11.3 (Binomial Bayes Estimation) Let \(X_1,\cdots,X_n\) be i.i.d. \(Bernoulli(p)\), then \(Y=\sum_{i=1}^nX_i\) is \(Bin(n,p)\). Assume the prior of \(p\) is \(Beta(\alpha,\beta)\). Then using Bayes theorem, the posterior distribution of \(p\) is \(Beta(y+\alpha,n-y+\beta)\). We can use the mean of the posterior distribution as a Bayesian point estimator of \(p\). Thus, the Bayes estimator of p is \[\begin{equation} \hat{p}_{B}=\frac{y+\alpha}{n-y+\alpha} \tag{11.15} \end{equation}\]

The prior mean is \(\frac{\alpha}{\alpha+\beta}\) and from the data we may use \(p=\frac{y}{n}\) as estimation for \(p\). The Bayes estimate of \(p\) combines all of this informtaion. We can rewrite the Bayes estimator as \[\begin{equation} \hat{p}_{B}=(\frac{n}{n+\alpha+\beta})(\frac{y}{n})+(\frac{\alpha+\beta}{n+\alpha+\beta})(\frac{\alpha}{\alpha+\beta}) \tag{11.16} \end{equation}\] Thus, \(\hat{p}_{B}\) is a linear combination of the prior mean and the sample mean, with weights being determined by \(\alpha,\beta\) and \(n\).

When using Bayes estimation, it is often useful to choose the prior distribution in the conjugate family because it makes the estimator in closed-form.

Definition 11.3 (Conjugate Family) Let \(\mathcal{F}\) denote the class of p.d.f. or p.m.f. \(f(x|theta)\) (indexed by \(\theta\)). A class \(\Pi\) of prior distributions is a conjugate family for \(\mathcal{F}\) if the posterior distribution is in the class of \(\Pi\) for all \(f\in\mathcal{F}\), all priors in \(\Pi\) and all \(x\in\mathcal{X}\).
When have priors as conjugate family of likelihood, updating of the prior takes the form of updating its parameters.

Example 11.4 (Normal Bayes Estimation) Let \(\mathbf{X}\sim N(\theta,\sigma^2)\) and suppose the prior for \(\theta\) is \(N(\mu,\tau^2)\). Here we assume \(\sigma^2,\mu,\tau^2\) are all known. The posterior distribution of \(\theta\) is therefore \(\theta|X\sim N(\theta_*,\sigma_*^2)\) with \[\begin{equation} \begin{split} &\mu_*=\frac{\tau^2}{\tau^2+\sigma^2}x+\frac{\sigma^2}{\sigma^2+\tau^2}\mu\\ &\sigma_*^2=\frac{\sigma^2\tau^2}{\sigma^2+\tau^2} \end{split} \tag{11.17} \end{equation}\] The Bayes estimator of \(\theta\) is given by \(\mu_*\).

The Bayes estimator is also a linear combination of the prior and sample means. If the prior information becomes more vague, meaning that \(\tau^2\) increase, more weight is given to the sample information. On the other hand, if the prior information is good, so that \(\sigma^2>\tau^2\), then more weight is given to the prior mean.

EM (Expectation-Maximization) Algorithm

EM algorithm is designed to rather than detailing a procedure for solving for the MLE, it is guaranteed to converge to the MLE. It is based on the idea of replacing one difficult likelihood maximization with a sequence of easier maximizations whose limit is the answer to the original problem.

Example 11.5 (Multiple Poisson Rates) We observe \(X_1,\cdots,X_n\) and \(Y_1,\cdots,Y_n\), all mutually independent, where \(Y_i\sim Pois(\beta\tau_i)\) and \(X_i\sim Pois(\tau_i)\). The joint p.m.f. is therefore \[\begin{equation} f((x_1,y_1),\cdots,(x_n,y_n)|\beta,\tau_1,\cdots,\tau_n)=\prod_{i=1}^n\frac{e^{-\beta\tau_i}(\beta\tau_i)^{y_i}}{y_i|}\frac{e^{-\tau_i}(\tau_i)^{x_i}}{x_i|} \tag{11.18} \end{equation}\] The likelihood estimators can be found by straightforward differentiation the log-likelihood function. \[\begin{equation} \ell(\beta,\tau_1,\cdots,\tau_n)=\log(\beta)\sum_{i=1}^ny_i+\sum_{i=1}^n(x_i+y_i)\log(\tau_i)-(\beta+1)\sum_{i=1}^n\tau_i-\sum_{i=1}^n\log(x_i!y_i!) \tag{11.19} \end{equation}\] Take first derivative w.r.t. to all parameters and equal to 0, we find the likelihood estimator \(\hat{\beta},\hat{\tau}_j,j=1,\cdots,n\), satisfies \[\begin{equation} \begin{split} & \frac{\sum_{i=1}^ny_i}{\hat{\beta}}-\sum_{i=1}^n\hat{\tau}_i=0\\ & \frac{x_i+y_i}{\hat{\tau}_i}-(\hat{\beta}+1)=0\quad i=1,2,\cdots,n \end{split} \tag{11.20} \end{equation}\] Solving (11.20) we have the likelihood estimator as \[\begin{equation} \begin{split} & \hat{\beta}=\frac{\sum_{i=1}^ny_i}{\sum_{i=1}^nx_i}\\ & \hat{\tau}_i=\frac{x_i+y_i}{\hat{\beta}+1}\quad i=1,2,\cdots,n \end{split} \tag{11.21} \end{equation}\]

The likelihood based on (11.18) is the complete-data likelihood, and \(((x_1,y_1),\cdots,(x_n,y_n))\) is called the complete data. One common occurrence is missing data, whihc would make estimation more difficult. Suppose, for instance, \(x_1\) is missing. Discarding \(y_1\) is not a good idea since we will lose information in \(y_1\). From the complete-data likelihood (11.18), integrate out \(x_1\) would give us the likelihood in case of \(x_1\) is missing \[\begin{equation} \sum_{x_1=0}^{\infty}f((x_1,y_1),\cdots,(x_n,y_n)|\beta,\tau_1,\cdots,\tau_n) \tag{11.22} \end{equation}\]

Definition 11.4 (Complete Data Problem and Incomplete Data Problem) In general, we can move from the complete-data problem to the incomplete-data problem or the reverse. If \(\mathbf{Y}=(Y_1,\cdots,Y_n)\) are the incomplete data and \(\mathbf{X}=(X_1,\cdots,X_m)\) are the augmented data, making \((\mathbf{Y},\mathbf{X})\) the complete data, the densities \(g(\cdot|\theta)\) of \(\mathbf{Y}\) and \(f(\cdot|\theta)\) of \((\mathbf{Y},\mathbf{X})\) have the relationship \[\begin{equation} g(\mathbf{y}|\theta)=\int_{\mathcal{X}}f(\mathbf{y},\mathbf{x}|\theta)d\mathbf{x} \tag{11.23} \end{equation}\] with sum replacing integrals in the discrete case.

If we turn these into likelihoods, \(L(\theta|\mathbf{y})=g(\mathbf{y}|\theta)\) is the incomplete-data likelihood and \(L(\theta|\mathbf{y},\mathbf{x})=f(\mathbf{y},\mathbf{x}|\theta)\) is the complete-data likelihood. When either of the likelihood is hard to work with, consider work with the other one.
Example 11.6 The incomplete-data likelihood is obtained from (11.18) by summing over \(x_1\). This gives \[\begin{equation} L(\beta,\tau_1,\cdots,\tau_n|y_1,(x_2,y_2),\cdots,(x_n,y_n))=[\prod_{i=1}^n\frac{e^{-\beta\tau_i}(\beta\tau_i)^{y_i}}{y_i|}][\frac{e^{-\tau_i}(\tau_i)^{x_i}}{x_i|}] \tag{11.24} \end{equation}\] and \((y_1,(x_2,y_2),\cdots,(x_n,y_n))\) is the incomplete data. This is the likelihood that we need to maximize. Differentiation leads to the MLE equations \[\begin{equation} \begin{split} &\hat{\beta}=\frac{\sum_{i=1}^ny_i}{\sum_{i=1}^n\hat{\tau}_i}\\ &y_1=\hat{\tau}_1\hat{\beta}\\ &x_j+y_j=\hat{\tau}_j(\hat{\beta}+1)\quad j=2,3,\cdots,n \end{split} \tag{11.25} \end{equation}\] which we now solve with the EM algorithm.
Definition 11.5 (EM Algorithm) The EM algorithm allows us to maximize \(L(\theta|\mathbf{y})\) by working with only \(L(\theta|\mathbf{y},\mathbf{x})\) and the conditional p.d.f. or p.m.f. of \(\mathbf{X}\) given \(\mathbf{y}\) and \(\theta\), defined by \[\begin{equation} \begin{split} &L(\theta|\mathbf{y},\mathbf{x})=f(\mathbf{y},\mathbf{x}|\theta)\\ &L(\theta|\mathbf{y})=g(\mathbf{y}|\theta)\\ &k(\mathbf{x}|\theta,\mathbf{y})=\frac{f(\mathbf{y},\mathbf{x}|\theta)}{g(\mathbf{y}|\theta)} \end{split} \tag{11.26} \end{equation}\] Rearrange the last equation in (11.26) gives the identity \[\begin{equation} \log(L(\theta|\mathbf{y}))=\log(L(\theta|\mathbf{y},\mathbf{x}))-\log(k(\mathbf{x}|\theta,\mathbf{y})) \tag{11.27} \end{equation}\] As \(\mathbf{x}\) is missing, we replace the right side of (11.27) with its expectation under \(k(\mathbf{x}|\theta^{\prime},\mathbf{y})\) for some \(\theta^{\prime}\), creating the new identity \[\begin{equation} \log(L(\theta|\mathbf{y}))=E[\log(L(\theta|\mathbf{y},\mathbf{x}))|\theta^{\prime},\mathbf{y}]-E[\log(k(\mathbf{x}|\theta,\mathbf{y}))|\theta^{\prime},\mathbf{y}] \tag{11.28} \end{equation}\] Now the algorithm start with an initial value \(\theta^{(0)}\), we create a sequence of \(\theta^{(r)}\) according to \[\begin{equation} \theta^{(r+1)}=arg\max_{\theta}E[\log(L(\theta|\mathbf{y},\mathbf{x}))|\theta^{(r)},\mathbf{y}] \tag{11.29} \end{equation}\] The “E-step” of the algorithm calculates the expected log likelihood, and the “M-step” finds its maximum.

Example 11.7 Let \((\mathbf{x},\mathbf{y})=((x_1,y_1),\cdots,(x_n,y_n))\) denote the complete data and \((\mathbf{x}_{(-1)},\mathbf{y})=(y_1,(x_2,y_2),\cdots,(x_n,y_n))\) denote the incomplete data. The expected complete-data log likelihood is \[\begin{equation} \begin{split} & E[\log L(\beta,\tau_1,\cdots,\tau_n|(\mathbf{x},\mathbf{y}))|\tau^{(r)},(\mathbf{x}_{-1},\mathbf{y})]\\ &=\sum_{x_1=0}^{\infty}\log(\prod_{i=1}^n\frac{e^{-\beta\tau_i}(\beta\tau_i)^{y_i}}{y_i|}\frac{e^{-\tau_i}(\tau_i)^{x_i}}{x_i|})\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!}\\ &=\sum_{i=1}^n[-\beta\tau_i+y_i(\log\beta+\log\tau_i)-\log y_i!]+\sum_{i=2}^n[-\tau_i+x_i\log\tau_i-\log x_i!]\\ &+\sum_{x_1=0}^{\infty}[-\tau_1+x_1\log\tau_1-\log x_1!]\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!}\\ &=(\sum_{i=1}^n[-\beta\tau_i+y_i(\log\beta+\log\tau_i)]+\sum_{i=2}^n[-\tau_i+x_i\log\tau_i]+\sum_{x_1=0}^{\infty}[-\tau_1+x_1\log\tau_1]\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!})\\ &-(\sum_{i=1}^n\log y_i!+\sum_{i=2}^n \log x_i!+\sum_{x_1=0}^{\infty}\log x_1!\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!}) \end{split} \tag{11.30} \end{equation}\] The first equation in (11.30) uses @(eq:11020), (11.24) and the last equation of (11.26). The last equation in (11.30) is just grouping together terms involving \(\beta\) and \(\tau_i\) and terms that do not involve these parameters. We can ignore terms without \(\beta\) and \(\tau_i\). Thus, we only need to maximize the terms in the first set of parentheses, where the last sum can be written as \[\begin{equation} -\tau_1+\log\tau_1\sum_{x_1=0}^{\infty}x_1\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!})=-\tau_1+\tau_1^{(r)}\log\tau_1 \tag{11.31} \end{equation}\] When substituting back to (11.30), we see that the expected complete-data likelihood is the same as the original complete-data likelihood, with the exception that \(x_1\) is replaced by \(\tau_1^{(r)}\). Thus, in the \(r\)th step the MLEs are only a minor variation of (11.21) and are given by \[\begin{equation} \begin{split} & \hat{\beta}^{(r+1)}=\frac{\sum_{i=1}^ny_i}{\hat{\tau}_1^{(r)}+\sum_{i=2}^nx_i}\\ & \hat{\tau}_1^{(r+1)}=\frac{\hat{\tau}_1^{(r)}+y_1}{\hat{\beta}^{(r+1)}+1}\\ & \hat{\tau}_i^{(r+1)}=\frac{x_i+y_i}{\hat{\beta}^{(r+1)}+1}\quad i=2,3,\cdots,n \end{split} \tag{11.32} \end{equation}\]

This defines both the E-step (which results in the substitution of \(\hat{\tau}_1^{(r)}\) for \(x_1\)) and the M-step (which results in the calculation in (11.32) for the MLEs at the \(r\)th iteration). The properties of the EM algorithm give us assurance that the sequence \((\hat{\beta}^{(r)},\hat{\tau}_1^{(r)},\cdots,\hat{\tau}_n^{(r)})\) converges to the incomplete-data MLE as \(r\to\infty\).
Theorem 11.2 (Monotonic EM sequence) The sequence \(\{\hat{\theta}_{(r)}\}\) defined by (11.29) satisfies \[\begin{equation} L(\hat{\theta}^{(r+1)}|\mathbf{y})\geq L(\hat{\theta}^{(r)}|\mathbf{y}) \tag{11.33} \end{equation}\] with equality holding if and only if successive iterations yield the same value of the maximized expected complete-data log likelihood, that is
\[\begin{equation} E[\log(L(\hat{\theta}^{(r+1)}|\mathbf{y},\mathbf{X}))|\hat{\theta}^{(r)},\mathbf{y}]=E[\log(L(\hat{\theta}^{(r)}|\mathbf{y},\mathbf{X}))|\hat{\theta}^{(r)},\mathbf{y}] \tag{11.34} \end{equation}\]

References

Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.