Chapter 11 Maximum Likelihood Estimator, Bayes Estimator, EM Algorithm (Lecture on 02/04/2020)
\[\begin{equation} \begin{split} &\frac{\partial}{\partial\theta}\log(L(\theta,\sigma^2|\mathbf{x}))=\frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\theta)\\ &\frac{\partial}{\partial\sigma^2}\log(L(\theta,\sigma^2|\mathbf{x}))=-\frac{n}{2\sigma^2}+\frac{1}{2\sigma^4}\sum_{i=1}^n(x_i-\theta)^2 \end{split} \tag{11.8} \end{equation}\] Setting these partial derivatives equal to 0 and solving yields the soulution \(\hat{\theta}=\bar{x}\) and \(\hat{\sigma}^2=n^{-1}\sum_{i=1}^n(x_i-\bar{x})^2\). We remain to verify this solution is a global maximum. Since \(\forall\theta\neq\bar{x},\sum_{i=1}^n(x_i-\theta)^2>\sum_{i=1}^n(x_i-\bar{x})\), for any value of \(\sigma^2\) \[\begin{equation} \frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{2\sigma^2})\geq\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{\sum_{i=1}^n(x_i-\theta)^2}{2\sigma^2}) \tag{11.9} \end{equation}\] Therefore, the problem reduce to a one dimensional problem for \(\sigma^2\) only, which achieves global maximum at \(\hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2\). Therefore, \((\bar{X},\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2)\) are the MLEs.
Proposition 11.1 (Check Maximum for Two Dimensional Functions) To use two-variate calculus to verify that a function \(H(\theta_1,\theta_2)\) has a local maximum at \((\hat{\theta}_1,\hat{\theta}_2)\), it must be shown that the following three conditions hold.
The first-order partial derivatives are both 0, \[\begin{equation} \begin{split} &\frac{\partial}{\partial\theta_1}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}=0\quad and \\ &\frac{\partial}{\partial\theta_2}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}=0 \end{split} \tag{11.10} \end{equation}\]
At least one second-order partial derivative is negative \[\begin{equation} \begin{split} &\frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}<0\quad or\\ &\frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2)|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}<0 \end{split} \tag{11.11} \end{equation}\]
- The Jacobian of the second-order partial derivatives is positive, \[\begin{equation} \begin{split} &\begin{vmatrix} \frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2) & \frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2)\\ \frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2) & \frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2) \end{vmatrix}_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}\\ &=\frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2)\frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2)-(\frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2))^2|_{\theta=\hat{\theta}_1,\theta_2=\hat{\theta}_2}>0 \end{split} \tag{11.12} \end{equation}\]
Bayes Estimator
Example 11.3 (Binomial Bayes Estimation) Let \(X_1,\cdots,X_n\) be i.i.d. \(Bernoulli(p)\), then \(Y=\sum_{i=1}^nX_i\) is \(Bin(n,p)\). Assume the prior of \(p\) is \(Beta(\alpha,\beta)\). Then using Bayes theorem, the posterior distribution of \(p\) is \(Beta(y+\alpha,n-y+\beta)\). We can use the mean of the posterior distribution as a Bayesian point estimator of \(p\). Thus, the Bayes estimator of p is \[\begin{equation} \hat{p}_{B}=\frac{y+\alpha}{n-y+\alpha} \tag{11.15} \end{equation}\]
The prior mean is \(\frac{\alpha}{\alpha+\beta}\) and from the data we may use \(p=\frac{y}{n}\) as estimation for \(p\). The Bayes estimate of \(p\) combines all of this informtaion. We can rewrite the Bayes estimator as \[\begin{equation} \hat{p}_{B}=(\frac{n}{n+\alpha+\beta})(\frac{y}{n})+(\frac{\alpha+\beta}{n+\alpha+\beta})(\frac{\alpha}{\alpha+\beta}) \tag{11.16} \end{equation}\] Thus, \(\hat{p}_{B}\) is a linear combination of the prior mean and the sample mean, with weights being determined by \(\alpha,\beta\) and \(n\).When using Bayes estimation, it is often useful to choose the prior distribution in the conjugate family because it makes the estimator in closed-form.
Example 11.4 (Normal Bayes Estimation) Let \(\mathbf{X}\sim N(\theta,\sigma^2)\) and suppose the prior for \(\theta\) is \(N(\mu,\tau^2)\). Here we assume \(\sigma^2,\mu,\tau^2\) are all known. The posterior distribution of \(\theta\) is therefore \(\theta|X\sim N(\theta_*,\sigma_*^2)\) with \[\begin{equation} \begin{split} &\mu_*=\frac{\tau^2}{\tau^2+\sigma^2}x+\frac{\sigma^2}{\sigma^2+\tau^2}\mu\\ &\sigma_*^2=\frac{\sigma^2\tau^2}{\sigma^2+\tau^2} \end{split} \tag{11.17} \end{equation}\] The Bayes estimator of \(\theta\) is given by \(\mu_*\).
The Bayes estimator is also a linear combination of the prior and sample means. If the prior information becomes more vague, meaning that \(\tau^2\) increase, more weight is given to the sample information. On the other hand, if the prior information is good, so that \(\sigma^2>\tau^2\), then more weight is given to the prior mean.EM (Expectation-Maximization) Algorithm
Example 11.5 (Multiple Poisson Rates) We observe \(X_1,\cdots,X_n\) and \(Y_1,\cdots,Y_n\), all mutually independent, where \(Y_i\sim Pois(\beta\tau_i)\) and \(X_i\sim Pois(\tau_i)\). The joint p.m.f. is therefore \[\begin{equation} f((x_1,y_1),\cdots,(x_n,y_n)|\beta,\tau_1,\cdots,\tau_n)=\prod_{i=1}^n\frac{e^{-\beta\tau_i}(\beta\tau_i)^{y_i}}{y_i|}\frac{e^{-\tau_i}(\tau_i)^{x_i}}{x_i|} \tag{11.18} \end{equation}\] The likelihood estimators can be found by straightforward differentiation the log-likelihood function. \[\begin{equation} \ell(\beta,\tau_1,\cdots,\tau_n)=\log(\beta)\sum_{i=1}^ny_i+\sum_{i=1}^n(x_i+y_i)\log(\tau_i)-(\beta+1)\sum_{i=1}^n\tau_i-\sum_{i=1}^n\log(x_i!y_i!) \tag{11.19} \end{equation}\] Take first derivative w.r.t. to all parameters and equal to 0, we find the likelihood estimator \(\hat{\beta},\hat{\tau}_j,j=1,\cdots,n\), satisfies \[\begin{equation} \begin{split} & \frac{\sum_{i=1}^ny_i}{\hat{\beta}}-\sum_{i=1}^n\hat{\tau}_i=0\\ & \frac{x_i+y_i}{\hat{\tau}_i}-(\hat{\beta}+1)=0\quad i=1,2,\cdots,n \end{split} \tag{11.20} \end{equation}\] Solving (11.20) we have the likelihood estimator as \[\begin{equation} \begin{split} & \hat{\beta}=\frac{\sum_{i=1}^ny_i}{\sum_{i=1}^nx_i}\\ & \hat{\tau}_i=\frac{x_i+y_i}{\hat{\beta}+1}\quad i=1,2,\cdots,n \end{split} \tag{11.21} \end{equation}\]
The likelihood based on (11.18) is the complete-data likelihood, and \(((x_1,y_1),\cdots,(x_n,y_n))\) is called the complete data. One common occurrence is missing data, whihc would make estimation more difficult. Suppose, for instance, \(x_1\) is missing. Discarding \(y_1\) is not a good idea since we will lose information in \(y_1\). From the complete-data likelihood (11.18), integrate out \(x_1\) would give us the likelihood in case of \(x_1\) is missing \[\begin{equation} \sum_{x_1=0}^{\infty}f((x_1,y_1),\cdots,(x_n,y_n)|\beta,\tau_1,\cdots,\tau_n) \tag{11.22} \end{equation}\]Definition 11.4 (Complete Data Problem and Incomplete Data Problem) In general, we can move from the complete-data problem to the incomplete-data problem or the reverse. If \(\mathbf{Y}=(Y_1,\cdots,Y_n)\) are the incomplete data and \(\mathbf{X}=(X_1,\cdots,X_m)\) are the augmented data, making \((\mathbf{Y},\mathbf{X})\) the complete data, the densities \(g(\cdot|\theta)\) of \(\mathbf{Y}\) and \(f(\cdot|\theta)\) of \((\mathbf{Y},\mathbf{X})\) have the relationship \[\begin{equation} g(\mathbf{y}|\theta)=\int_{\mathcal{X}}f(\mathbf{y},\mathbf{x}|\theta)d\mathbf{x} \tag{11.23} \end{equation}\] with sum replacing integrals in the discrete case.
If we turn these into likelihoods, \(L(\theta|\mathbf{y})=g(\mathbf{y}|\theta)\) is the incomplete-data likelihood and \(L(\theta|\mathbf{y},\mathbf{x})=f(\mathbf{y},\mathbf{x}|\theta)\) is the complete-data likelihood. When either of the likelihood is hard to work with, consider work with the other one.Example 11.7 Let \((\mathbf{x},\mathbf{y})=((x_1,y_1),\cdots,(x_n,y_n))\) denote the complete data and \((\mathbf{x}_{(-1)},\mathbf{y})=(y_1,(x_2,y_2),\cdots,(x_n,y_n))\) denote the incomplete data. The expected complete-data log likelihood is \[\begin{equation} \begin{split} & E[\log L(\beta,\tau_1,\cdots,\tau_n|(\mathbf{x},\mathbf{y}))|\tau^{(r)},(\mathbf{x}_{-1},\mathbf{y})]\\ &=\sum_{x_1=0}^{\infty}\log(\prod_{i=1}^n\frac{e^{-\beta\tau_i}(\beta\tau_i)^{y_i}}{y_i|}\frac{e^{-\tau_i}(\tau_i)^{x_i}}{x_i|})\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!}\\ &=\sum_{i=1}^n[-\beta\tau_i+y_i(\log\beta+\log\tau_i)-\log y_i!]+\sum_{i=2}^n[-\tau_i+x_i\log\tau_i-\log x_i!]\\ &+\sum_{x_1=0}^{\infty}[-\tau_1+x_1\log\tau_1-\log x_1!]\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!}\\ &=(\sum_{i=1}^n[-\beta\tau_i+y_i(\log\beta+\log\tau_i)]+\sum_{i=2}^n[-\tau_i+x_i\log\tau_i]+\sum_{x_1=0}^{\infty}[-\tau_1+x_1\log\tau_1]\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!})\\ &-(\sum_{i=1}^n\log y_i!+\sum_{i=2}^n \log x_i!+\sum_{x_1=0}^{\infty}\log x_1!\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!}) \end{split} \tag{11.30} \end{equation}\] The first equation in (11.30) uses @(eq:11020), (11.24) and the last equation of (11.26). The last equation in (11.30) is just grouping together terms involving \(\beta\) and \(\tau_i\) and terms that do not involve these parameters. We can ignore terms without \(\beta\) and \(\tau_i\). Thus, we only need to maximize the terms in the first set of parentheses, where the last sum can be written as \[\begin{equation} -\tau_1+\log\tau_1\sum_{x_1=0}^{\infty}x_1\frac{e^{-\tau_1^{(r)}}(\tau_1^{(r)})^x_1}{x_1!})=-\tau_1+\tau_1^{(r)}\log\tau_1 \tag{11.31} \end{equation}\] When substituting back to (11.30), we see that the expected complete-data likelihood is the same as the original complete-data likelihood, with the exception that \(x_1\) is replaced by \(\tau_1^{(r)}\). Thus, in the \(r\)th step the MLEs are only a minor variation of (11.21) and are given by \[\begin{equation} \begin{split} & \hat{\beta}^{(r+1)}=\frac{\sum_{i=1}^ny_i}{\hat{\tau}_1^{(r)}+\sum_{i=2}^nx_i}\\ & \hat{\tau}_1^{(r+1)}=\frac{\hat{\tau}_1^{(r)}+y_1}{\hat{\beta}^{(r+1)}+1}\\ & \hat{\tau}_i^{(r+1)}=\frac{x_i+y_i}{\hat{\beta}^{(r+1)}+1}\quad i=2,3,\cdots,n \end{split} \tag{11.32} \end{equation}\]
This defines both the E-step (which results in the substitution of \(\hat{\tau}_1^{(r)}\) for \(x_1\)) and the M-step (which results in the calculation in (11.32) for the MLEs at the \(r\)th iteration). The properties of the EM algorithm give us assurance that the sequence \((\hat{\beta}^{(r)},\hat{\tau}_1^{(r)},\cdots,\hat{\tau}_n^{(r)})\) converges to the incomplete-data MLE as \(r\to\infty\).\[\begin{equation} E[\log(L(\hat{\theta}^{(r+1)}|\mathbf{y},\mathbf{X}))|\hat{\theta}^{(r)},\mathbf{y}]=E[\log(L(\hat{\theta}^{(r)}|\mathbf{y},\mathbf{X}))|\hat{\theta}^{(r)},\mathbf{y}] \tag{11.34} \end{equation}\]
References
Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.