Chapter 9 Method of Moments, Maximum Likelihood Estimator (Lecture on 01/28/2020)

Definition 9.1 (Point Estimator) A point estimator is any function \(W(X_1,\cdots,X_n)\) of a sample; that is, any statistic is a point estimator.

The definition makes no mention of any correspondence between the estimator and the parameter it is to estimate.

An estimator is a function of the sample, while an \(estimate\) is the realized value of an estimator (that is, a number) that is obtained when a sample is actually taken.

Methods of Moments is maybe the oldest method of finding point estimators. It’s simple to use and almost always yields some sort of estimate. However, this method yields estimators that may be improved upon. It’s a good starting point anyway, especially when other methods prove intractable.

Definition 9.2 (Method of Moments) Let \(X_1,\cdots,X_n\) be a sample from a population with p.d.f. and p.m.f. \(f(x|\theta_1,\cdots,\theta_k)\). Method of moments estimators are found by equating the first \(k\) sample moments to the corresponding \(k\) population moments, and solving the resulting system of simultaneous equations. That is, define \[\begin{equation} \begin{split} & m_1=\frac{1}{n}\sum_{i=1}^nX_i,\quad \mu^{\prime}_1=EX\\ & m_2=\frac{1}{n}\sum_{i=1}^nX_i^2,\quad \mu^{\prime}_2=EX^2\\ & \vdots\\ & m_k=\frac{1}{n}\sum_{i=1}^nX_i^k,\quad \mu^{\prime}_k=EX^k\\ \end{split} \tag{9.1} \end{equation}\]

The population moments \(\mu^{\prime}_j\) will typically be a function of \(\theta_1,\cdots,\theta_k\), denoted as \(\mu^{\prime}_j(\theta_1,\cdots,\theta_k)\). The method of moments estimator \((\tilde{\theta}_1,\cdots,\tilde{\theta}_k)\) is obtained by solving the following system of equations for \((\theta_1,\cdots,\theta_k)\) in terms of \((m_1,\cdots,m_k)\): \[\begin{equation} \begin{split} & m_1=\mu^{\prime}_1(\theta_1,\cdots,\theta_k)\\ & m_2=\mu^{\prime}_2(\theta_1,\cdots,\theta_k)\\ & \vdots\\ & m_k=\mu^{\prime}_k(\theta_1,\cdots,\theta_k)\\ \end{split} \tag{9.2} \end{equation}\]
Equating the first \(k\) moments because the p.m.f. or p.d.f. has \(k\) parameters.

Example 9.1 (Normal method of moments) Suppose \(X_1,\cdots,X_n\) are i.i.d. \(N(\theta,\sigma^2)\). We have \(m_1=\bar{X}\), \(m_2=\frac{1}{n}\sum_{i=1}^nX_i^2\), \(\mu_1^{\prime}=\theta\), \(\mu_2^{\prime}=\theta^2+\sigma^2\). Hence, we need to solve the following system of equations: \[\begin{equation} \left\{\begin{aligned} &\bar{X}=\theta \\ &\frac{1}{n}\sum_{i=1}^nX_i^2=\theta^2+\sigma^2 \end{aligned} \right. \tag{9.3} \end{equation}\]

Solving for \(\theta\) and \(\sigma^2\) yields the method of moments estimators \[\begin{equation} \begin{split} & \tilde{\theta}=\bar{X}\\ & \tilde{\sigma}^2=\frac{1}{n}\sum_{i=1}^nX_i^2-\bar{X}^2=\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2 \end{split} \tag{9.4} \end{equation}\]
Example 9.2 (Binomial method of moments) Let \(X_1,\cdots,X_n\) be i.i.d. \(Bin(k,p)\) where both \(k\) and \(p\) are both unknown. Equating the first two sample moments yields the system of equations \[\begin{equation} \left\{\begin{aligned} &\bar{X}=kp \\ &\frac{1}{n}\sum_{i=1}^nX_i^2=kp(1-p)+k^2p^2 \end{aligned} \right. \tag{9.5} \end{equation}\] Solving (9.5) for \(k\) and \(p\) yields \[\begin{equation} \begin{split} & \tilde{k}=\frac{\bar{X}^2}{\bar{X}-(1/n)\sum_{i=1}^n(X_i-\bar{X})^2}\\ & \tilde{p}=\frac{\bar{X}}{\tilde{k}} \end{split} \tag{9.6} \end{equation}\]
These are not the best estimators for both parameters. It is possible to get negative estimates of \(k\) and \(p\). This is a case where the range of the estimator does not coincide with the range of the parameter it is estimating.

The method of moments can be very useful in obtaining approximations to the distributions of statistics. This technique is called moment matching.

Example 9.3 (Satterthwaite approximation) If \(Y_i\), \(i=1,\cdots,k\) are independent \(\chi_{r_i}^2\) random variables, we have seen that the distribution of \(\sum_{i=1}^kY_i\) is also chi squared, with degree of freedom equal to \(\sum_{i=1}^k r_i\). However, the distribution of weighted sum \(\sum_{i=1}^na_iY_i\) where \(a_i\)s are known constants is ingeneral, hard to obtain. It is reasonable to assume that \(\chi^2_{\nu}\) for some \(\nu\) is a good approximation. This is almost Satterthwaite problem, where \(\sum_{i=1}^na_iY_i\) represented the square of the denominator of a t statistic which he wants to estimate. Hence, for given \(a_1,\cdots,a_k\), he wanted to find a value of \(\nu\) such that \[\begin{equation} \sum_{i=1}^na_iY_i\sim \frac{\chi_{\nu}^2}{\nu}\quad (approximately) \tag{9.7} \end{equation}\]

Since \(E(\chi_{\nu}^2/\nu)=1\), to match first moments we need \[\begin{equation} E(\sum_{i=1}^na_iY_i)=\sum_{i=1}^ka_ir_i=1 \tag{9.8} \end{equation}\]

This is a constrain on \(a_i\)s. To estimate \(\nu\), we must match the second moments, which gives \[\begin{equation} E(\sum_{i=1}^na_iY_i)^2=E(\frac{\chi_{\nu}^2}{\nu})^2=\frac{2}{\nu}+1 \end{equation}\] Applying the method of moments, we drop the first expectation and solve for \(\nu\), which yields \[\begin{equation} \hat{\nu}=\frac{2}{(\sum_{i=1}^ka_iY_i)^2-1} \tag{9.9} \end{equation}\] However, this estimator can be negative. Hence, consider \[\begin{equation} \begin{split} E(\sum_{i=1}^na_iY_i)^2&=Var(\sum_{i=1}^na_iY_i)+(E(\sum_{i=1}^na_iY_i))^2\\ &=(E(\sum_{i=1}^na_iY_i))^2[\frac{Var(\sum_{i=1}^na_iY_i)}{(E(\sum_{i=1}^na_iY_i))^2}+1]\\ &=[\frac{Var(\sum_{i=1}^na_iY_i)}{(E(\sum_{i=1}^na_iY_i))^2}+1] \end{split} \tag{9.10} \end{equation}\] The last equation holds because (9.8). Now equate second moments obtains \[\begin{equation} \nu=\frac{2(E(\sum_{i=1}^na_iY_i))^2}{Var(\sum_{i=1}^ka_iY_i)} \tag{9.11} \end{equation}\]

Finally, since \(Y_1,\cdots,Y_k\) are independent chi squared random variables, \[\begin{equation} Var(\sum_{i=1}^ka_iY_i)=\sum_{i=1}^ka_i^2Var(Y_i)=2\sum_{i=1}^k\frac{a_i^2(EY_i)^2}{r_i} \tag{9.12} \end{equation}\] Substitute to (9.11) we have \[\begin{equation} \hat{\nu}=\frac{(\sum_{i=1}^ka_iY_i)^2}{\sum_{i=1}^k\frac{a_i^2}{r_i}Y_i^2} \tag{9.13} \end{equation}\] This is an estimator that is always positive.
Estimator is a function of data, so one can not take expectaion or variance of \(Y_i\). Instead we keep them until the last step and remove the expectations.

Maximum Likelihood Estimators is by far the most popular technique for deriving estimators.

Definition 9.3 (Maximum Likelihood Estimators) If \(X_1,\cdots,X_n\) are an i.i.d. sample from a population with p.d.f. or p.m.f. \(f(x|\theta_1,\cdots,\theta_k)\), the likelihood function is defined by \[\begin{equation} L(\theta|\mathbf{x})=L(\theta_1,\cdots,\theta_k|x_1,\cdots,x_n)=\prod_{i=1}^nf(x_i|\theta_1,\cdots,\theta_k) \tag{9.14} \end{equation}\]

For each sample point \(\mathbf{x}\), let \(\hat{\theta}(\mathbf{x})\) be a parameter value at which \(L(\theta|\mathbf{x})\) attains its maximum as a function of \(\theta\), with \(\mathbf{x}\) held fixed. A maximum likelihood estimator (MLE) of the parameter \(\theta\) based on a sample \(\mathbf{X}\) is \(\hat{\theta}(\mathbf{X})\).
  • The range of the MLE coincides with the range of the parameter.

  • The MLE is the parameter point for which the observed sample is most likely.

  • Drwabacks are: finding the global maximum and verifying it; numerical sensitivity (how sensitive is the estimate to small chaneges in the data).

If the likelihood function is differentiable (in \(\theta_i\)), possible candidates for the MLE are the values of \((\theta_1,\cdots,\theta_k)\) that solve \[\begin{equation} \frac{\partial}{\partial\theta_i}L(\boldsymbol{\theta}|\mathbf{x})=0,\quad i=1,\cdots,k \tag{9.15} \end{equation}\]

The solutions to (9.15) are only possible candidates for the MLE. Points at which the first derivatives are 0 may be local or global minima or maxima, or inflection points, and they are located in the interior of the domain. One may need to check the boundary for MLE.

Example 9.4 (Normal Likelihood) Let \(X_1,\cdots,X_n\) be i.i.d. \(N(\theta,1)\) and let \(L(\theta|\mathbf{x})\) denote the likelihood function. Then \[\begin{equation} L(\theta|\mathbf{x})=\prod_{i=1}^n\frac{1}{(2\pi)^{1/2}}e^{-(x_i-\theta)^2/2}= \frac{1}{(2\pi)^{n/2}}e^{-\sum_{i=1}^n(x_i-\theta)^2} \tag{9.16} \end{equation}\]

The equation \(\frac{d}{d\theta}L(\theta|\mathbf{x})=0\) reduces to \(\sum_{i=1}^n(x_i-\theta)=0\), which has the solution \(\hat{\theta}=\bar{x}\). Hence, \(\bar{x}\) is a candidate of the MLE. To verify that in fact it is the global maximum of the likelihood function, we still need to calculate the second derivative and check boundary. Or alternatively, we can argue by properties of quadratic function.

Anther way to find MLE of \(\theta\) is by noticing that \[\begin{equation} \min_a\sum_{i=1}^n(x_i-a)^2=\sum_{i=1}^n(x_i-\bar{x})^2 \tag{9.17} \end{equation}\] Thus, for any \(\theta\), \[\begin{equation} \min_{\theta}exp(-\sum_{i=1}^n(x_i-\theta)^2/2)=exp(-\sum_{i=1}^n(x_i-\bar{x})^2/2) \tag{9.18} \end{equation}\] and the maximum is attainable by choose \(\theta=\bar{x}\) . Hence \(\bar{X}\) is the MLE.
Example 9.5 (Bernoulli MLE) Let \(X_1,\cdots,X_n\) be i.i.d. \(Bernoulli(p)\). Then the likelihood function is \[\begin{equation} L(p|\mathbf{x})=\prod_{i=1}^np^{x_i}(1-p)^{1-x_i}=p^y(1-p)^{n-y} \tag{9.19} \end{equation}\] where \(y=\sum_{i=1}^nx_i\). It is much easier to differentiate the log likelihood \[\begin{equation} \log L(p|\mathbf{x})=y\log p+(n-y)\log (1-p) \tag{9.20} \end{equation}\] for \(0<y<n\), differentiating \(\log L(p|\mathbf{x})\) and setting to 0 gives \(\hat{p}=\frac{y}{n}\) and this is actually the global maximum. In case of \(y=0\) or \(y=n\), the log-likelihood function is monotone but \(\hat{p}=\frac{y}{n}\) is still global maximum. Thus, \(\frac{1}{n}\sum_{i=1}^nX_i\) is the MLE of \(p\).

When finding a maximum likelihood estimator , the maximization takes place only over the range of parameter values.

Example 9.6 (Restricted Range MLE) Let \(X_1,\cdots,X_n\) be i.i.d. \(N(\theta,1)\), where it is known that \(\theta\) must be nonnegative. In this case, the MLE of \(\theta\) is \[\begin{equation} \hat{\theta}=\left\{\begin{aligned} & \bar{X} & \bar{X}\geq 0 \\ & 0 & o.w. \end{aligned} \right. \tag{9.21} \end{equation}\]

Example 9.7 (Binominal MLE with Unknown Number of Trials) Let \(X_1,\cdots,X_n\) be a random sample from a \(Bin(k,p)\) distribution, where \(p\) is known and \(k\) is unknown. The likelihood function is \[\begin{equation} L(k|\mathbf{x},p)=\prod_{i=1}^n{k \choose x_i}p^{x_i}(1-p)^{k-x_i} \tag{9.22} \end{equation}\]

Maximizing \(L(k|\mathbf{x},p)\) by differentiation is difficult because of the factorials of \(k\) and \(k\) is integer.

Firstly, noticing that \(L(k|\mathbf{x},p)=0\) if \(k<\max_ix_i\). Thus, the MLE is an integer \(k\geq\max_ix_i\) that satisfies \(L(k|\mathbf{x},p)/L(k-1|\mathbf{x},p)\geq1\) and \(L(k+1|\mathbf{x},p)/L(k|\mathbf{x},p)<1\). We reamin to show there is only one such \(k\).

The ratio of likelihood is \[\begin{equation} \frac{L(k|\mathbf{x},p)}{L(k-1|\mathbf{x},p)}=\frac{k^n(1-p)^n}{\prod_{i=1}^n(k-x_i)} \tag{9.23} \end{equation}\] Thus, the condition for a maximum is \[\begin{equation} \left\{\begin{aligned} &k^n(1-p)^n\geq \prod_{i=1}^n(k-x_i) \\ &(k+1)^n(1-p)^n<\prod_{i=1}^n(k+1-x_i) \end{aligned}\right. \tag{9.24} \end{equation}\] Denote \(z=\frac{1}{k}\), then since \(\prod_{i=1}^n(1-x_iz)\) is a decreasing function of \(z\), z satisfies (9.25) also satisfies (9.24). \[\begin{equation} (1-p)^n=\prod_{i=1}^n(1-x_iz) \tag{9.25} \end{equation}\] for \(0\leq z\leq 1/\max_ix_i\). Since \(\prod_{i=1}^n(1-x_iz)\) is a decreasing function of \(z\) and takes 1 at \(z=0\) and 0 at \(z=1/\max_ix_i\), there is a unique \(\hat{z}\) satisfies (9.25). However, the value \(\frac{1}{\hat{z}}\) may not be an integer. We thus take \(\hat{k}\) to be the largest integer less than or equal to \(\frac{1}{\hat{z}}\). By \(\prod_{i=1}^n(1-x_iz)\) is a decreasing function \(\hat{k}\) satisfies (9.24) and therefore is the MLE. It can be found by numerically solving an nth-degree polynomial equality.

(This is Exercise 7.5 on Casella and Berger (2002))

References

Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.