Chapter 12 Mean Squared Error, Best Unbiased Estimators(Lecture on 02/06/2020)

Since we can usually apply more than one of thses methods in finding estimators in a particular situation, and these methods are not necessarily given same estimation, we are often faced with the task of choosing between estimators. The general topic of evaluating statistical procedures is part of the branch of statistics known as decision theory.

Definition 12.1 (Mean Squared Error) The mean squared error (MSE) of an estimator \(W\) of a parameter \(\theta\) is the function of \(\theta\) defined by \(E_{\theta}(W-\theta)^2\).

(Here \(\theta\) is the parameter (not random), so the expectation is taken w.r.t. \(W\).)
In general, any increasing function of the absolute distance \(|W-\theta|\) would serve to measure the goodness of an estimator, for example, mean absolute error \(E_{\theta}(|W-\theta|)\), but MSE has at least two advantages over other distance measures: First, it is quite tractable analytically and second, it has the interpretation \[\begin{equation} E_{\theta}(W-\theta)^2=Var_{\theta}W+(E_{\theta}W-\theta)^2=Var_{\theta}W+(Bias_{\theta}W)^2 \tag{12.1} \end{equation}\] Thus, MSE incorporates two components, one measuring the variability of the estimator (precision) and the other measuring its bias (accuracy).

Definition 12.2 (Bias) The bias os a point estimator \(W\) of a parameter \(\theta\) is the difference between the expected value of \(W\) and \(\theta\); that is \(Bias_{\theta}W=E_{\theta}W-\theta\). An estimator whose bias is identically (in \(\theta\)) equal to 0 is called unbiased and satisfies \(E_{\theta}W=\theta\) for all \(\theta\).

For an unbiased estimator, we have its MSE is equal to its variance, i.e. \(E_{\theta}(W-\theta)^2=Var_{\theta}W\).
Example 12.1 (Normal MSE) Let \(X_1,\cdots,X_n\) be i.i.d. \(N(\mu,\sigma^2)\). The statistics \(\bar{X}\) and \(S^2\) are both unbiased estimators since \(E\bar{X}=\mu\) and \(ES^2=\sigma^2\), for all \(\mu\) and \(\sigma^2\). The MSEs of these estimators are given by \[\begin{equation} \begin{split} &E(\bar{X}-\mu)^2=Var(\bar{X})=\frac{\sigma^2}{n}\\ &E(S^2-\sigma^2)^2=Var(S^2)=\frac{2\sigma^4}{n-1} \end{split} \tag{12.2} \end{equation}\] (12.2) holds because from Theorem 1.1 and Theorem 1.4, \(Var(\bar{X})=\frac{\sigma^2}{n}\) and \(\frac{(n-1)S^2}{\sigma^2}\sim\chi_{n-1}^2\). Notice also the first equation of (12.2) still holds if the normality assumption is dropped, but the second one dose not.
Although many unbiased estimators are also reasonable from the standpoint of MSE, be aware that controlling bias does not guarantee that MSE is controlled. In particular, it is sometimes the case that a trade-off occurs between variance and bias in such a way that a small increase in bias can be traded for a larger decrease in variance, resulting in an improvement in MSE. This is the well-known Bias-Variance Trade-off in statistics.
Example 12.2 An alternative estimator for \(\sigma^2\) is the maximum likelihood estimator \(\hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2=\frac{n-1}{n}S^2\). Thus, \(E\hat{\sigma}^2=E(\frac{n-1}{n}S^2)=\frac{n-1}{n}\sigma^2\), suggesting that \(\hat{\sigma}^2\) is a biased estimator of \(\sigma^2\). The variance of \(\hat{\sigma}^2\) is \[\begin{equation} Var\hat{\sigma}^2=Var(\frac{n-1}{n}S^2$)=(\frac{n-1}{n})^2Var(S^2)=\frac{2(n-1)\sigma^4}{n^2} \tag{12.3} \end{equation}\] and hence, the MSE is given by \[\begin{equation} E(\hat{\sigma}^2-\sigma^2)^2=\frac{2(n-1)\sigma^4}{n^2}+(\frac{n-1}{n}\sigma^2-\sigma^2)=(\frac{2n-1}{n^2})\sigma^4 \tag{12.4} \end{equation}\] We thus have \[\begin{equation} E(\hat{\sigma}^2-\sigma^2)^2<E(S^2-\sigma^2)^2 \tag{12.5} \end{equation}\] \(\hat{\sigma}^2\) has smaller MSE than \(S^2\). By trading off variance for bias, the MSE is improved.
  • It can be argued that MSE, while a reasonable criterion for location parameters, is not reasonable for scale parameters. (One problem is that MSE penalizes equally for overestimation and underestimation, which is fine in the location case. In the scale case, however, 0 is a natural lower bound, so the estimation problem is not symmetric. Use of MSE in this case tends to be forgiving of underestimation.)

  • In general, since MSE is a function of the parameter, there will not be one “best” estimator. Often, the MSEs of two estimators will cross each other, showing that each estimator is better (with respect to the other) in only a portion of the parameter space.

Example 12.3 (MSE of Binomial Bayes Estimator) Let \(X_1,\cdots,X_n\) be i.i.d. \(Bernoulli(p)\). The MSE of the MLE \(\hat{p}\) as an estimator of p is \[\begin{equation} E_p(\hat{p}-p)^2=Var_p\bar{X}=\frac{p(1-p)}{n} \tag{12.6} \end{equation}\]

Let \(Y=\sum_{i=1}^nX_i\) and the Bayes estimator for \(p\) is \(\hat{p}_{B}=\frac{Y+\alpha}{\alpha+\beta+n}\) (See Example 11.3). The MSE of this Bayes estimator of \(p\) is \[\begin{equation} \begin{split} E_p(\hat{p}_{B}-p)^2&=Var_p(\hat{p}_{B})+(Bias_p\hat{p}_{B})^2\\ &=Var_p(\frac{Y+\alpha}{\alpha+\beta+n})+(E_p(\frac{Y+\alpha}{\alpha+\beta+n}-p)^2\\ &=\frac{np(1-p)}{(\alpha+\beta+n)^2}+(\frac{np+\alpha}{\alpha+\beta+n}-p)^2 \end{split} \tag{12.7} \end{equation}\]

We may try to choose \(\alpha\) and \(\beta\) to make the MSE of \(\hat{p}_{B}\) constant, and such choice is \(\alpha=\beta=\sqrt{n/4}\). In that case, \(E_p(\hat{p}_{B}-p)^2=\frac{n}{4(n+\sqrt{n})^2}\).

We compare the MSE of \(\hat{p}_{B}\) and \(\hat{p}\) for different value of p in Figure 12.1. As suggested by Figure 12.1, for small \(n\), \(\hat{p}_B\) is the better choice unless there is a strong belief that \(p\) is near 0 or 1. For large \(n\), \(\hat{p}\) is the better choice unless there is a strong belief that p is close to \(\frac{1}{2}\).

\label{fig:12001}Comparison of MSE for MLE and Bayes estimator of p when sample size is 4 and 400

FIGURE 12.1: Comparison of MSE for MLE and Bayes estimator of p when sample size is 4 and 400

MSE can be a helpful criterion for finding the bset estimator in a class of equivariant estimators. For an estimator \(W(\mathbf{X})\) of \(\theta\), using principles of Measurement Equivariance and Formal Invariance, we have

Measurement Equivariance: \(W(\mathbf{x})\) estimates \(\theta\Rightarrow\bar{g}(W(\mathbf{x}))\) estimates \(\bar{g}(\theta)=\theta^{\prime}\).

Formal Invariance: \(W(\mathbf{x})\) estimates \(\theta\Rightarrow W(g(\mathbf{x}))\) estimates \(\bar{g}(\theta)=\theta^{\prime}\).

Measurement equivariance means that when using different measure on \(\theta\), the inference should not change. \(\bar{g}(\cdot)\) here is another measurement. This gives the first relationship. Formal Invariance means when the inferences have same mathematical form, the results should identical. Estimating \(\bar{g}(\theta)\) is of course the same as \(\theta\), which leads to the second relationship.

Putting these two requirements together gives \(W(g(\mathbf{x}))=\bar{g}(W(\mathbf{x}))\)

Example 12.4 (MSE of Equivariant Estimators) Let \(X_1,\cdots,X_n\) be i.i.d. \(f(x-\theta)\). For an estimator \(W(X_1,\cdots,X_n)\) to satisfy \(W(g_a(\mathbf{x}))=\bar{g}_a(W(\mathbf{x}))\), we must have \[\begin{equation} W(x_1,\cdots,x_n)+a=W(x_1+a,\cdots,x_n+a) \tag{12.8} \end{equation}\]

which specifies the equivariant estimators w.r.t. the group of transformations defined by \(\mathcal{g}=\{g_a(x):-\infty<a<\infty\}\) where \(g_a(x_1,\cdots,x_n)=(x_1+a,\cdots,x_n+a)\). For these estimators we have \[\begin{equation} \begin{split} E_{\theta}(W&(X_1,\cdots,X_n)-\theta)^2\\ &=E_{\theta}(W(X_1+a,\cdots,X_n+a)-a-\theta)^2\quad (a=-\theta)\\ &=E_{\theta}(W(X_1\theta,\cdots,X_n-\theta))^2\\ &=\int_{\mathcal{X}}(W(X_1-\theta,\cdots,X_n-\theta))^2\prod_{i=1}^nf(x_i-\theta)d\mathbf{x}\\ &=\int_{\mathcal{X}}(W(\mu_1,\cdots,\mu_n))^2\prod_{i=1}^nf(\mu_i)d\mathbf{\mu}\quad (\mu_i=x_i-\theta)\\ \end{split} \tag{12.9} \end{equation}\] This last expression does not depend on \(\theta\); hence, the MSEs of these equivariant estimators are not functions of \(\theta\). The MSE can therefore be used to order the equivariant estimators, and an equivariant estimator with smallest MSE can be found.

A comparison of estimators based on MSE may not yield a clear favorite. Indeed, there is no one “best MSE” estimator. One way to make the problem of finding a “best” estimator tractable is to limit the class of estimators. Suppose there is an estimator \(W^*\) of \(\theta\) with \(E_{\theta}W^*=\tau(\theta)\), consider the class of estimators \(\mathcal{C}_{\tau}=\{W:E_{\theta}W=\tau(\theta)\}\). For any \(W_1,W_2\in\mathcal{C}_{\tau}\), the bias of the two estimators are the same, so the MSE is determined by the variance. We are favorable to the estimator with smallest variance in such class.

Definition 12.3 (Best Unbiased Estimator) An estimator \(W^*\) is a best unbiased estimator of \(\tau(\theta)\) if it satisfies \(E_{\theta}W^*=\tau(\theta)\) for all \(\theta\) and for any other estimator \(W\) satisfies \(E_{\theta}W=\tau(\theta)\), we have \(Var_{\theta}(W^*)\leq Var_{\theta}(W)\) for all \(\theta\). \(W^*\) is also called a uniform minimum variance unbiased estimator (UMVUE) of \(\tau{\theta}\).

Example 12.5 (Poisson Unbiased Estimation) Let \(X_1,\cdots,X_n\) be i.i.d. \(Pois(\lambda)\) and let \(\bar{X}\) and \(S^2\) be the sample mean and variance, respectively. For Poisson distribution, mean and variance are both equal to \(\lambda\). Therefore, \(E_{\lambda}\bar{X}=E_{\lambda}S^2=\lambda,\forall\lambda\), both \(\bar{X}\) and \(S^2\) are unbiased estimator of \(\lambda\).

Now consider the variance of \(\bar{X}\) and \(S^2\). It can be shown that \(Var_{\lambda}(\bar{X})\leq Var_{\lambda}(S^2),\forall\lambda\), but calculating \(Var_{\lambda}(S^2)\) is not an easy task. Furthermore, consider the class of estimators \[\begin{equation} W_a(\bar{X},S^2)=a\bar{X}+(1-a)S^2 \tag{12.10} \end{equation}\] for every constant \(a\), \(W_a(\bar{X},S^2)\) is an unbiased estimator of \(\lambda\). Comparing the variance for all of them is not tractable, no need to mention there may be other unbiased estimator of different form.

A more reasonable way in finding unbiased estimator is firstly sepcify a lower bound \(B(\theta)\) on the variance of any unbiased estimator. Then find some estimator \(W^*\) satisfying \(Var_{\theta}(W^*)=B(\theta)\). This approach is taken with the use of the Cramer-Rao Lower Bound.

Theorem 12.1 (Cramer-Rao Inequality) Let \(X_1,\cdots,X_n\) be a sample with p.d.f. \(f(x|\theta)\) and let \(W(\mathbf{X})=W(X_1,\cdots,X_n)\) be any estimator satisfying \[\begin{equation} \frac{d}{d\theta}E_{\theta}W(\mathbf{X})=\int_{\mathcal{X}}\frac{\partial}{\partial\theta}[W(\mathbf{x})f(\mathbf{x}|\theta)]d\mathbf{x} \tag{12.11} \end{equation}\] and \(Var_{\theta}(W(\mathbf{X}))<\infty\), then \[\begin{equation} Var_{\theta}(W(\mathbf{X}))\geq\frac{(\frac{d}{d\theta}E_{\theta}W(\mathbf{X}))^2}{E_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))^2)} \tag{12.12} \end{equation}\]

Proof. By Cauchy-Schwarz Inequality, for any two random variables \(X\) and \(Y\) \[\begin{equation} [Cov(X,Y)]^2\leq(Var(X))(Var(Y)) \tag{12.13} \end{equation}\]

Rearrange terms in (12.13) we get the lower bound on the variance of \(X\) \[\begin{equation} Var(X)\geq\frac{[Cov(X,Y)]^2}{Var(Y)} \tag{12.14} \end{equation}\]

The cleverness in this theorem is in choosing \(X\) to be the estimator \(W(\mathbf{X})\) and Y to be the quantity \(\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta)\).

Firstly, note that \[\begin{equation} \begin{split} \frac{d}{d\theta}E_{\theta}W(\mathbf{X})&=\int_{\mathcal{X}}W(\mathbf{x})[\frac{\partial}{\partial\theta}f(\mathbf{x}|\theta)]dx\\ &=E_{\theta}[W(\mathbf{X})\frac{\frac{\partial}{\partial\theta}f(\mathbf{X}|\theta)}{f(\mathbf{X}|\theta)}]\\ &=E_{\theta}[W(\mathbf{X})\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta)] \end{split} \tag{12.15} \end{equation}\] Here the second equality holds by multiply \(\frac{f(\mathbf{x}|\theta)}{f(\mathbf{x}|\theta)}\) and take expectation w.r.t. to random variable \(\mathbf{X}\). The third equality uses the property of logs. Define \(W(\mathbf{x})=1,\forall\mathbf{x}\), from (12.15) we have \[\begin{equation} E_{\theta}(\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))=\frac{d}{d\theta}E_{\theta}(1)=0 \tag{12.16} \end{equation}\] Therefore we have \[\begin{equation} Cov_{\theta}(W(\mathbf{X}),\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))=E_{\theta}[W(\mathbf{X})\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta)]=\frac{d}{d\theta}E_{\theta}W(\mathbf{X}) \tag{12.17} \end{equation}\] and since \(E_{\theta}(\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))=0\), \[\begin{equation} Var_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))= E_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta)^2) \tag{12.18} \end{equation}\] Using (12.14), (12.17), and (12.18) we have (12.12) holds as we desired.
Corollary 12.1 (Cramer-Rao Inequality, i.i.d. case) If the assumptions of Theorem @ref{thm:thm12001} are satisfied and additionally, if \(X_1,\cdots,X_n\) are i.i.d. with p.d.f. \(f(x|\theta)\), then \[\begin{equation} Var_{\theta}(W(\mathbf{X}))\geq\frac{(\frac{d}{d\theta}E_{\theta}W(\mathbf{X}))^2}{nE_{\theta}((\frac{\partial}{\partial\theta}\log f(X|\theta))^2)} \tag{12.19} \end{equation}\] That is, the expectation in the denominator becomse a univeraite calculation.
Proof. We only need to show that \[\begin{equation} E_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))^2) =nE_{\theta}((\frac{\partial}{\partial\theta}\log f(X|\theta))^2) \tag{12.20} \end{equation}\] Since \(X_1,\cdots,X_n\) are independent, \[\begin{equation} \begin{split} E_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))^2) &=E_{\theta}((\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta))^2)\\ &=E_{\theta}((\sum_{i=1}^n\frac{\partial}{\partial\theta}\log f(X_i|\theta))^2)\\ &=\sum_{i=1}^nE_{\theta}((\frac{\partial}{\partial\theta}\log f(X_i|\theta))^2)\\ &+\sum_{i\neq j}E_{\theta}(\frac{\partial}{\partial\theta}\log f(X_i|\theta)\frac{\partial}{\partial\theta}\log f(X_j|\theta)) \end{split} \tag{12.21} \end{equation}\] For \(i\neq j\) we have \[\begin{equation} \begin{split} E_{\theta}&(\frac{\partial}{\partial\theta}\log f(X_i|\theta)\frac{\partial}{\partial\theta}\log f(X_j|\theta))\\ &E_{\theta}(\frac{\partial}{\partial\theta}\log f(X_i|\theta))E_{\theta}(\frac{\partial}{\partial\theta}\log f(X_j|\theta))\\ &=0 \end{split} \tag{12.22} \end{equation}\] Therefore, we estabilished (12.20) and the corollary holds.
  • The Cramer-Rao Lower Bound also holds for discrete random variable, with modification on (12.11) to interchangeable of differentiation and summation.

  • \(E_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))^2)\) is called the information number, or Fisher information of the sample. This terminology reflects the fact that the information number gives a bound on the variance of the best unbiased estimator of \(\theta\). As the information number gets bigger and we have more information about \(\theta\), we have a smaller bound on the variance of the best unbiased estimator.

  • For any differentiable function \(\tau(\theta)\), Cramer-Rao inequality gives a lower bound on the variance of any estimator \(W\) satisfies (12.11) and \(E_{\theta}W=\tau(\theta)\). The bound depends only on \(\tau(\theta)\) and \(f(x|\theta)\). Any candidate estimator satisfying \(E_{\theta}W=\tau(\theta)\) and attaining this lower bound is a best unbiased estimator of \(\tau(\theta)\).
Lemma 12.1 If \(f(x|\theta)\) satisfies \[\begin{equation} \frac{d}{d\theta}E_{\theta}(\frac{\partial}{\partial\theta}\log f(X|\theta)) = \int\frac{\partial}{\partial\theta}[(\frac{\partial}{\partial\theta}\log f(x|\theta))f(x|\theta)]dx \tag{12.23} \end{equation}\] (true for exponential family), then \[\begin{equation} E_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))^2)=-E_{\theta}(\frac{\partial^2}{\partial\theta^2}\log f(\mathbf{X}|\theta)) \tag{12.24} \end{equation}\]
Example 12.6 (Poisson Unbiased Estimation Continued) Here \(\tau(\lambda)=\lambda\), so \(\tau^{\prime}(\lambda)=1\). Also since we have an exponential family, using Lemma 12.1 we have \[\begin{equation} \begin{split} E_{\lambda}((\frac{\partial}{\partial\lambda}\log\prod_{i=1}^nf(X_i|\theta))^2)&= -nE_{\lambda}(\frac{\partial^2}{\partial\lambda^2}\log f(\mathbf{X}|\lambda))\\ &=-nE_{\lambda}(\frac{\partial^2}{\partial\lambda^2}\log(\frac{e^{-\lambda}\lambda^X}{X!}))\\ &=-nE_{\lambda}(\frac{\partial^2}{\partial\lambda^2}(-\lambda+X\log\lambda-\log X!))\\ &=-nE_{\lambda}(-\frac{X}{\lambda^2})\\ &=\frac{n}{\lambda} \end{split} \tag{12.25} \end{equation}\] Hence, for any unbiased estimator \(W\) of \(\lambda\), we must have \[\begin{equation} Var_{\lambda}W\geq\frac{\lambda}{n} \tag{12.26} \end{equation}\] Since \(Var_{\lambda}\bar{X}=\frac{\lambda}{n}\), \(\bar{X}\) is a best unbiased estimator of \(\lambda\).
The key assumption in the Cramer-Rao Theorem is the ability to differentiate under the integral sign. Densities in the exponential class will satisfy the assumption but in general, such assumption should be checked, otherwise contradictions will arise.
Example 12.7 (Unbiased Estimator for the Scale Uniform) Let \(X_1,\cdots,X_n\) be i.i.d. with p.d.f. \(f(x|\theta)=1/\theta,x\in(0,\theta)\). Since \(\frac{\partial}{\partial\theta}\log f(x|\theta)=-1/\theta\), we have \[\begin{equation} E_{\theta}((\frac{\partial}{\partial\theta}\log f(X|\theta))^2)=\frac{1}{\theta^2} \tag{12.27} \end{equation}\] The Cramer-Rao Theorem would indicate that if \(W\) is any unbiased estimator of \(\theta\), \[\begin{equation} Var_{\theta}(W)\geq\frac{\theta^2}{n} \tag{12.28} \end{equation}\] Now consider the sufficient statistic \(Y=max(X_1,\cdots,X_n)\), the p.d.f. of \(Y\) is \(f_Y(y|\theta)=ny^{n-1}/\theta^n,0<y<\infty\), so \[\begin{equation} E_{\theta}Y=\int_{0}^{\infty}\frac{ny^{n-1}}{\theta^n}dy=\frac{n}{n+1}\theta \tag{12.29} \end{equation}\] showing that \(\frac{n+1}{n}Y\) is an unbiased estimator of \(\theta\). Thus \[\begin{equation} \begin{split} Var_{\theta}(\frac{n+1}{n}Y)&=(\frac{n+1}{n})^2Var_{\theta}Y\\ &=(\frac{n+1}{n})^2[E_{\theta}Y^2-(\frac{n}{n+1}\theta)^2]\\ &=(\frac{n+1}{n})^2[\frac{n}{n+2}\theta^2-(\frac{n}{n+1}\theta)^2]\\ &=\frac{1}{n(n+2)}\theta^2 \end{split} \tag{12.30} \end{equation}\] This is uniformly smaller then \(\theta^2/n\), which indicates that the Cramer-Rao Theorem is not applicable to this p.d.f. To see why,
\[\begin{equation} \begin{split} \frac{d}{d\theta}\int_0^{\theta}h(x)f(x|\theta)dx&=\frac{d}{d\theta}\int_0^{\theta}h(x)\frac{1}{\theta}dx\\ &=\frac{h(\theta)}{\theta}+\int_0^{\theta}h(x)\frac{\partial}{\partial\theta}(\frac{1}{\theta})dx\\ &\neq \int_0^{\theta}h(x)\frac{\partial}{\partial\theta}(\frac{1}{\theta})dx \end{split} \tag{12.31} \end{equation}\] unless \(h(\theta)/\theta=0\) for all \(\theta\). Hence, the Cramer-Rao Theorem does not apply. In general, if the range of the p.d.f. depends on the parameter, the theorem will not be applicable.
There is no guarantee for Cramer-Rao Lower Bound to be attainable. In fact, Cramer-Rao Lower Bound may be strictly smaller than the variance of any unbiased estimator. In the usually favorable case of \(f(x|\theta)\) being a one-parameter exponential family, there exists a parameter \(\tau(\theta)\) with an unbiased estimator that achieves the Cramer-Rao Lower Bound. However, in other typical situations, for other parameters, the bound may not be attainable.
Example 12.8 (Normal Variance Bound) Let \(X_1,\cdots,X_n\) be i.i.d. \(N(\mu,\sigma^2)\) and consider estimation of \(\sigma^2\), where \(\mu\)is unknown. The normal p.d.f. satisfies the assumptions of the Cramer-Rao Theorem and Lemma @(lem:lem12001), so we have \[\begin{equation} \frac{\partial^2}{\partial(\sigma^2)^2}\log((2\pi\sigma^2)^{-1/2}exp(-\frac{(x-\mu)^2}{2\sigma^2}))=\frac{1}{2\sigma^4}-\frac{(x-\mu)^2}{\sigma^6} \tag{12.32} \end{equation}\] and \[\begin{equation} -E[\frac{\partial^2}{\partial(\sigma^2)^2}\log f(X|\mu,\sigma^2)|\mu,\sigma^2]=-E[\frac{1}{2\sigma^4}-\frac{(x-\mu)^2}{\sigma^6}|\mu,\sigma^2]=\frac{1}{2\sigma^4} \tag{12.33} \end{equation}\] Thus, any unbiased estimator \(W\) of \(\sigma^2\) must satisfy \(Var(W|\mu,\sigma^2)\geq\frac{2\sigma^4}{n}\). From Example 12.1 we know \(Var(S^2|\mu,\sigma^2)=\frac{2\sigma^4}{n-1}\), so \(S^2\) does not attain the Cramer-Rao Lower Bound.
Corollary 12.2 (Attainment) Let \(X_1,\cdots,X_n\) be i.i.d. \(f(x|\theta)\) where \(f(x|\theta)\) satisfies the conditions of the Cramer-Rao Theorem. Let \(L(\theta|\mathbf{x})=\prod_{i=1}^nf(x_i|\theta)\) denote the likelihood function. If \(W(\mathbf{X})=(X_1,\cdots,X_n)\) is any unbiased estimator of \(\tau(\theta)\), then \(W(\mathbf{X})\) attains the Cramer-Rao Lower Bound if and only if \[\begin{equation} a(\theta)[W(\mathbf{x})-tau(\theta)]=\frac{\partial}{\partial\theta}\log L(\theta|\mathbf{x}) \tag{12.34} \end{equation}\] for some function \(a(\theta)\).

Proof. The Cramer-Rao Inequality can be written as \[\begin{equation} [Cov_{\theta}(W(\mathbf{X}),\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta))]^2\leq Var_{\theta}(W(\mathbf{X}))Var(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)) \tag{12.35} \end{equation}\] and recalling that \(E_{\theta}W=\tau(\theta)\), \(E_{\theta}(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta))=0\). (12.35) is just Cauchy-Schwarz inequality in probability, it can also be written as \[\begin{equation} \begin{split} &|<W-E_{\theta}W,\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)-E_{\theta}(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta))>|^2\\ &\leq<W-E_{\theta}W,W-E_{\theta}W><\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)-E_{\theta}(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)),\\ &\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)-E_{\theta}(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta))> \end{split} \tag{12.36} \end{equation}\] Thus, by the sufficient and necessary condition for the equal sign holds in Cauchy-Schwarz inequality, we need to have \(W(\mathbf{x})-\tau(\theta)\) is proportional to \(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)\), which is exactly (12.34).

Example 12.9 (Normal Variance Bound Continued) As in Example 12.8, we have \[\begin{equation} L(\mu,\sigma^2|\mathbf{x})=(2\pi\sigma^2)^{-1/2}exp(-\frac{(x-\mu)^2}{2\sigma^2}) \tag{12.37} \end{equation}\] and hence \[\begin{equation} \frac{\partial}{\partial\sigma^2}\log L(\mu,\sigma^2|\mathbf{x})=\frac{n}{2\sigma^4}(\sum_{i=1}^n\frac{(x_i-\mu)^2}{n}-\sigma^2) \tag{12.38} \end{equation}\] Thus, taking \(a(\sigma^2)=n/(2\sigma^4)\) shows that the best unbiased estimator of \(\sigma^2\) is \(\sum_{i=1}^n\frac{(x_i-\mu)^2}{n}\) which is not calculable if \(\mu\) is unknown. In that case, the bound cannot be attained.