Chapter 12 Mean Squared Error, Best Unbiased Estimators(Lecture on 02/06/2020)
Since we can usually apply more than one of thses methods in finding estimators in a particular situation, and these methods are not necessarily given same estimation, we are often faced with the task of choosing between estimators. The general topic of evaluating statistical procedures is part of the branch of statistics known as decision theory.
Definition 12.1 (Mean Squared Error) The mean squared error (MSE) of an estimator \(W\) of a parameter \(\theta\) is the function of \(\theta\) defined by \(E_{\theta}(W-\theta)^2\).
(Here \(\theta\) is the parameter (not random), so the expectation is taken w.r.t. \(W\).)Definition 12.2 (Bias) The bias os a point estimator \(W\) of a parameter \(\theta\) is the difference between the expected value of \(W\) and \(\theta\); that is \(Bias_{\theta}W=E_{\theta}W-\theta\). An estimator whose bias is identically (in \(\theta\)) equal to 0 is called unbiased and satisfies \(E_{\theta}W=\theta\) for all \(\theta\).
For an unbiased estimator, we have its MSE is equal to its variance, i.e. \(E_{\theta}(W-\theta)^2=Var_{\theta}W\).It can be argued that MSE, while a reasonable criterion for location parameters, is not reasonable for scale parameters. (One problem is that MSE penalizes equally for overestimation and underestimation, which is fine in the location case. In the scale case, however, 0 is a natural lower bound, so the estimation problem is not symmetric. Use of MSE in this case tends to be forgiving of underestimation.)
- In general, since MSE is a function of the parameter, there will not be one “best” estimator. Often, the MSEs of two estimators will cross each other, showing that each estimator is better (with respect to the other) in only a portion of the parameter space.
Example 12.3 (MSE of Binomial Bayes Estimator) Let \(X_1,\cdots,X_n\) be i.i.d. \(Bernoulli(p)\). The MSE of the MLE \(\hat{p}\) as an estimator of p is \[\begin{equation} E_p(\hat{p}-p)^2=Var_p\bar{X}=\frac{p(1-p)}{n} \tag{12.6} \end{equation}\]
Let \(Y=\sum_{i=1}^nX_i\) and the Bayes estimator for \(p\) is \(\hat{p}_{B}=\frac{Y+\alpha}{\alpha+\beta+n}\) (See Example 11.3). The MSE of this Bayes estimator of \(p\) is \[\begin{equation} \begin{split} E_p(\hat{p}_{B}-p)^2&=Var_p(\hat{p}_{B})+(Bias_p\hat{p}_{B})^2\\ &=Var_p(\frac{Y+\alpha}{\alpha+\beta+n})+(E_p(\frac{Y+\alpha}{\alpha+\beta+n}-p)^2\\ &=\frac{np(1-p)}{(\alpha+\beta+n)^2}+(\frac{np+\alpha}{\alpha+\beta+n}-p)^2 \end{split} \tag{12.7} \end{equation}\]
We may try to choose \(\alpha\) and \(\beta\) to make the MSE of \(\hat{p}_{B}\) constant, and such choice is \(\alpha=\beta=\sqrt{n/4}\). In that case, \(E_p(\hat{p}_{B}-p)^2=\frac{n}{4(n+\sqrt{n})^2}\).
We compare the MSE of \(\hat{p}_{B}\) and \(\hat{p}\) for different value of p in Figure 12.1. As suggested by Figure 12.1, for small \(n\), \(\hat{p}_B\) is the better choice unless there is a strong belief that \(p\) is near 0 or 1. For large \(n\), \(\hat{p}\) is the better choice unless there is a strong belief that p is close to \(\frac{1}{2}\).
MSE can be a helpful criterion for finding the bset estimator in a class of equivariant estimators. For an estimator \(W(\mathbf{X})\) of \(\theta\), using principles of Measurement Equivariance and Formal Invariance, we have
Measurement Equivariance: \(W(\mathbf{x})\) estimates \(\theta\Rightarrow\bar{g}(W(\mathbf{x}))\) estimates \(\bar{g}(\theta)=\theta^{\prime}\).
Formal Invariance: \(W(\mathbf{x})\) estimates \(\theta\Rightarrow W(g(\mathbf{x}))\) estimates \(\bar{g}(\theta)=\theta^{\prime}\).
Measurement equivariance means that when using different measure on \(\theta\), the inference should not change. \(\bar{g}(\cdot)\) here is another measurement. This gives the first relationship. Formal Invariance means when the inferences have same mathematical form, the results should identical. Estimating \(\bar{g}(\theta)\) is of course the same as \(\theta\), which leads to the second relationship.
Putting these two requirements together gives \(W(g(\mathbf{x}))=\bar{g}(W(\mathbf{x}))\)
Example 12.4 (MSE of Equivariant Estimators) Let \(X_1,\cdots,X_n\) be i.i.d. \(f(x-\theta)\). For an estimator \(W(X_1,\cdots,X_n)\) to satisfy \(W(g_a(\mathbf{x}))=\bar{g}_a(W(\mathbf{x}))\), we must have \[\begin{equation} W(x_1,\cdots,x_n)+a=W(x_1+a,\cdots,x_n+a) \tag{12.8} \end{equation}\]
which specifies the equivariant estimators w.r.t. the group of transformations defined by \(\mathcal{g}=\{g_a(x):-\infty<a<\infty\}\) where \(g_a(x_1,\cdots,x_n)=(x_1+a,\cdots,x_n+a)\). For these estimators we have \[\begin{equation} \begin{split} E_{\theta}(W&(X_1,\cdots,X_n)-\theta)^2\\ &=E_{\theta}(W(X_1+a,\cdots,X_n+a)-a-\theta)^2\quad (a=-\theta)\\ &=E_{\theta}(W(X_1\theta,\cdots,X_n-\theta))^2\\ &=\int_{\mathcal{X}}(W(X_1-\theta,\cdots,X_n-\theta))^2\prod_{i=1}^nf(x_i-\theta)d\mathbf{x}\\ &=\int_{\mathcal{X}}(W(\mu_1,\cdots,\mu_n))^2\prod_{i=1}^nf(\mu_i)d\mathbf{\mu}\quad (\mu_i=x_i-\theta)\\ \end{split} \tag{12.9} \end{equation}\] This last expression does not depend on \(\theta\); hence, the MSEs of these equivariant estimators are not functions of \(\theta\). The MSE can therefore be used to order the equivariant estimators, and an equivariant estimator with smallest MSE can be found.A comparison of estimators based on MSE may not yield a clear favorite. Indeed, there is no one “best MSE” estimator. One way to make the problem of finding a “best” estimator tractable is to limit the class of estimators. Suppose there is an estimator \(W^*\) of \(\theta\) with \(E_{\theta}W^*=\tau(\theta)\), consider the class of estimators \(\mathcal{C}_{\tau}=\{W:E_{\theta}W=\tau(\theta)\}\). For any \(W_1,W_2\in\mathcal{C}_{\tau}\), the bias of the two estimators are the same, so the MSE is determined by the variance. We are favorable to the estimator with smallest variance in such class.
Example 12.5 (Poisson Unbiased Estimation) Let \(X_1,\cdots,X_n\) be i.i.d. \(Pois(\lambda)\) and let \(\bar{X}\) and \(S^2\) be the sample mean and variance, respectively. For Poisson distribution, mean and variance are both equal to \(\lambda\). Therefore, \(E_{\lambda}\bar{X}=E_{\lambda}S^2=\lambda,\forall\lambda\), both \(\bar{X}\) and \(S^2\) are unbiased estimator of \(\lambda\).
Now consider the variance of \(\bar{X}\) and \(S^2\). It can be shown that \(Var_{\lambda}(\bar{X})\leq Var_{\lambda}(S^2),\forall\lambda\), but calculating \(Var_{\lambda}(S^2)\) is not an easy task. Furthermore, consider the class of estimators \[\begin{equation} W_a(\bar{X},S^2)=a\bar{X}+(1-a)S^2 \tag{12.10} \end{equation}\] for every constant \(a\), \(W_a(\bar{X},S^2)\) is an unbiased estimator of \(\lambda\). Comparing the variance for all of them is not tractable, no need to mention there may be other unbiased estimator of different form.
A more reasonable way in finding unbiased estimator is firstly sepcify a lower bound \(B(\theta)\) on the variance of any unbiased estimator. Then find some estimator \(W^*\) satisfying \(Var_{\theta}(W^*)=B(\theta)\). This approach is taken with the use of the Cramer-Rao Lower Bound.
Theorem 12.1 (Cramer-Rao Inequality) Let \(X_1,\cdots,X_n\) be a sample with p.d.f. \(f(x|\theta)\) and let \(W(\mathbf{X})=W(X_1,\cdots,X_n)\) be any estimator satisfying \[\begin{equation} \frac{d}{d\theta}E_{\theta}W(\mathbf{X})=\int_{\mathcal{X}}\frac{\partial}{\partial\theta}[W(\mathbf{x})f(\mathbf{x}|\theta)]d\mathbf{x} \tag{12.11} \end{equation}\] and \(Var_{\theta}(W(\mathbf{X}))<\infty\), then \[\begin{equation} Var_{\theta}(W(\mathbf{X}))\geq\frac{(\frac{d}{d\theta}E_{\theta}W(\mathbf{X}))^2}{E_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))^2)} \tag{12.12} \end{equation}\]
Proof. By Cauchy-Schwarz Inequality, for any two random variables \(X\) and \(Y\) \[\begin{equation} [Cov(X,Y)]^2\leq(Var(X))(Var(Y)) \tag{12.13} \end{equation}\]
Rearrange terms in (12.13) we get the lower bound on the variance of \(X\) \[\begin{equation} Var(X)\geq\frac{[Cov(X,Y)]^2}{Var(Y)} \tag{12.14} \end{equation}\]
The cleverness in this theorem is in choosing \(X\) to be the estimator \(W(\mathbf{X})\) and Y to be the quantity \(\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta)\).
Firstly, note that \[\begin{equation} \begin{split} \frac{d}{d\theta}E_{\theta}W(\mathbf{X})&=\int_{\mathcal{X}}W(\mathbf{x})[\frac{\partial}{\partial\theta}f(\mathbf{x}|\theta)]dx\\ &=E_{\theta}[W(\mathbf{X})\frac{\frac{\partial}{\partial\theta}f(\mathbf{X}|\theta)}{f(\mathbf{X}|\theta)}]\\ &=E_{\theta}[W(\mathbf{X})\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta)] \end{split} \tag{12.15} \end{equation}\] Here the second equality holds by multiply \(\frac{f(\mathbf{x}|\theta)}{f(\mathbf{x}|\theta)}\) and take expectation w.r.t. to random variable \(\mathbf{X}\). The third equality uses the property of logs. Define \(W(\mathbf{x})=1,\forall\mathbf{x}\), from (12.15) we have \[\begin{equation} E_{\theta}(\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))=\frac{d}{d\theta}E_{\theta}(1)=0 \tag{12.16} \end{equation}\] Therefore we have \[\begin{equation} Cov_{\theta}(W(\mathbf{X}),\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))=E_{\theta}[W(\mathbf{X})\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta)]=\frac{d}{d\theta}E_{\theta}W(\mathbf{X}) \tag{12.17} \end{equation}\] and since \(E_{\theta}(\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))=0\), \[\begin{equation} Var_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))= E_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta)^2) \tag{12.18} \end{equation}\] Using (12.14), (12.17), and (12.18) we have (12.12) holds as we desired.The Cramer-Rao Lower Bound also holds for discrete random variable, with modification on (12.11) to interchangeable of differentiation and summation.
\(E_{\theta}((\frac{\partial}{\partial\theta}\log f(\mathbf{X}|\theta))^2)\) is called the information number, or Fisher information of the sample. This terminology reflects the fact that the information number gives a bound on the variance of the best unbiased estimator of \(\theta\). As the information number gets bigger and we have more information about \(\theta\), we have a smaller bound on the variance of the best unbiased estimator.
- For any differentiable function \(\tau(\theta)\), Cramer-Rao inequality gives a lower bound on the variance of any estimator \(W\) satisfies (12.11) and \(E_{\theta}W=\tau(\theta)\). The bound depends only on \(\tau(\theta)\) and \(f(x|\theta)\). Any candidate estimator satisfying \(E_{\theta}W=\tau(\theta)\) and attaining this lower bound is a best unbiased estimator of \(\tau(\theta)\).
\[\begin{equation} \begin{split} \frac{d}{d\theta}\int_0^{\theta}h(x)f(x|\theta)dx&=\frac{d}{d\theta}\int_0^{\theta}h(x)\frac{1}{\theta}dx\\ &=\frac{h(\theta)}{\theta}+\int_0^{\theta}h(x)\frac{\partial}{\partial\theta}(\frac{1}{\theta})dx\\ &\neq \int_0^{\theta}h(x)\frac{\partial}{\partial\theta}(\frac{1}{\theta})dx \end{split} \tag{12.31} \end{equation}\] unless \(h(\theta)/\theta=0\) for all \(\theta\). Hence, the Cramer-Rao Theorem does not apply. In general, if the range of the p.d.f. depends on the parameter, the theorem will not be applicable.
Proof. The Cramer-Rao Inequality can be written as \[\begin{equation} [Cov_{\theta}(W(\mathbf{X}),\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta))]^2\leq Var_{\theta}(W(\mathbf{X}))Var(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)) \tag{12.35} \end{equation}\] and recalling that \(E_{\theta}W=\tau(\theta)\), \(E_{\theta}(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta))=0\). (12.35) is just Cauchy-Schwarz inequality in probability, it can also be written as \[\begin{equation} \begin{split} &|<W-E_{\theta}W,\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)-E_{\theta}(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta))>|^2\\ &\leq<W-E_{\theta}W,W-E_{\theta}W><\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)-E_{\theta}(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)),\\ &\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)-E_{\theta}(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta))> \end{split} \tag{12.36} \end{equation}\] Thus, by the sufficient and necessary condition for the equal sign holds in Cauchy-Schwarz inequality, we need to have \(W(\mathbf{x})-\tau(\theta)\) is proportional to \(\frac{\partial}{\partial\theta}\log \prod_{i=1}^nf(X_i|\theta)\), which is exactly (12.34).