Chapter 12 Mean Squared Error, Best Unbiased Estimators(Lecture on 02/06/2020)

Since we can usually apply more than one of thses methods in finding estimators in a particular situation, and these methods are not necessarily given same estimation, we are often faced with the task of choosing between estimators. The general topic of evaluating statistical procedures is part of the branch of statistics known as decision theory.

Definition 12.1 (Mean Squared Error) The mean squared error (MSE) of an estimator W of a parameter θ is the function of θ defined by Eθ(Wθ)2.

(Here θ is the parameter (not random), so the expectation is taken w.r.t. W.)
In general, any increasing function of the absolute distance |Wθ| would serve to measure the goodness of an estimator, for example, mean absolute error Eθ(|Wθ|), but MSE has at least two advantages over other distance measures: First, it is quite tractable analytically and second, it has the interpretation Eθ(Wθ)2=VarθW+(EθWθ)2=VarθW+(BiasθW)2 Thus, MSE incorporates two components, one measuring the variability of the estimator (precision) and the other measuring its bias (accuracy).

Definition 12.2 (Bias) The bias os a point estimator W of a parameter θ is the difference between the expected value of W and θ; that is BiasθW=EθWθ. An estimator whose bias is identically (in θ) equal to 0 is called unbiased and satisfies EθW=θ for all θ.

For an unbiased estimator, we have its MSE is equal to its variance, i.e. Eθ(Wθ)2=VarθW.
Example 12.1 (Normal MSE) Let X1,,Xn be i.i.d. N(μ,σ2). The statistics ˉX and S2 are both unbiased estimators since EˉX=μ and ES2=σ2, for all μ and σ2. The MSEs of these estimators are given by E(ˉXμ)2=Var(ˉX)=σ2nE(S2σ2)2=Var(S2)=2σ4n1 (12.2) holds because from Theorem 1.1 and Theorem 1.4, Var(ˉX)=σ2n and (n1)S2σ2χ2n1. Notice also the first equation of (12.2) still holds if the normality assumption is dropped, but the second one dose not.
Although many unbiased estimators are also reasonable from the standpoint of MSE, be aware that controlling bias does not guarantee that MSE is controlled. In particular, it is sometimes the case that a trade-off occurs between variance and bias in such a way that a small increase in bias can be traded for a larger decrease in variance, resulting in an improvement in MSE. This is the well-known Bias-Variance Trade-off in statistics.
Example 12.2 An alternative estimator for σ2 is the maximum likelihood estimator ˆσ2=1nni=1(XiˉX)2=n1nS2. Thus, Eˆσ2=E(n1nS2)=n1nσ2, suggesting that ˆσ2 is a biased estimator of σ2. The variance of ˆσ2 is Varˆσ2=Var(n1nS2$)=(n1n)2Var(S2)=2(n1)σ4n2 and hence, the MSE is given by E(ˆσ2σ2)2=2(n1)σ4n2+(n1nσ2σ2)=(2n1n2)σ4 We thus have E(ˆσ2σ2)2<E(S2σ2)2 ˆσ2 has smaller MSE than S2. By trading off variance for bias, the MSE is improved.
  • It can be argued that MSE, while a reasonable criterion for location parameters, is not reasonable for scale parameters. (One problem is that MSE penalizes equally for overestimation and underestimation, which is fine in the location case. In the scale case, however, 0 is a natural lower bound, so the estimation problem is not symmetric. Use of MSE in this case tends to be forgiving of underestimation.)

  • In general, since MSE is a function of the parameter, there will not be one “best” estimator. Often, the MSEs of two estimators will cross each other, showing that each estimator is better (with respect to the other) in only a portion of the parameter space.

Example 12.3 (MSE of Binomial Bayes Estimator) Let X1,,Xn be i.i.d. Bernoulli(p). The MSE of the MLE ˆp as an estimator of p is Ep(ˆpp)2=VarpˉX=p(1p)n

Let Y=ni=1Xi and the Bayes estimator for p is ˆpB=Y+αα+β+n (See Example 11.3). The MSE of this Bayes estimator of p is Ep(ˆpBp)2=Varp(ˆpB)+(BiaspˆpB)2=Varp(Y+αα+β+n)+(Ep(Y+αα+β+np)2=np(1p)(α+β+n)2+(np+αα+β+np)2

We may try to choose α and β to make the MSE of ˆpB constant, and such choice is α=β=n/4. In that case, Ep(ˆpBp)2=n4(n+n)2.

We compare the MSE of ˆpB and ˆp for different value of p in Figure 12.1. As suggested by Figure 12.1, for small n, ˆpB is the better choice unless there is a strong belief that p is near 0 or 1. For large n, ˆp is the better choice unless there is a strong belief that p is close to 12.

\label{fig:12001}Comparison of MSE for MLE and Bayes estimator of p when sample size is 4 and 400

FIGURE 12.1: Comparison of MSE for MLE and Bayes estimator of p when sample size is 4 and 400

MSE can be a helpful criterion for finding the bset estimator in a class of equivariant estimators. For an estimator W(X) of θ, using principles of Measurement Equivariance and Formal Invariance, we have

Measurement Equivariance: W(x) estimates θˉg(W(x)) estimates ˉg(θ)=θ.

Formal Invariance: W(x) estimates θW(g(x)) estimates ˉg(θ)=θ.

Measurement equivariance means that when using different measure on θ, the inference should not change. ˉg() here is another measurement. This gives the first relationship. Formal Invariance means when the inferences have same mathematical form, the results should identical. Estimating ˉg(θ) is of course the same as θ, which leads to the second relationship.

Putting these two requirements together gives W(g(x))=ˉg(W(x))

Example 12.4 (MSE of Equivariant Estimators) Let X1,,Xn be i.i.d. f(xθ). For an estimator W(X1,,Xn) to satisfy W(ga(x))=ˉga(W(x)), we must have W(x1,,xn)+a=W(x1+a,,xn+a)

which specifies the equivariant estimators w.r.t. the group of transformations defined by g={ga(x):<a<} where ga(x1,,xn)=(x1+a,,xn+a). For these estimators we have Eθ(W(X1,,Xn)θ)2=Eθ(W(X1+a,,Xn+a)aθ)2(a=θ)=Eθ(W(X1θ,,Xnθ))2=X(W(X1θ,,Xnθ))2ni=1f(xiθ)dx=X(W(μ1,,μn))2ni=1f(μi)dμ(μi=xiθ) This last expression does not depend on θ; hence, the MSEs of these equivariant estimators are not functions of θ. The MSE can therefore be used to order the equivariant estimators, and an equivariant estimator with smallest MSE can be found.

A comparison of estimators based on MSE may not yield a clear favorite. Indeed, there is no one “best MSE” estimator. One way to make the problem of finding a “best” estimator tractable is to limit the class of estimators. Suppose there is an estimator W of θ with EθW=τ(θ), consider the class of estimators Cτ={W:EθW=τ(θ)}. For any W1,W2Cτ, the bias of the two estimators are the same, so the MSE is determined by the variance. We are favorable to the estimator with smallest variance in such class.

Definition 12.3 (Best Unbiased Estimator) An estimator W is a best unbiased estimator of τ(θ) if it satisfies EθW=τ(θ) for all θ and for any other estimator W satisfies EθW=τ(θ), we have Varθ(W)Varθ(W) for all θ. W is also called a uniform minimum variance unbiased estimator (UMVUE) of τθ.

Example 12.5 (Poisson Unbiased Estimation) Let X1,,Xn be i.i.d. Pois(λ) and let ˉX and S2 be the sample mean and variance, respectively. For Poisson distribution, mean and variance are both equal to λ. Therefore, EλˉX=EλS2=λ,λ, both ˉX and S2 are unbiased estimator of λ.

Now consider the variance of ˉX and S2. It can be shown that Varλ(ˉX)Varλ(S2),λ, but calculating Varλ(S2) is not an easy task. Furthermore, consider the class of estimators Wa(ˉX,S2)=aˉX+(1a)S2 for every constant a, Wa(ˉX,S2) is an unbiased estimator of λ. Comparing the variance for all of them is not tractable, no need to mention there may be other unbiased estimator of different form.

A more reasonable way in finding unbiased estimator is firstly sepcify a lower bound B(θ) on the variance of any unbiased estimator. Then find some estimator W satisfying Varθ(W)=B(θ). This approach is taken with the use of the Cramer-Rao Lower Bound.

Theorem 12.1 (Cramer-Rao Inequality) Let X1,,Xn be a sample with p.d.f. f(x|θ) and let W(X)=W(X1,,Xn) be any estimator satisfying ddθEθW(X)=Xθ[W(x)f(x|θ)]dx and Varθ(W(X))<, then Varθ(W(X))(ddθEθW(X))2Eθ((θlogf(X|θ))2)

Proof. By Cauchy-Schwarz Inequality, for any two random variables X and Y [Cov(X,Y)]2(Var(X))(Var(Y))

Rearrange terms in (12.13) we get the lower bound on the variance of X Var(X)[Cov(X,Y)]2Var(Y)

The cleverness in this theorem is in choosing X to be the estimator W(X) and Y to be the quantity θlogf(X|θ).

Firstly, note that ddθEθW(X)=XW(x)[θf(x|θ)]dx=Eθ[W(X)θf(X|θ)f(X|θ)]=Eθ[W(X)θlogf(X|θ)] Here the second equality holds by multiply f(x|θ)f(x|θ) and take expectation w.r.t. to random variable X. The third equality uses the property of logs. Define W(x)=1,x, from (12.15) we have Eθ(θlogf(X|θ))=ddθEθ(1)=0 Therefore we have Covθ(W(X),θlogf(X|θ))=Eθ[W(X)θlogf(X|θ)]=ddθEθW(X) and since Eθ(θlogf(X|θ))=0, Varθ((θlogf(X|θ))=Eθ((θlogf(X|θ)2) Using (12.14), (12.17), and (12.18) we have (12.12) holds as we desired.
Corollary 12.1 (Cramer-Rao Inequality, i.i.d. case) If the assumptions of Theorem @ref{thm:thm12001} are satisfied and additionally, if X1,,Xn are i.i.d. with p.d.f. f(x|θ), then Varθ(W(X))(ddθEθW(X))2nEθ((θlogf(X|θ))2) That is, the expectation in the denominator becomse a univeraite calculation.
Proof. We only need to show that Eθ((θlogf(X|θ))2)=nEθ((θlogf(X|θ))2) Since X1,,Xn are independent, Eθ((θlogf(X|θ))2)=Eθ((θlogni=1f(Xi|θ))2)=Eθ((ni=1θlogf(Xi|θ))2)=ni=1Eθ((θlogf(Xi|θ))2)+ijEθ(θlogf(Xi|θ)θlogf(Xj|θ)) For ij we have Eθ(θlogf(Xi|θ)θlogf(Xj|θ))Eθ(θlogf(Xi|θ))Eθ(θlogf(Xj|θ))=0 Therefore, we estabilished (12.20) and the corollary holds.
  • The Cramer-Rao Lower Bound also holds for discrete random variable, with modification on (12.11) to interchangeable of differentiation and summation.

  • Eθ((θlogf(X|θ))2) is called the information number, or Fisher information of the sample. This terminology reflects the fact that the information number gives a bound on the variance of the best unbiased estimator of θ. As the information number gets bigger and we have more information about θ, we have a smaller bound on the variance of the best unbiased estimator.

  • For any differentiable function τ(θ), Cramer-Rao inequality gives a lower bound on the variance of any estimator W satisfies (12.11) and EθW=τ(θ). The bound depends only on τ(θ) and f(x|θ). Any candidate estimator satisfying EθW=τ(θ) and attaining this lower bound is a best unbiased estimator of τ(θ).
Lemma 12.1 If f(x|θ) satisfies ddθEθ(θlogf(X|θ))=θ[(θlogf(x|θ))f(x|θ)]dx (true for exponential family), then Eθ((θlogf(X|θ))2)=Eθ(2θ2logf(X|θ))
Example 12.6 (Poisson Unbiased Estimation Continued) Here τ(λ)=λ, so τ(λ)=1. Also since we have an exponential family, using Lemma 12.1 we have Eλ((λlogni=1f(Xi|θ))2)=nEλ(2λ2logf(X|λ))=nEλ(2λ2log(eλλXX!))=nEλ(2λ2(λ+XlogλlogX!))=nEλ(Xλ2)=nλ Hence, for any unbiased estimator W of λ, we must have VarλWλn Since VarλˉX=λn, ˉX is a best unbiased estimator of λ.
The key assumption in the Cramer-Rao Theorem is the ability to differentiate under the integral sign. Densities in the exponential class will satisfy the assumption but in general, such assumption should be checked, otherwise contradictions will arise.
Example 12.7 (Unbiased Estimator for the Scale Uniform) Let X1,,Xn be i.i.d. with p.d.f. f(x|θ)=1/θ,x(0,θ). Since θlogf(x|θ)=1/θ, we have Eθ((θlogf(X|θ))2)=1θ2 The Cramer-Rao Theorem would indicate that if W is any unbiased estimator of θ, Varθ(W)θ2n Now consider the sufficient statistic Y=max(X1,,Xn), the p.d.f. of Y is fY(y|θ)=nyn1/θn,0<y<, so EθY=0nyn1θndy=nn+1θ showing that n+1nY is an unbiased estimator of θ. Thus Varθ(n+1nY)=(n+1n)2VarθY=(n+1n)2[EθY2(nn+1θ)2]=(n+1n)2[nn+2θ2(nn+1θ)2]=1n(n+2)θ2 This is uniformly smaller then θ2/n, which indicates that the Cramer-Rao Theorem is not applicable to this p.d.f. To see why,
ddθθ0h(x)f(x|θ)dx=ddθθ0h(x)1θdx=h(θ)θ+θ0h(x)θ(1θ)dxθ0h(x)θ(1θ)dx unless h(θ)/θ=0 for all θ. Hence, the Cramer-Rao Theorem does not apply. In general, if the range of the p.d.f. depends on the parameter, the theorem will not be applicable.
There is no guarantee for Cramer-Rao Lower Bound to be attainable. In fact, Cramer-Rao Lower Bound may be strictly smaller than the variance of any unbiased estimator. In the usually favorable case of f(x|θ) being a one-parameter exponential family, there exists a parameter τ(θ) with an unbiased estimator that achieves the Cramer-Rao Lower Bound. However, in other typical situations, for other parameters, the bound may not be attainable.
Example 12.8 (Normal Variance Bound) Let X1,,Xn be i.i.d. N(μ,σ2) and consider estimation of σ2, where μis unknown. The normal p.d.f. satisfies the assumptions of the Cramer-Rao Theorem and Lemma @(lem:lem12001), so we have 2(σ2)2log((2πσ2)1/2exp((xμ)22σ2))=12σ4(xμ)2σ6 and E[2(σ2)2logf(X|μ,σ2)|μ,σ2]=E[12σ4(xμ)2σ6|μ,σ2]=12σ4 Thus, any unbiased estimator W of σ2 must satisfy Var(W|μ,σ2)2σ4n. From Example 12.1 we know Var(S2|μ,σ2)=2σ4n1, so S2 does not attain the Cramer-Rao Lower Bound.
Corollary 12.2 (Attainment) Let X1,,Xn be i.i.d. f(x|θ) where f(x|θ) satisfies the conditions of the Cramer-Rao Theorem. Let L(θ|x)=ni=1f(xi|θ) denote the likelihood function. If W(X)=(X1,,Xn) is any unbiased estimator of τ(θ), then W(X) attains the Cramer-Rao Lower Bound if and only if a(θ)[W(x)tau(θ)]=θlogL(θ|x) for some function a(θ).

Proof. The Cramer-Rao Inequality can be written as [Covθ(W(X),θlogni=1f(Xi|θ))]2Varθ(W(X))Var(θlogni=1f(Xi|θ)) and recalling that EθW=τ(θ), Eθ(θlogni=1f(Xi|θ))=0. (12.35) is just Cauchy-Schwarz inequality in probability, it can also be written as |<WEθW,θlogni=1f(Xi|θ)Eθ(θlogni=1f(Xi|θ))>|2≤<WEθW,WEθW><θlogni=1f(Xi|θ)Eθ(θlogni=1f(Xi|θ)),θlogni=1f(Xi|θ)Eθ(θlogni=1f(Xi|θ))> Thus, by the sufficient and necessary condition for the equal sign holds in Cauchy-Schwarz inequality, we need to have W(x)τ(θ) is proportional to θlogni=1f(Xi|θ), which is exactly (12.34).

Example 12.9 (Normal Variance Bound Continued) As in Example 12.8, we have L(μ,σ2|x)=(2πσ2)1/2exp((xμ)22σ2) and hence σ2logL(μ,σ2|x)=n2σ4(ni=1(xiμ)2nσ2) Thus, taking a(σ2)=n/(2σ4) shows that the best unbiased estimator of σ2 is ni=1(xiμ)2n which is not calculable if μ is unknown. In that case, the bound cannot be attained.