Chapter 12 Mean Squared Error, Best Unbiased Estimators(Lecture on 02/06/2020)
Since we can usually apply more than one of thses methods in finding estimators in a particular situation, and these methods are not necessarily given same estimation, we are often faced with the task of choosing between estimators. The general topic of evaluating statistical procedures is part of the branch of statistics known as decision theory.
Definition 12.1 (Mean Squared Error) The mean squared error (MSE) of an estimator W of a parameter θ is the function of θ defined by Eθ(W−θ)2.
(Here θ is the parameter (not random), so the expectation is taken w.r.t. W.)Definition 12.2 (Bias) The bias os a point estimator W of a parameter θ is the difference between the expected value of W and θ; that is BiasθW=EθW−θ. An estimator whose bias is identically (in θ) equal to 0 is called unbiased and satisfies EθW=θ for all θ.
For an unbiased estimator, we have its MSE is equal to its variance, i.e. Eθ(W−θ)2=VarθW.It can be argued that MSE, while a reasonable criterion for location parameters, is not reasonable for scale parameters. (One problem is that MSE penalizes equally for overestimation and underestimation, which is fine in the location case. In the scale case, however, 0 is a natural lower bound, so the estimation problem is not symmetric. Use of MSE in this case tends to be forgiving of underestimation.)
- In general, since MSE is a function of the parameter, there will not be one “best” estimator. Often, the MSEs of two estimators will cross each other, showing that each estimator is better (with respect to the other) in only a portion of the parameter space.
Example 12.3 (MSE of Binomial Bayes Estimator) Let X1,⋯,Xn be i.i.d. Bernoulli(p). The MSE of the MLE ˆp as an estimator of p is Ep(ˆp−p)2=VarpˉX=p(1−p)n
Let Y=∑ni=1Xi and the Bayes estimator for p is ˆpB=Y+αα+β+n (See Example 11.3). The MSE of this Bayes estimator of p is Ep(ˆpB−p)2=Varp(ˆpB)+(BiaspˆpB)2=Varp(Y+αα+β+n)+(Ep(Y+αα+β+n−p)2=np(1−p)(α+β+n)2+(np+αα+β+n−p)2
We may try to choose α and β to make the MSE of ˆpB constant, and such choice is α=β=√n/4. In that case, Ep(ˆpB−p)2=n4(n+√n)2.
We compare the MSE of ˆpB and ˆp for different value of p in Figure 12.1. As suggested by Figure 12.1, for small n, ˆpB is the better choice unless there is a strong belief that p is near 0 or 1. For large n, ˆp is the better choice unless there is a strong belief that p is close to 12.
FIGURE 12.1: Comparison of MSE for MLE and Bayes estimator of p when sample size is 4 and 400
MSE can be a helpful criterion for finding the bset estimator in a class of equivariant estimators. For an estimator W(X) of θ, using principles of Measurement Equivariance and Formal Invariance, we have
Measurement Equivariance: W(x) estimates θ⇒ˉg(W(x)) estimates ˉg(θ)=θ′.
Formal Invariance: W(x) estimates θ⇒W(g(x)) estimates ˉg(θ)=θ′.
Measurement equivariance means that when using different measure on θ, the inference should not change. ˉg(⋅) here is another measurement. This gives the first relationship. Formal Invariance means when the inferences have same mathematical form, the results should identical. Estimating ˉg(θ) is of course the same as θ, which leads to the second relationship.
Putting these two requirements together gives W(g(x))=ˉg(W(x))
Example 12.4 (MSE of Equivariant Estimators) Let X1,⋯,Xn be i.i.d. f(x−θ). For an estimator W(X1,⋯,Xn) to satisfy W(ga(x))=ˉga(W(x)), we must have W(x1,⋯,xn)+a=W(x1+a,⋯,xn+a)
which specifies the equivariant estimators w.r.t. the group of transformations defined by g={ga(x):−∞<a<∞} where ga(x1,⋯,xn)=(x1+a,⋯,xn+a). For these estimators we have Eθ(W(X1,⋯,Xn)−θ)2=Eθ(W(X1+a,⋯,Xn+a)−a−θ)2(a=−θ)=Eθ(W(X1θ,⋯,Xn−θ))2=∫X(W(X1−θ,⋯,Xn−θ))2n∏i=1f(xi−θ)dx=∫X(W(μ1,⋯,μn))2n∏i=1f(μi)dμ(μi=xi−θ) This last expression does not depend on θ; hence, the MSEs of these equivariant estimators are not functions of θ. The MSE can therefore be used to order the equivariant estimators, and an equivariant estimator with smallest MSE can be found.A comparison of estimators based on MSE may not yield a clear favorite. Indeed, there is no one “best MSE” estimator. One way to make the problem of finding a “best” estimator tractable is to limit the class of estimators. Suppose there is an estimator W∗ of θ with EθW∗=τ(θ), consider the class of estimators Cτ={W:EθW=τ(θ)}. For any W1,W2∈Cτ, the bias of the two estimators are the same, so the MSE is determined by the variance. We are favorable to the estimator with smallest variance in such class.
Example 12.5 (Poisson Unbiased Estimation) Let X1,⋯,Xn be i.i.d. Pois(λ) and let ˉX and S2 be the sample mean and variance, respectively. For Poisson distribution, mean and variance are both equal to λ. Therefore, EλˉX=EλS2=λ,∀λ, both ˉX and S2 are unbiased estimator of λ.
Now consider the variance of ˉX and S2. It can be shown that Varλ(ˉX)≤Varλ(S2),∀λ, but calculating Varλ(S2) is not an easy task. Furthermore, consider the class of estimators Wa(ˉX,S2)=aˉX+(1−a)S2 for every constant a, Wa(ˉX,S2) is an unbiased estimator of λ. Comparing the variance for all of them is not tractable, no need to mention there may be other unbiased estimator of different form.
A more reasonable way in finding unbiased estimator is firstly sepcify a lower bound B(θ) on the variance of any unbiased estimator. Then find some estimator W∗ satisfying Varθ(W∗)=B(θ). This approach is taken with the use of the Cramer-Rao Lower Bound.
Theorem 12.1 (Cramer-Rao Inequality) Let X1,⋯,Xn be a sample with p.d.f. f(x|θ) and let W(X)=W(X1,⋯,Xn) be any estimator satisfying ddθEθW(X)=∫X∂∂θ[W(x)f(x|θ)]dx and Varθ(W(X))<∞, then Varθ(W(X))≥(ddθEθW(X))2Eθ((∂∂θlogf(X|θ))2)
Proof. By Cauchy-Schwarz Inequality, for any two random variables X and Y [Cov(X,Y)]2≤(Var(X))(Var(Y))
Rearrange terms in (12.13) we get the lower bound on the variance of X Var(X)≥[Cov(X,Y)]2Var(Y)
The cleverness in this theorem is in choosing X to be the estimator W(X) and Y to be the quantity ∂∂θlogf(X|θ).
Firstly, note that ddθEθW(X)=∫XW(x)[∂∂θf(x|θ)]dx=Eθ[W(X)∂∂θf(X|θ)f(X|θ)]=Eθ[W(X)∂∂θlogf(X|θ)] Here the second equality holds by multiply f(x|θ)f(x|θ) and take expectation w.r.t. to random variable X. The third equality uses the property of logs. Define W(x)=1,∀x, from (12.15) we have Eθ(∂∂θlogf(X|θ))=ddθEθ(1)=0 Therefore we have Covθ(W(X),∂∂θlogf(X|θ))=Eθ[W(X)∂∂θlogf(X|θ)]=ddθEθW(X) and since Eθ(∂∂θlogf(X|θ))=0, Varθ((∂∂θlogf(X|θ))=Eθ((∂∂θlogf(X|θ)2) Using (12.14), (12.17), and (12.18) we have (12.12) holds as we desired.The Cramer-Rao Lower Bound also holds for discrete random variable, with modification on (12.11) to interchangeable of differentiation and summation.
Eθ((∂∂θlogf(X|θ))2) is called the information number, or Fisher information of the sample. This terminology reflects the fact that the information number gives a bound on the variance of the best unbiased estimator of θ. As the information number gets bigger and we have more information about θ, we have a smaller bound on the variance of the best unbiased estimator.
- For any differentiable function τ(θ), Cramer-Rao inequality gives a lower bound on the variance of any estimator W satisfies (12.11) and EθW=τ(θ). The bound depends only on τ(θ) and f(x|θ). Any candidate estimator satisfying EθW=τ(θ) and attaining this lower bound is a best unbiased estimator of τ(θ).
ddθ∫θ0h(x)f(x|θ)dx=ddθ∫θ0h(x)1θdx=h(θ)θ+∫θ0h(x)∂∂θ(1θ)dx≠∫θ0h(x)∂∂θ(1θ)dx unless h(θ)/θ=0 for all θ. Hence, the Cramer-Rao Theorem does not apply. In general, if the range of the p.d.f. depends on the parameter, the theorem will not be applicable.
Proof. The Cramer-Rao Inequality can be written as [Covθ(W(X),∂∂θlogn∏i=1f(Xi|θ))]2≤Varθ(W(X))Var(∂∂θlogn∏i=1f(Xi|θ)) and recalling that EθW=τ(θ), Eθ(∂∂θlog∏ni=1f(Xi|θ))=0. (12.35) is just Cauchy-Schwarz inequality in probability, it can also be written as |<W−EθW,∂∂θlogn∏i=1f(Xi|θ)−Eθ(∂∂θlogn∏i=1f(Xi|θ))>|2≤<W−EθW,W−EθW><∂∂θlogn∏i=1f(Xi|θ)−Eθ(∂∂θlogn∏i=1f(Xi|θ)),∂∂θlogn∏i=1f(Xi|θ)−Eθ(∂∂θlogn∏i=1f(Xi|θ))> Thus, by the sufficient and necessary condition for the equal sign holds in Cauchy-Schwarz inequality, we need to have W(x)−τ(θ) is proportional to ∂∂θlog∏ni=1f(Xi|θ), which is exactly (12.34).