2 Estimation Theory

2.1 Bias, Variance and MSE

Given a sample X1,,Xn from the population distribution P consider an estimator ˆθnˆθ(X1,,Xn) of a real-valued parameter θΘR. The parameter θ is simply a feature of P.


Statistical inference requires to assess the accuracy of an estimator. One way to do so is to use the bias of an estimator, which is defined as Bias(ˆθn)=E(ˆθn)θ An estimator is called unbiased if E(ˆθn)=θ for all possible values of the true parameter θ.


The variance of an estimator is given by var(ˆθn)=E((ˆθnE(ˆθn))2).


Performance of an estimator is most frequently evaluated with respect to the quadratic loss (also called L2 loss) (ˆθnθ)2. The corresponding risk is the Mean Squared Error (MSE) E((ˆθnθ)2)=Bias(ˆθn)2+var(ˆθn) For an unbiased estimator the mean squared error is equal to the variance of the estimator.


Example: Assume an i.i.d. sample X1,,Xn with mean μ=E(Xi) and variance σ2=var(Xi)<.

  • The sample average ˉX is an unbiased estimator of the true mean μ, since the equation E(ˉX)=μ holds for any possible value of the true mean μ.
  • The variance of the estimator ˉX is given by var(ˉX)=σ2/n
  • The mean squared error of the estimator ˉX is given by E((ˉXμ)2)=var(ˉX)=σ2/n

2.2 Consistency of Estimators

Asymptotic theory is concerned with theoretical results valid for “large sample sizes”. Important keywords of asymptotic theory are:

  • consistency
  • rates of convergence
  • asymptotic distributions

They all rely on elaborated concepts on the stochastic convergence of random variables.


Stochastic convergence. Let {Zn}n=1,2,3, be a sequence of random variables. Mathematically, there are different kinds of convergence of {Zn} to a fixed value c. Two important ones are:

  • Convergence in probability (abbreviated ZnPc): lim
  • Convergence in quadratic mean (abbreviated Z_n\to_{q.m.} c): \lim_{n\to\infty} E\left((Z_n-c)^2\right)=0

It can be shown using Chebychev’s inequality that Z_n\to_{q.m.} c implies Z_n\to_P c.


Consistency of estimators. Based on a sample X_1,\dots,X_n let \hat\theta_n\equiv\theta_n(X_1,\dots,X_n) be an estimator of an unknown parameter \theta. We say that \hat\theta_n is a (weakly) consistent estimator of \theta if \hat{\theta}_n\to_{P} \theta\quad \hbox{ as }\quad n\to\infty


Example: Assume again an i.i.d. sample X_1,\dots,X_n with mean \mu=E(X_i) and variance \sigma^2=\textrm{var}(X_i)<\infty. As stated above we then have E\left((\bar X-\mu)^2\right)=\textrm{var}(\bar X)=\sigma^2/n\rightarrow 0 \quad \text{as } n\rightarrow\infty. Therefore, \bar X \to_{q.m.} \mu. The latter implies that \bar X \to_{P} \mu, i.e. \bar X is a (weakly) consistent estimator of \mu.

2.3 Rates of Convergence

Rates of convergence quantify the (stochastic) order of magnitude of an estimation error in dependence of the sample size n. This order of magnitude is usually represented using the symbols: O_P and o_P.


Let \{Z_n\}_{n=1,2,3,\dots} be a sequence of random variables, and let \{c_n\}_{n=1,2,3,\dots} be a sequence of positive (deterministic) numbers.

  • We will write Z_n=O_p(c_n) if for any \epsilon>0 there exist numbers 0<M<\infty and m such that P(|Z_n|\ge M\cdot c_n)\leq\epsilon\quad\hbox{ for all }\quad n\geq m.
  • We will write Z_n=o_p(c_n) if \lim_{n\to\infty} P(|Z_n|\geq\epsilon\cdot c_n)=0\quad\hbox{ for all }\quad \epsilon>0.
  • With c_n=1 for all n, Z_n=O_p(1) means that the sequence \{Z_n\} is stochastically bounded. I.e., for any \epsilon>0 there exist number 0<M<\infty and m such that P(|Z_n|\geq M)\leq\epsilon\quad\hbox{ for all }\quad n\geq m.
  • With c_n=1 for all n, Z_n=o_P(1) is equivalent to Z_n\to_{P} 0, i.e., Z_n converges in probability to zero.

Note that:

  • Z_n=O_p(c_n) is equivalent to Z_n/c_n=O_p(1)
  • Z_n=o_p(c_n) is equivalent to Z_n/c_n=o_p(1)


Definition: An estimator \hat\theta_n of a parameter \theta possesses the rate of convergence n^{-r} if and only if r is the largest positive number with the property that |\hat\theta_n-\theta|=O_P(n^{-r}).

The rate of convergence quantifies how fast the estimation error decreases when increasing the sample size n.


Some remarks:

  • Estimators based on parametric models, such as maximum-likelihood estimators, typically have a rate of convergence of n^{-1/2} (there are exceptions!).
  • The situation is different, for instance, in nonparametric curve estimation problems. For example kernel estimators (of a density or regression function) only achieve the rate of convergence n^{-2/5}.
  • The rate of convergence is an important criterion for selecting the best possible estimator for a given problem. For most parametric problems, it is well known that the optimal (i.e. fastest possible) convergence rate is n^{-1/2}. In nonparametric regression or density estimation the optimal convergence rate is only n^{-2/5}, if the underlying function is twice continuously differentiable.


O_P-rules:

  • We have Z_n\rightarrow_P Z \qquad \text{if and only if }\qquad Z_n=Z+o_p(1) This follows from Z_n=Z+(Z_n-Z) and Z_n-Z\rightarrow_P 0.


  • If Z_n=O_P(n^{-\delta}) for some \delta>0, then Z_n=o_P(1)


  • If Z_n=O_P(r_n), then Z_n^\delta=O_P(r_n^\delta) for any \delta>0. Similarly, Z_n=o_P(r_n) implies Z_n^\delta=o_P(r_n^\delta) for any \delta>0.


  • If Z_n=O_P(r_n) and V_n=O_P(s_n), then \begin{align*} Z_n+V_n & =O_P(\max\{r_n,s_n\})\\ Z_nV_n & =O_P(r_ns_n) \end{align*}


  • If Z_n=o_P(r_n) and V_n=O_P(s_n), then Z_nV_n=o_P(r_n s_n)


  • If E(|Z_n|^k)=O(r_n), then Z_n=O_p(|r_n|^{1/k}) for k=1,2,3,\dots

2.4 Asymptotic Distributions

The practically most important version of stochastic convergence is convergence in distribution. Knowledge about the “asymptotic distribution” of an estimator allows to construct confidence intervals and tests.


Definition: Let Z_n be a sequence of random variables with corresponding distribution functions G_n. Then Z_n converges in distribution to a random variable Z with distribution function G, if G_n(x)\to G(x)\quad\hbox{ as }\quad n\to\infty at all continuity points x of G (abbreviated: Z_n\to_L Z or Z_n\to_L G or “\to_D” instead of “\to_L”).


In a vast majority of practically important situation the limiting distribution is the normal distribution. One then speaks of asymptotic normality. Asymptotic normality is usually a consequence of central limit theorems. The simplest result in this direction is the central limit theorem of Lindeberg-Levy.


Theorem (Lindeberg-Levy) Let X_1,X_2,\dots be a sequence of i.i.d. random variables with finite mean \mu and variance \sigma^2<\infty. Then \sqrt{n}\left(\frac{1}{n} \sum_{i=1}^n X_i -\mu\right)\rightarrow_L N(0,\sigma^2).


We can conclude that \bar X is an “asymptotically normal estimator” of \mu. If n is sufficiently large, then \bar X is approximatively normal with mean \mu and variance \sigma^2/n. Frequently used notations:

  • \bar X\sim AN(\mu,\sigma^2/n)
  • \bar X\overset{a}{\sim}N(\mu,\sigma^2/n)


Most estimators \hat\theta_n used in parametric and nonparametric statistics are asymptotically normal. In parametric problems (with rate of convergence n^{-1/2}) one usually obtains \sqrt{n}(\hat\theta_n -\theta )\to_L N(0,v^2), where v^2 is the asymptotic variance of the estimator (often, but not necessarily, v^2=\lim_{n\to\infty} n\cdot\textrm{var}(\hat\theta_n)).


Multivariate generalization: The above concepts are easily generalized to estimators \hat\theta_n of a multivariate parameter vector \theta\in\mathbb{R}^p. Consistency and rates of convergence then have to be derived separately for each element of the vector. Convergence in distribution is defined via convergence of the multivariate distribution functions. For standard estimators (e.g., maximum likelihood) in parametric problems one usually obtains \sqrt{n}(\hat\theta_n -\theta )\to_L N_p(0,V), where V is the asymptotic covariance matrix (usually, V=\lim_{n\to\infty} n\cdot\textrm{Cov}(\hat\theta_n)).


Multivariate normality holds if and only if for any vector c=(c_1,\dots,c_p)'\in\mathbb{R}^p with \sum_{j=1}^p c_j^2=\Vert c\Vert_2^2=1 \sqrt{n}\left(\sum_{j=1}^p c_j (\hat\theta_{jn} -\theta_j)\right)=\sqrt{n}\left(c'\hat\theta_n-c'\theta\right)\to_L N\left(0,v_c^2\right), where v_c^2=c'Vc=\sum_{j=1}^p\sum_{k=1}^p c_jc_k V_{jk}, and where V_{jk} are the elements of the asymptotic covariance matrix V.

This condition is frequently called “Cramer-Wold device”. Using one-dimensional central limit theorems it can be verified for any vector c.


Example: Let X_1=(X_{11},X_{12})',\dots,X_n=(X_{n1},X_{n2})' be i.i.d. two-dimensional random vectors with E(X_i)=\mu=(\mu_1,\mu_2)' and Cov(X_i)=\Sigma. The Cramer-Wold device and Lindeberg-Levy’s central limit theorem then imply that \sqrt{n}\left(\bar X -\mu\right)\to_L N_2\left(0,\Sigma\right).


Note that asymptotic normality usually also holds for nonparametric curve estimators with convergence rates slower than n^{-1/2}.

2.5 Asymptotic Theory

Many estimation procedures in modern statistics rely on fairly general assumptions. For a given sample size n it is then often impossible to derive the exact distribution of \widehat{\theta}_n. Necessary calculations are too complex, and finite sample distributions usually depend on unknown characteristics of the distribution of the underlying data.


The goal of asymptotic theory then is to derive reasonable approximations. For large samples such approximations are of course very accurate, for small samples there may exist a considerable approximation error. Therefore, for small samples the approximation quality of asymptotic approximations is usually studied by Monte-Carlo approximations, which we will discuss later.


Asymptotic theory is used in order to select an appropriate estimation procedure in complex situations. The idea is to determine the estimator which, at least for large sample sizes, provides the smallest possible estimation error. This leads to the concept of “asymptotically efficient” estimators.


Properties of an asymptotically efficient estimator \widehat{\theta}_n:

  • For the estimation problem to be considered \widehat{\theta}_n is consistent and adopts the fastest possible rate of convergence (generally: n^{-1/2} in parametric statistics; n^{-2/5} can be achieved in nonparametric univariate curve estimation problems).
  • In most regular situations one is additionally interested in a “best asymptotically normal” (BAN) estimator. Assume that \sqrt{n}(\widehat{\theta}_n -\theta)\sim N(0,v^2). Then \widehat{\theta}_n is a BAN-estimator if any alternative estimator \tilde\theta_n with \sqrt{n}(\tilde\theta_n -\theta)\sim N(0,\tilde v^2) possesses a larger asymptotic variance, i.e. \tilde v^2\geq v^2.
  • Multivariate generalization: An estimator \widehat{\theta}_n with \sqrt{n}(\widehat{\theta}_n -\theta)\sim N_p(0,V) is best asymptotically normal if c'\tilde V c\geq c'Vc\quad\hbox{ for all }\quad c\in\mathbb{R}^p, \Vert c\Vert_2^2=1 for any other estimator \tilde\theta_n satisfying \sqrt{n}(\tilde\theta_n -\theta)\sim N_p(0,\tilde V).


For most estimation problems in parametric statistics maximum-likelihood estimators are best asymptotically normal.

2.6 Mathematical tools

2.6.1 Taylor expansions

Taylors’ theorem: Let f be a real-valued function which is k+1 continuously differentiable in the interior of an interval [a,b]. Consider a point x_0\in (a,b). For any other value x\in (a,b) there exists some \psi\in [x_0,x] such that f(x)=f(x_0)+\sum_{r=1}^k \frac{1}{r!}f^{(r)}(x_0)\cdot(x-x_0)^r+\frac{1}{(k+1)!}f^{(k+1)}(\psi)\cdot(x-x_0)^{k+1}

Multivariate generalization: x_0,x\in\mathbb{R}^p, f'(x_0)\in\mathbb{R}^p, f''(x_0) a p\times p Matrix.

First order Taylor approximation: f(x)=f(x_0)+f'(x_0)\cdot(x-x_0)+O(\Vert x-x_0\Vert_2^2)

Second order Taylor approximation: f(x)=f(x_0)+f'(x_0)\cdot(x-x_0)+\frac{1}{2} (x-x_0)^T f''(x_0)(x-x_0)+O(\Vert x-x_0\Vert_2^3)

2.6.2 Tools for deriving asymptotic distributions

Let \{W_n\}, \{Z_n\} be sequences of random variables, then:

  • Z_n=W_n+o_P(1)\quad \Leftrightarrow \quad Z_n-W_n\to_P 0. If additionally W_n\to_L N(0,v^2) then Z_n\to_L N(0,v^2).
  • For any fixed constant c\neq 0: If Z_n\to_P c and W_n\to_L N(0,v^2), then cW_n\to_L N(0,c^2v^2)\quad\hbox{as well as }\quad V_n:=Z_n\cdot W_n\to_L N(0,c^2v^2). Furthermore, If Z_n and c are positive (with probability 1) then also W_n/c\to_L N(0,v^2/c^2)\quad\hbox{as well as }\quad V_n:= W_n/Z_n\to_L N(0,v^2/c^2).
  • Multivariate generalization (C, Z_n p\times p matrices; W_n p-dimensional random vectors): If Z_n\to_P C as well as W_n\to_L N_p(0,V), then \begin{align*} CW_n&\to_L N_p(0,CVC')\quad\hbox{as well as }\\ V_n:=Z_n\cdot W_n&\to_L N_p(0,CVC') \end{align*}

2.6.3 The Delta-Method

A further tool which is frequently used in asymptotic statistics is the so-called delta-method.

Delta-Method: Let \widehat{\theta}_n be a sequence of estimators of a one-dimensional parameter \theta satisfying n^{r} (\widehat{\theta}_n-\theta)\rightarrow_L N(0,v^2), and let g(.) be a real-valued function which is continuously differentiable at \theta and satisfies g'(\theta)\neq 0. Then n^{r} \left(g(\widehat{\theta}_n)-g(\theta)\right) \rightarrow_L N\left(0,g'(\theta)^2v^2\right).


Example: Assume an i.i.d. random sample X_1,\dots,X_n from an exponential distribution. That is, the underlying density of X_i, i=1,\dots,n, is given by f(x|\theta)=\theta\exp(-\theta x). We then have \mu:=E(X_i)=1/\theta as well as \sigma^2_X:=\textrm{var}(X_i)=1/\theta^2. The underlying parameter \theta>0 is unknown and has to be estimated from the data.


The maximum-likelihood estimator of \theta is \hat\theta=1/\bar X.


We know that \sqrt{n}(\bar X-\frac{1}{\theta})\to_L N(0,\frac{1}{\theta^2}), but what’s about the distribution of 1/\bar X? For this purpose the delta-method can be applied with g(x)=1/x. Then g'(x)=-1/x^2, g'(1/\theta)=-\theta^2, and consequently n^{1/2} \left(\frac{1}{\bar X}-\theta\right)=n^{1/2}\left(g\left(\bar X\right)-g\left(\frac{1}{\theta}\right)\right)\rightarrow_L N\left(0,\theta^2\right).