2 Estimation Theory

2.1 Bias, Variance and MSE

Given a sample \(X_1,\dots,X_n\) from the population distribution \(\mathcal{P}\) consider an estimator \(\widehat{\theta}_n\equiv\widehat{\theta}(X_1,\dots,X_n)\) of a real-valued parameter \(\theta\in\Theta\subset\mathbb{R}\). The parameter \(\theta\) is simply a feature of \(\mathcal{P}\).

Statistical inference requires to assess the accuracy of an estimator. One way to do so is to use the bias of an estimator, which is defined as \[\textrm{Bias}(\widehat{\theta}_n)=E(\widehat{\theta}_n)-\theta\] An estimator is called unbiased if \(E(\widehat{\theta}_n)=\theta\) for all possible values of the true parameter \(\theta\).

The variance of an estimator is given by \[\textrm{var}(\widehat{\theta}_n)=E\left((\widehat{\theta}_n-E(\widehat{\theta}_n))^2\right).\]

Performance of an estimator is most frequently evaluated with respect to the quadratic loss (also called \(L_2\) loss) \[(\widehat{\theta}_n-\theta)^2.\] The corresponding risk is the Mean Squared Error (MSE) \[E\left((\widehat{\theta}_n-\theta)^2\right)=\textrm{Bias}(\widehat{\theta}_n)^2+\textrm{var}(\widehat{\theta}_n)\] For an unbiased estimator the mean squared error is equal to the variance of the estimator.

Example: Assume an i.i.d. sample \(X_1,\dots,X_n\) with mean \(\mu=E(X_i)\) and variance \(\sigma^2=\textrm{var}(X_i)<\infty\).

The sample average \(\bar X\) is an unbiased estimator of the true mean \(\mu\), since the equation \[E(\bar X)=\mu\] holds for any possible value of the true mean \(\mu\).
The variance of the estimator \(\bar X\) is given by \[\textrm{var}(\bar X)=\sigma^2/n\]
The mean squared error of the estimator \(\bar X\) is given by \[E\left((\bar X-\mu)^2\right)=\textrm{var}(\bar X)=\sigma^2/n\]

2.2 Consistency of Estimators

Asymptotic theory is concerned with theoretical results valid for “large sample sizes”. Important keywords of asymptotic theory are:

consistency
rates of convergence
asymptotic distributions

They all rely on elaborated concepts on the stochastic convergence of random variables.

Stochastic convergence. Let \(\{Z_n\}_{n=1,2,3,\dots}\) be a sequence of random variables. Mathematically, there are different kinds of convergence of \(\{Z_n\}\) to a fixed value \(c\). Two important ones are:

Convergence in probability (abbreviated \(Z_n\to_P c\)): \[\lim_{n\to\infty} P\left(|Z_n-c|>\epsilon\right)=0\quad\hbox{ for all }\quad\epsilon>0\]
Convergence in quadratic mean (abbreviated \(Z_n\to_{q.m.} c\)): \[\lim_{n\to\infty} E\left((Z_n-c)^2\right)=0\]

It can be shown using Chebychev’s inequality that \(Z_n\to_{q.m.} c\) implies \(Z_n\to_P c\).

Consistency of estimators. Based on a sample \(X_1,\dots,X_n\) let \(\hat\theta_n\equiv\theta_n(X_1,\dots,X_n)\) be an estimator of an unknown parameter \(\theta\). We say that \(\hat\theta_n\) is a (weakly) consistent estimator of \(\theta\) if \[\hat{\theta}_n\to_{P} \theta\quad \hbox{ as }\quad n\to\infty \]

Example: Assume again an i.i.d. sample \(X_1,\dots,X_n\) with mean \(\mu=E(X_i)\) and variance \(\sigma^2=\textrm{var}(X_i)<\infty\). As stated above we then have \[E\left((\bar X-\mu)^2\right)=\textrm{var}(\bar X)=\sigma^2/n\rightarrow 0 \quad \text{as } n\rightarrow\infty.\] Therefore, \(\bar X \to_{q.m.} \mu\). The latter implies that \(\bar X \to_{P} \mu\), i.e. \(\bar X\) is a (weakly) consistent estimator of \(\mu\).

2.3 Rates of Convergence

Rates of convergence quantify the (stochastic) order of magnitude of an estimation error in dependence of the sample size \(n\). This order of magnitude is usually represented using the symbols: \(O_P\) and \(o_P\).

Let \(\{Z_n\}_{n=1,2,3,\dots}\) be a sequence of random variables, and let \(\{c_n\}_{n=1,2,3,\dots}\) be a sequence of positive (deterministic) numbers.

We will write \(Z_n=O_p(c_n)\) if for any \(\epsilon>0\) there exist numbers \(0<M<\infty\) and \(m\) such that \[P(|Z_n|\ge M\cdot c_n)\leq\epsilon\quad\hbox{ for all }\quad n\geq m.\]
We will write \(Z_n=o_p(c_n)\) if \[\lim_{n\to\infty} P(|Z_n|\geq\epsilon\cdot c_n)=0\quad\hbox{ for all }\quad \epsilon>0.\]
With \(c_n=1\) for all \(n\), \(Z_n=O_p(1)\) means that the sequence \(\{Z_n\}\) is stochastically bounded. I.e., for any \(\epsilon>0\) there exist number \(0<M<\infty\) and \(m\) such that \[P(|Z_n|\geq M)\leq\epsilon\quad\hbox{ for all }\quad n\geq m.\]
With \(c_n=1\) for all \(n\), \(Z_n=o_P(1)\) is equivalent to \(Z_n\to_{P} 0\), i.e., \(Z_n\) converges in probability to zero.

Note that:

\(Z_n=O_p(c_n)\) is equivalent to \(Z_n/c_n=O_p(1)\)
\(Z_n=o_p(c_n)\) is equivalent to \(Z_n/c_n=o_p(1)\)

Definition: An estimator \(\hat\theta_n\) of a parameter \(\theta\) possesses the rate of convergence \(n^{-r}\) if and only if \(r\) is the largest positive number with the property that \[|\hat\theta_n-\theta|=O_P(n^{-r}).\]

The rate of convergence quantifies how fast the estimation error decreases when increasing the sample size \(n\).

Some remarks:

Estimators based on parametric models, such as maximum-likelihood estimators, typically have a rate of convergence of \(n^{-1/2}\) (there are exceptions!).
The situation is different, for instance, in nonparametric curve estimation problems. For example kernel estimators (of a density or regression function) only achieve the rate of convergence \(n^{-2/5}\).
The rate of convergence is an important criterion for selecting the best possible estimator for a given problem. For most parametric problems, it is well known that the optimal (i.e. fastest possible) convergence rate is \(n^{-1/2}\). In nonparametric regression or density estimation the optimal convergence rate is only \(n^{-2/5}\), if the underlying function is twice continuously differentiable.

\(O_P\)-rules:

We have \[Z_n\rightarrow_P Z \qquad \text{if and only if }\qquad Z_n=Z+o_p(1)\] This follows from \(Z_n=Z+(Z_n-Z)\) and \(Z_n-Z\rightarrow_P 0\).

If \(Z_n=O_P(n^{-\delta})\) for some \(\delta>0\), then \(Z_n=o_P(1)\)

If \(Z_n=O_P(r_n)\), then \(Z_n^\delta=O_P(r_n^\delta)\) for any \(\delta>0\). Similarly, \(Z_n=o_P(r_n)\) implies \(Z_n^\delta=o_P(r_n^\delta)\) for any \(\delta>0\).

If \(Z_n=O_P(r_n)\) and \(V_n=O_P(s_n)\), then \[\begin{align*} Z_n+V_n & =O_P(\max\{r_n,s_n\})\\ Z_nV_n & =O_P(r_ns_n) \end{align*}\]

If \(Z_n=o_P(r_n)\) and \(V_n=O_P(s_n)\), then \(Z_nV_n=o_P(r_n s_n)\)

If \(E(|Z_n|^k)=O(r_n)\), then \(Z_n=O_p(|r_n|^{1/k})\) for \(k=1,2,3,\dots\)

2.4 Asymptotic Distributions

The practically most important version of stochastic convergence is convergence in distribution. Knowledge about the “asymptotic distribution” of an estimator allows to construct confidence intervals and tests.

Definition: Let \(Z_n\) be a sequence of random variables with corresponding distribution functions \(G_n\). Then \(Z_n\) converges in distribution to a random variable \(Z\) with distribution function \(G\), if \[G_n(x)\to G(x)\quad\hbox{ as }\quad n\to\infty \] at all continuity points \(x\) of \(G\) (abbreviated: \(Z_n\to_L Z\) or \(Z_n\to_L G\) or “\(\to_D\)” instead of “\(\to_L\)”).

In a vast majority of practically important situation the limiting distribution is the normal distribution. One then speaks of asymptotic normality. Asymptotic normality is usually a consequence of central limit theorems. The simplest result in this direction is the central limit theorem of Lindeberg-Levy.

Theorem (Lindeberg-Levy) Let \(X_1,X_2,\dots\) be a sequence of i.i.d. random variables with finite mean \(\mu\) and variance \(\sigma^2<\infty\). Then \[\sqrt{n}\left(\frac{1}{n} \sum_{i=1}^n X_i -\mu\right)\rightarrow_L N(0,\sigma^2).\]

We can conclude that \(\bar X\) is an “asymptotically normal estimator” of \(\mu\). If \(n\) is sufficiently large, then \(\bar X\) is approximatively normal with mean \(\mu\) and variance \(\sigma^2/n\). Frequently used notations:

\(\bar X\sim AN(\mu,\sigma^2/n)\)
\(\bar X\overset{a}{\sim}N(\mu,\sigma^2/n)\)

Most estimators \(\hat\theta_n\) used in parametric and nonparametric statistics are asymptotically normal. In parametric problems (with rate of convergence \(n^{-1/2}\)) one usually obtains \[\sqrt{n}(\hat\theta_n -\theta )\to_L N(0,v^2),\] where \(v^2\) is the asymptotic variance of the estimator (often, but not necessarily, \(v^2=\lim_{n\to\infty} n\cdot\textrm{var}(\hat\theta_n)\)).

Multivariate generalization: The above concepts are easily generalized to estimators \(\hat\theta_n\) of a multivariate parameter vector \(\theta\in\mathbb{R}^p\). Consistency and rates of convergence then have to be derived separately for each element of the vector. Convergence in distribution is defined via convergence of the multivariate distribution functions. For standard estimators (e.g., maximum likelihood) in parametric problems one usually obtains \[\sqrt{n}(\hat\theta_n -\theta )\to_L N_p(0,V),\] where \(V\) is the asymptotic covariance matrix (usually, \(V=\lim_{n\to\infty} n\cdot\textrm{Cov}(\hat\theta_n)\)).

Multivariate normality holds if and only if for any vector \(c=(c_1,\dots,c_p)'\in\mathbb{R}^p\) with \(\sum_{j=1}^p c_j^2=\Vert c\Vert_2^2=1\) \[\sqrt{n}\left(\sum_{j=1}^p c_j (\hat\theta_{jn} -\theta_j)\right)=\sqrt{n}\left(c'\hat\theta_n-c'\theta\right)\to_L N\left(0,v_c^2\right),\] where \[v_c^2=c'Vc=\sum_{j=1}^p\sum_{k=1}^p c_jc_k V_{jk},\] and where \(V_{jk}\) are the elements of the asymptotic covariance matrix \(V\).

This condition is frequently called “Cramer-Wold device”. Using one-dimensional central limit theorems it can be verified for any vector \(c\).

Example: Let \(X_1=(X_{11},X_{12})',\dots,X_n=(X_{n1},X_{n2})'\) be i.i.d. two-dimensional random vectors with \(E(X_i)=\mu=(\mu_1,\mu_2)'\) and \(Cov(X_i)=\Sigma\). The Cramer-Wold device and Lindeberg-Levy’s central limit theorem then imply that \[\sqrt{n}\left(\bar X -\mu\right)\to_L N_2\left(0,\Sigma\right).\]

Note that asymptotic normality usually also holds for nonparametric curve estimators with convergence rates slower than \(n^{-1/2}\).

2.5 Asymptotic Theory

Many estimation procedures in modern statistics rely on fairly general assumptions. For a given sample size \(n\) it is then often impossible to derive the exact distribution of \(\widehat{\theta}_n\). Necessary calculations are too complex, and finite sample distributions usually depend on unknown characteristics of the distribution of the underlying data.

The goal of asymptotic theory then is to derive reasonable approximations. For large samples such approximations are of course very accurate, for small samples there may exist a considerable approximation error. Therefore, for small samples the approximation quality of asymptotic approximations is usually studied by Monte-Carlo approximations, which we will discuss later.

Asymptotic theory is used in order to select an appropriate estimation procedure in complex situations. The idea is to determine the estimator which, at least for large sample sizes, provides the smallest possible estimation error. This leads to the concept of “asymptotically efficient” estimators.

Properties of an asymptotically efficient estimator \(\widehat{\theta}_n\):

For the estimation problem to be considered \(\widehat{\theta}_n\) is consistent and adopts the fastest possible rate of convergence (generally: \(n^{-1/2}\) in parametric statistics; \(n^{-2/5}\) can be achieved in nonparametric univariate curve estimation problems).
In most regular situations one is additionally interested in a “best asymptotically normal” (BAN) estimator. Assume that \(\sqrt{n}(\widehat{\theta}_n -\theta)\sim N(0,v^2)\). Then \(\widehat{\theta}_n\) is a BAN-estimator if any alternative estimator \(\tilde\theta_n\) with \(\sqrt{n}(\tilde\theta_n -\theta)\sim N(0,\tilde v^2)\) possesses a larger asymptotic variance, i.e. \(\tilde v^2\geq v^2\).
Multivariate generalization: An estimator \(\widehat{\theta}_n\) with \(\sqrt{n}(\widehat{\theta}_n -\theta)\sim N_p(0,V)\) is best asymptotically normal if \[c'\tilde V c\geq c'Vc\quad\hbox{ for all }\quad c\in\mathbb{R}^p, \Vert c\Vert_2^2=1\] for any other estimator \(\tilde\theta_n\) satisfying \(\sqrt{n}(\tilde\theta_n -\theta)\sim N_p(0,\tilde V)\).

For most estimation problems in parametric statistics maximum-likelihood estimators are best asymptotically normal.

2.6 Mathematical tools

2.6.1 Taylor expansions

Taylors’ theorem: Let \(f\) be a real-valued function which is \(k+1\) continuously differentiable in the interior of an interval \([a,b]\). Consider a point \(x_0\in (a,b)\). For any other value \(x\in (a,b)\) there exists some \(\psi\in [x_0,x]\) such that \[f(x)=f(x_0)+\sum_{r=1}^k \frac{1}{r!}f^{(r)}(x_0)\cdot(x-x_0)^r+\frac{1}{(k+1)!}f^{(k+1)}(\psi)\cdot(x-x_0)^{k+1}\]

Multivariate generalization: \(x_0,x\in\mathbb{R}^p\), \(f'(x_0)\in\mathbb{R}^p\), \(f''(x_0)\) a \(p\times p\) Matrix.

First order Taylor approximation: \[f(x)=f(x_0)+f'(x_0)\cdot(x-x_0)+O(\Vert x-x_0\Vert_2^2)\]

Second order Taylor approximation: \[f(x)=f(x_0)+f'(x_0)\cdot(x-x_0)+\frac{1}{2} (x-x_0)^T f''(x_0)(x-x_0)+O(\Vert x-x_0\Vert_2^3)\]

2.6.2 Tools for deriving asymptotic distributions

Let \(\{W_n\}\), \(\{Z_n\}\) be sequences of random variables, then:

\(Z_n=W_n+o_P(1)\quad \Leftrightarrow \quad Z_n-W_n\to_P 0\). If additionally \(W_n\to_L N(0,v^2)\) then \(Z_n\to_L N(0,v^2)\).
For any fixed constant \(c\neq 0\): If \(Z_n\to_P c\) and \(W_n\to_L N(0,v^2)\), then \[cW_n\to_L N(0,c^2v^2)\quad\hbox{as well as }\quad V_n:=Z_n\cdot W_n\to_L N(0,c^2v^2).\] Furthermore, If \(Z_n\) and \(c\) are positive (with probability 1) then also \[W_n/c\to_L N(0,v^2/c^2)\quad\hbox{as well as }\quad V_n:= W_n/Z_n\to_L N(0,v^2/c^2).\]
Multivariate generalization (\(C\), \(Z_n\) \(p\times p\) matrices; \(W_n\) \(p\)-dimensional random vectors): If \(Z_n\to_P C\) as well as \(W_n\to_L N_p(0,V)\), then \[\begin{align*} CW_n&\to_L N_p(0,CVC')\quad\hbox{as well as }\\ V_n:=Z_n\cdot W_n&\to_L N_p(0,CVC') \end{align*}\]

2.6.3 The Delta-Method

A further tool which is frequently used in asymptotic statistics is the so-called delta-method.

Delta-Method: Let \(\widehat{\theta}_n\) be a sequence of estimators of a one-dimensional parameter \(\theta\) satisfying \(n^{r} (\widehat{\theta}_n-\theta)\rightarrow_L N(0,v^2),\) and let \(g(.)\) be a real-valued function which is continuously differentiable at \(\theta\) and satisfies \(g'(\theta)\neq 0\). Then \[n^{r} \left(g(\widehat{\theta}_n)-g(\theta)\right) \rightarrow_L N\left(0,g'(\theta)^2v^2\right).\]

Example: Assume an i.i.d. random sample \(X_1,\dots,X_n\) from an exponential distribution. That is, the underlying density of \(X_i\), \(i=1,\dots,n\), is given by \(f(x|\theta)=\theta\exp(-\theta x)\). We then have \(\mu:=E(X_i)=1/\theta\) as well as \(\sigma^2_X:=\textrm{var}(X_i)=1/\theta^2\). The underlying parameter \(\theta>0\) is unknown and has to be estimated from the data.

The maximum-likelihood estimator of \(\theta\) is \(\hat\theta=1/\bar X\).

We know that \(\sqrt{n}(\bar X-\frac{1}{\theta})\to_L N(0,\frac{1}{\theta^2})\), but what’s about the distribution of \(1/\bar X\)? For this purpose the delta-method can be applied with \(g(x)=1/x\). Then \(g'(x)=-1/x^2\), \(g'(1/\theta)=-\theta^2\), and consequently \[n^{1/2} \left(\frac{1}{\bar X}-\theta\right)=n^{1/2}\left(g\left(\bar X\right)-g\left(\frac{1}{\theta}\right)\right)\rightarrow_L N\left(0,\theta^2\right).\]