# Chapter 3 Point estimation

**Definition 3.1 (Estimation)**

*Estimation*is the process of infering or attempting to guess the value of one or several population parameters from a sample. Therefore, an

*estimator*\(\hat\theta\) of a parameter \(\theta\in\Theta\) is an statistic with range in the

*parameter space*\(\Theta\).

**Example 3.1 **The following table shows the usual estimators for well-known parameters.

Parameter | Estimator |
---|---|

Proportion \(p\) | Sample proportion \(\bar{X}\) |

Mean \(\mu\) | Sample mean \(\bar{X}\) |

Variance \(\sigma^2\) | Sample variance \(S^2\) and sample quasivariance \(S'^2\) |

## 3.1 Unbiased estimators

**Definition 3.2 (Unbiased estimator) **Given an estimator \(\hat\theta\) of a parameter \(\theta\), the quantity \(\mathrm{B}[\hat\theta]=\mathbb{E}[\hat\theta]-\theta\) is the *bias* of the estimator \(\hat\theta\). The estimator \(\hat\theta\) is *unbiased* if its bias is zero, *i.e.*, if \(\mathbb{E}[\hat\theta]=\theta\).

**Example 3.2**We saw in (2.3) and (2.4) that the sample variance \(S^2\) was not an unbiased estimator of \(\sigma^2\), whereas the sample quasivariance \(S'^2\) was unbiased. From Theorem 2.2 we can see, from an alternative approach based on assuming normality, that \(S^2\) is not unbiased. On one hand, \[ \mathbb{E}\left[\frac{nS^2}{\sigma^2}\right]=\mathbb{E}\left[\frac{(n-1)S'^2}{\sigma^2}\right]=\mathbb{E}\left[\chi_{n-1}^2\right]=n-1 \] and, on the other, \[ \mathbb{E}\left[\frac{nS^2}{\sigma^2}\right]=\frac{n}{\sigma^2}\mathbb{E}[S^2]. \] Therefore, equating both results and solving for \(\mathbb{E}[S^2]\), we have \[ \mathbb{E}[S^2]=\frac{n-1}{n}\,\sigma^2. \] We can also see that \(S'^2\) is indeed unbiased. Fist, we have that \[ \mathbb{E}\left[\frac{(n-1)S'^2}{\sigma^2}\right]=\frac{n-1}{\sigma^2}\mathbb{E}[S'^2]. \] Then, equating this expectation with the mean of a r.v. \(\chi^2_{n-1}\), \(n-1\), and solving for \(\mathbb{E}[S'^2]\), it follows that \(\mathbb{E}[S'^2]=\sigma^2\).

**Example 3.3 **Let \(X\) be a uniform r.v. in the interval \((0,\theta)\), that is, \(X\sim \mathcal{U}(0,\theta)\) with p.d.f. \(f_X(x)=1/\theta\), \(0<x<\theta\). Let \((X_1,\ldots,X_n)\) be a s.r.s. of \(X\). Let us obtain an unbiased estimator of \(\theta\).

Since \(\theta\) is the upper bound for the sample realization, the value from the sample that is closer to \(\theta\) is \(X_{(n)}\), the maximum of the sample. Hence, we take \(\hat\theta=X_{(n)}\) as an estimator of \(\theta\) and check whether it is unbiased. In order to compute its expectation, we need to obtain its p.d.f. We can derive it from Exercise 2.1: the c.d.f. \(X_{(n)}\) for a s.r.s. of a r.v. with c.d.f. \(F_{X}\) is \([F_{X}]^n\).

The c.d.f. of \(X\) for \(0< x < \theta\) is

\[ F_{X}(x)=\int_0^x f_X(t)\,\mathrm{d}t=\int_0^x \frac{1}{\theta}\,\mathrm{d}t=\frac{x}{\theta}. \]

Then, the full c.d.f. is

\[ F_X(x)=\left\{\begin{array}{ll} 0 & x<0,\\ x/\theta & 0\leq x<\theta,\\ 1 & x\geq \theta. \end{array}\right. \]

Consequently, the c.d.f. of the maximum is

\[ F_{X_{(n)}}(x)=\left\{\begin{array}{ll} 0 & x<0,\\ (x/\theta)^n & 0\leq x<\theta,\\ 1 & x\geq \theta. \end{array}\right. \]

The density of \(X_{(n)}\) follows by differentiation:

\[ f_{X_{(n)}}(x)=\frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1}, \ x\in (0,\theta). \]

Finally, the expectation of \(\hat\theta=X_{(n)}\) is

\[\begin{align*} \mathbb{E}[\hat\theta]&=\int_0^{\theta} x \frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1} \,\mathrm{d}x=\frac{n}{\theta^n}\int_0^{\theta} x^n\,d\mathrm{x}\\ &=\frac{n}{\theta^n}\frac{\theta^{n+1}}{n+1}=\frac{n}{n+1}\theta\neq\theta. \end{align*}\]

Therefore, \(\hat\theta\) is *not* unbiased. However, it can be readily patched as
\[
\hat\theta'=\frac{n+1}{n}X_{(n)},
\]
which *is* an unbiased estimator of \(\theta\):
\[
\mathbb{E}[\hat\theta']=\frac{n+1}{n}\frac{n}{n+1}\theta=\theta.
\]

**Example 3.4 **Let \(X\) be a r.v. with p.d.f.
\[
f_{X}(x)=e^{-\theta x}, \ x>0, \ \theta>0.
\]
Let \((X_1,\ldots,X_n)\) be a s.r.s. of such r.v. We are going to find an unbiased estimator for \(\theta\).

Since \(X\sim \mathrm{Exp}(1/\theta)\), we know that \(\mathbb{E}[X]=1/\theta\) and hence \(\theta=1/\mathbb{E}[X]\). As \(\bar{X}\) is an unbiased estimator of \(\mathbb{E}[X]\), it is reasonable to consider \(\hat{\theta}=1/\bar{X}\) as an estimator of \(\theta\). Checking whether it is unbiased requires its p.d.f. Observe that \(X\sim \Gamma(1,1/\theta)\). Then, by the additive property of the gamma (see Exercise 1.17):

\[ T=\sum_{i=1}^n X_i\sim \Gamma\left(n,1/\theta\right), \]

with p.d.f.

\[ f_T(t)=\frac{1}{(n-1)!} \theta^n t^{n-1} e^{-\theta t}, \ t>0. \]

Then, the expectation of the estimator \(\hat\theta=n/T=1/\bar{X}\) is given by

\[\begin{align*} \mathbb{E}[\hat\theta]&=\int_0^{\infty} \frac{n}{t}\frac{1}{(n-1)!} \theta^n t^{n-1} e^{-\theta t}\,\mathrm{d}t\\ &=\frac{n \theta}{n-1}\int_0^{\infty}\frac{1}{(n-2)!} \theta^{n-1} t^{(n-1)-1} e^{-\theta t}\,\mathrm{d}t\\ &=\frac{n}{n-1} \theta \end{align*}\]

and as a consequence \(\hat\theta\) is *not* unbiased for \(\theta\). However, the corrected estimator

\[ \hat\theta'=\frac{n-1}{n}\frac{1}{\bar{X}} \]

*is*unbiased.

In the previous example we have seen that, even if \(\bar{X}\) is unbiased for \(\mathbb{E}[X]\), \(1/\bar{X}\) is not unbiased for \(1/\mathbb{E}[X]\). This illustrates that, even if \(\hat\theta\) is an unbiased estimator of \(\theta\), then in general \(g(\hat\theta)\) is *not* unbiased for \(g(\theta)\).

The quantity \(\hat\theta-\theta\) is the *estimation error*, and depends on the particular value of \(\hat\theta\) for the observed (or realized) sample. Observe that the bias is the expected (or mean) estimation error across all the possible realizations of the sample, which does *not* depend on the actual realization of \(\hat\theta\) for a particular sample:

\[ \mathrm{B}[\hat\theta]=\mathbb{E}[\hat\theta]-\theta=\mathbb{E}[\hat\theta-\theta]. \]

If the estimation error is measured in absolute value, \(|\hat\theta-\theta|\), the quantity \(\mathbb{E}[|\hat\theta-\theta|]\) is referred as the *mean absolute error*. If the square is taken, \((\hat\theta-\theta)^2\), then we obtain the so-called *Mean Squared Error* (MSE)

\[
\mathrm{MSE}[\hat\theta]=\mathbb{E}\big[(\hat\theta-\theta)^2\big].
\]
The MSE is mathematically more tractable than the mean absolute error, hence is usually preferred. Since the MSE gives an average of the squared estimation errors, it introduces a *performance measure* for comparing two estimators \(\hat\theta_1\) and \(\hat\theta_2\) of a parameter \(\theta\). The estimator with the lowest MSE is the optimal (according to the performance measure based on the MSE) for estimating \(\theta\).

A key identity for the MSE is the following *bias-variance decomposition*:

\[\begin{align*} \mathrm{MSE}[\hat\theta]&=\mathbb{E}\big[(\hat\theta-\theta)^2\big]\\ &=\mathbb{E}\big[(\hat\theta-\mathbb{E}[\hat\theta]+\mathbb{E}[\hat\theta]-\theta)^2\big]\\ &=\mathbb{E}\big[(\hat\theta-\mathbb{E}[\hat\theta])^2\big]+\mathbb{E}\big[(\mathbb{E}[\theta]-\theta)^2\big]+2\mathbb{E}\big[(\hat\theta-\mathbb{E}[\hat\theta])(\mathbb{E}[\hat\theta]-\theta)\big]\\ &=\mathbb{V}\mathrm{ar}[\hat\theta]+(\mathbb{E}[\theta]-\theta)^2+2\mathbb{E}[\hat\theta-\mathbb{E}[\hat\theta]](\mathbb{E}[\hat\theta]-\theta)\\ &=\mathbb{V}\mathrm{ar}[\hat\theta]+\mathrm{B}^2[\hat\theta]+2(\mathbb{E}[\hat\theta]-\mathbb{E}[\hat\theta])(\mathbb{E}[\hat\theta]-\theta)\\ &=\mathrm{B}^2[\hat\theta]+\mathbb{V}\mathrm{ar}[\hat\theta]. \end{align*}\]

This identity tells us that if we want to minimize the MSE, it does not suffice to find an unbiased estimator: *the variance contributes to the MSE the same as the squared bias*. Therefore, if we search for the optimal estimator in terms of MSE, both bias and variance should be minimized.

**Example 3.5 **Let us compute the MSE of the sample variance \(S^2\) and the sample quasivariance \(S'^2\) when estimating the population variance \(\sigma^2\) of a normal r.v. (this assumption is fundamental for obtaining the expression for the variance of \(S^2\) and \(S'^2\)).

In Exercise 2.15 we saw that, for a normal population,

\[\begin{align*} \mathbb{E}[S^2]&=\frac{n-1}{n}\sigma^2, & \mathbb{V}\mathrm{ar}[S^2]&=\frac{2(n-1)}{n^2}\sigma^4,\\ \mathbb{E}[S'^2]&=\sigma^2, & \mathbb{V}\mathrm{ar}[S'^2]&=\frac{2}{n-1}\sigma^4. \end{align*}\]

Therefore, the bias of \(S^2\) is

\[ \mathrm{B}[S^2]=\frac{n-1}{n}\sigma^2-\sigma^2=-\frac{1}{n}\sigma^2<0 \]

and the MSE of \(S^2\) for estimating \(\sigma^2\) is

\[ \mathrm{MSE}[S^2]=\mathrm{B}^2[S^2]+\mathbb{V}\mathrm{ar}[S^2]=\frac{1}{n^2}\sigma^4+\frac{2(n-1)}{n^2}\sigma^4= \frac{2n-1}{n^2}\sigma^4. \]

Replicating the calculations for the sample quasivariance, we have that

\[ \mathrm{MSE}[S'^2]=\mathbb{V}\mathrm{ar}[S'^2]=\frac{2}{n-1}\sigma^4. \]

Since \(n>1\), we have

\[ \frac{2}{n-1}>\frac{2n-1}{n^2} \]

and, as a consequence,

\[ \mathrm{MSE}[S'^2]>\mathrm{MSE}[S^2]. \]

The bottomline is clear: despite \(S'^2\) is unbiased and \(S^2\) is not, for normal populations \(S^2\) has lower MSE than \(S'^2\) when estimating \(\sigma^2\). Therefore, \(S^2\) is better than \(S'^2\) in terms of MSE for estimating \(\sigma^2\) in normal populations.

The use of unbiased estimators is convenient when the sample size \(n\) is large, since in those cases the variance tends to be small. However, when \(n\) is small, the bias is usually very small compared with the variance, so a smaller MSE can be obtained by focusing on decreasing the variance. On the other hand, it is possible that, for a parameter and a given sample, there is no unbiased estimator, as the following example shows.

**Example 3.6 **The next game is presented to us. We have to pay \(6\) euros in order to participate and the payoff is \(12\) euros if we obtain two heads in two tosses of a coin with heads probability \(p\). We receive \(0\) euros otherwise. We are allowed to perform a test toss for estimating the value
of the success probability \(\theta=p^2\).

In the coin toss we observe the value of the r.v.

\[ X_1=\left\{\begin{array}{ll} 1 & \text{if heads},\\ 0 & \text{if tails}. \end{array}\right. \]

We know that \(X_1\sim \mathrm{Ber}(p)\). Let

\[ \hat\theta=\left\{\begin{array}{ll} \hat\theta(1) & \text{if $X_1=1$},\\ \hat\theta(0) & \text{if $X_1=0$}. \end{array}\right. \]

be an estimator of \(\theta=p^2\). Its expectation is

\[ \mathbb{E}[\hat\theta]=\hat\theta(1)p+\hat\theta(0)(1-p), \]

which is different from \(p^2\) for any estimator \(\hat\theta\). Therefore, for any given sample of size \(n=1\), \(X_1\), there does not exist any unbiased estimator of \(p^2\).

## 3.2 Invariant estimators

**Example 3.7 **The manufacturer of a given product claims that the product packages contain at least \(\theta\) grams of product. If this claim is true, then the product content within a package is distributed as a \(\mathcal{U}(\theta, \theta+100)\). To check the claim of the manufacturer, a s.r.s. measuring the content was taken. The realization of this s.r.s. is \((x_1,\ldots,x_n)\) and is used to compute an estimate \(\hat\theta(x_1,\ldots,x_n)\). But, after computing the estimate, it is discovered that the balance was weigthing sistematically \(c\) grams less. Can we just simply correct our estimate as \(\hat\theta(x_1,\ldots,x_n)+c\)?

**Definition 3.3 (Translation-invariant estimator) **Let \((X_1,\ldots,X_n)\) be a sample, whose distribution depends on an unknown parameter \(\theta\),
and let \(\hat\theta(X_1,\ldots,X_n)\) be an estimator of \(\theta\). This estimator is
*invariant to translations* or *translation-invariant* if, for any \(c\in \mathbb{R}\),
\[
\hat\theta(X_1+c,\ldots,X_n+c)=\hat\theta(X_1,\ldots,X_n)+c.
\]

**Example 3.8 **Check that \(X_{(1)}\), \(\bar{X}\) and \((X_{(1)}+X_{(n)})/2\) are invariant to translations but geometric mean \((\prod_{i=1}^n X_i)^{1/n}\) and the harmonic mean \(n/\sum_{i=1}^n X_i^{-1}\) are not.

For \(X_{(1)}=\min_{1\leq i\leq n} X_i\), we have

\[ \min_{1\leq i\leq n} (X_i+c)=\min_{1\leq i\leq n} (X_i)+c. \]

Therefore, \(X_{(1)}\) is translation-invariant. For \(\bar{X}=(1/n)\sum_{i=1}^n X_i\),

\[ \frac{1}{n}\sum_{i=1}^n (X_i+c)=\frac{1}{n}\left[\sum_{i=1}^n x_i+nc\right]=\frac{1}{n}\sum_{i=1}^n X_i+c=\bar{X}+c. \]

So \(\bar{X}\) is translation-invariant too. Finally, we check \((X_{(1)}+X_{(n)})/2\):

\[\begin{align*} \frac{1}{2}\left[\min_{1\leq i\leq n}(X_i+c)+\max_{1\leq i\leq n}(X_i+c)\right] &=\frac{1}{2}\left[\min_{1\leq i\leq n}X_i+\max_{1\leq i\leq n}X_i+2c\right]\\ &=\frac{1}{2}\left(X_{(1)}+X_{(n)}\right)+c. \end{align*}\]

To see that neither the geometric nor the harmonic means are invariant to translations, we only need to find counterexamples. For that, consider the sample of size \(n=3\) given by \(x_1=1\), \(x_2=2\) and \(x_3=3\). For these data, the geometric and harmonic means are, respectively, \[\begin{align*} \left[\prod_{i=1}^n X_i\right]^{1/n}&=(1\cdot 2\cdot 3)^{1/3}=6^{1/3}=1.82,\\ \frac{n}{\sum_{i=1}^n X_i^{-1}}&=\frac{3}{1+\frac{1}{2}+\frac{1}{3}}=\frac{18}{11}=1.64. \end{align*}\] However, if we take \(c=1\): \[\begin{align*} \left[\prod_{i=1}^n (X_i+c)\right]^{1/n}&=(2\cdot 3\cdot 4)^{1/3}=2.88\neq 1.82+1,\\ \frac{n}{\sum_{i=1}^n (X_i+c)^{-1}}&=\frac{3}{\frac{1}{2}+\frac{1}{3}+\frac{1}{4}}=\frac{36}{13}=2.77\neq 1.64+1, \end{align*}\] and we see that none of them are translation-invariant.**Example 3.9 **A woman always arrives to the bus stop at the same hour. She wishes to estimate the maximum time waiting for the bus, knowing that the waiting time is distributed as a \(\mathcal{U}(0,\theta)\). For that purpose, she times how long she waits during \(n\) days and obtains a realization of a s.r.s., \((x_1,\ldots,x_n)\), measured in seconds. Based on that sample, she obtains an estimate of the maximum waiting time \(\hat\theta(x_1,\ldots,x_n)\) in seconds. If she wants to convert the result to minutes, can she just compute \(\hat\theta(x_1,\ldots,x_n)/60\)?

The answer depends on whether the estimator satisfies \[ \hat\theta(x_1/60,\ldots,x_n/60)=\hat\theta(x_1,\ldots,x_n)/60. \]

If this is not the case, then she will need to compute \(\hat\theta(x_1/60,\ldots,x_n/60)\).

**Definition 3.4 (Scale-invariant estimators)**Let \((X_1,\ldots,X_n)\) be a sample whose distribution depends on an unknown parameter \(\theta\) and let \(\hat\theta(X_1,\ldots,X_n)\) be an estimator of \(\theta\). This estimator is

*invariant to scale*or

*scale-invariant*if, for any \(c>0\), \[ \hat\theta(cX_1,\ldots,cX_n)=c\,\hat\theta(X_1,\ldots,X_n). \]

**Example 3.10**Check that \(\bar{X}\) and \(X_{(n)}\) are scale-invariant and that \(\log((1/n)\sum_{i=1}^n\exp({X_i}))\) and \(X_{(n)}/X_{(1)}\) are not.

## 3.3 Consistent estimators

The idea of consistency is related with the sample size \(n\). Assume that \(X\) is a r.v. with induced probability \(\mathbb{P}(\, \cdot\, ;\theta)\), \(\theta\in\Theta\) and that \((X_1,\ldots,X_n)\) is a s.r.s. of \(X\). Take an estimator \(\hat\theta=\hat\theta(X_1,\ldots,X_n)\), where we have emphasized the dependence of the estimator on the s.r.s. Of course \(\hat\theta\) depends on \(n\) and, to emphasize that, we will also denote the estimator by \(\hat\theta_n\). The question now is: what happens with the distribution of this estimator as the sample size increases? Is its distribution going to be more concentrated around the true value \(\theta\)?

We will study different concepts of consistency, all of them taking into account the distribution of \(\hat\theta\). The first concept we will see, tell us that an estimator is *consistent in probability* if the probability of \(\hat\theta\) being far away from \(\theta\) decays as \(n\to\infty\).

**Example 3.11 **Let \(X\sim \mathcal{N}(\mu,\sigma^2)\). Let us see how the distribution of \(\bar{X}-\mu\) changes as \(n\) increases, for \(\sigma=2\).

We know from Theorem 2.1 that \[ Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim\mathcal{N}(0,1). \]

From that result, we have computed the next probabilities for \(\varepsilon=1\):

\(n\) | \(\bar{X}_n\) | \(\mathbb{P}\left(\vert\bar{X}_n-\mu\vert>\varepsilon\right)=2\mathbb{P}\left(Z>\frac{\varepsilon}{\sigma/\sqrt{n}}\right)\) |
---|---|---|

\(1\) | \(X_1\) | \(0.617\) |

\(2\) | \((X_1+X_2)/2\) | \(0.484\) |

\(3\) | \((X_1+X_2+X_3)/3\) | \(0.3844\) |

\(10\) | \(\sum_{i=1}^{10} X_i/10\) | \(0.1142\) |

\(20\) | \(\sum_{i=1}^{20} X_i/20\) | \(0.0258\) |

*larger*than the one for \(\varepsilon_1\), for each \(n\), if \(\varepsilon_2\leq \varepsilon_1\) (respectively,

*smaller*and \(\varepsilon_2\geq \varepsilon_1\)). These observations can be visualized in Figure 3.1.

Then, an estimator \(\hat\theta\) for \(\theta\) is consistent in probability when the probabilities of \(\hat\theta\) and \(\theta\) differing more than \(\varepsilon>0\) decrease as \(n\) increases. Or, equivalently, when the probabilities of \(\hat\theta\) and \(\theta\) being closer than \(\varepsilon\) grow as \(n\) increases. This must happen for any \(\varepsilon>0\). If this is true, then the distribution of \(\hat\theta\) becomes more and more concentrated about \(\theta\).

**Definition 3.5 (Consistency in probability) **Let \(X\) be a r.v. with induced probability \(\mathbb{P}(\,\cdot\,;\theta)\). Let \((X_1,\ldots,X_n)\) be a s.r.s. of \(X\) and let \(\hat\theta_n=\hat\theta_n(X_1,\ldots,X_n)\) be an estimator of \(\theta\). The sequence \(\{\hat\theta_n: \ n\in\mathbb{N}\}\) is said to be *consistent in probability* (or *consistent*) for \(\theta\), if
\[
\forall \varepsilon>0, \quad \lim_{n\rightarrow \infty}
\mathbb{P}(|\hat\theta_n-\theta|>\varepsilon)=0.
\]
We simply say that the *estimator \(\hat\theta_n\) is consistent in probability* to indicate that the sequence \(\{\hat\theta_n: \ n\in\mathbb{N}\}\) is consistent in probability.

**Example 3.12 **Let \(X\sim\mathcal{U}(0,\theta)\) and let \((X_1,\ldots,X_n)\) be a s.r.s. of \(X\). Show that the estimator \(\hat\theta_n=X_{(n)}\) is consistent in probability for \(\theta\).

We take \(\varepsilon>0\) and compute the limit of the probability given in Definition 3.5. Since \(\theta \geq X_i\), \(i=1,\dots,n\), we have that

\[ \mathbb{P}(|\hat\theta_n-\theta|>\varepsilon)=\mathbb{P}(\theta-X_{(n)}>\varepsilon) =\mathbb{P}\left(X_{(n)}<\theta-\varepsilon\right)=F_{X_{(n)}}(\theta-\varepsilon). \]

If \(\varepsilon>\theta\), then \(\theta-\varepsilon<0\) and therefore \(F_{X_{(n)}}(\theta-\varepsilon)=0\). For \(\varepsilon\leq\theta\), the c.d.f. evaluated at \(\theta-\varepsilon\) is not zero. Using the c.d.f. of \(X_{(n)}\) given in Exercise 2.1 and the fact that \(X\sim\mathcal{U}(0,\theta)\), we have

\[ F_{X_{(n)}}(x)=\left(F_{X}(x)\right)^n=\left(\frac{x}{\theta}\right)^n, \quad x\in(0,\theta). \]

Considering the value of such distribution at \(\theta-\varepsilon\in [0,\theta)\), we get

\[ F_{X_{(n)}}(\theta-\varepsilon)=\left(\frac{\theta-\varepsilon}{\theta}\right)^n =\left(1-\frac{\varepsilon}{\theta}\right)^n. \]

Then, taking the limit as \(n\rightarrow \infty\) and noting that \(\varepsilon<\theta\), we have

\[ \lim_{n\rightarrow \infty} \mathbb{P}(|\hat\theta_n-\theta|>\varepsilon)=\lim_{n\rightarrow \infty}\left(1-\frac{\varepsilon}{\theta}\right)^n=0. \]

That is, \(\hat\theta_n=X_{(n)}\) is consistent in probability for \(\theta\).The concept of consistency in probability of a sequence of estimators can be extended to general sequences of r.v.’s. This gives an important type of convergence of r.v.’s, the *convergence in probability*.

**Definition 3.6 (Convergence in probability)**A sequence \(\{X_n: \ n\in\mathbb{N}\}\) of r.v.’s defined over the same measurable space \((\Omega,\mathcal{A},\mathbb{P})\) is said to

*converge in probability*to another r.v. \(X\), and is denoted by \[ X_n \stackrel{\mathbb{P}}{\longrightarrow} X, \] if the following statement holds: \[ \forall \varepsilon>0, \quad \lim_{n\rightarrow \infty} \mathbb{P}(|X_n-X|>\varepsilon)=0, \] where here \(\mathbb{P}\) stands for the joint induced probability function of \(X_n\) and \(X\).

The following definition establishes another type of consistency that is stronger than (in the sense that implies, but it is not implied) consistency in probability.

**Definition 3.7 (Convergence in squared mean)**A sequence of estimators \(\{\hat\theta_n: \ n\in\mathbb{N}\}\) is

*consistent in squared mean*for \(\theta\) if \[ \lim_{n\rightarrow \infty}\mathrm{MSE}[\hat\theta_n]=0, \] or, equivalently, if \[ \lim_{n\rightarrow \infty} \mathrm{B}[\hat\theta_n]=0 \quad \text{and}\quad \lim_{n\rightarrow \infty} \mathbb{V}\mathrm{ar}[\hat\theta_n]=0. \]

**Example 3.13 **Let us check if \(\hat\theta_n=X_{(n)}\) is consistent in squared mean for \(\theta\).

The MSE of \(\hat\theta_n\) is given by \[ \mathrm{MSE}[\hat\theta_n]=\mathbb{E}\left[(\hat\theta_n-\theta)^2\right]=\mathbb{E}\left[\hat\theta_n^2-2\theta\hat\theta+\theta^2\right]. \] Therefore, we have to compute the expectation and variance of \(\hat\theta_n=X_{(n)}\). In Exercise 2.1, the density of \(\hat\theta_n=X_{(n)}\) and the expectation were given respectively by \[ f_{\hat\theta_n}(x)=\frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1}, \quad x\in(0,\theta), \quad \mathbb{E}[\hat\theta_n]=\frac{n}{n+1}\theta. \]

It remains to compute

\[\begin{align*} \mathbb{E}[\hat\theta_n^2]&=\int_0^{\theta} x^2 \frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1}\,\mathrm{d}x\\ &=\frac{n}{\theta^n} \int_0^{\theta} x^{n+1}\,\mathrm{d}x=\frac{n}{\theta^n}\frac{\theta^{n+2}}{n+2}= \frac{n\theta^2}{n+2}. \end{align*}\]

Then, the MSE is \[ \mathrm{MSE}[\hat\theta_n]=\left(1-\frac{2n}{n+1}+\frac{n}{n+2}\right)\theta^2. \] Taking the limit as \(n\rightarrow \infty\), we obtain \[ \lim_{n\rightarrow \infty} \mathrm{MSE}[\hat\theta_n]=0, \] so \(\hat\theta_n=X_{(n)}\) is consistent in squared mean, even if it is not unbiased.From the previous definition we deduce automatically the following result.

**Proposition 3.1 **Assume that \(\hat\theta_n\) is unbiased for \(\theta\), for all
\(n\in\mathbb{N}\). Then \(\hat\theta_n\) is consistent in squared mean for
\(\theta\) if and only if
\[
\lim_{n\rightarrow \infty} \mathbb{V}\mathrm{ar}[\hat\theta_n]=0.
\]

**Theorem 3.1 (Markov’s inequality)**For any r.v. \(X\), \[ \mathbb{P}(|X|>k)\leq \frac{\mathbb{E}[X^2]}{k^2}, \quad k\geq 0. \]

*Proof* (Proof of Theorem 3.1). Assume that \(X\) is a continuous r.v. and let \(f\) be its p.d.f. We compute the second-order moment of \(X\):

\[\begin{align*} \mathbb{E}[X^2] &=\int_{-\infty}^{\infty} x^2 f(x)\,\mathrm{d}x\\ &=\int_{-\infty}^{-k} x^2 f(x)\,\mathrm{d}x+\int_{-k}^{k} x^2 f(x)\,\mathrm{d}x+\int_{k}^{\infty} x^2 f(x)\,\mathrm{d}x \\ & \geq \int_{-\infty}^{-k} x^2 f(x)\,\mathrm{d}x + \int_{k}^{\infty} x^2 f(x)\,\mathrm{d}x \\ &\geq k^2 \int_{-\infty}^{-k} f(x)\,\mathrm{d}x + k^2\int_{k}^{\infty} f(x)\,\mathrm{d}x\\ & = k^2\left[\mathbb{P}(X\leq -k)+\mathbb{P}(X\geq k)\right]\\ &=k^2\mathbb{P}(|X|\geq k), \end{align*}\]

which is equivalent to

\[ \mathbb{P}(|X|\geq k)\leq \frac{\mathbb{E}[X^2]}{k^2}. \]

The proof for a discrete r.v. is analogous.**Theorem 3.2 (Chebyshev’s inequality)**Let \(X\) a r.v. with \(\mathbb{E}[X]=\mu\) and \(\mathbb{V}\mathrm{ar}[X]=\sigma^2\). Then, \[ \mathbb{P}\left(|X-\mu|\geq k\sigma\right)\leq \frac{1}{k^2}, \quad k\geq 0. \]

*Proof* (Proof of Theorem 3.2). This inequality follows from Markov’s inequality. Indeed, taking
\[
X'=X-\mu, \quad k'=k\sigma,
\]
and replacing \(X\) by \(X'\) and \(k\) by \(k'\) in Markov’s inequality, we get
\[
\mathbb{P}(|X-\mu|\geq k\sigma)\leq
\frac{\sigma^2}{k^2\sigma^2}=\frac{1}{k^2}.
\]

This inequality is useful for obtaining probability bounds about \(\hat\theta\) when its probability distribution is unknown; we only need to know its expectation and variance. Indeed, taking \(k=2\), the Chebyshev’s inequality then gives \[ \mathbb{P}(|\hat\theta-\theta|\leq 2\sigma_{\hat\theta})\geq 1-\frac{1}{4}=0.75, \] which means that at least the \(75\%\) of the realized values of \(\hat\theta\) fall within the interval \([\theta-2\sigma_{\hat\theta}, \theta+2\sigma_{\hat\theta}]\). However, if we additionally know that the distribution of the estimator \(\hat\theta\) is normal, \(\hat\theta\sim\mathcal{N}(\theta,\sigma_{\hat\theta}^2)\), then we obtain the much more precise result \[ \mathbb{P}(|\hat\theta-\theta|\leq 2\sigma_{\hat\theta})\cong 0.95>0.75. \] The fact that the true probability, \(\cong 0.95\), is in this case substantially larger than the lower bound given by Chebyshev’s inequality, \(0.75\), is reasonable, since Chebyshev’s inequality does not employ any knowledge on the distribution of \(\hat\theta\). Thus, the precision increases as there is more information about the true distribution of \(\hat\theta\).

**Example 3.14 **From previous experience, it is known that the time \(X\) (in minutes) that a periodic check of a machine requires is distributed as \(\Gamma(3,2)\). A new worker spends \(19\) minutes checking that machine. Is this time coherent with the previous experience?

**Theorem 3.3 (Consistency in squared mean implies consistency in probability)**Consistency in squared mean of an estimator \(\hat\theta_n\) of \(\theta\) implies consistency in probability of \(\hat\theta_n\) to \(\theta\).

*Proof*(Proof of Theorem 3.3). We first assume that \(\hat\theta_n\) is a consistent estimator of \(\theta\), in squared mean. Then, taking \(\varepsilon>0\) and applying Markov’s inequality for \(X=\hat\theta_n-\theta\) and \(k=\varepsilon\), we obtain \[ 0\leq \mathbb{P}(|\hat\theta_n-\theta|\geq \varepsilon)\leq \frac{\mathbb{E}[(\hat\theta_n-\theta)^2]}{\varepsilon^2}=\frac{\mathrm{MSE}[\hat\theta_n]}{\varepsilon^2}. \] The right hand side tends to zero as \(n\to\infty\) because of the consistency in squared mean. Then, \[ \lim_{n\rightarrow \infty} \mathbb{P}(|\hat\theta_n-\theta|\geq \varepsilon)=0, \] and this happens for any \(\varepsilon>0\).

Combining Proposition 3.1 and Theorem 3.3, the task of proving the consistency in probability of an unbiased estimator is notably simplified: it is only required to compute the limit of its variance, a much simpler task than directly employing Definition 3.5. Besides, if the estimator is not unbiased, Theorem 3.3 also provides a recipe for proving the consistency in probability of \(\hat\theta_n\) to \(\theta\) by checking that: \[ \lim_{n\rightarrow \infty} \mathbb{E}[\hat\theta_n]=\theta,\quad \lim_{n\rightarrow \infty} \mathbb{V}\mathrm{ar}[\hat\theta_n]=0. \]

**Example 3.15 **Let \(X_1,\ldots,X_n\) be a s.r.s. of a r.v. with mean \(\mu\) and variance \(\sigma^2\). Consider the following estimators of \(\mu\):

\[ \hat\mu_1=\frac{X_1+X_2}{2},\quad \hat\mu_2=\frac{X_1}{4}+\frac{X_2+\ldots + X_{n-1}}{2(n-2)}+\frac{X_n}{4},\quad \hat\mu_3=\bar{X}. \]

Which one is unbiased? Which one is consistent in probability for \(\mu\)?

Their expectations are respectively given by

\[ \mathbb{E}[\hat\mu_1]=\mathbb{E}[\hat\mu_2]=\mathbb{E}[\hat\mu_3]=\mu, \]

so all of them are unbiased. Now, to check whether they are consistent in probability or not, it only remains to check whether their variances tend to zero. But their variances are respectively given by

\[ \mathbb{V}\mathrm{ar}[\hat\mu_1]=\frac{\sigma^2}{2}, \quad \mathbb{V}\mathrm{ar}[\hat\mu_2]=\frac{n\sigma^2}{8(n-2)}, \quad \mathbb{V}\mathrm{ar}[\hat\mu_3]=\frac{\sigma^2}{n}. \]

Unlike the variances of \(\hat\mu_1\) and \(\hat\mu_2\), we can see that the variance of \(\hat\mu_3\) converges to zero, which means that only \(\hat\mu_3\) is consistent in probability for \(\mu\).

The Law of the Large Numbers (LLN) stated below follows by straightforward application of the previous results.

**Theorem 3.4 (Law of the Large Numbers) **Let \((X_1,\ldots,X_n)\) a s.r.s. of a r.v. \(X\) with mean \(\mu\)
and variance \(\sigma^2<\infty\). Then,
\[
\bar{X} \stackrel{\mathbb{P}}{\longrightarrow} \mu.
\]

The above LLN ensures that, by averaging many i.i.d. observations of a r.v. \(X\), we can get arbitrarily close to the true mean \(\mu=\mathbb{E}[X]\). For example, if we were interested in knowing the average weight of an animal, by taking many independent measurements of the weight and averaging them we will get a value very close to the true weight.

**Example 3.16 **Let \(Y\sim \mathrm{Bin}(n,p)\). Let us see that \(\hat p=Y/n\) is consistent
in probability for \(p\).

Indeed, we know that a binomial r.v. counts the number of successes in \(n\) random trials with two possible outcomes, *success* and *fail*, with probability of *success* equal to \(p\). Thus, \(Y=n\bar{X}=\sum_{i=1}^n X_i\), where the r.v.’s \(X_i\) are i.i.d. \(\mathrm{Ber}(p)\), with

\[ \mathbb{E}[X_i]=p,\quad \mathbb{V}\mathrm{ar}[X_i]=p(1-p)<\infty. \]

This means that the sample proportion \(\hat p\) is actually a sample mean:

\[ \hat p=\frac{\sum_{i=1}^n X_i}{n}=\bar{X}. \]

Applying the LLN, we get that \(\hat p=\bar{X}\) converges in probability to \(p=\mathbb{E}[X_i]\).

The LLN implies the following result, giving the condition under which the sample moments are consistent in probability for the population moments.

**Corollary 3.1 (Consistency in probability of the sample moments) **Let \(X\) be a r.v. with \(k\)-th population moment \(\alpha_k=\mathbb{E}[X^k]\) and such that \(\alpha_{2k}=\mathbb{E}[X^{2k}]<\infty\). Let \((X_1,\ldots,X_n)\) be a s.r.s. of \(X\), with \(k\)-th sample moment \(a_k=\frac{1}{n}\sum_{i=1}^n X_i^k\). Then,
\[
a_k\stackrel{\mathbb{P}}{\longrightarrow} \alpha_k.
\]

*Proof*(Proof of Corollary 3.1). The proof is straightforward by taking \(Y_i=X_i^k\), \(i=1,\ldots,n\), in such a way that \((Y_1,\ldots,Y_n)\) represents a s.r.s. of a r.v. \(Y\) with mean \(\mathbb{E}[Y]=\mathbb{E}[X^k]=\alpha_k\) and variance \(\mathbb{V}\mathrm{ar}[Y]=\mathbb{E}[(X^k-\alpha_k)^2]=\alpha_{2k}-\alpha_k^2<\infty\). Then by the LLN, we have that \(a_k=\frac{1}{n}\sum_{i=1}^n Y_i\stackrel{\mathbb{P}}{\longrightarrow} \alpha_k\).

The following theorem states that any continuous transformation \(g\) of an estimator \(\hat\theta_n\) that is consistent in probability to \(\theta\) is also consistent for the transformed parameter \(g(\theta)\).

**Theorem 3.5 (Version of the continuous-mapping Theorem) **

- Let \(\hat\theta_n\stackrel{\mathbb{P}}{\longrightarrow} \theta\) and let \(g\) be a function that is continuous at \(\theta\). Then, \[ g(\hat\theta_n)\stackrel{\mathbb{P}}{\longrightarrow} g(\theta). \]
- Let \(\hat\theta_n\stackrel{\mathbb{P}}{\longrightarrow} \theta\) and \(\hat\theta_n'\stackrel{\mathbb{P}}{\longrightarrow} \theta'\). Let \(g\) be a bivariate function that is continuous at \((\theta,\theta')\). Then, \[ g\left(\hat\theta_n,\hat\theta_n'\right)\stackrel{\mathbb{P}}{\longrightarrow}g\left(\theta,\theta'\right). \]

*Proof*(Proof of Theorem 3.5). Due to the continuity of \(g\) at \(\theta\), for all \(\varepsilon>0\), there exists a \(\delta=\delta(\varepsilon)>0\) such that \[ |x-\theta|<\delta\implies \left|g(x)-g(\theta)\right|<\varepsilon. \] Hence, for any fixed \(n\), it holds \[ 1\geq \mathbb{P}\left(\left|g(\hat\theta_n)-g(\theta)\right|\leq \varepsilon\right)\geq \mathbb{P}(|\hat\theta_n-\theta|\leq \delta). \] Therefore, if \(\hat\theta_n\stackrel{\mathbb{P}}{\longrightarrow} \theta\), then \[ \lim_{n\rightarrow\infty} \mathbb{P}(|\hat\theta_n-\theta|\leq \delta)=1, \] and, as a consequence, \[ \lim_{n\rightarrow\infty} \mathbb{P}\left(\left|g(\hat\theta_n)-g(\theta)\right|\leq \varepsilon\right)=1. \]

The following corollary readily follows from Theorem 3.5, but it makes explicit some the possible algebraic operations that preserve the convergence in probability.

**Corollary 3.2 (Algebra of convergence in probability) **Assume that \(\hat\theta_n\stackrel{\mathbb{P}}{\longrightarrow} \theta\)
and \(\hat\theta'\stackrel{\mathbb{P}}{\longrightarrow} \theta'\). Then:

- \(\hat\theta_n+\hat\theta_n' \stackrel{\mathbb{P}}{\longrightarrow} \theta+\theta'\).
- \(\hat\theta_n\hat\theta_n' \stackrel{\mathbb{P}}{\longrightarrow}\theta\theta'\).
- \(\hat\theta_n/\hat\theta_n' \stackrel{\mathbb{P}}{\longrightarrow} \theta/\theta'\) if \(\theta'\neq 0\).
- Let \(a_n\) be a deterministic sequence whose limit is \(a\). Then \(a_n\hat\theta_n\stackrel{\mathbb{P}}{\longrightarrow} a\theta\).

**Example 3.17 (Consistency in probability of the sample quasivariance) **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\) with the following finite moments:
\[
\mathbb{E}[X]=\mu, \quad \mathbb{E}[X^2]=\alpha_2, \quad \mathbb{E}[X^4]=\alpha_4.
\]
Its variance is therefore \(\mathbb{V}\mathrm{ar}[X]=\alpha_2-\mu^2=\sigma^2\). Let us check that \(S'^2\stackrel{\mathbb{P}}{\longrightarrow} \sigma^2\).

\(S'^2\) can be written as

\[ S'^2=\frac{n}{n-1}\left(\frac{1}{n}\sum_{i=1}^n X_i^2-{\bar X}^2\right). \]

Now, by the LLN,

\[ \bar{X}\stackrel{\mathbb{P}}{\longrightarrow} \mu. \]

Firstly, applying part ii of Corollary 3.2, we obtain

\[\begin{align} {\bar{X}}^2\stackrel{\mathbb{P}}{\longrightarrow} \mu^2. \tag{3.1} \end{align}\]

Secondly, the r.v.’s \(Y_i=X_i^2\) have mean \(\mathbb{E}[Y_i]=\mathbb{E}[X_i^2]=\alpha_2\) and variance \(\mathbb{V}\mathrm{ar}[Y_i]=\mathbb{E}[X_i^4]-\mathbb{E}[X_i^2]=\alpha_4-\alpha_2<\infty\), \(i=1,\ldots,n\). Applying the LLN to \((Y_1,\ldots,Y_n)\), we obtain

\[\begin{align} \bar Y=\frac{1}{n}\sum_{i=1}^n X_i^2 \stackrel{\mathbb{P}}{\longrightarrow} \alpha_2.\tag{3.2} \end{align}\]

Applying now part i of Corollary 3.2 to (3.1) and (3.2), we get

\[ \frac{1}{n}\sum_{i=1}^n X_i^2-{\bar{X}}^2\stackrel{\mathbb{P}}{\longrightarrow} \alpha_2-\mu^2=\sigma^2. \] Observe that we have just proved consistency in probability for the sample variance \(S^2\). Applying finally iv with the following deterministic sequence

\[ \lim_{n\rightarrow \infty}\frac{n}{n-1}=1, \]

we conclude that

\[ S'^2\stackrel{\mathbb{P}}{\longrightarrow} \sigma^2. \]

The next theorem delivers asymptotic normality of the \(T\) statistic that was used in Section 2.2.3 for normal populations. This theorem will be rather useful for deriving the asymptotic distribution (via convergence in distribution) of an estimator that is affected by a nuisance factor converging to \(1\) in probability.

**Theorem 3.6 (Version of Slutsky’s Theorem) **Let \(U_n\) and \(W_n\) two r.v.’s that sastisfy, respectively,
\[
U_n\stackrel{d}{\longrightarrow} \mathcal{N}(0,1),\quad W_n\stackrel{\mathbb{P}}{\longrightarrow} 1.
\]
Then,
\[
U_n/W_n \stackrel{d}{\longrightarrow} \mathcal{N}(0,1).
\]

**Example 3.18 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\) with \(\mathbb{E}[X]=\mu\) and \(\mathbb{V}\mathrm{ar}[X]=\sigma^2\). Show that:

\[ T=\frac{\bar{X}-\mu}{S'/\sqrt{n}}\stackrel{d}{\longrightarrow} \mathcal{N}(0,1). \]

In Example 3.17 we saw \(S'^2\stackrel{\mathbb{P}}{\longrightarrow} \sigma^2\). Employing iv in Corollary 3.2, it follows that

\[ \frac{S'^2}{\sigma^2} \stackrel{\mathbb{P}}{\longrightarrow} 1. \]

Taking square root (a continuous function at \(1\)) in Theorem 3.5, we have

\[ \frac{S'}{\sigma} \stackrel{\mathbb{P}}{\longrightarrow} 1. \]

In addition, by the CLT, we know that

\[ U_n=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\stackrel{d}{\longrightarrow} \mathcal{N}(0,1). \]

Finally, applying Theorem 3.6 to \(U_n\) and \(W_n=S'/\sigma\), we get

\[ \frac{\bar{X}-\mu}{S'/\sqrt{n}}=U_n/W_n \stackrel{d}{\longrightarrow} \mathcal{N}(0,1). \]**Example 3.19 **Let \(X_1,\ldots,X_n\) be i.i.d. r.v.’s with distribution \(\mathrm{Ber}(p)\), where \(p\) is the probability of success. Prove that the sample proportion, \(\hat p=\bar{X}\), satisfies

\[ \frac{\hat p-p}{\sqrt{\hat p(1-\hat p)/n}}\stackrel{d}{\longrightarrow} \mathcal{N}(0,1). \]

Similarly as in Example 2.14, applying the CLT to \(\hat p=\bar{X}\), it readily follows that

\[ U_n=\frac{\hat p-p}{\sqrt{p(1-p)/n}}\stackrel{d}{\longrightarrow} \mathcal{N}(0,1). \] Now, because of the LLN, we have

\[ \hat p\stackrel{\mathbb{P}}{\longrightarrow} p, \]

which, by ii in Corollary 3.2, leads to

\[ \hat p(1-\hat p)\stackrel{\mathbb{P}}{\longrightarrow} p(1-p). \] Applying again iii and iv in Corollary 3.2, we get

\[ W_n=\sqrt{\frac{\hat p(1-\hat p)}{p(1-p)}}\stackrel{\mathbb{P}}{\longrightarrow} 1. \] Finally, Theorem 3.6 gives \[ \frac{\hat p-p}{\sqrt{\hat p(1-\hat p)/n}}=U_n/W_n\stackrel{d}{\longrightarrow} \mathcal{N}(0,1). \]## 3.4 Sufficient statistics

As we will see below, a statistic is sufficient for a parameter \(\theta\) if it collects all the useful information of the sample about \(\theta\). In other words: if for a given sample we know the value of a sufficient statistic, then the sample does not offer any additional information about \(\theta\).

The following example motivates adequately the definition of sufficiency.

**Example 3.20 **Assume that an experiment with two posibles results, *success* and *fail*, is repited \(n\) times, in a way that \((X_1\ldots,X_n)\) is a s.r.s. of a \(\mathrm{Ber}(p)\). If we compute the value of the statistic \(Y=\sum_{i=1}^n X_i\), counting the number of successes in the \(n\) trials, does the sample provide any information about \(p\) in addition to the one contained in the observed value of \(Y\)?

We can answer this question by computing the probability of the sample given the observed value of \(Y\):

\[ \mathbb{P}\left(X_1=x_1,\ldots, X_n=x_n|Y=y\right)= \left\{\begin{array}{ll} \frac{\mathbb{P}(X_1=x_1,\ldots, X_n=x_n)}{\mathbb{P}(Y=y)} & \text{if}\quad \sum_{i=1}^n x_i=y,\\ 0 & \text{if}\quad \sum_{i=1}^n x_i\neq y. \end{array}\right. \]

For the case \(\sum_{i=1}^n x_i=y\), the above probability is given by

\[ \mathbb{P}\left(X_1=x_1,\ldots, X_n=x_n|Y=y\right) = \frac{p^y(1-p)^{n-y}} {\binom{n}{y}p^y(1-p)^{n-y}}=\frac{1}{\binom{n}{y}}, \]

which does not depend on \(p\). This means that, once the number of total successes \(Y\) is known, there is no useful information left in the sample about the probability of success \(p\). In this case, the information that remains in the sample is only about the order of appeareance of the successes, which is superfluous for estimating \(p\) because trials are independent.

**Definition 3.8 (Sufficient statistic)**Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\) whose distribution depends on an unknown parameter \(\theta\). A statistic \(T=T(X_1,\ldots,X_n)\) is

*sufficient*for \(\theta\) if the distribution of the sample given \(T\), that is, the distribution of \((X_1,\ldots,X_n)|T\), does not depend on \(\theta\).

*Remark.*Observe that the previous definition implies that if \(T\) is a suffucient statistic for a parameter \(\theta\), then \(T\) is also sufficient for any parameter \(g(\theta)\) that is function of \(\theta\).

Given a realization of a s.r.s. \((X_1,\ldots,X_n)\) of a r.v. \(X\) with distribution depending on an unknown parameter \(\theta\), the *likelihood* of a value of \(\theta\) represents the credibility that the sample gives to that value of \(\theta\). The likelihood is one of the most important concepts in statistics and statistical inference – it is the core of many inferential tools with excellent properties. It is defined through the joint p.m.f. for discrete r.v.’s or through the joint p.d.f. for continuous r.v.’s.

**Definition 3.9 (Likelihood) **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\) with distribution depending on an unknown parameter \(\theta\). Let \(x_1,\ldots,x_n\) be a realization of the s.r.s. If the r.v.’s are discrete, the *likelihood* of \(\theta\) for \((x_1,\ldots,x_n)\) is defined as the joint p.m.f. evaluated at \((x_1,\ldots,x_n)\) :
\[
\mathcal{L}(\theta;x_1,\ldots,x_n)=\mathbb{P}(X_1=x_1,\ldots,X_n=x_n;\theta)=p_{(X_1\ldots,X_n)}(x_1,\ldots,x_n;\theta).
\]
If the r.v.’s are continuous, the likelihood is defined as the joint p.d.f. evaluated at \((x_1,\ldots,x_n)\):
\[
\mathcal{L}(\theta;x_1,\ldots,x_n)=f_{(X_1\ldots,X_n)}(x_1,\ldots,x_n;\theta).
\]

\(\mathcal{L}(\theta;x_1,\ldots,x_n)\) is usually considered as a function of the parameter \(\theta\), since the realization of the sample \((x_1,\ldots,x_n)\) is fixed. In the situation in which the r.v.’s are independent (the case we cover in these notes), the likelihood is

\[ \mathcal{L}(\theta;x_1,\ldots,x_n)=\left\{\begin{array}{ll} \prod_{i=1}^n p_{X_i}(x_i;\theta) & \text{if the r.v.'s are discrete}, \\ \prod_{i=1}^n f_{X_i}(x_i;\theta) & \text{if the r.v.'s are continuous}. \end{array}\right. \]

The next theorem gives a simple method for checking whether a statistic is sufficient.

**Theorem 3.7 (Factorization criterion)**Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\) with distribution depending on an unknown parameter \(\theta\). The statistic \(T=T(X_1,\ldots,X_n)\) is sufficient for \(\theta\) if and only if the likelihood can be factorized in two non negative functions of the form \[ \mathcal{L}(\theta;x_1,\ldots,x_n)=g(t,\theta)h(x_1,\ldots,x_n), \] where \(g(t,\theta)\) only depends on the sample through \(t\) and \(h(x_1,\ldots,x_n)\) does not depend of \(\theta\).

*Proof* (Proof of Theorem 3.7). We only proof the result for discrete r.v.’s.

“\(\Longrightarrow\)”. Let \(t=T(x_1,\ldots,x_n)\) be the observed value of the statistic for the sample \((x_1,\ldots,x_n)\). Since \(T\) is sufficient, \(\mathbb{P}(X_1=x_1,\ldots,X_n=x_n|T=t)\) is independent of \(\theta\) and therefore the likelihood can be factorized as

\[\begin{align*} \mathcal{L}(\theta;x_1,\ldots,x_n) &=\mathbb{P}(X_1=x_1,\ldots,X_n=x_n;\theta)\\ &=\mathbb{P}\left(\{X_1=x_1,\ldots,X_n=x_n\} \cap\{T=t\};\theta\right) \\ &=p(T=t;\theta)\mathbb{P}(X_1=x_1,\ldots,X_n=x_n|T=t), \end{align*}\]

which agrees with the desired factorization just by taking

\[ g(t,\theta)=p(T=t;\theta), \quad h(x_1,\ldots,x_n)=\mathbb{P}(X_1=x_1,\ldots,X_n=x_n|T=t). \]

“\(\Longleftarrow\)”. Assume now that the factorization

\[ \mathcal{L}(\theta;x_1,\ldots,x_n)=\mathbb{P}(X_1=x_1,\ldots,X_n=x_n;\theta)=g(t,\theta)h(x_1,\ldots,x_n) \]

holds. We define the set

\[ A_t=\left\{(x_1,\ldots,x_n)\in\mathbb{R}^n: \ T(x_1,\ldots,x_n)=t\right\}. \]

Then,

\[\begin{align*} p(T=t;\theta)&=\sum_{(x_1,\ldots,x_n)\in A_t} \mathbb{P}(X_1=x_1,\ldots,X_n=x_n;\theta)\\ &=g(t,\theta)\sum_{(x_1,\ldots,x_n)\in A_t}h(x_1,\ldots,x_n), \end{align*}\]

so therefore

\[ \mathbb{P}(X_1=x_1,\ldots,X_n=x_n|T=t;\theta)=\left\{\begin{array}{ll} \frac{h(x_1,\ldots,x_n)}{\sum_{(x_1,\ldots,x_n)\in A_t}h(x_1,\ldots,x_n)} & \text{if} \ T(x_1,\ldots,x_n)=t,\\ 0 & \text{if} \ T(x_1,\ldots,x_n)\neq t. \end{array}\right. \]

Since \(h(x_1,\ldots,x_n)\) does not depend on \(\theta\), then the conditional distribution of \((X_1,\ldots,X_n)\) given \(T\) does not depend on \(\theta\). Therefore, \(T\) is sufficient.

**Example 3.21 **In Example 3.20, prove that \(T=\sum_{i=1}^n X_i\) is sufficient for \(p\) using the factorization criterion.

The likelihood is

\[ \mathcal{L}(p;x_1,\ldots,x_n)=p^{\sum_{i=1}^n x_i} (1-p)^{n-\sum_{i=1}^n x_i} =g(t,p) \]

with \(t=\sum_{i=1}^n x_i\) and \(h(x_1,\ldots,x_n)=1\). Therefore, \(T=\sum_{i=1}^n X_i\) is sufficient for \(p\).

**Example 3.22 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\) with \(\mathrm{Exp}(\alpha)\) distribution. Let us see that \(\bar{X}\) is sufficient for \(\alpha\).

Indeed,

\[ \mathcal{L}(\alpha;x_1,\ldots,x_n)=\prod_{i=1}^n f(x_i)=\prod_{i=1}^n \frac{e^{-x_i/\alpha}}{\alpha}=\frac{e^{-\sum_{i=1}^n x_i/\alpha}}{\alpha^n}=\frac{e^{-n\bar{x}/\alpha}}{\alpha^n}=g(t,\alpha) \]

with \(h(x_1,\ldots,x_n)=1\). Then, \(g(t,\alpha)\) depends on the sample through \(t=\bar{x}\). Therefore, \(T=\bar{X}\) is sufficient for \(\alpha\).

**Example 3.23 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a \(\mathcal{U}(\theta_1,\theta_2)\) r.v. with \(\theta_1<\theta_2\). Let us find a sufficient statistic for \((\theta_1,\theta_2)\).

The likelihood is

\[ \mathcal{L}(\theta_1,\theta_2;x_1,\ldots,x_n)=\frac{1}{(\theta_2-\theta_1)^n}, \ \theta_1<x_1,\ldots,x_n<\theta_2. \]

Rewriting the likelihood in terms of indicator functions, we get the factorization

\[ \mathcal{L}(\theta_1,\theta_2;x_1,\ldots,x_n)=\frac{1}{(\theta_2-\theta_1)^n}\, 1_{\{x_{(1)}>\theta_1\}}1_{\{x_{(n)}<\theta_2\}}=g(t,\theta_1,\theta_2) \]

by taking \(h(x_1,\ldots,x_n)=1\). Since \(g(t,\theta_1,\theta_2)\) depends on the sample through \((x_{(1)},x_{(n)})\), then the statistic

\[ T=(X_{(1)},X_{(n)}) \]

is sufficient for \((\theta_1,\theta_2)\).

However, if \(\theta_1\) was known, then the factorization would be

\[ g(t,\theta_2)=\frac{1}{(\theta_2-\theta_1)^n}\, 1_{\{x_{(n)}<\theta_2\}}, \ h(x_1,\ldots,x_n)=1_{\{x_{(1)}>\theta_1\}} \]

and in this case \(T(X_1,\ldots,X_n)=X_{(n)}\) is a sufficient statistic for \(\theta_2\).

**Example 3.24 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. with \(\mathcal{N}(\mu,\sigma^2)\) distribution. Let us find sufficient statistics for

- \(\sigma^2\), if \(\mu\) is known.
- \(\mu\), if \(\sigma^2\) is known.
- \(\mu\) and \(\sigma^2\).

The likelihood of the sample is

\[ \mathcal{L}(\mu,\sigma^2;x_1,\ldots,x_n)=\frac{1}{(\sqrt{2\pi\sigma^2})^n} \exp\left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2\right\}. \]

With this, we already have the adequate factorization for part a, by taking \(h(x_1,\ldots,x_n)=1\). Therefore, \(T=\sum_{i=1}^n (x_i-\mu)^2\) is sufficient for \(\sigma^2\).

For part b, adding and substracting \(\bar{x}\) inside the exponential, we obtain

\[ \sum_{i=1}^n (x_i-\mu)^2=\sum_{i=1}^n (x_i-\bar{x})^2+n(\bar x-\mu)^2, \]

since the crossed term vanishes. Then, the likelihood can be factorized in the form of

\[ \mathcal{L}(\mu,\sigma^2;x_1,\ldots,x_n)=\frac{1}{(\sqrt{2\pi\sigma^2})^n} \exp\left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i-\bar{x})^2\right\}\exp\left\{-\frac{n}{2\sigma^2} (\bar{x}-\mu)^2\right\}. \]

Therefore, in this case \(T=\bar{X}\) is sufficient for \(\mu\).

Finally, a sufficient statistic for \((\mu,\sigma^2)\) follows from the same factorization:

\[ T=\left(\bar{X},\sum_{i=1}^n (X_i-\bar{X})^2\right), \]

or, equivalently, multiplying and dividing by \(n\) inside of the first exponential, it follows that

\[ T=(\bar{X},S^2) \]

is sufficient for \((\mu,\sigma^2)\).

## 3.5 Minimal sufficient statistics

Intuitively, a minimal sufficient statistic for parameter \(\theta\) is the one that collects the useful information in the sample about \(\theta\) *but only the essential one*, excluding any superfluous information on the sample that does not help on the estimation of \(\theta\).

Observe that, if \(T\) is a sufficient statistic and \(T'=\varphi(T)\) is also a sufficient statistic, being \(\varphi\) a non-injective mapping^{2}, then \(T'\) *condenses more the information*, that is, the information in \(T\) can not be obtained from that one in \(T'\) because \(\varphi\) can not be inverted, yet still both collect the sufficient amount of information. A minimal sufficient statistic is a sufficient statistic that can be obtained by means of (not necessarily injective but measurable) functions of any other sufficient statistic.

**Definition 3.10 (Minimal sufficient statistic) **A sufficient statistic \(T\) for \(\theta\) is *minimal sufficient* if, for any other sufficient statistic \(\tilde T\), there exists a measurable function \(\varphi\) such that
\[
T=\varphi(\tilde T).
\]

The factorization criterion of Theorem 3.7 provides an effective way of obtaining sufficient statistics that *usually* happen to be minimal. A guarantee of minimality is given by the next theorem.

**Theorem 3.8 (Sufficient condition for minimal sufficiency) **A statistic \(T\) is minimal sufficient if the following property holds:
\[\begin{align}
\frac{\mathcal{L}(\theta;x_1,\ldots,x_n)}{\mathcal{L}(\theta;x_1',\ldots,x_n')}\ \text{is independent of $\theta$} \iff
T(x_1,\ldots,x_n)=T(x_1',\ldots,x_n').\tag{3.3}
\end{align}\]

*Proof* (Proof of Theorem 3.8). We prove the theorem for discrete r.v.’s. Let \(T\) be a statistic that satisfies (3.3). Let us see that, then, it is minimal sufficient. Firstly, we check that \(T\) is
sufficient. Indeed, for any sample \((x_1',\ldots,x_n')\), we have that

\[ \mathbb{P}(X_1=x_1',\ldots,X_n=x_n'|T=t) =\left\{\begin{array}{ll} 0 & \text{if} \ T(x_1',\ldots,x_n')\neq t,\\ {\frac{\mathbb{P}(X_1=x_1',\ldots,X_n=x_n';\theta)}{\mathbb{P}(T=t;\theta)}} & \text{if} \ T(x_1',\ldots,x_n')= t. \end{array}\right. \]

Then, if we have a sample \((x_1',\ldots,x_n')\) such that \(T(x_1',\ldots,x_n')=t\), then

\[\begin{align*} \mathbb{P}(X_1=x_1',\ldots,X_n=x_n'|T=t)&=\frac{\mathbb{P}(X_1=x_1',\ldots,X_n=x_n';\theta)}{\mathbb{P}(T=t;\theta)}\\ &=\frac{\mathbb{P}(X_1=x_1',\ldots,X_n=x_n';\theta)}{\sum_{(x_1,\ldots,x_n)\in A_t} \mathbb{P}(X_1=x_1,\ldots,X_n=x_n;\theta)}, \end{align*}\]

where

\[ A_t=\{(x_1,\ldots,x_n)\in\mathbb{R}^n: \ T(x_1,\ldots,x_n)=t\}, \]

that is,

\[\begin{align*} \mathbb{P}(X_1=x_1',\ldots,X_n=x_n'|T=t)&=\frac{\mathcal{L}(\theta;x_1',\ldots,x_n')}{{\sum_{(x_1,\ldots,x_n)\in A_t}} \mathcal{L}(\theta;x_1,\ldots,x_n)}\\ &=\frac{1}{{\sum_{(x_1,\ldots,x_n)\in A_t}} {\frac{\mathcal{L}(\theta;x_1,\ldots,x_n)}{\mathcal{L}(\theta;x_1',\ldots,x_n')}}}. \end{align*}\]

All the samples \((x_1,\ldots,x_n)\in A_t\) share the same value of the statistic, \(T(x_1,\ldots,x_n)=t\), just like \((x_1',\ldots,x_n')\). Therefore, the ratio of likelihoods in the denominator, for the samples \((x_1,\ldots,x_n)\in A_t\), does not depend on \(\theta\); that is, \(T\) is sufficient.

Let \(\tilde T\) be another sufficient statistic. Let us see that then it has to be \(T=\varphi(\tilde T)\). Let \((x_1,\ldots,x_n)\) and \((x_1',\ldots,x_n')\) be two samples with the same value for the new sufficient statistic:

\[ \tilde T(x_1,\ldots,x_n)=\tilde T(x_1',\ldots,x_n')=\tilde t. \]

Then, the probabilities of such samples given \(\tilde T=\tilde t\) are

\[\begin{align*} \mathbb{P}(X_1=x_1,\ldots,X_n=x_n|\tilde T=\tilde t)&=\frac{f(x_1,\ldots,x_n;\theta)}{\mathbb{P}(\tilde T=\tilde t;\theta)},\\ \mathbb{P}(X_1=x_1',\ldots,X_n=x_n'|\tilde T=\tilde t)&=\frac{f(x_1',\ldots,x_n';\theta)}{\mathbb{P}(\tilde T=\tilde t;\theta)}. \end{align*}\]

Both are independent of \(\theta\), so the ratio

\[ \frac{f(x_1,\ldots,x_n;\theta)}{f(x_1',\ldots,x_n';\theta)}=\frac{\mathcal{L}(\theta;x_1,\ldots,x_n)}{\mathcal{L}(\theta;x_1',\ldots,x_n')} \]

is also independent of \(\theta\). By (3.3), it follows that

\[ T(x_1,\ldots,x_n)=T(x_1',\ldots,x_n'). \]

We have obtained that all the samples that share the same value of \(\tilde T\) also share the same value of \(T\), that is, for each value \(\tilde t\) of \(\tilde T\), there exists a unique value \(\varphi(\tilde t)\), and therefore \(T=\varphi(\tilde T)\). This means that \(T\) is minimal sufficient.

Now we prove the reciprocal. Let \(T\) be a minimal sufficient statistic. Because of sufficiency, Theorem 3.7 ensures that, for two samples \((x_1,\ldots,x_n)\) and \((x_1',\ldots,x_n')\), the likelihood can be factorized as

\[\begin{align*} \mathcal{L}(\theta;x_1,\ldots,x_n)&=g(T(x_1,\ldots,x_n),\theta)h(x_1,\ldots,x_n),\\ \mathcal{L}(\theta;x_1',\ldots,x_n')&=g(T(x_1',\ldots,x_n'),\theta)h(x_1',\ldots,x_n'). \end{align*}\]

Then, if the likelihood ratio \[ \frac{\mathcal{L}(\theta;x_1,\ldots,x_n)}{\mathcal{L}(\theta;x_1',\ldots,x_n')} \]

depends on \(\theta\), then the ratio

\[ \frac{g(T(x_1,\ldots,x_n),\theta)}{g(T(x_1',\ldots,x_n'),\theta)} \]

also depends on \(\theta\). This implies that

\[ T(x_1,\ldots,x_n)\neq T(x_1',\ldots,x_n'). \]

On the other hand, consider another sufficient statistic \(\tilde T\). Again, because of sufficiency, if the ratio of likelihoods does not depend on \(\theta\), then

\[ \tilde T(x_1,\ldots,x_n)= \tilde T(x_1',\ldots,x_n'). \]

In addition, because \(T\) is minimal, it satisfies \(T=\varphi(\tilde T)\). Then, \[ T(x_1,\ldots,x_n)=\varphi(\tilde T(x_1,\ldots,x_n))=\varphi(\tilde T(x_1',\ldots,x_n'))=T(x_1',\ldots,x_n'). \]

**Example 3.25 **Let us find a minimal sufficient statistic for \(p\) in Example 3.20.

The ratio of likelihoods is

\[\begin{align*} \frac{\mathcal{L}(\theta;x_1,\ldots,x_n)}{\mathcal{L}(\theta;x_1',\ldots,x_n')}&=\frac{p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^n x_i}} {p^{\sum_{i=1}^n x_i'}(1-p)^{n-\sum_{i=1}^n x_i'}}\\ &=\frac{(1-p)^n \left(\frac{p}{1-p}\right)^{\sum_{i=1}^n x_i}} {(1-p)^n \left(\frac{p}{1-p}\right)^{\sum_{i=1}^n x_i'}}\\ &=\left(\frac{p}{1-p}\right)^{\sum_{i=1}^n x_i-\sum_{i=1}^n x_i'}. \end{align*}\]

The ratio is independent of \(p\) if and only if \(\sum_{i=1}^n x_i=\sum_{i=1}^n x_i'\). Therefore, \(T=\sum_{i=1}^n X_i\) is minimal sufficient for \(p\).

The *exponential family* is a family of probability distributions sharing a common structure which gives them excellent properties. In particular, minimal sufficient statistics for parameters of distributions within the exponential family are trivial to obtain.

**Definition 3.11 (Exponential family) **A r.v. \(X\) belongs to the (univariate) *exponential family* with parameter \(\theta\) if
its p.m.f. or p.d.f \(f(\cdot;\theta)\) can be expressed as
\[
f(x;\theta)=c(\theta)h(x)\exp\{w(\theta)t(x)\}.
\]

**Example 3.26 **Let us check that a r.v. \(X\sim \mathrm{Bin}(n,\theta)\) belongs to the
exponential family.

Writting the p.m.f. of the binomial as follows,

\[\begin{align*} p(x;\theta) &=\left(\begin{array}{c}n\\ p\end{array}\right) \theta^x (1-\theta)^{n-x}=(1-\theta)^n\left(\begin{array}{c}n\\ p\end{array}\right) \left(\frac{\theta}{1-\theta}\right)^x \\ &=(1-\theta)^n\left(\begin{array}{c}n\\ p\end{array}\right) \exp\left\{x\log\left(\frac{\theta}{1-\theta}\right)\right\}, \end{align*}\]

we can see that it has the shape of the exponential family.**Example 3.27 **Let us check that a r.v. \(X\sim \Gamma(\theta,3)\) belongs to the
exponential family.

Again, writing the p.d.f. of a gamma as \[ f(x;\theta)=\frac{1}{\Gamma(\theta)3^{\theta}}x^{\theta-1}e^{-x/3}=\frac{1}{\Gamma(\theta)3^{\theta}} e^{-x/3} \exp\{(\theta-1)\log x\}, \] it readily follows that it belongs to the exponential family.

**Example 3.28 **Let us see that a r.v. \(X\sim \mathcal{U}(0,\theta)\) does not belong to the exponential family.

The p.d.f. of a \(\mathcal{U}(0,\theta)\) is given by \[ f(x;\theta)=\left\{ \begin{array}{ll} 1/\theta & \text{if}\ x\in(0,\theta),\\ 0 & \text{if} \ x\notin (0,\theta), \end{array}\right. \] which can be expressed as \[ f(x;\theta)=\frac{1}{\theta}1_{(0,\theta)}(x). \]

Since the indicator is a function of \(x\) and \(\theta\) at the same time, and it is impossible to express it in terms of an exponential function, we conclude that \(X\) does not belong to the exponential family.

**Proposition 3.2 (Minimal sufficient statistics in the exponential family) **For the distributions within the exponential family with parameter \(\theta\), the statistic
\[
T(X_1,\ldots,X_n)=\sum_{i=1}^n t(X_i)
\]
is minimal sufficient.

*Proof* (Proof of Proposition 3.2). First, we prove that \(T(X_1,\ldots,X_n)=\sum_{i=1}^n t(X_i)\) is sufficient. The likelihood of a s.r.s. for distributions within the exponential family is given by

\[ \mathcal{L}(x_1,\ldots,x_n)=[c(\theta)]^n \prod_{i=1}^n h(x_i)\exp\left\{w(\theta)\sum_{i=1}^n t(x_i)\right\}. \]

Applying Theorem 3.7, we have that

\[ h(x_1,\ldots,x_n)=\prod_{i=1}^n h(x_i), \quad g(t,\theta) =[c(\theta)]^n\exp\left\{w(\theta)\sum_{i=1}^n t(x_i)\right\}, \]

and we can see that \(g(t,\theta)\) depends on the sample through \(\sum_{i=1}^n t(x_i)\). Therefore, \(T=\sum_{i=1}^n t(X_i)\) is sufficient for \(\theta\). To check that it is minimal sufficient, we apply Theorem 3.8:

\[\begin{align*} \frac{\mathcal{L}(\theta;x_1,\ldots,x_n)}{\mathcal{L}(\theta;x_1',\ldots,x_n')}&=\frac{[c(\theta)]^n \prod_{i=1}^n h(x_i)\exp\{w(\theta)\sum_{i=1}^n t(x_i)\}} {[c(\theta)]^n \prod_{i=1}^n h(x_i')\exp\{w(\theta)\sum_{i=1}^n t(x_i')\}} \\ & =\exp\left\{w(\theta)\left[T(x_1,\ldots,x_n)-T(x_1',\ldots,x_n')\right]\right\}\prod_{i=1}^n \frac{h(x_i)}{h(x_i')}. \end{align*}\]

The ratio is independent of \(\theta\) if and only if

\[ T(x_1,\ldots,x_n)=T(x_1',\ldots,x_n'). \]

**Example 3.29 **A minimal sufficient statistic for \(\theta\) in Example 3.26 is
\[
T=\sum_{i=1}^n X_i.
\]

**Example 3.30 **A minimal sufficient statistic for \(\theta\) in Example 3.27 is
\[
T=\sum_{i=1}^n \log X_i.
\]

## 3.6 Uniformly minimum-variance unbiased estimators

In this section, we focus on estimators \(\hat\theta\) of \(\theta\in\Theta\) that are unbiased and have finite variance, that is, such that

\[\begin{align} \mathbb{E}[\hat\theta]=\theta,\quad \mathbb{E}[\hat\theta^2]<\infty, \quad \forall \theta\in \Theta.\tag{3.4} \end{align}\]

**Definition 3.12 (UMVUE) **An estimator \(\hat\theta\) of \(\theta\) is a *Uniformly Minimum-Variance Unbiased Estimator* (UMVUE) if it is unbiased and, among the set of unbiased estimators that satisfy (3.4), has the *minimum variance* for *any* value of the parameter \(\theta\in \Theta\), that is
\[
\mathbb{V}\mathrm{ar}[\hat\theta]\leq \mathbb{V}\mathrm{ar}[\hat\theta'], \quad
\ \forall \hat\theta', \forall \theta\in \Theta.
\]

**Theorem 3.9 (Rao–Blackwell’s Theorem) **Let \(T\) be a sufficient statistic for \(\theta\). Let \(\hat\theta\) be an unbiased estimator of \(\theta\). Then, the estimator
\[
\hat\theta'=\mathbb{E}[\hat\theta|T]
\]
verifies:

- \(\hat\theta'\) independent of \(\theta\).
- \(\mathbb{E}[\hat\theta']=\theta\), \(\forall \theta\in \Theta\).
- \(\mathbb{V}\mathrm{ar}[\hat\theta']\leq \mathbb{V}\mathrm{ar}[\hat\theta]\), \(\forall\theta\in\Theta\).

In addition, \(\mathbb{V}\mathrm{ar}[\hat\theta']= \mathbb{V}\mathrm{ar}[\hat\theta]\) if and only if \(\mathbb{P}(\hat\theta'=\hat\theta)=1\), \(\forall \theta \in \Theta\).

Observe that the new estimator \(\hat\theta'\) depends on the sample through the sufficient statistic \(T\). Therefore, if the UMVUE exists, it is a function of some sufficient statistic, in particular, of the minimal sufficient statistic.

**Example 3.31 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of \(X\sim \mathrm{Pois}(\lambda)\), that is, with p.m.f.

\[ p(x;\lambda)=\frac{\lambda^x e^{-\lambda}}{x!}, \quad x=0,1,\ldots, \ \lambda>0. \]

Consider the parameter \(\theta=p(0;\lambda)=e^{-\lambda}\). From the previous theorem, we can obtain the UMVUE of \(\theta\).

First, we need an unbiased estimator, for example:

\[ \hat\theta=\left\{\begin{array}{ll} 1 & \text{if} \ X_1=0,\\ 0 & \text{if} \ X_1\neq 0, \end{array}\right. \]

since

\[ \mathbb{E}[\hat\theta]=1\,\mathbb{P}(X_1=0)+ 0\, \mathbb{P}(X_1\neq 0)=\mathbb{P}(X_1=0)=e^{-\lambda}. \]

Now we need a sufficient estimator. Writing the p.m.f. of the Poisson in the form of the exponential family, we obtain \[ p(x;\lambda)=e^{-\lambda}\frac{1}{x!}e^{x\log{\lambda}} \] and therefore \(T(X_1,\ldots,X_n)=\sum_{i=1}^n X_i\) is sufficient for \(\lambda\) and also for \(e^{-\lambda}\).

Then, we can Rao–Blackwellize \(\hat\theta\) by means of the estimator \(\hat\theta'=\mathbb{E}[\hat\theta|T]\):

\[\begin{align*} \hat\theta' &=\mathbb{E}[\hat\theta|T]\\ &=1\, \mathbb{P}\left(X_1=0\Bigg|\sum_{i=1}^n X_i=t\right)+0\,\mathbb{P}\left(X_1\neq 0\Bigg|\sum_{i=1}^n X_i=t\right) \\ &=\frac{\mathbb{P}(X_1=0,\sum_{i=1}^n X_i=t)}{\mathbb{P}(\sum_{i=1}^n X_i=t)}\\ &=\frac{\mathbb{P}(X_1=0)\mathbb{P}(\sum_{i=2}^n X_i=t)}{\mathbb{P}(\sum_{i=1}^n X_i=t)}. \end{align*}\]

Now, it holds that, if \(X_i\sim \mathrm{Pois}(\lambda)\), \(i=1,\ldots,n\), are independent, then (see Exercise 1.16)

\[ \sum_{i=1}^n X_i \sim \mathrm{Pois}(n\lambda). \]

Therefore,

\[ \hat\theta' =\frac{e^{-\lambda}[(n-1)\lambda]^t e^{-(n-1)\lambda}/t!}{(n\lambda)^t e^{-n\lambda}/t!}=\left(\frac{n-1}{n}\right)^t. \]

Then, we have obtained the estimator

\[ \hat\theta'=\mathbb{E}[\hat\theta|T]=\left(\frac{n-1}{n}\right)^T, \]

which is unbiased, and whose variance is smaller than the one of \(\hat\theta\). Indeed,

\[\begin{align*} \mathbb{E}\left[\hat\theta'\right]&=\sum_{x=1}^{\infty} \left(\frac{n-1}{n}\right)^x \frac{e^{-n\lambda}(n\lambda)^x}{x!}\\ &=e^{-n\lambda}\sum_{x=0}^{\infty} \frac{(n-1)^x \lambda^x}{x!}\\ &=e^{-n\lambda}e^{(n-1)\lambda}\\ &=e^{-\lambda}=\theta. \end{align*}\]

Therefore, \(\hat\theta'\) is unbiased. We compute its variance. For that, in the first place, we compute

\[\begin{align*} \mathbb{E}[\hat\theta'^2] &=\sum_{t=0}^{\infty} \left(\frac{n-1}{n}\right)^{2t} \frac{e^{-n\lambda} (n\lambda)^\prime}{t!}\\ &=e^{-n\lambda}\sum_{t=0}^{\infty} \left(\frac{(n-1)^2\lambda}{n}\right)^\prime \frac{1}{t!} \\ &=e^{-n\lambda}e^{(n-1)^2\lambda/n}\\ &=e^{-2\lambda+\lambda/n}. \end{align*}\]

Then, the variance is

\[ \mathbb{V}\mathrm{ar}[\hat\theta']=\mathbb{E}[\hat\theta'^2]-\mathbb{E}^2[\hat\theta']=e^{-2\lambda+\lambda/n}-e^{-2\lambda} =e^{-2\lambda}(e^{\lambda/n}-1). \]

We calculate the variance of \(\hat\theta\) for comparison:

\[ \mathbb{E}[\hat\theta^2]=1\,\mathbb{P}(X_1=0)=\mathbb{E}[\hat\theta]=e^{-\lambda}. \]

As a consequence,

\[ \mathbb{V}\mathrm{ar}[\hat\theta]=e^{-\lambda}-e^{-2\lambda}=e^{-\lambda}(1-e^{-\lambda}). \]

Therefore,

\[ \frac{\mathbb{V}\mathrm{ar}[\hat\theta']}{\mathbb{V}\mathrm{ar}[\hat\theta]} =\frac{e^{-2\lambda}(e^{\lambda/n}-1)}{e^{-\lambda}(1-e^{-\lambda})} =\frac{e^{\lambda/n}-1}{e^{\lambda}-1}<1, \ \forall n\geq 1. \]

## 3.7 Efficient estimators

**Definition 3.13 (Fisher information) **Let \(X\) be a continuous r.v. with distribution depending on a parameter \(\theta\in \Theta\subset \mathbb{R}\) and with p.d.f. \(f(\cdot;\theta)\). The *Fisher information* of \(X\) about \(\theta\) is defined as

\[ \mathcal{I}(\theta)=\mathbb{E}\left[\left(\frac{\partial \log f(x;\theta)}{\partial \theta}\right)^2\right]. \]

When \(X\) is discrete, the Fisher information is defined analogously by just replacing the p.d.f. \(f(\cdot;\theta)\) by the p.m.f. \(p(\cdot;\theta)\).

Observe that the quantity

\[\begin{align} \frac{\partial \log f(x;\theta)}{\partial \theta} =\frac{1}{f(x;\theta)}\frac{\partial f(x;\theta)}{\partial \theta} \tag{3.5} \end{align}\]

is the relative rate of variation of \(f\) for infinitesimal variations of \(\theta\), for the realization \(x\) of the r.v. \(X\). Therefore, (3.5) represents the information contained in \(x\) for discriminating \(\theta\) from a close value \(\theta+h\). When taking the expectation, we obtain the average information of \(X\) that is useful to distinguish \(\theta\) from close values. This quantity is related with the precision of an unbiased estimator of \(\theta\).

**Example 3.32 **Compute the Fisher’s information quantity of a r.v. \(X\sim \mathrm{Pois}(\lambda)\).

The Poisson’s p.m.f. is given by

\[ p(x;\lambda)=\frac{\lambda^x e^{-\lambda}}{x!}, \quad x=0,1,2,\ldots, \]

so its logarithm is

\[ \log{p(x;\lambda)}=x\log{\lambda}-\lambda-\log{(x!)} \]

and its derivative is \[ \frac{\partial \log p(x;\lambda)}{\partial \lambda}=\frac{x}{\lambda}-1. \]

The Fisher information is then obtained taking expectation of \[ \left(\frac{\partial \log p(x;\lambda)}{\partial \lambda}\right)^2=\left(\frac{x-\lambda}{\lambda}\right)^2, \]

Noting that \(\mathbb{E}[X]=\lambda\) and therefore, \(\mathbb{E}[(X-\lambda)^2]=\mathbb{V}\mathrm{ar}[X]=\lambda\), we obtain

\[ \mathcal{I}(\lambda)=\mathbb{E}\left[\left(\frac{x-\lambda}{\lambda}\right)^2\right] =\frac{\mathbb{V}\mathrm{ar}[X]}{\lambda^2}=\frac{1}{\lambda}. \]

**Theorem 3.10 (Frechet–Crámer–Rao’s inequality) **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. with p.d.f. \(f(x;\theta)\), and let \(\mathcal{I}_n(\theta)\) be the Fisher’s information quantity of the sample \((X_1,\ldots,X_n)\) about \(\theta\). If \(\hat\theta=\hat\theta(X_1,\ldots,X_n)\) is an unbiased estimator of \(\theta\) then, under certain general conditions, it holds
\[
\mathbb{V}\mathrm{ar}[\hat\theta]\geq \mathcal{I}_n^{-1}(\theta).
\]

**Definition 3.14 (Efficient estimator)**An unbiased estimator \(\hat\theta\) of \(\theta\) that verifies \(\mathbb{V}\mathrm{ar}[\hat\theta]=\mathcal{I}_n^{-1}(\theta)\) is said to be

*efficient*.

*Remark.*Of course, an efficient estimator is UMVUE. However, an UMVUE might not attain the Fréchet–Crámer–Rao’s lower bound \(\mathcal{I}_n^{-1}(\theta)\) and therefore it is not necessarily efficient.

**Example 3.33 **Show that for a r.v. \(\mathrm{Pois}(\lambda)\) the estimator \(\hat\lambda=\bar{X}\) is efficient.

We first calculate the Fisher information of \(\mathcal{I}_n(\theta)\) of the sample \((X_1,\ldots,X_n)\). The joint p.m.f. of \((X_1,\ldots,X_n)\) is

\[ p(x_1,\ldots,x_n;\lambda)=\frac{\lambda^{\sum_{i=1}^n x_i} e^{-n\lambda}}{\prod_{i=1}^n x_i!}, \]

and then its logarithm is

\[ \log p(x_1,\ldots,x_n;\lambda)=\sum_{i=1}^n x_i\log \lambda-n\lambda-\sum_{i=1}^n\log x_i!. \]

Differentiating with respect to \(\lambda\), we get

\[ \frac{\partial}{\partial \lambda}\log p(x_1,\ldots,x_n;\lambda) =\frac{\sum_{i=1}^n x_i}{\lambda}-n. \]

Therefore,

\[\begin{align*} \mathcal{I}_n(\lambda)&=\mathbb{E}\left[\left(\frac{\sum_{i=1}^n x_i-n\lambda}{\lambda}\right)^2\right]=\frac{1}{\lambda^2}\mathbb{V}\mathrm{ar}\left[\sum_{i=1}^n x_i\right]\\ &=\frac{n\mathbb{V}\mathrm{ar}[X]}{\lambda^2}=\frac{n\lambda}{\lambda^2}=\frac{n}{\lambda}. \end{align*}\]

Oh the other hand, the variance of \(\hat\lambda=\bar{X}\) is

\[ \mathbb{V}\mathrm{ar}[\hat\lambda]=\frac{1}{n^2}\mathbb{V}\mathrm{ar}\left[\sum_{i=1}^n x_i\right]=\frac{n\lambda}{n^2}=\frac{\lambda}{n}. \]

Therefore, \(\hat\lambda=\bar{X}\) is efficient.

## 3.8 Robust estimators

An estimator \(\hat{\theta}\) of the parameter \(\theta\) associated to a r.v. with p.d.f. \(f(\cdot;\theta)\) is *robust* if it preserves good properties (small bias and variance) even if the model suffers from small *contamination*, that is, if the assumed density \(f(\cdot;\theta)\) is just an approximation of the true density due to the presence of observations coming from other distributions in the data.

The theory of statistical robustness is deep and has a broad toolbox of methods aimed for different contexts. In this section, we just provide some ideas for robust estimation of the mean \(\mu\) and the standard deviation \(\sigma\) of a population. For that, we consider the following widely-used contamination model for \(f(\cdot;\theta)\): \[ (1-\varepsilon)f(x;\theta)+\varepsilon g(x), \quad x\in \mathbb{R}, \] for a small \(0<\varepsilon<0.5\) and an arbitrary p.d.f. \(g\).

The next example shows that the sample mean is *not* robust for this kind of contamination.

**Example 3.34 (\(\bar{X}\) is not robust for \(\mu\)) **Let \(f(\cdot;\theta)\) be the p.d.f of a \(\mathcal{N}(\mu,\sigma^2)\), with unknown vector of parameters \(\theta=(\mu,\sigma^2)'\) and \((X_1,\ldots,X_n)\) a s.r.s. of that distribution. In absence of contamination, we know that the variance of the sample mean is \(\mathbb{V}\mathrm{ar}_\theta(\bar{X})=\sigma^2/n\) and therefore the sample mean \(\bar{X}\) is efficient. However, if we now contaminate \(f(\cdot;\theta)\) by the p.d.f. \(g\) of a \(\mathcal{N}(\mu, c\sigma^2)\), where \(c>0\) is a constant, then the variance of the sample mean \(\bar{X}\) under the contamination model becomes

\[ \mathbb{V}\mathrm{ar}_{\theta,\varepsilon,c}(\bar{X})=(1-\varepsilon)\sigma^2+\varepsilon c^2\sigma^2=\sigma^2(1+\varepsilon [c^2-1]), \]

which can be very different. For example, for \(c=5\) and \(\varepsilon=0.01\), it results \((1+\varepsilon [c^2-1])=1.24\). In addition, \(\lim_{c\to \infty} \mathbb{V}\mathrm{ar}_{\theta,\varepsilon,c}(\bar{X})=\infty\), for all \(\varepsilon>0\). Therefore, \(\bar{X}\) is not robust.

The concept of *outlier* is intimately related with robustness. Outliers are “abnormal” observations in the sample that seem very unlikely for the assumed distribution model or are remarkably different from the rest of sample observations. Outliers can be originated by measurement errors, exceptional circumstances, changes in the data generating process, etc.

There are two main approaches for preventing outliers or contamination to undermine the estimation of \(\theta\):

- Detect the outliers through a diagnosis of the model fit and re-estimate the model once the outliers have been removed.
- Employ a robust estimator.

The first approach is the traditional one and is still popular due to its simplicity. Besides, it allows us to employ non-robust efficient estimators that tend to be simpler to compute, provided the data has been cleared adequately. However, this procedure may quickly run into problems, since, for example, detecting outliers in higher dimensions is usually complicated and this detection may require manual inspection of the data.

In addition, robust estimators may be needed even when performing the first approach, as the following example illustrates. A simple rule to detect outliers in a normal population is to flag as outliers the observations that lie further away than \(3\sigma\) from the mean \(\mu\), since those observations are highly extreme. Since their probability is \(0.0027\), we expect to flag as an outlier \(1\) out of \(371\) observations if the data comes from a perfectly valid normal population. However, applying this procedure entails estimating first \(\mu\) and \(\sigma\) from the data. But the conventional estimators, sample mean and variance, are also very sensitive to outliers, and therefore their resulting values may hide the existence of outliers. Therefore, it is better to rely on a robust estimator, which brings us back to the second approach. As a consequence, it is sometimes preferred to employ robust estimators from the beginning.

The next definition introduces a simple measure of the robustness of an estimator.

**Definition 3.15 (Finite-sample breakdown point) **For a realized s.r.s. \(\mathbf{x}=(x_1,\ldots,x_n)'\) and an integer \(m\) with \(1\leq m\leq n\), let us define the set of samples that differ from \(\mathbf{x}\) in \(m\) observations
\[
U_m({\mathbf{x}})=\{\mathbf{y}=(y_1,\ldots,y_n)\in\mathbb{R}^n : |\{i: x_i\neq y_i\}|=m\}.
\]
The *maximum change* of an estimator \(\hat{\theta}\) when \(m\) observations are contaminated is
\[
A(\mathbf{x},m)=\sup_{\mathbf{y}\in U_m({\mathbf{x}})}|\hat{\theta}(\mathbf{y})-\hat{\theta}(\mathbf{x})|,
\]
and the *breakdown point* of \(\hat{\theta}\) for the sample \(\mathbf{x}\) is defined as
\[
\max\left\{\frac{m}{n}:A(\mathbf{x},m)<\infty\right\}.
\]

The breakdown point of an estimator \(\hat{\theta}\) can be interpreted as the maximum fraction of the sample that can be changed without modifying the value of \(\hat{\theta}\) to an arbitrarily large value.

**Example 3.35 **It can be seen that:

- The breakdown point of the sample mean is \(0\).
- The breakdown point of the sample median is \(\lfloor n/2\rfloor/n\), with \(\lfloor n/2\rfloor/n\to 0.5\) as \(n\to\infty\).
- The breakdown point of the sample variance (and of the standard deviation) is \(0\).

The so-called *trimmed means* defined below form a popular class of robust estimators for \(\mu\) that generalizes the mean and the median in a very intuitive way.

**Definition 3.16 (Trimmed mean)**Let \((X_1,\ldots, X_n)\) be a s.r.s. The

*\(\alpha\)-trimmed mean*at level \(0\leq \alpha\leq 0.5\) is defined as \[ T_{\alpha}=\frac{1}{n-2m(\alpha)}\sum_{i=m(\alpha)+1}^{n-m(\alpha)}X_{(i)} \] where \(m(\alpha)=\lfloor n\cdot \alpha\rfloor\) is the number of trimmed observations at each extreme.

Observe that \(\alpha=0\) corresponds to the sample mean and \(\alpha=0.5\) to the sample median. The next result reveals that the breakdown point of the trimmed mean is approximately equal to \(\alpha>0\), which is larger than that of the sample mean. Of course, this gain in robustness is at the expense of a moderate loss of efficiency in the form of an increased variance, which in a normal population is about a \(6\%\) increment when \(\alpha=0.10\).

**Proposition 3.3 (Properties of the trimmed mean) **

- For symmetric distributions, \(T_{\alpha}\) is unbiased for \(\mu\).
- The breakdown point of \(T_{\alpha}\) is \(m(\alpha)/n\), with \(m(\alpha)/n\to \alpha\) as \(n\to\infty\).
- For \(X\sim \mathcal{N}(\mu,\sigma^2)\), \(\mathbb{V}\mathrm{ar}(T_{0.1})\cong1.06\cdot \sigma^2/n\) for large \(n\).

Another well-known class of robust estimators for the population mean is the class of \(M\)-estimators.

**Definition 3.17 (\(M\)-estimator for \(\mu\)) **An \(M\)-estimator for \(\mu\) is a statistic \(\tilde{\mu}\) based on the s.r.s. \((X_1,\ldots,X_n)\) that satisfies
\[
\tilde{\mu}=\arg \min_{a} \sum_{i=1}^n \rho\left(\frac{X_i-a}{\hat{s}}\right),
\]
where \(\hat{s}\) is a robust estimator of the standard deviation (such that \(\tilde{\mu}\) is scale-invariant) and \(\rho\) is the *objective function*, which satisfies the following properties:

- \(\rho\) is always nonnegative: \(\rho(x)\geq 0\), \(\forall x\in\mathbb{R}\)
- \(\rho(0)=0\).
- \(\rho\) is symmetric: \(\rho(x)=\rho(-x)\), \(\forall x\in\mathbb{R}\).
- \(\rho\) is monotone nondecreasing: \(x\leq x'\implies \rho(x)\leq \rho(x')\), \(\forall x,x'\in\mathbb{R}\).

**Example 3.36 **The sample mean is the least squares estimator of the mean, that is, it minimizes

\[ \bar{X}= \arg \min_{a} \sum_{i=1}^n (X_i-a)^2 \]

and therefore is an \(M\)-estimator with \(\rho(x)=x^2\).

Analogously, the sample median minimizes the sum of absolute distances

\[ \arg \min_{a} \sum_{i=1}^n |X_i-a| \]

and hence is an \(M\)-estimator with \(\rho(x)=|x|\).

A popular objective function is Huber’s rho function, \[ \rho_c(d)=\left\{\begin{array}{ll} 0.5d^2 & \text{if $|d|\leq c$},\\ c|d|-0.5c^2 & \text{if $|d|>c$}, \end{array}\right. \] for any constant \(c>0\). For small distances, \(\rho_c\) employs quadratic distances, as in the case of the sample mean. For large absolute distances (that are more influential), it employs absolute distances, as the sample median does.

Finally, a robust alternative for estimating \(\sigma\) is the *mean absolute deviation*.

**Definition 3.18 (Mean absolute deviation) **The *Mean Absolute Deviation* (MAD) of a s.r.s \((X_1,\ldots,X_n)\) is defined as
\[
\text{MAD}(X_1,\ldots,X_n)=c_n\cdot \text{med}\{|X_1-\text{med}\{X_1,\ldots,X_n\}|,\dots,|X_n-\text{med}\{X_1,\ldots,X_n\}|\},
\]
where \(\text{med}\{X_1,\ldots,X_n\}\) stands for the median of the s.r.s. and \(c_n\) corrects the MAD such that it is centered for \(\sigma\) in a normal population:
\[
\mathbb{P}(|X-\mu|\le \sigma/c)=0.5\iff c=1/0.675\cong 1.48.
\]

## Exercises

**Exercise 3.1 **We have a s.r.s of size \(n\) from a population with mean \(\mu\) and variance \(\sigma^2\).

Prove that \(\sum_{i=1}^{n}a_{i}X_{i}\) is an unbiased estimator of \(\mu\) if \(\sum_{i=1}^{n}a_{i}=1\).

Among all the unbiased estimators of this form, find the one with minimum variance and compute it.

**Exercise 3.2 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. with \(\mathrm{Pois}(\lambda)\) distribution. Check that
\[
\frac{\bar{X}-\lambda}{\sqrt{\bar{X}/n}}\stackrel{d}{\longrightarrow}
\mathcal{N}(0,1).
\]

**Exercise 3.3 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. following a geometric distribution with parameter \(\theta\), that is, with p.m.f.
\[
p\left( X=x;\theta \right) =\theta (1-\theta )^{x-1}\qquad
x=1,2,\ldots,\qquad \theta \in \left(0, 1\right).
\]
Prove that \(\sum_{i=1}^{n}X_{i}\) is sufficient for \(\theta\).

**Exercise 3.4**Let \((X_{1},\ldots ,X_{n})\) be a s.r.s. from a population with p.d.f. \[ f\left(x;\theta \right) =\frac{x}{\theta }e^{-\frac{x^{2}}{\theta } }1_{\{ 0 <x<\infty\}}. \] Show that \(\sum_{i=1}^{n}X_{i}^{2}\) is sufficient for \(\theta\) and that \(\sum_{i=1}^{n}X_{i}\) is not.

**Exercise 3.5 **Let \((X_{1},\ldots,X_{n_{1}})\) be a s.r.s. of \(X\sim \mathcal{N}(\mu_{1},\sigma^2)\) and let \((Y_{1},\ldots ,Y_{n_{2}})\) be a s.r.s. of \(Y\sim \mathcal{N}(\mu_{2},\sigma^2)\), with \(X\) and \(Y\)
independent. As an estimator of \(\sigma^2\), we consider a linear combination of the sample quasi-variances \(S_{1}'^{2}\) and \(S_{2}'^{2}\), that is, \(\lambda S_{1}'^{2} + (1-\lambda)S_{2}'^{2}\), for \(0\leq \lambda \leq 1\).

- Prove that this estimator is unbiased for any value of \(\lambda\).
- Obtain the value of \(\lambda\) that provides the most efficient estimator.

**Exercise 3.6 **Let \((X_{1},\ldots,X_{n_{1}})\) be a s.r.s. of \(X\) with \(\mathbb{E}[X]=\mu_{1}\)
and \(\mathbb{V}\mathrm{ar}[X]=\sigma^{2}\), and let \((Y_{1},\ldots ,Y_{n_{2}})\) be a s.r.s. of \(Y\) with \(\mathbb{E}[Y]=\mu_{2}\) and \(\mathbb{V}\mathrm{ar}[Y]=\sigma^2\), with \(X\) and \(Y\) independent.

- Prove that \(S^2=S_{1}'^2+(1-\lambda)S_{2}'^2\) is a consistent estimator of \(\sigma^2\), for any \(\lambda\in(0,1)\).
- Show that \[ \frac{\bar{X}-\bar Y-(\mu_1-\mu_2)}{S\sqrt{{\frac{1}{n_1}}+{\frac{1}{n_2}}}} \stackrel{d}{\longrightarrow} \mathcal{N}(0,1). \]

**Exercise 3.7 **The number of independent arrivals to an emergency service during a day follows a \(\mathrm{Pois}(\theta)\) distribution. In order to estimate \(\theta\) and forecast the amount of required personnel in the service, we observe the arrivals during \(n\) days. We know that \(T=\sum_{i=1}^{n}X_i\) is a sufficient statistic. Determine:

\(k\) such that the estimator \(T_{k}=kT\) with \(k>0\) is unbiased for \(\theta\).

The condition that the sequence \(k_{n}\) must satisfy for obtaining a sequence of estimators that is consistent in probability, if the sample size \(n\) is allowed to grow.

**Exercise 3.8**Let \(X\) be a normal r.v. with mean zero and variance \(\sigma^2\). Is \(\left| X\right|\) a sufficient statistic for \(\sigma^2\)?

**Exercise 3.9 **Assume that the r.v.’s \(Y_{1},\ldots,Y_{n}\) are such that
\[
Y_{i}=\beta x_{i}+\varepsilon_{i},\quad i=1,\ldots,n,
\]
where the \(x_{i}\)’s are constants and the \(\varepsilon_{i}\) are i.i.d. r.v.’s distributed as \(\mathcal{N}(0,\sigma^{2})\).

- Prove that \(\sum_{i=1}^{n}Y_{i}/\sum_{i=1}^{n}x_{i}\) is an unbiased estimator of \(\beta\).
- Prove that \((1/n)\sum_{i=1}^{n}Y_{i}/x_{i}\) is also unbiased.
- Compute the variances of both estimators.

**Exercise 3.10 **Let \((X_1,\ldots,X_n)\) be a s.r.s. from a \(\mathcal{U}\left(0, \theta \right)\), \(\theta >0\). Find out whether the following estimators are unbiased for the population mean and, in case of postive answer, find their biases:
\[
\bar{X},\quad X_{1},\quad X_{(n)},\quad X_{(1)},\quad 0.5 X_{(n)}+0.5 X_{(1)}.
\]

**Exercise 3.11 **Consider a sample as in Exercise 3.10.

- Prove that \(X_{(n)}\) is consistent in probability for \(\theta\).
- Check whether \(Y_{n}=2\bar{X}_{n}\) is also consistent in probability for \(\theta\).

**Exercise 3.12 **Let \((X_1,\ldots,X_n)\) be a s.r.s. from an \(\mathrm{Exp}(\lambda)\).

Find an unbiased estimator of \(\lambda\) based on \(X_{(1)}\).

Compare the previous estimator with \(\bar{X}\) and decide which one is better.

The lifetime of a light bulb in days is usually modeled by an exponential distribution. The following data are lifetimes for light bulbs: \(50.1\), \(70.1\), \(137\), \(166.9\), \(170.5\), \(152.8\), \(80.5\), \(123.5\), \(112.6\), \(148.5\), \(160\), \(125.4\). Estimate the average duration of a light bulb using the two previous estimators.

**Exercise 3.13**Let \((X_1,\ldots,X_{10})\) be a s.r.s. from a distribution with mean \(\mu\) and variance \(\sigma^2\). Consider the following two estimators of \(\mu\): \[ \hat{\mu}_{1}=X_{1}+\frac{1}{2}X_{2}-\frac{1}{2}X_{3}, \quad \hat{\mu}_{2}=X_{4}+\frac{1}{5}X_{5}-\frac{1}{10}X_{10}. \] Which one is better in terms of MSE? Are they consistent in probability?

**Exercise 3.14 **Let \((X_1,\ldots,X_{n})\) be a s.r.s. from a distribution with mean \(\mu\) and variance \(\sigma^2\). Consider the following estimators of \(\mu\):
\[
\hat{\mu}_{1}=\frac{X_{1}+2X_{2}+3X_{3}}{6}, \quad
\hat{\mu}_{2}=\frac{X_{1}+4X_{2}+X_{3}}{6}, \quad
\hat{\mu}_{3}=\frac{\frac{3}{2}X_{1}+\frac{1}{2}X_{2}+X_{3}+\ldots
+X_{n}}{n}.
\]

Which ones are unbiased?

Among the unbiased ones, which is the most efficient?

Find an unbiased estimator of \(\mu\) different from \(\bar{X}\) that is more efficient than the previous unbiased estimators.

Which of them is consistent in squared mean?

**Exercise 3.15**Let \((X_1,\ldots,X_{n})\) be a s.r.s. from a Weibull distribution, whose p.d.f. is given by \[ f(x;\theta)=\frac{2x}{\theta}e^{-x^2/\theta}, \quad x>0. \] Find an minimal sufficient statistic for \(\theta\).

**Exercise 3.16**Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\sim\mathcal{N}(\mu,\sigma^2)\). Show that \(\bar{X}\) is an efficient estimator of \(\mu\).

**Exercise 3.17 **The consumption of a certain good in a family with four members during the summer months is a r.v. with \(\mathcal{U}(\alpha,\alpha+1)\) distribution. Let (X\(_{1},\ldots,X_{n})\) be a s.r.s. of consumptions of the same good for different families.

- Show that the sample mean is biased for \(\alpha\) and that its bias is \(1/2\).
- Compute the MSE of \(\bar{X}\) as an estimator of \(\alpha\).
- Obtain from \(\bar{X}\) an unbiased estimator of \(\alpha\) and provide its MSE.

In that case, \(\varphi(x)=\varphi(y)\) does not imply that \(x=y\) and there might be different elements having the same image by \(\varphi\).↩︎