# Chapter 4 Estimation methods

## 4.1 Method of moments

Consider a population \(X\) whose distribution depends on \(K\) unknown parameters \(\theta_1,\ldots,\theta_K\). If they exist, the population moments are, in general, functions of the unknown parameters. That is,
\[
\alpha_r=\alpha_r(\theta_1,\ldots,\theta_K)=\mathbb{E}[X^r], \quad
r=1,2,3,\ldots
\]
Given a s.r.s. of \(X\); we denote by \(a_r\) to the sample moment of order \(r\) that estimate \(\alpha_r\):
\[
a_r=\bar{X^r}=\frac{1}{n}\sum_{i=1}^n X_i^r, \quad r=1,2,3,\ldots
\]
Note that the sample moments do not depend on \(\theta_1,\ldots,\theta_K\) but the population moments do. This is the key fact that the *method of moments* exploits for finding the parameters \(\theta_1,\ldots,\theta_K\) such that perfectly equate (what should happen when \(n\to\infty\)) \(\alpha_r\) to \(a_r\) for as many \(r\)’s as necessary.

**Definition 4.1 (Method of moments) **Let \(X\) be a r.v. with distribution that depends on the unknown parameters \(\theta_1,\ldots,\theta_K\). The *method of moments* produces estimators from a s.r.s. of \(X\) by solving for the variables \(\theta_1,\ldots,\theta_K\) the system of equations

\[ \alpha_r(\theta_1,\ldots,\theta_K)=a_r, \quad r=1,\ldots, R, \]

where \(R\geq K\) is the lowest integer such that the system admits a unique solution. The estimator of \(\theta\) produced by the method of moments is simply referred as the *moment estimator* of \(\theta\) and is denoted as \(\hat\theta_{MM}\).

**Example 4.1 **Assume that we have a population with distribution \(\mathcal{N}(\mu,\sigma^2)\) and a s.r.s. of size \((X_1,\ldots,X_n)\) from it. Let’s compute the moment estimators of \(\mu\) and \(\sigma^2\).

For estimating two parameters, we need at least two equations. We compute in the first place the first two moments of the r.v. \(X\sim \mathcal{N}(\mu,\sigma^2)\). The first one is \(\alpha_1(\mu,\sigma^2)=\mathbb{E}[X]=\mu\). The second order moment arises from the variance \(\sigma^2\):

\[ \alpha_2(\mu,\sigma^2)=\mathbb{E}[X^2]=\mathbb{V}\mathrm{ar}[X]-\mathbb{E}[X]^2=\sigma^2+\mu^2. \]

On the other hand, the first two sample moments are given by

\[ a_1=\bar{X}, \quad a_2=\frac{1}{n}\sum_{i=1}^n X_i^2=\bar{X^2}. \]

Then, the equations to solve are

\[ \left\{ \begin{array}{rl} \alpha_1(\mu,\sigma^2)&=\bar{X} \\ \alpha_2(\mu,\sigma^2)&=\bar{X^2} \end{array} \right. \]

The solution for \(\mu\) is already in the first equation. Substituting this value in the second equation and solving \(\sigma^2\), we get the estimators

\[ \hat\mu_{\mathrm{MM}}=\bar{X},\quad \hat\sigma^2_{\mathrm{MM}}=\bar{X^2}-\hat\mu^2=\bar{X^2}-\bar{X}^2=S^2. \]

**Example 4.2 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\sim\mathcal{U}(0,\theta)\), \(\theta>0\). Let’s obtain the estimator of \(\theta\) by the method of moments.

The first population moment is \(\alpha_1(\theta)=\mathbb{E}[X]=\theta/2\) and the first sample moment is \(\alpha_1(\theta)=\bar{X}\). Equating both and solving for \(\theta\), we obtain \(\hat\theta_{\mathrm{MM}}=2\bar{X}\).

We can observe that the estimator \(\hat\theta_{\mathrm{MM}}\) of the upper range limit can be actually smaller than maximum observation. Then, intuitively, the estimator is clearly suboptimal. This observation is just an illustration of a more general fact that shows that the estimators obtained by the method of moments are usually not the most efficient ones.

**Example 4.3 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\sim\mathcal{U}(-\theta,\theta)\), \(\theta>0\). Obtain the estimator of \(\theta\) by the method of moments.

Now the first population moment is \(\alpha_1(\theta)=\mathbb{E}[X]=0\) and it does not contain any information about \(\theta\). Therefore, we need to look into the second population moment, that is \(\alpha_2(\theta)=\mathbb{E}[X^2]=\mathbb{V}\mathrm{ar}[X]-\mathbb{E}[X]^2=\mathbb{V}\mathrm{ar}[X]=\frac{\theta^2}{3}\). Solving \(\alpha_2(\theta)=\bar{X^2}\) for \(\theta\), we obtain \(\hat\theta_{\mathrm{MM}}=\sqrt{3\bar{X^2}}\).

This example illustrates that in cerain situations it may be required to consider \(R>K\) equations (here \(R=2\) and \(K=1\)) if some of them are non-informative.

An important observation is that if the parameters to be estimated \(\theta_1,\ldots,\theta_K\) can be written as a function of \(K\) population moments through the continuous \[ \theta_i=g_i(\alpha_1,\ldots,\alpha_K), \] then the estimator of \(\theta_i\) by the method of moments is \[ \hat\theta_{\mathrm{MM},i}=g_i(a_1,\ldots,a_K). \] Recall that \(g_i\) is such that \(\theta_i=g_i\left(\alpha_1(\theta_1,\ldots,\theta_K),\ldots,\alpha_K(\theta_1,\ldots,\theta_K)\right)\). This means that \(g_i\) is the \(i\)-th component of the inverse function of \[ \alpha:(\theta_1,\ldots,\theta_K)\in\mathbb{R}^K\mapsto \left(\alpha_1(\theta_1,\ldots,\theta_K),\ldots,\alpha_K(\theta_1,\ldots,\theta_K)\right)\in\mathbb{R}^K. \]

**Proposition 4.1 (Consistency in probability of the method of moments) **Let \((X_1,\ldots,X_n)\) be a s.r.s. of a r.v. \(X\) that depends on the unknown parameters \(\theta_1,\ldots,\theta_K\), and that verifies

\[\begin{align} \mathbb{E}[(X^k-\alpha_k)^2<\infty, \quad k=1,\ldots,K.\tag{4.1} \end{align}\]

If \(\theta_i=g_i(\alpha_1,\ldots,\alpha_K)\), with \(g_i\) continuous, \(i=1,\ldots,K\), then the moment estimators for \(\theta_i\), \(\hat\theta_{\mathrm{MM},i}=g_i(a_1,\ldots,a_K)\), are consistent in probability:

\[ \hat\theta_{\mathrm{MM},i}\stackrel{\mathbb{P}}{\longrightarrow} \theta_i,\quad i=1,\ldots,K. \]

*Proof*(Proof of Proposition 4.1). Thanks to the condition (4.1), the LLN implies that the sample moments \(a_1,\ldots,a_K\) are consistent in probability for estimating the population moments. In addition, the functions \(g_i\) are continuous, \(i=1,\ldots,K\), hence by Theorem 3.5, \(\hat\theta_i\) is consistent in probability for \(\theta_i\), \(i=1,\ldots,K\).

## 4.2 Maximum likelihood

**Example 4.4 **Consider an experiment consisting in tossing a coin two times independently. Let \(X\) be the r.v. “number of observed heads in two tosses”. Then, \(X\sim \mathrm{Bin}(2,\theta)\), where \(\theta=\mathbb{P}(\text{`heads"})\in\{0.2,0.8\}\). Tha is, the p.m.f. is

\[ p(x;\theta)=\left(\begin{array}{c}2 \\ x\end{array}\right)\theta^x (1-\theta)^{2-x}, \quad x=0,1,2. \]

Then, the p.m.f. of \(X\) according to the possible values of \(\theta\) is given in the following table:

\(\theta\) | \(x=0\) | \(x=1\) | \(x=2\) |
---|---|---|---|

\(0.20\) | \(0.64\) | \(0.32\) | \(0.04\) |

\(0.80\) | \(0.04\) | \(0.32\) | \(0.64\) |

A the sight of this table, it seems logival to estimate \(\theta\) in the following way: if in the experiment we obtain \(x=0\) heads, then we set as estimator \(\hat\theta=0.2\) since, provided \(x=0\), \(\theta=0.2\) is more likely than \(\theta=0.8\). Analogously, if we obtain \(x=2\) heads, we set \(\hat\theta=0.8\). If \(x=1\), we do not have a clear choice among \(\theta=0.2\) and \(\theta=0.8\), since both are equally likely according to the available information, so we can arbitrarily choose one of the values for \(\hat\theta\).

The previous example illustrates the core idea behind the maximum likelihood method: estimate the unknown parameter \(\theta\) with the value that maximizes the probability of obtaining a sample realization as the one which was actually observed. Or, in other words, to select the most likely value of \(\theta\) according to the data at hand. Note, however, that this interpretation is only valid for discrete r.v.’s \(X\), for whom we can define the probability of the sample realization \((X_1=x_1,\ldots,X_n=x_n)\) as \(\mathbb{P}(X_1=x_1,\ldots,X_n=x_n)\). In the continuous case the probability of a particular sample realization is zero. In this case, is the \(\theta\) that maximizes the *density* (instead of the p.m.f.) of the sample, evaluated at the sample realization, the one that delivers the maximum likelihood estimator.

The formal definition of the maximum likelihood estimator is given next.

**Definition 4.2 (Maximum likelihood estimator) **Let \((X_1=x_1,\ldots,X_n=x_n)\) be the realization of a s.r.s. of a r.v. whose distribution belongs to the family \(\mathcal{F}_{\Theta}=\{F_{\theta}: \theta\in \Theta\}\), where \(\theta\in\mathbb{R}^K\). The *Maximum Likelihood Estimator* (MLE) of \(\theta\) is the quantity \(\hat\theta\) that verifies
\[
\hat{\theta}_{\mathrm{MLE}}=\arg\max_{\theta\in \Theta} \mathcal{L}(\theta;x_1,\ldots,x_n).
\]

*Remark.*Quite often, the

*log-likelihood*\[ \ell(\theta;x_1,\ldots,x_n)=\log \mathcal{L}(\theta;x_1,\ldots,x_n) \] is more manageable that the likelihood. As the logarithm function is monotonously increasing, then the maxima of \(\ell(\theta;x_1,\ldots,x_n)\) and \(\mathcal{L}(\theta;x_1,\ldots,x_n)\) happen at the same point.

*Remark.*If the parametric space \(\Theta\) is finite, the maximum can be found by exhaustive enumeration of the values \(\{\mathcal{L}(\theta;x_1,\ldots,x_n):\theta\in\Theta\}\) (as done in Example 4.4). If the cardinality of \(\Theta\) is infinity and \(\mathcal{L}(\theta;x_1,\ldots,x_n)\) is differntiable with respect to \(\theta\) in the interior of \(\Theta\), then we only need to solve the so-called

*log-likelihood equations*(usually simpler than the

*likelihood equations*), which characterize the relative extremes of the log-likelihood function \[ \frac{\partial}{\partial \theta_j}\mathcal{\ell}(\theta;x_1,\ldots,x_n)=0, \quad j=1,\ldots,k. \] The global maximum is found by checking which of the relative extremes is a local maximum and comparing it with the boundary values of \(\Theta\).

**Example 4.5 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of \(X\sim\mathcal{N}(\mu,\sigma^2)\). Let’s obtain the MLE of \(\mu\) and \(\sigma^2\).

Since \(\theta=(\mu,\sigma^2)\), the parametric space is \(\Theta=\mathbb{R}\times\mathbb{R}^+\). The likelihood of \(\theta\) is given by

\[ \mathcal{L}(\mu,\sigma^2;x_1,\ldots,x_n)=\frac{1}{(\sqrt{2\pi\sigma^2})^n}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right\} \]

and the log-likelihood by

\[ \ell(\mu,\sigma^2;x_1,\ldots,x_n)=-\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2. \]

The derivatives with respect to \(\mu\) and \(\sigma^2\) are

\[\begin{align*} \frac{\partial \ell}{\partial \mu} &=\frac{1}{\sigma^2}\sum_{i=1}^n (x_i-\mu)=0,\\ \frac{\partial \ell}{\partial \sigma^2} &= -\frac{n}{2}\frac{1}{\sigma^2}+\frac{\sum_{i=1}^n(x_i-\mu)^2}{2\sigma^4}=0. \end{align*}\]

The solution to the log-likelihood equations is

\[ \hat\mu=\bar{X}, \quad \hat\sigma^2=\frac{1}{n}\sum_{i=1}^n (x_i-\bar{X})^2=s^2. \]

We can see that the solution is not a minimum, since that taking the limit when \(\sigma^2\) approaches zero, we obtain

\[ \lim_{\sigma^2 \downarrow 0} \mathcal{L}(\mu,\sigma^2;x_1,\ldots,x_n)=0 <\mathcal{L}(\mu,\sigma^2;x_1,\ldots,x_n), \quad \forall \mu\in\mathbb{R}, \forall \sigma^2>0. \]

In particular,

\[ \lim_{\sigma^2 \downarrow 0} \mathcal{L}(\mu,\sigma^2;x_1,\ldots,x_n)=0 <\mathcal{L}(\hat\mu,\hat\sigma^2;x_1,\ldots,x_n). \]

This means that the solution is a maximum. If the second derivatives are computed, it is seen that the Hessian matrix evaluated at \((\hat\mu,\hat\sigma^2)\) is negatively defined, so \(\hat\mu=\bar{X}\) and \(\hat\sigma^2=S^2\) are the MLE of \(\mu\) and \(\sigma^2\).

The previous example shows that the MLE no does not have to be necessarily unbiased, since the sample variance \(S^2\) is a biased estimator of the variance \(\sigma^2\).

**Example 4.6 **Consider a continuous parametric space \(\Theta=[0,1]\) for the experiment of Example 4.4. Let’s obtain the MLE of \(\theta\).

In the first place, observe that \(\Theta\) is a closed and bounded interval. A continuous function defined on such an interval always have a maximum, that may be in the interval extremes. Therefore, in this situation we must compare the solutions to the log-likelihood with the extremes of the interval.

The likelihood is

\[ \mathcal{L}(\theta;x)=\left(\begin{array}{c}2\\ x \end{array}\right)\theta^x(1-\theta)^{2-x}, \quad x=0,1,2. \]

The log-likelihood is then

\[ \ell(\theta;x)=\log \left(\begin{array}{c}2\\ x \end{array}\right)+x\log\theta+(2-x)\log(1-\theta), \]

with derivative

\[ \frac{\partial \ell}{\partial \theta}=\frac{x}{\theta}-\frac{2-x}{1-\theta}. \]

Equating to zero this derivative, and assuming that \(\theta\neq 0,1\) (boundary of \(\Theta\), \(\ell\) is not differentiable), we obtain the equation

\[ x(1-\theta)-(2-x)\theta=0. \]

Solving for \(\theta\), we obtain

\[ \hat\theta=x/2, \]

that is, the proportion of heads obtained in the two tosses. Comparing the likelihood evaluated at \(\hat\theta\) with the one of the values of \(\theta\) at the boundary of \(\Theta\), we have

\[ \mathcal{L}(0;x)=\mathcal{L}(1;x)=0\leq \mathcal{L}(\hat\theta;x). \]

Therefore, \(\hat\theta=x/2\) is the MLE of \(\theta\).

The following example shows a likelihood that is not continuous for all the possible values of \(\theta\) but that it delivers the MLE of \(\theta\).

**Example 4.7 **Let \(X\sim \mathcal{U}(0,\theta)\) with p.d.f.
\[
f(x;\theta)=\frac{1}{\theta}1_{\{0\leq x\leq \theta\}}, \quad
\theta>0,
\]
where \(\theta\) is unknown, and let \((X_1,\ldots,X_n)\) be a s.r.s. of \(X\). We compute the MLE of \(\theta\).

The likelihood is given by

\[ \mathcal{L}(\theta;x_1,\ldots,x_n)=\frac{1}{\theta^n}, \quad 0\leq x_1,\ldots,x_n\leq \theta. \]

The restriction involving \(\theta\) can be included within the likelihood function (which will take otherwise the value \(0\)), rewriting it as

\[ \mathcal{L}(\theta;x_1,\ldots,x_n)=\frac{1}{\theta^n}1_{\{x_{(1)}\geq0\}}1_{\{x_{(n)}\leq\theta\}}. \]

We can observe that for \(\theta>x_{(n)}\), \(L\) decreases when \(\theta\) increases, that is, \(L\) is decreasing in \(\theta\) for \(\theta> x_{(n)}\). However, for \(\theta<x_{(n)}\), the likelihood is zero. Therefore, the maximum is attained precisely at \(\hat\theta=x_{(n)}\) and, as a consequence, this is the MLE of \(\theta\). Figure 4.1 helps visualizing this reasoning.The next example shows that the MLE does not have to be unique.

**Example 4.8 **Let \(X\sim \mathcal{U}(\theta-1,\theta+1)\) with p.d.f.
\[
f(x;\theta)=\frac{1}{2}1_{\{x\in[\theta-1,\theta+1]\}},
\]
where \(\theta>0\) is unknown, and let \((X_1,\ldots,X_n)\) be a s.r.s. of \(X\). We compute the MLE of \(\theta\).

The likelihood is given by

\[ \mathcal{L}(\theta;x_1,\ldots,x_n)=\frac{1}{2^n}1_{\{x_{(1)}\geq \theta-1\}} 1_{\{x_{(n)}\leq \theta+1\}}=\frac{1}{2^n}1_{\{x_{(n)}-1\leq \theta\leq x_{(1)}+1\}}. \]

Therefore, \(\mathcal{L}(\theta;x_1,\ldots,x_n)\) is constant for any value of \(\theta\) in the interval \([x_{(n)}-1,x_{(1)}+1]\), and is zero if \(\theta\) is not in that interval. This means that any value inside \([x_{(n)}-1,x_{(1)}+1]\) is a maximum and, therefore, is a MLE.

The MLE is a function of any sufficient statistic. However, this function does not have to be bijective, and as a consequence, the MLE is not guaranteed to be sufficient (any sufficient statistic can be transformed into the MLE, but the MLE can not be transformed into a any sufficient statistic). The next example shows a MLE that is not sufficient.

**Example 4.9 **Let \(X\sim\mathcal{U}(\theta,2\theta)\) with p.d.f.
\[
f_{\theta}(x)=\frac{1}{\theta}1_{\{\theta\leq x\leq 2\theta\}},
\]
where \(\theta>0\) is unknown. Obtain the MLE of \(\theta\) form a s.r.s. \((X_1,\ldots,X_n)\).

The likelihood is \[ \mathcal{L}(\theta;x_1,\ldots,x_n)=\frac{1}{\theta^n}1_{\{x_{(1)}\geq \theta\}}1_{\{x_{(n)}\leq 2\theta\}}=\frac{1}{\theta^n}1_{\{x_{(n)}/2\leq \theta\leq x_{(1)}\}}. \]

Because of Theorem 3.7, the statistic

\[ T=(X_{(1)},X_{(n)}) \]

is sufficient. On the other hand, since \(\mathcal{L}(\theta;x_1,\ldots,x_n)\) is decreasing in \(\theta\), then the maximum is obtained at

\[ \hat\theta=X_{(n)}/2, \]

which is *not* sufficient, since \(\mathcal{L}(\theta;x_1,\ldots,x_n)\) can not be factorized adequately.

In addition, the MLE verifies that if there exists an unbiased and efficient estimator for \(\theta\), then this has to be the *unique* MLE of \(\theta\).

**Proposition 4.2 (Invariance of the MLE)**The MLE is invariant with respect to bijective transformations of the parameter. That is, if \(\phi=h(\theta)\), where \(h\) is bijective and \(\hat\theta\) is the MLE of \(\theta\), then \(\hat\phi=h(\hat\theta)\) is the MLE of \(\phi\).

*Proof* (Proof of Proposition 4.2). Let \(\mathcal{L}_{\theta}(\theta;x_1,\ldots,x_n)\) be the likelihood of \(\theta\) for the sample \((x_1,\ldots,x_n)\). The likelihood of \(\phi\) verifies

\[ \mathcal{L}_{\phi}(\phi;x_1,\ldots,x_n)=\mathcal{L}_{\theta}(\theta;x_1,\ldots,x_n)=\mathcal{L}_{\theta}(h^{-1}(\phi);x_1,\ldots,x_n), \quad\forall \phi\in\Phi. \]

If \(\hat\theta\) is the MLE of \(\theta\) then, by definition,

\[ \mathcal{L}_\theta(\hat\theta;x_1,\ldots,x_n)\geq \mathcal{L}_\theta(\theta;x_1,\ldots,x_n), \quad \forall \theta\in \Theta. \]

Denote \(\hat\phi=h(\hat\theta)\). Since \(h\) is bijective, \(\hat\theta=h^{-1}(\phi)\). Then, it follows that

\[\begin{align*} \mathcal{L}_\phi(\hat\phi;x_1,\ldots,x_n)&=\mathcal{L}_\theta(h^{-1}(\hat\phi);x_1,\ldots,x_n)=\mathcal{L}_\theta(\hat\theta;x_1,\ldots,x_n) \\ &\geq \mathcal{L}_\theta(\theta;x_1,\ldots,x_n)=\mathcal{L}_\theta(h^{-1}(\phi);x_1,\ldots,x_n)\\ &=\mathcal{L}_\phi(\phi;x_1,\ldots,x_n), \quad \forall \phi\in\Phi. \end{align*}\]

Therefore, \(\hat\phi\) is the MLE of \(\phi=h(\theta)\).

**Example 4.10 **The lifetime of certain type of light bulbs is a r.v. \(X\sim\mathrm{Exp}(\theta)\), \(\theta>0\). After observing the lifetime of \(n\) of them, we wish to estimate the probability that the lifetime of a light bulb is above \(500\) hours.

The p.d.f. of \(X\) is

\[ f(x;\theta)=\frac{1}{\theta}\exp\left\{-\frac{x}{\theta}\right\}, \quad \theta>0,\ x>0. \]

The probability that a light bulb lasts more than \(500\) hours is

\[ \mathbb{P}(X>500)=\int_{500}^{\infty} \frac{1}{\theta}\exp\left\{-\frac{x}{\theta}\right\}\,\mathrm{d}x =\left[-\exp\left\{-\frac{x}{\theta}\right\}\right]_{500}^{\infty} =e^{-500/\theta}. \]

Then, the parameter to estimate is

\[ \phi=e^{-500/\theta}. \]

If we compute the MLE of \(\theta\), we obtain \(\phi\) by simply \(\hat\phi=h(\hat\theta)\).

For the sample \((x_1,\ldots,x_n)\), the likelihood of \(\theta\) is

\[ \mathcal{L}(\theta;x_1,\ldots,x_n)=\frac{1}{\theta^n}\exp\left\{-\frac{\sum_{i=1}^n x_i}{\theta}\right\}. \]

The log-likelihood of \(\theta\) is

\[ \ell(\theta;x_1,\ldots,x_n)=-n\log\theta-\frac{\sum_{i=1}^n x_i}{\theta}. \]

Differentiating and equating to zero, we obtain the equation of the likelihood:

\[ \frac{\partial \ell(\theta)}{\partial \theta}=-\frac{n}{\theta}+\frac{\sum_{i=1}^n x_i}{\theta^2}=0. \]

Since \(\theta\neq 0\), then solving for \(\theta\), we get

\[ \hat\theta=\bar{X}. \]

The second derivative is

\[ \frac{\partial^2\ell(\theta)}{\partial \theta^2}=\frac{n}{\theta^2}-\frac{2\sum_{i=1}^n x_i}{\theta^3}=\frac{1}{\theta^2}\left(n-\frac{2\sum_{i=1}^n x_i}{\theta}\right). \]

If we evaluate it at \(\hat\theta=\bar{X}\), we get

\[ \left.\frac{\partial^2 \ell(\theta)}{\partial \theta^2}\right|_{\theta=\bar{X}}=\frac{1}{\bar{X}^2}\left(n-\frac{2n\bar{X}}{\bar x}\right)=-\frac{n}{\bar{X}^2}<0,\ \forall n>0. \]

Therefore, \(\hat\theta=\bar{X}\) is a local maximum. Since \(\Theta=\mathbb{R}^+\) is open and \(\hat\theta\) is the unique local maximum, then \(\hat\theta=\bar{X}\) is a global maximum. Hence, the MLE of \(\phi=e^{-500/\theta}\) is \(\hat\phi=e^{-500/\bar{X}}\).

One of the most important properties of the MLE is that its asymptotic distribution is completely specified, no matter what is the underlying parametric model.

**Theorem 4.1 (Asymptotic distribution of the MLE) **Let \(X\) be a r.v. whose distribution depends on an unknown parameter \(\theta\in\Theta\), with \(\Theta\) an open interval of \(\mathbb{R}\). Under certain regularity conditions, it is verified that any sequence \(\hat\theta_n\) of roots for the log-likelihood equation such that \(\hat\theta_n\stackrel{\mathbb{P}}{\longrightarrow}\theta\), verifies that

\[ \sqrt{n}(\hat\theta_n-\theta)\stackrel{d}{\longrightarrow} \mathcal{N}(0,\mathcal{I}(\theta)^{-1}). \]

**Example 4.11 **Let \((X_1,\ldots,X_n)\) be a s.r.s. of \(X\sim\Gamma(k,\theta)\), whose p.d.f. is

\[ f(x;\theta)=\frac{\theta^k}{\Gamma(k)}x^{k-1}e^{-\theta x}, \quad x>0, \ k>0, \ \theta>0, \] where \(k\) is known and \(\theta\) is unknown. We compute the MLE of \(\theta\) and find its asymptotic distribution.

The likelihood is given by

\[ \mathcal{L}(\theta;x_1,\ldots,x_n)=\frac{\theta^{nk}}{(\Gamma(k))^n}\left(\prod_{i=1}^n x_i\right)^{k-1}e^{-\theta\sum_{i=1}^n x_i}. \]

The log-likelihood is then

\[ \ell(\theta;x_1,\ldots,x_n)=nk\log\theta-n\log \Gamma(k)+(k-1)\sum_{i=1}^n\log x_i-\theta\sum_{i=1}^n x_i. \]

Differentiating with respect to \(\theta\), we get

\[ \frac{\partial \ell}{\partial \theta}=\frac{nk}{\theta}-n\bar{X}>0. \]

Then, the solution is to the loglikelihood equation \(\hat\theta=k/\bar{X}\). In addition, \(\partial \ell/\partial \theta>0\) if \(\theta\in(0,k/\bar{X})\), while \(\partial \ell/\partial \theta<0\) if \(\theta\in(k/\bar{X},\infty)\). Therefore, \(\ell\) is increasing in \((0,k/\bar{X})\) and decreasing in \((k/\bar{X},\infty)\), which means that it attains the maximum at \(k/\bar{X}\). Therefore, \(\hat\theta=k/\bar{X}\) is the MLE of \(\theta\).

We compute the Fisher information quantity of \(f(x;\theta)\). For that, we compute:

\[\begin{align*} \log f(x;\theta)&=k\log\theta-\log\Gamma(k)+(k-1)\log x-\theta x,\\ \frac{\partial \log f(x;\theta)}{\partial \theta}&=\frac{k}{\theta}-x. \end{align*}\]

Squaring and taking expectations, we get

\[ \mathcal{I}(\theta)=\mathbb{E}\left[\left(\frac{\partial \log f(X;\theta)}{\partial \theta}\right)^2\right] =\mathbb{E}\left[\left(\frac{k}{\theta}-X\right)^2\right]=\mathbb{V}\mathrm{ar}[X]=\frac{k}{\theta^2}. \]

Applying Theorem 4.1, we have that

\[ \sqrt{n}(\hat\theta-\theta)\stackrel{d}{\longrightarrow} \mathcal{N}\left(0,\theta^2/k\right), \]

which means that for large \(n\),

\[ \hat\theta\cong \mathcal{N}\left(\theta,\theta^2/(nk)\right). \]

## Exercises

**Exercise 4.1 **The lifetime of a certain type of electrical components follows a \(\Gamma(\alpha,\beta)\) distribution, but the values of the parameters \(\alpha\) and \(\beta\) are unknown. Assume that three of these components are tried in an independent way and the following lifetimes are measured: \(120\), \(130\), and \(128\).
hours.

- Obtain the moment estimators of \(\alpha\) and \(\beta\) and provide its concrete value for the sample at hand.
- Prove the consistency in probability of the estimators by using Corollary 3.2.

**Exercise 4.2 **The proportion of impurities in each manufactured unit of a certain kind of chemical product is a r.v. with p.d.f.
\[
f(x;\theta)=\left\{\begin{array}{ll}
(\theta+1)x^\theta & 0<x<1,\\
0 & \text{otherwise},
\end{array}\right.
\]
where \(\theta>-1\). Five units of the manufactured product are taken in one day, resulting the next impuriy proportions: \(0.33\), \(0.51\), \(0.02\), \(0.15\), \(0.12\).

- Obtain the estimator of \(\theta\) by the method of the moments.
- Obtain the maximum likelihood estimator of \(\theta\).

**Exercise 4.3 **An industrial product is packaged in batches of \(N\) units each. The number of defective units within each batch is unknown. Since checking whether a unit is defective or not is expensive, the quality control consists in selecting \(n\) units of the batch and obtaining an estimation of the number of defective units within the batch. The batch is rejected if the estimated number of defective units exceeds \(N/4\).

- Find the moment estimator of the number of defective units within a parcel.
- If \(N=20\), \(n=5\), and among these \(n\) units \(2\) of them are defective, is the batch rejected?

**Exercise 4.4 **Assume that there are \(3\) balls in an urn: \(\theta\) of them are red and \(3-\theta\) white. Two balls are extracted (with replacement).

- The two balls are red. Obtain the MLE of \(\theta\).
- If only one of the balls is red, what is now the MLE of \(\theta\)?

**Exercise 4.5 **A particular machine may fail because of two mutually exclusive types of failures: A and B. We wish to estimate the probability of each failure by knowing:

- The probability of failure A is twice the one of B.
- In \(30\) days we observe \(2\) failures of A, \(3\) failures of B, and \(25\) days without failures.

**Exercise 4.6**In the manufacturing process of metalic pieces of three sizes, the proportions of small, normal, and large pieces are \(p_{1}=0.05\), \(p_{2}=0.9\), and \(p_{3}=0.05\), respectively. It is suspected that the machine is not properly adjusted and that the proportions have changed in the following way: \(p_{1}=0.05+\tau\), \(p_{2}=0.9-2\tau\), and \(p_{3}=0.05+\tau\). For checking this suspicion, \(5000\) pieces are analyzed, finding \(278\), \(4428\), and \(294\) pieces of small, normal, and large sizes, respectively. Compute the MLE of \(\tau\).

**Exercise 4.7 **Let \(X_i\sim \mathcal{N}(i\theta, 1)\), \(i=1,\ldots,n\) (independent).

- Obtain the MLE of \(\theta\).
- Is unbiased?
- Obtain its asymptotic variance.

**Exercise 4.8 **Let \(X\) be a r.v. with p.d.f.
\[
f(x;\theta)=\frac{\theta }{x^{2}},\qquad 0<\theta \leq x.
\]

- Find the MLE of \(\theta\).
- Find a sufficient statistic for \(\theta\).
- Obtain the moment estimator of \(\theta\).

**Exercise 4.9 **Consider the Pareto r.v. \(X\) with p.d.f.
\[
f(x)=\left\{
\begin{array}{lll}
{\frac{3\theta ^{3}}{x^{4}}} & \text{if} & x\geq \theta, \\
0 & \text{if} &x< \theta.
\end{array}
\right.
\]

- Find a sufficient statistic for \(\theta\).
- Find the MLE of \(\theta\).
- Find the moment estimator \(\hat{\theta}_{MM}\). Does it always makes sense?
- Is \(\hat{\theta}_{MM}\) unbiased?
- Compute the precision of \(\hat{\theta}_{MM}\).

**Exercise 4.10 **A s.r.s. of size \(n\) from the p.d.f.
\[
f(x)=2\theta xe^{-\theta x^{2}},\qquad x>0,
\]
where \(\theta >0\) is an unknown parameter, is available.

- Determine the estimator of \(\theta\) by the method of moments.
- Determine the MLE of \(\theta\) and find its asymptotic distribution.