Chapter 10 Techniques for Deriving Estimators

10.1 Introduction

In this Section we introduce two techniques for deriving estimators:

The Method of Moments is a simple, intuitive approach, which has its limitations beyond simple random sampling (i.i.d. observations). Maximum Likelihood Estimation is an approach which can be extended to complex modelling scenarios and likelihood based estimation will be central to statistical inference procedures throughout not only this module but the whole course.

10.2 Method of Moments

Let XX be a random variable.

Moments

If E[Xk]E[Xk] exists in the sense that it is finite, then E[Xk]E[Xk] is said to be the kthkth moment of the random variable XX.

For example,

  • E[X]=μE[X]=μ is the first moment of XX;

  • E[X2]E[X2] is the second moment of XX.

Note that var(X)=E[X2](E[X])2var(X)=E[X2](E[X])2 is a function of the first and second moments.

Sample moments

Let X1,X2,,XnX1,X2,,Xn be a random sample. The kthkth sample moment is

ˆμk=1nni=1Xki.^μk=1nni=1Xki.
Since,
E[ˆμk]=E[1nni=1Xki]=1nni=1E[Xki]=E[Xki],

it follows that the kth sample moment is an unbiased estimator of the kth moment of a distribution. Therefore, if one wants to estimate the parameters from a particular distribution, one can write the parameters as a function of the moments of the distribution and then estimate them by their corresponding sample moments. This is known as the method of moments.

Method of Moments: Mean and Variance

Let X1,X2,,Xn be a random sample from any distribution with mean μ and variance σ2.
The method of moments estimators for μ and σ2 are:
ˆμ=ˉX and ˆσ2=1nni=1(XiˉX)2.
The method of moments estimator for μ is
ˆμ=ˆμ1=1nni=1Xi=ˉX,
Given that σ2=E[X2]E[X]2, the method of moments estimator for σ2 is
ˆσ2=ˆμ2(ˆμ1)2=1nni=1X2i(1nni=1Xi)2=1nni=1(XiˉX)2.

Note that E[ˆμ]=E[ˉX]=μ is an unbiased estimator, whilst
E[ˆσ2]=E[1nni=1(XiˉX)2]=n1nσ2

is a biased estimator, but is asymptotically unbiased. See Section 9.4 where the properties of ˆσ2 are explored further.

Method of Moments: Binomial distribution

Let X1,X2,,XnBin(m,θ) where m is known. Find the method of moments estimator for θ.

The first moment (mean) of the Binomial distribution is mθ. Therefore,

ˆθ=ˆμ1m=ˉXm.


Method of Moments: Exponential distribution

Let X1,X2,,XnExp(θ). Find the method of moments estimator for θ.

For x>0 and θ>0,
f(x|θ)=θeθx.
Therefore E[X]=1/θ, so 1/ˆθ=ˉX and
ˆθ=1/ˉX.

The sampling properties of the kth sample moment are fairly desirable:

  • ˆμk is an unbiased estimator of E[Xk];
  • By the Central Limit Theorem, ˆμk is asymptotically normal if E[X2k] exists;
  • ˆμk is a consistent estimator of E[Xk].

If h is a continuous function, then ˆθ=h(ˆμ1,ˆμ2,,ˆμk) is a consistent estimator of θ=h(μ1,μ2,,μk), but it may not be an unbiased or an asymptotically normal estimator.

There are often difficulties with the method of moments:

  • Finding θ as a function of theoretical moments is not always simple;
  • For some models, moments may not exist.

10.3 Maximum likelihood estimation

In the study of probability, for random variables X1,X2,,Xn we consider the joint probability mass function or probability density function as just a function of the random variables X1,X2,,Xn. Specifically we assume that the parameter value(s) are completely known.

For example, if X1,X2,,Xn is a random sample from a Poisson distribution with mean λ, then
P(X1=x1,X2=x2,,Xn=xn)=pX1,X2,,Xn(x1,x2,,xn)=enλλ(ni=1xi)ni=1xi!

for λ>0. See Section 6.4 for derivation.

However, in the study of statistics, we assume the parameter values are unknown. Therefore, if we are given a specific random sample x1,x2,,xn, then p(x1,x2,,xn) will take on different values for each possible value of the parameters (λ in the Poisson example). Hence, we can consider p(x1,x2,,xn) to also be a function of the unknown parameter and write p(x1,x2,,xn|λ) to make the dependence on λ explicit. In maximum likelihood estimation we choose ˆλ to be the value of λ which most likely produced the random sample x1,x2,,xn, that is, the value of λ which maximises p(x1,x2,,xn|λ) for the observed x1,x2,,xn.

Likelihood function

The likelihood function of the random variables X1,X2,,Xn is the joint p.m.f. (discrete case) or joint p.d.f. (continuous case) of the observed data given the parameter θ, that is
L(θ)=f(x1,x2,,xn|θ).
Note that if X1,X2,,Xn are a random sample from a distribution with probability function f(x|θ) then
L(θ)=ni=1f(xi|θ).

Maximum likelihood estimator

The maximum likelihood estimator, denoted shorthand by MLE or m.l.e., of θ is the value ˆθ which maximises L(θ).

Suppose that we collect a random sample from a Poisson distribution such that X1=1, X2=2, X3=3 and X4=4. Find the maximum likelihood estimator of λ.


The likelihood function is
L(λ)=p(x1,x2,x3,x4|λ)=p(1,2,3,4|λ)=e4λλ101!2!3!4!.
Since logx is a monotonic increasing function, the value ˆλ that maximises logL(λ) will also maximise L(λ). Hence calculate,
logL(λ)=4λ+10logλlog(1!2!3!4!).
To maximise logL(λ) we solve
dlogL(λ)dλ=0.

Now, dlogL(λ)dλ=4+10λ=0. Hence, ˆλ=52=2.5.

Log likelihood function

If L(θ) is the likelihood function of θ, then l(θ)=logL(θ) is called the log likelihood function of θ.

Binomial MLE

Let XBin(m,θ). Find the MLE of θ given observation x.

Attempt Example 10.3.5: Binomial MLE and then watch Video 17 for the solutions.

We will use the case m=10 and x=3 to illustrate the calculations.

Video 17: Binomial MLE

Solution to Example 10.3.5: Binomial MLE

Given x is sampled from the random variable X, we have that

L(θ)=(mx)θx(1θ)mx,0θ1.

In the case m=10 and x=3 the likelihood becomes L(θ)=120θ3(1θ)7 and this is illustrated in Figure 10.1.

Likelihood function.

Figure 10.1: Likelihood function.

Take the derivative of L(θ) (using the product rule):

dL(θ)dθ=(mx)xθx1(1θ)mx(mx)θx(mx)(1θ)mx1=(mx)θx1(1θ)mx1[x(1θ)(mx)θ].

Setting dL(θ)dθ=0, we obtain

[x(1θ)(mx)θ]=0.

Hence, ˆθ=xm is a possible value for the MLE of θ.

Since L(θ) is a continuous function over [0,1], the maximum must exist at either the stationary point or at one of the endpoints of the interval. Given, L(0)=0, L(1)=0, and L(xm)>0, it follows that ˆθ=xm is the MLE of θ.

In the illustrative example, m=10 and x=3 giving ˆθ=310=0.3. In Figure 10.2 the MLE is marked on the plot of the likelihood function.

Likelihood function with MLE at 0.3.

Figure 10.2: Likelihood function with MLE at 0.3.

It is easier to use the log-likelihood l(θ) to derive the MLE.

We have that

l(θ)=log[(mx)θx(1θ)mx]=log[(mx)]+xlogθ+(mx)log(1θ).

In the case m=10 and x=3 the likelihood becomes l(θ)=log120+3logθ+7log(1θ) and this is illustrated in Figure 10.3.

Log-likelihood function.

Figure 10.3: Log-likelihood function.

Take the derivative of l(θ):

dl(θ)dθ=0+xθmx1θ=x(1θ)(mx)θθ(1θ).
Setting dl(θ)dθ=0, again requires solving
[x(1θ)(mx)θ]=0.

Giving ˆθ=xm.

In the illustrative example, m=10 and x=3 giving ˆθ=310=0.3. In Figure 10.4 the MLE is marked on the plot of the likelihood function.

Log-likelihood function with MLE at 0.3.

Figure 10.4: Log-likelihood function with MLE at 0.3.


The following R shiny app allows you to investigate the MLE for data from a geometric distribution, XGeom(p). The success probability of the geometric distribution can be varied from 0.01 to 1. The likelihood, log-likelihood and relative likelihood (likelihood divided by its maximum) functions can be plotted. Note that as the number of observations become large the likelihood becomes very small, and equal to, 0 to computer accuracy. You will observe that the likelihood function becomes more focussed about the MLE as the sample size increases. Also the MLE will generally be closer to the true value of p used to generate the data as the sample size increases.

R Shiny app: MLE Geometric Distribution

Poisson MLE

Let X1,X2,,Xn be a random sample from a Poisson distribution with mean λ. Find the MLE of λ.


We have
L(λ)=p(x1,x2,,xn|λ)=enλλni=1xini=1xi!,
where λ>0. So,
l(λ)=nλ+ni=1xilogλlogni=1xi!.
Now
dl(λ)dλ=n+ni=1xiλ.
Setting dl(λ)dλ=0 and solving yields
ˆλ=ni=1xin=ˉx.

Since d2l(λ)dλ2=ni=1xiλ2<0, it follows that ˆλ=ˉX is a maximum, so is the MLE of λ.


In both Example 10.3.5 and Example 10.3.6, we note that terms in the likelihood which do not involve the parameter of interest play no role in the calculating of the MLE. For example, (mx) in the binomial and [ni=1xi!]1 in the Poisson. Therefore it is sufficient to consider a function H(θ) which is proportional to the likelihood, that is, there exists K>0 such that
L(θ)=KH(θ)for all θ.
We write L(θ)H(θ) and note that if h(θ)=logH(θ), then
l(θ)=logK+h(θ)
and
ddθl(θ)=ddθh(θ).

MLE of mean of a Normal random variable

Let X1,X2,,Xn be a random sample of N(θ,1) with mean θ. Find the MLE of θ given observations x1,x2,,xn.


For each of the xi:
f(xi|θ)=12πexp{12(xiθ)2}.
Thus:
L(θ)=(2π)n/2ni=1exp{12(xiθ)2}
and so,
L(θ)ni=1exp{12(xiθ)2}=exp{12ni=1(xiθ)2}
and
l(θ)=logL(θ)=12ni=1(xiθ)2+constant.
Hence
dl(θ)dθ=ni=1(xiθ)=0
gives the stationary point of the likelihood, with
ˆθ=xin=ˉx,
It is easily verified that ˆθ given in (10.1) is a maximum since
d2l(θ)dθ2=n<0.

So ˆθ=ˉx is the MLE of θ.

In Example 10.3.5, Example 10.3.6 and Example 10.3.7 the maximum likelihood estimators correspond with the method of moment estimators. In Example 10.3.8 we consider a situation where the maximum likelihood estimator is very different from the method of moments estimator.

MLE for Uniform random variables

Let U1,U2,,Un be i.i.d. samples of U[0,θ]. Given observations u1,u2,,un:

  1. Find the MLE of θ.
  2. Find the method of moments estimator of θ.

Attempt Example 10.3.8: MLE for Uniform random variables and then watch Video 18 for the solutions.

We will data u=(u1,u2,,u5)=(1.30,2.12,2.40,0.98,1.43) as an illustrative example. These 5 observations were simulated from U(0,3).

Video 18: MLE for Uniform random variables

Solution to Example 10.3.8: MLE for Uniform random variables
  1. If UiU[0,θ], then its p.d.f. is given by
    f(u|θ)={1θ,if 0uθ,0,otherwise.
    Note that if θ<ui for some i, then L(θ)=0. Since L(θ) is always positive and we want to maximise L, we can assume 0uiθ for all i=1,,n, then
    L(θ)=ni=1f(ui|θ)=ni=11θ=1θn.
    Hence, L(θ) is a decreasing function of θ and its maximum must exist at the smallest value that θ can obtain. Since θ>max{u1,u2,,un}, the MLE of θ is ˆθ=max{u1,u2,,un}.

Figure 10.5 shows the likelihood function L(θ) using the data u=(1.30,2.12,2.40,0.98,1.43).

Likelihood function for u = (1.30,2.12,2.40,0.98,1.43).

Figure 10.5: Likelihood function for u = (1.30,2.12,2.40,0.98,1.43).

  1. By comparison the method of moments estimator, ˇθ, of θ uses E[U]=0+θ2 and hence is given by ˇθ=2ˉu. Note that if 2ˉu<max{u1,u2,,un} then ˇθ will not be consistent with the data, i.e. L(ˇθ)=0.

To observe the difference between the MLE and the method of moments estimator, using u=(1.30,2.12,2.40,0.98,1.43):

  • MLE: ˆθ=max{1.30,2.12,2.40,0.98,1.43}=2.40;
  • Method of Moments: ˇθ=2ˉu=2(1.646)=3.292.


10.4 Comments on the Maximum Likelihood Estimator

The following points on the maximum likelihood estimator are worth noting:

  1. When finding the MLE you want to maximise the likelihood function. However it is often more convenient to maximise the log likelihood function instead. Both functions will be maximised by the same parameter values;
  2. MLEs may not exist, and if they do, they may not be unique;
  3. The likelihood function is NOT the probability distribution for θ. The correct interpretation of the likelihood function is that it is the probability of obtaining the observed data if θ were the true value of the parameter. We assume θ is an unknown constant, not a random variable. In Bayesian statistics we will consider the parameter to be random;
  4. The MLE has some nice large sample properties, including consistency, asymptotic normality and other optimality properties;
  5. The MLE can be used for non-independent data or non-identically distributed data as well;
  6. Often the MLE cannot be found using calculus techniques and must be found numerically. It is often useful, if we can, to plot the likelihood function to find good starting points to find the MLE numerically;
  7. The MLE satisfies a useful invariance property. Namely, if ϕ=h(θ), where h(θ) is a one-to-one function of θ, then the MLE of ϕ is given by ˆϕ=h(ˆθ). For example, if ϕ=1θ and ˆθ=ˉX then ˆϕ=1ˆθ=1ˉX.

Student Exercises

Attempt the exercises below.


Let X1,X2,,Xn be independent random variables, each with pdf f(x|θ)=θ2xexp(θx), for x>0. Use the method of moments to determine an estimator of θ.

Remember that if XGamma(α,β) then E[X]=α/β.

Solution to Exercise 10.1.
The distribution is Gamma(α,β) with α=2 and β=θ since
βαxα1eβxΓ(α)=θ2x21eθxΓ(2)=θ2xeθx.
For the Gamma(α,β) distribution
E[X]=αβ=2θ.
Alternatively this can be obtained directly using integration by parts.
By the method of moments,
E[X]=αβ=2θ=ˉxθ=2ˉx.



Let X1,X2,,Xn be a random sample from the distribution with p.d.f.
f(x|θ)=θe(x1)θ,x>1,

where θ>0 is an unknown parameter. Find the MLE of θ.

Solution to Exercise 10.2.
The likelihood is
L(θ)=ni=1f(xi|θ)=ni=1θe(xi1)θ=θnexp{ni=1(xi1)θ}=θnexp{(ni=1xin)θ}.
Thus
l(θ)=logL(θ)=nlogθ(ni=1xin)θ,
so
l(θ)=nθni=1xi+n.
For a stationary point,
l(θ)=0nθni=1xi+n=0θ=nni=1xin=1ˉx1.
This corresponds to a maximum since
l(θ)=nθ2<0.

Thus the MLE of θ is ˆθ=1ˉx1.



(a) Let X1,X2,,Xn be a random sample from the distribution having pdf f(x|θ)=12(1+θx),1<x<1, where θ(1,1) is an unknown parameter. Show that the method of moments estimator for θ is
˜θ1=3ˉX
where ˉX=1nni=1Xi.
(b) Suppose instead that it is observed only whether a given observation is positive or negative. For i=1,2,,n, let
Yi={1if Xi00if Xi<0.
Show that the method of moments estimator for θ based on Y1,Y2,,Yn is
˜θ2=4ˉY2,

where ˉY=1nni=1Yi.
(c) Justifying your answers,

  1. which, if either, of the estimators ˜θ1 and ˜θ2 are unbiased?
  2. which of the estimators ˜θ1 and ˜θ2 is more efficient?
  3. which, if either, of the estimators ˜θ1 and ˜θ2 are mean-square consistent?
Solution to Question 3.
  1. Since,
    E[X1]=11x2(1+θx)dx=[x24+θx36]11=θ3,
    the method of moments estimator is obtained by solving ˉX=θ3 yielding, ˜θ1=3ˉX.
  2. First note that,
    P(X1>0)=1012(1+θx)dx=[x2+θx24]10=12(1+θ2).
    Thus,
    E[Y1]=P(X1>0)=12(1+θ2),
    so, ˉY=12(1+θ2), yielding,
    ˜θ2=4ˉY2.
  1. Both estimators are unbiased.
    E[˜θ1]=E[3nni=1Xi]=3nni=1E[Xi]=3nnθ3=θ,
    so ˜θ1 is unbiased.
    E[˜θ2]=E[4nni=1Yi2]=4nni=1E[Yi]2=4nn12(1+θ2)2=θ,
    so ˜θ2 is unbiased.
  2. var(˜θ1)=var(3nni=1Xi)=9n2ni=1var(Xi)=9nvar(X1).
    Now,
    E[X21]=11x22(1+θx)dx=[x36+θx48]11=13.
    Thus,
    var(X1)=13θ29=19(3θ2),
    so,
    var(˜θ1)=1n(3θ2).
    Similarly,
    var(˜θ2)=var(4ˉY2)=16var(ˉY)=16nvar(Y1).
    Now Y1Bin(1,12(1+θ2)) so,
    var(Y1)=12(1+θ2)12(1θ2)=14(1θ24)=116(4θ2),
    thus,
    var(˜θ2)=1n(4θ2).
    Hence, var(˜θ1)<var(˜θ2), so ˜θ1 is more efficient than ˜θ2.
  3. Since ˜θi is unbiased, MSE(˜θi)=var(˜θi) for i=1,2. Thus
    MSE(˜θ1)=1n(3˜θ21)0 as n.
    So ˜θ1 is consistent.
    Also,
    MSE(˜θ2)=1n(4˜θ22)0 as n.
    So ˜θ2 is consistent.