Chapter 10 Techniques for Deriving Estimators
10.1 Introduction
In this Section we introduce two techniques for deriving estimators:
The Method of Moments is a simple, intuitive approach, which has its limitations beyond simple random sampling (i.i.d. observations). Maximum Likelihood Estimation is an approach which can be extended to complex modelling scenarios and likelihood based estimation will be central to statistical inference procedures throughout not only this module but the whole course.
10.2 Method of Moments
Let XX be a random variable.
Moments
If E[Xk]E[Xk] exists in the sense that it is finite, then E[Xk]E[Xk] is said to be the kthkth moment of the random variable XX.
For example,
E[X]=μE[X]=μ is the first moment of XX;
E[X2]E[X2] is the second moment of XX.
Note that var(X)=E[X2]−(E[X])2var(X)=E[X2]−(E[X])2 is a function of the first and second moments.
Sample moments
Let X1,X2,…,XnX1,X2,…,Xn be a random sample. The kthkth sample moment is
it follows that the kth sample moment is an unbiased estimator of the kth moment of a distribution. Therefore, if one wants to estimate the parameters from a particular distribution, one can write the parameters as a function of the moments of the distribution and then estimate them by their corresponding sample moments. This is known as the method of moments.
Method of Moments: Mean and Variance
Let X1,X2,…,Xn be a random sample from any distribution with mean μ and variance σ2.The method of moments estimators for μ and σ2 are:
Note that E[ˆμ]=E[ˉX]=μ is an unbiased estimator, whilst
is a biased estimator, but is asymptotically unbiased. See Section 9.4 where the properties of ˆσ2 are explored further.
Method of Moments: Binomial distribution
Let X1,X2,…,Xn∼Bin(m,θ) where m is known. Find the method of moments estimator for θ.
The first moment (mean) of the Binomial distribution is mθ. Therefore,
Method of Moments: Exponential distribution
Let X1,X2,…,Xn∼Exp(θ). Find the method of moments estimator for θ.
For x>0 and θ>0,The sampling properties of the kth sample moment are fairly desirable:
- ˆμk is an unbiased estimator of E[Xk];
- By the Central Limit Theorem, ˆμk is asymptotically normal if E[X2k] exists;
- ˆμk is a consistent estimator of E[Xk].
If h is a continuous function, then ˆθ=h(ˆμ1,ˆμ2,…,ˆμk) is a consistent estimator of θ=h(μ1,μ2,…,μk), but it may not be an unbiased or an asymptotically normal estimator.
There are often difficulties with the method of moments:
- Finding θ as a function of theoretical moments is not always simple;
- For some models, moments may not exist.
10.3 Maximum likelihood estimation
In the study of probability, for random variables X1,X2,…,Xn we consider the joint probability mass function or probability density function as just a function of the random variables X1,X2,…,Xn. Specifically we assume that the parameter value(s) are completely known.
For example, if X1,X2,…,Xn is a random sample from a Poisson distribution with mean λ, thenfor λ>0. See Section 6.4 for derivation.
However, in the study of statistics, we assume the parameter values are unknown. Therefore, if we are given a specific random sample x1,x2,…,xn, then p(x1,x2,…,xn) will take on different values for each possible value of the parameters (λ in the Poisson example). Hence, we can consider p(x1,x2,…,xn) to also be a function of the unknown parameter and write p(x1,x2,…,xn|λ) to make the dependence on λ explicit. In maximum likelihood estimation we choose ˆλ to be the value of λ which most likely produced the random sample x1,x2,…,xn, that is, the value of λ which maximises p(x1,x2,…,xn|λ) for the observed x1,x2,…,xn.
The likelihood function of the random variables X1,X2,…,Xn is the joint p.m.f. (discrete case) or joint p.d.f. (continuous case) of the observed data given the parameter θ, that is
Maximum likelihood estimator
The maximum likelihood estimator, denoted shorthand by MLE or m.l.e., of θ is the value ˆθ which maximises L(θ).
Suppose that we collect a random sample from a Poisson distribution such that X1=1, X2=2, X3=3 and X4=4. Find the maximum likelihood estimator of λ.
Now, dlogL(λ)dλ=−4+10λ=0. Hence, ˆλ=52=2.5.
Log likelihood function
If L(θ) is the likelihood function of θ, then l(θ)=logL(θ) is called the log likelihood function of θ.
Binomial MLE
Let X∼Bin(m,θ). Find the MLE of θ given observation x.
Attempt Example 10.3.5: Binomial MLE and then watch Video 17 for the solutions.
We will use the case m=10 and x=3 to illustrate the calculations.
Video 17: Binomial MLE
Solution to Example 10.3.5: Binomial MLE
Given x is sampled from the random variable X, we have that
In the case m=10 and x=3 the likelihood becomes L(θ)=120θ3(1−θ)7 and this is illustrated in Figure 10.1.

Figure 10.1: Likelihood function.
Take the derivative of L(θ) (using the product rule):
Setting dL(θ)dθ=0, we obtain
Hence, ˆθ=xm is a possible value for the MLE of θ.
Since L(θ) is a continuous function over [0,1], the maximum must exist at either the stationary point or at one of the endpoints of the interval. Given, L(0)=0, L(1)=0, and L(xm)>0, it follows that ˆθ=xm is the MLE of θ.
In the illustrative example, m=10 and x=3 giving ˆθ=310=0.3. In Figure 10.2 the MLE is marked on the plot of the likelihood function.

Figure 10.2: Likelihood function with MLE at 0.3.
It is easier to use the log-likelihood l(θ) to derive the MLE.
We have that
In the case m=10 and x=3 the likelihood becomes l(θ)=log120+3logθ+7log(1−θ) and this is illustrated in Figure 10.3.

Figure 10.3: Log-likelihood function.
Take the derivative of l(θ):
Giving ˆθ=xm.
In the illustrative example, m=10 and x=3 giving ˆθ=310=0.3. In Figure 10.4 the MLE is marked on the plot of the likelihood function.

Figure 10.4: Log-likelihood function with MLE at 0.3.
The following R shiny app allows you to investigate the MLE for data from a geometric distribution, X∼Geom(p). The success probability of the geometric distribution can be varied from 0.01 to 1. The likelihood, log-likelihood and relative likelihood (likelihood divided by its maximum) functions can be plotted. Note that as the number of observations become large the likelihood becomes very small, and equal to, 0 to computer accuracy. You will observe that the likelihood function becomes more focussed about the MLE as the sample size increases. Also the MLE will generally be closer to the true value of p used to generate the data as the sample size increases.
R Shiny app: MLE Geometric Distribution
Poisson MLE
Let X1,X2,…,Xn be a random sample from a Poisson distribution with mean λ. Find the MLE of λ.
Since d2l(λ)dλ2=−∑ni=1xiλ2<0, it follows that ˆλ=ˉX is a maximum, so is the MLE of λ.
MLE of mean of a Normal random variable
Let X1,X2,…,Xn be a random sample of N(θ,1) with mean θ. Find the MLE of θ given observations x1,x2,…,xn.
So ˆθ=ˉx is the MLE of θ.
In Example 10.3.5, Example 10.3.6 and Example 10.3.7 the maximum likelihood estimators correspond with the method of moment estimators. In Example 10.3.8 we consider a situation where the maximum likelihood estimator is very different from the method of moments estimator.
MLE for Uniform random variables
Let U1,U2,…,Un be i.i.d. samples of U[0,θ]. Given observations u1,u2,…,un:
- Find the MLE of θ.
- Find the method of moments estimator of θ.
Attempt Example 10.3.8: MLE for Uniform random variables and then watch Video 18 for the solutions.
We will data u=(u1,u2,…,u5)=(1.30,2.12,2.40,0.98,1.43) as an illustrative example. These 5 observations were simulated from U(0,3).
Video 18: MLE for Uniform random variables
Solution to Example 10.3.8: MLE for Uniform random variables
- If Ui∼U[0,θ], then its p.d.f. is given by
f(u|θ)={1θ,if 0≤u≤θ,0,otherwise. Note that if θ<ui for some i, then L(θ)=0. Since L(θ) is always positive and we want to maximise L, we can assume 0≤ui≤θ for all i=1,…,n, thenL(θ)=n∏i=1f(ui|θ)=n∏i=11θ=1θn. Hence, L(θ) is a decreasing function of θ and its maximum must exist at the smallest value that θ can obtain. Since θ>max{u1,u2,…,un}, the MLE of θ is ˆθ=max{u1,u2,…,un}.
Figure 10.5 shows the likelihood function L(θ) using the data u=(1.30,2.12,2.40,0.98,1.43).

Figure 10.5: Likelihood function for u = (1.30,2.12,2.40,0.98,1.43).
- By comparison the method of moments estimator, ˇθ, of θ uses E[U]=0+θ2 and hence is given by ˇθ=2ˉu. Note that if 2ˉu<max{u1,u2,…,un} then ˇθ will not be consistent with the data, i.e. L(ˇθ)=0.
To observe the difference between the MLE and the method of moments estimator, using u=(1.30,2.12,2.40,0.98,1.43):
- MLE: ˆθ=max{1.30,2.12,2.40,0.98,1.43}=2.40;
- Method of Moments: ˇθ=2ˉu=2(1.646)=3.292.
10.4 Comments on the Maximum Likelihood Estimator
The following points on the maximum likelihood estimator are worth noting:
- When finding the MLE you want to maximise the likelihood function. However it is often more convenient to maximise the log likelihood function instead. Both functions will be maximised by the same parameter values;
- MLEs may not exist, and if they do, they may not be unique;
- The likelihood function is NOT the probability distribution for θ. The correct interpretation of the likelihood function is that it is the probability of obtaining the observed data if θ were the true value of the parameter. We assume θ is an unknown constant, not a random variable. In Bayesian statistics we will consider the parameter to be random;
- The MLE has some nice large sample properties, including consistency, asymptotic normality and other optimality properties;
- The MLE can be used for non-independent data or non-identically distributed data as well;
- Often the MLE cannot be found using calculus techniques and must be found numerically. It is often useful, if we can, to plot the likelihood function to find good starting points to find the MLE numerically;
- The MLE satisfies a useful invariance property. Namely, if ϕ=h(θ), where h(θ) is a one-to-one function of θ, then the MLE of ϕ is given by ˆϕ=h(ˆθ). For example, if ϕ=1θ and ˆθ=ˉX then ˆϕ=1ˆθ=1ˉX.
Student Exercises
Attempt the exercises below.
Let X1,X2,…,Xn be independent random variables, each with pdf
f(x|θ)=θ2xexp(−θx),
for x>0. Use the method of moments to determine an estimator of θ.
Remember that if X∼Gamma(α,β) then E[X]=α/β.
Solution to Exercise 10.1.
By the method of moments,
Let X1,X2,…,Xn be a random sample from the distribution with p.d.f.
where θ>0 is an unknown parameter. Find the MLE of θ.
Solution to Exercise 10.2.
Thus the MLE of θ is ˆθ=1ˉx−1.
(a) Let X1,X2,…,Xn be a random sample from the distribution having pdf f(x|θ)=12(1+θx),−1<x<1, where θ∈(−1,1) is an unknown parameter. Show that the method of moments estimator for θ is
(b) Suppose instead that it is observed only whether a given observation is positive or negative. For i=1,2,…,n, let
where ˉY=1n∑ni=1Yi.
(c) Justifying your answers,
- which, if either, of the estimators ˜θ1 and ˜θ2 are unbiased?
- which of the estimators ˜θ1 and ˜θ2 is more efficient?
- which, if either, of the estimators ˜θ1 and ˜θ2 are mean-square
consistent?
Solution to Question 3.
- Since,
E[X1]=∫1−1x2(1+θx)dx=[x24+θx36]1−1=θ3, the method of moments estimator is obtained by solving ˉX=θ3 yielding, ˜θ1=3ˉX.
- First note that,
P(X1>0)=∫1012(1+θx)dx=[x2+θx24]10=12(1+θ2). Thus,
E[Y1]=P(X1>0)=12(1+θ2), so, ˉY=12(1+θ2), yielding,
˜θ2=4ˉY−2.
- Both estimators are unbiased.
E[˜θ1]=E[3nn∑i=1Xi]=3nn∑i=1E[Xi]=3nnθ3=θ, so ˜θ1 is unbiased.
E[˜θ2]=E[4nn∑i=1Yi−2]=4nn∑i=1E[Yi]−2=4nn12(1+θ2)−2=θ, so ˜θ2 is unbiased.
var(˜θ1)=var(3nn∑i=1Xi)=9n2n∑i=1var(Xi)=9nvar(X1). Now,
E[X21]=∫1−1x22(1+θx)dx=[x36+θx48]1−1=13. Thus,
var(X1)=13−θ29=19(3−θ2), so,
var(˜θ1)=1n(3−θ2). Similarly,
var(˜θ2)=var(4ˉY−2)=16var(ˉY)=16nvar(Y1). Now Y1∼Bin(1,12(1+θ2)) so,
var(Y1)=12(1+θ2)12(1−θ2)=14(1−θ24)=116(4−θ2), thus,
var(˜θ2)=1n(4−θ2). Hence, var(˜θ1)<var(˜θ2), so ˜θ1 is more efficient than ˜θ2.
- Since ˜θi is unbiased, MSE(˜θi)=var(˜θi) for i=1,2. Thus
MSE(˜θ1)=1n(3−˜θ21)→0 as n→∞. So ˜θ1 is consistent.
Also,MSE(˜θ2)=1n(4−˜θ22)→0 as n→∞. So ˜θ2 is consistent.