1.5 Likelihood
The likelihood function is the likelihood of a parameter \(\theta\) given an observed value of the random variable \(X\). The likelihood function is identical to the probability distribution function, except that it reverses which variable is considered fixed. E.g., the binomial probability distribution expresses the probability that \(X = x\) given the success probability \(\theta = \pi\).
\[f(x|\pi) = \frac{n!}{x!(n-x)!} \pi^x (1-\pi)^{n-x}.\]
The corresponding likelihood function expresses the probability that \(\pi = p\) given the observed value \(x\).
\[L(p|x) = \frac{n!}{x!(n-x)!} p^x (1-p)^{n-x}.\]
You usually want to know the value of \(\theta\) at the maximum of the likelihood function. When taking derivatives, any multiplicative constant is irrevelant and can be discarded. So for the binomial distribution, the likelihood function for \(\pi\) may instead be expressed as
\[L(p|x) \propto p^x (1-p)^{n-x}\]
Calculating the maximum is usually simplified using the log-likelihood, \(l(\theta|x) = \log L(\theta|x)\). For the binomial distribution, \(l(p|x) = x \log p + (n - x) \log (1 - p)\). Frequently you derive loglikelihood from a sample. The overall likelihood is the product of the individual likelihoods, and the overall loglikelihood is the log of the overall likelihood.
\[l(\theta|x) = \log \prod_{i=1}^n f(x_i|\theta)\]
Here are plots of the binomial log-likelihood of \(pi\) for several values of \(X\) from a sample of size \(n = 5\).
As the total sample size \(n\) grows, the loglikelihood function becomes more sharply peaked around its maximum, and becomes nearly quadratic (i.e. a parabola, if there is a single parameter). Here is the same plot with \(n = 500\).
The value of \(\theta\) that maximizes \(l\) (and \(L\)) is the maximum-likelihood estimator (MLE) of \(\theta\), \(\hat{\theta}\). E.g., suppose you have an experiment of \(n = 5\) Bernoulli trials \(\left(X \sim Bin(5, \pi) \right)\) with and \(X = 3\) successful events. A plot of \(L(p|x) = p^3(1 - p)^2\) shows the MLE is at \(p = 0.6\).
This approach is called maximum-likelihood estimation. MLE usually involves setting the derivatives to zero and solving for \(theta\).