10.3 Maximum Likelihood Estimation

The estimation of the ARCH-GARCH model parameters is more complicated than the estimation of the CER model parameters. There are no simple plug-in principle estimators for the conditional variance parameters. Instead, an alternative estimation method called maximum likelihood (ML) is typically used to estimate the ARCH-GARCH parameters. This section reviews the ML estimation method and shows how it can be applied to estimate the ARCH-GARCH model parameters.

10.3.1 The Likelihood Function

Let $X_{1},\ldots,X_{T}$ be an iid sample with probability density function (pdf) $f(x_{t};\theta),$ where $\theta$ is a $(k\times1)$ vector of parameters that characterize $f(x_{t};\theta).$ For example, if $X_{t}\sim N(\mu,\sigma^{2})$ then $f(x_{t};\theta)=(2\pi\sigma^{2})^{-1/2}\exp\left(-\frac{1}{2\sigma^{2}}(x_{t}-\mu)^{2}\right)$ and $\theta=(\mu,\sigma^{2})^{\prime}.$ The joint density of the sample is, by independence, equal to the product of the marginal densities $\begin{equation} f(x_{1},\ldots,x_{T};\theta)=f(x_{1};\theta)\cdots f(x_{T};\theta)=\prod_{t=1}^{T}f(x_{t};\theta).\tag{10.16} \end{equation}$ The joint density is a $T$ dimensional function of the data $x_{1},\ldots,x_{T}$ given the parameter vector $\theta.$ The joint density satisfies⁵⁶ $\begin{align*} f(x_{1},\ldots,x_{T};\theta) & \geq0,\\ \int\cdots\int f(x_{1},\ldots,x_{T};\theta)dx_{1}\cdots dx_{T} & =1. \end{align*}$

The likelihood function is defined as the joint density treated as a function of the parameters $\theta:$ $\begin{equation} L(\theta|x_{1},\ldots,x_{T})=f(x_{1},\ldots,x_{T};\theta)=\prod_{t=1}^{T}f(x_{t};\theta).\tag{10.17} \end{equation}$ Notice that the likelihood function is a $k$ dimensional function of $\theta$ given the data $x_{1},\ldots,x_{T}.$ It is important to keep in mind that the likelihood function, being a function of $\theta$ and not the data, is not a proper pdf. It is always positive but $\int\cdots\int L(\theta|x_{1},\ldots,x_{T})d\theta_{1}\cdots d\theta_{k}\neq1.$

To simplify notation, let the vector $\mathbf{x}=(x_{1},\ldots,x_{T})^{\prime}$ denote the observed sample. Then the joint pdf and likelihood function may be expressed as $f(\mathbf{x};\theta)$ and $L(\theta|\mathbf{x}).$

Example 2.7 (Likelihood function for a sample from the CER model for $R_{t}$ )

Let $R_{t}$ denote the daily return on an asset and assume that $\{R_{t}\}_{t=1}^{T}$ is described by the CER model. Then $\{R_{t}\}_{t=1}^{T}$ is an iid sample with $R_{t}\sim N(\mu,\sigma^{2}).$ The pdf for $R_{t}$ is $f(r_{t};\theta)=(2\pi\sigma^{2})^{-1/2}\exp\left(-\frac{1}{2\sigma^{2}}(r_{t}-\mu)^{2}\right),\text{ }-\infty<\mu<\infty,\text{ }\sigma^{2}>0,\text{ }-\infty<r<\infty,$ so that $\theta=(\mu,\sigma^{2})^{\prime}.$ Then the likelihood function is given by $\begin{align} L(\theta|\mathbf{r}) & =\prod_{t=1}^{T}(2\pi\sigma^{2})^{-1/2}\exp\left(-\frac{1}{2\sigma^{2}}(r_{t}-\mu)^{2}\right)=(2\pi\sigma^{2})^{-T/2}\exp\left(-\frac{1}{2\sigma^{2}}\sum_{t=1}^{T}(r_{t}-\mu)^{2}\right),\tag{10.18} \end{align}$ where $\mathbf{r}=(r_{1},\ldots,r_{T})^{\prime}$ denotes the observed sample.

$\blacksquare$

Now suppose $\{X_{t}\}$ is a covariance stationary time series whose joint pdf for a sample of size $T$ is given by $f(x_{1},\ldots,x_{T};\theta)$ . Here, the factorization of the likelihood function given in (10.16) does not work because the random variables $\{X_{t}\}_{t=1}^{T}$ generating the sample $\{x_{t}\}_{t=1}^{T}$ are not iid. In this case, to factorize the joint density one can use the conditional-marginal factorization of the joint density $f(x_{1},\ldots,x_{T};\theta)$ . To see how this works, consider the joint density of two adjacent observations $f(x_{1},x_{2};\theta)$ . This joint density can be factored as the product of the conditional density of $X_{2}$ given $X_{1}=x_{1}$ and the marginal density of $X_{1}:$ $f(x_{1},x_{2};\theta)=f(x_{2}|x_{1};\theta)f(x_{1};\theta).$ For three observations, the factorization becomes $f(x_{1},x_{2},x_{3};\theta)=f(x_{3}|x_{2},x_{1};\theta)f(x_{2}|x_{1};\theta)f(x_{1};\theta).$ For a sample of size $T$ , the conditional-marginal factorization of the joint pdf $f(x_{1},\ldots,x_{T};\theta)$ has the form $\begin{align} f(x_{1},\ldots,x_{T};\theta) & =\left(\prod_{t=p+1}^{T}f(x_{t}|I_{t-1};\theta)\right)\cdot f(x_{1},\ldots,x_{p};\theta),\tag{10.19} \end{align}$ where $I_{t}=\{x_{t},\ldots,x_{1}\}$ denotes the conditioning information at time $t$ , $f(x_{t}|I_{t-1};\theta)$ is the pdf of $x_{t}$ conditional on $I_{t},$ and $f(x_{1},\ldots,x_{p};\theta)$ denotes the joint density of the $p$ initial values $x_{1},\ldots,x_{p}$ . Hence, the conditional-marginal factorization of the likelihood function is $\begin{equation} L(\theta|x_{1},\ldots,x_{T})=f(x_{1},\ldots,x_{T};\theta)=\left(\prod_{t=p+1}^{T}f(x_{t}|I_{t-1};\theta\right)\cdot f(x_{1},\ldots,x_{p};\theta).\tag{10.20} \end{equation}$ For many covariance stationary time series models, the conditional pdf $f(x_{t}|I_{t-1};\theta)$ has a simple form whereas the marginal joint pdf $f(x_{1},\ldots,x_{p};\theta)$ is complicated. For these models, the marginal joint pdf is often ignored in (10.20) which gives the simplified conditional likelihood function $\begin{equation} L(\theta|x_{1},\ldots,x_{T})\approx\left(\prod_{t=p+1}^{T}f(x_{t}|I_{t-1};\theta\right).\tag{10.21} \end{equation}$ For reasonably large samples (e.g., $T>100)$ the density of the initial values $f(x_{1},\ldots,x_{p};\theta)$ has a negligible influence on the shape of the likelihood function (10.20) and can be safely ignored. In what follows we will only consider the conditional likelihood function (10.21).

Example 2.8 (Likelihood function for a sample from GARCH(1,1) model for $R_{t}$ )

Let $R_{t}$ denote the daily return on an asset and assume that $\{R_{t}\}_{t=1}^{T}$ is described by the GARCH(1,1) model $\begin{eqnarray*} R_{t} & = & \mu+\epsilon_{t},\\ \epsilon_{t} & = & \sigma_{t}z_{t},\\ \sigma_{t}^{2} & = & \omega+\alpha_{1}\epsilon_{t-1}^{2}+\beta_{1}\sigma_{t-1}^{2},\\ z_{t} & \sim & iid\,N(0,1). \end{eqnarray*}$ From the properties of the GARCH(1,1) model we know that $R_{t}|I_{t-1}\sim N(\mu,\sigma_{t}^{2})$ and so $\begin{equation} f(r_{t}|I_{t-1};\theta)=(2\pi\sigma_{t}^{2})^{-1/2}\exp\left(-\frac{1}{2\sigma_{t}^{2}}(r_{t}-\mu)^{2}\right),\tag{10.22} \end{equation}$ where $\theta=(\mu,\omega,\alpha_{1},\beta_{1})^{\prime}$ . The conditional likelihood (10.21) is then $\begin{align} L(\theta|\mathbf{r}) & =\prod_{i=1}^{n}(2\pi\sigma_{t}^{2})^{-1/2}\exp\left(-\frac{1}{2\sigma_{t}^{2}}(r_{t}-\mu)^{2}\right).\tag{10.23} \end{align}$ In (10.23), the values for $\sigma_{t}^{2}$ are determined recursively from (10.15) starting at $t=1$ $\begin{equation} \sigma_{1}^{2}=\omega+\alpha_{1}\epsilon_{0}^{2}+\beta_{1}\sigma_{0}^{2}.\tag{10.24} \end{equation}$ Given values for the parameters $\omega$ , $\alpha_{1}$ , $\beta_{1}$ and initial values $\epsilon_{0}^{2}$ and $\sigma_{0}^{2}$ the value for $\sigma_{1}^{2}$ is determined from (10.24). For $t=2$ , we have $\sigma_{2}^{2}=\omega+\alpha_{1}\epsilon_{1}^{2}+\beta_{1}\sigma_{1}^{2}=\omega+\alpha_{1}(r_{1}-\mu)^{2}+\beta_{1}\sigma_{1}^{2},$ where $\sigma_{1}^{2}$ is determined from (10.24). Hence, the values for $\sigma_{t}^{2}$ in (10.23) are determined recursively using the GARCH(1,1) variance equation (10.15) starting with (10.24).

$\blacksquare$

10.3.2 The Maximum Likelihood Estimator

Consider a sample of observations $\{x_{t}\}_{t=1}^{T}$ with joint pdf $f(x_{1},\ldots,x_{T};\theta)$ and likelihood function $L(\theta|x_{1},\ldots,x_{T})$ . Suppose that $X_{t}$ is a discrete random variable so that $L(\theta|x_{1},\ldots,x_{T})=f(x_{1},\ldots,x_{T};\theta)=Pr(X_{1}=x_{1},\ldots,X_{T}=x_{T}).$ So for a given value of $\theta$ , say $\theta=\theta_{0},$ $L(\theta_{0}|x_{1},\ldots,x_{T})$ gives the probability of the observed sample $\{x_{t}\}_{t=1}^{T}$ coming from a model in which $\theta_{0}$ is the true parameter. Now, some values of $\theta$ will lead to higher probabilities of observing $\{x_{t}\}_{t=1}^{T}$ than others. The value of $\theta$ that gives the highest probability for the observed sample $\{x_{t}\}_{t=1}^{T}$ is the value of $\theta$ that maximizes the likelihood function $L(\theta|x_{1},\ldots,x_{T})$ . This value of $\theta$ is called the maximum likelihood estimator (MLE) for $\theta$ . When $X_{t}$ is a continuous random variable, $f(x_{1},\ldots,x_{T};\theta)$ is not a joint probability but represents the height of the joint pdf as a function of $\{x_{t}\}_{t=1}^{T}$ for a given $\theta$ . However, the interpretation of the MLE is similar. It is the value of $\theta$ that makes the observed data most likely.

Formally, the MLE for $\theta$ , denoted $\hat{\theta}_{mle},$ is the value of $\theta$ that maximizes $L(\theta|\mathbf{x}).$ That is, $\hat{\theta}_{mle}$ solves the optimization problem $\max_{\theta}L(\theta|\mathbf{x}).$

It is often quite difficult to directly maximize $L(\theta|\mathbf{x}).$ It is usually much easier to maximize the log-likelihood function $\ln L(\theta|\mathbf{x}).$ Since $\ln(\cdot)$ is a monotonically increasing function the value of the $\theta$ that maximizes $\ln L(\theta|\mathbf{x})$ will also maximize $L(\theta|\mathbf{x}).$ Therefore, we may also define $\hat{\theta}_{mle}$ as the value of $\theta$ that solves $\max_{\theta}\ln L(\theta|\mathbf{x}).$ With random sampling, the log-likelihood is additive in the log of the marginal densities: $\ln L(\theta|\mathbf{x})=\ln\left(\prod_{t=1}^{T}f(x_{t};\theta)\right)=\sum_{t=1}^{T}\ln f(x_{t};\theta).$ For a covariance stationary time series, the conditional log-likelihood is additive in the log of the conditional densities: $\ln L(\theta|\mathbf{x})=\ln\left(\prod_{t=1}^{T}f(x_{t}|I_{t-1};\theta)\right)=\sum_{t=1}^{T}\ln f(x_{t}|I_{t-1};\theta).$

Since the MLE is defined as a maximization problem, under suitable regularity conditions we can determine the MLE using simple calculus.⁵⁷ We find the MLE by differentiating $\ln L(\theta|\mathbf{x})$ and solving the first order conditions $\frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta}=\mathbf{0}.$ Since $\theta$ is $(k\times1)$ the first order conditions define $k$ , potentially nonlinear, equations in $k$ unknown values: $\frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta}=\left(\begin{array}{c} \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta_{1}}\\ \vdots\\ \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta_{k}} \end{array}\right)=\left(\begin{array}{c} 0\\ \vdots\\ 0 \end{array}\right).$

The vector of derivatives of the log-likelihood function is called the score vector and is denoted $S(\theta|\mathbf{x})=\frac{\partial\ln L(\theta|\mathbf{x})}{\partial\theta}.$ By definition, the MLE satisfies $S(\hat{\theta}_{mle}|\mathbf{x})=0.$

In some cases, it is possible to find analytic solutions to the set of equations $S(\hat{\theta}_{mle}|\mathbf{x})=\mathbf{0}.$ However, for ARCH and GARCH models the set of equations $S(\hat{\theta}_{mle}|\mathbf{x})=0$ are complicated nonlinear functions of the elements of $\theta$ and no analytic solutions exist. As a result, numerical optimization methods are required to determine $\hat{\theta}_{mle}$ .

Example 4.4 (MLE for parameters in CER model)

In the CER model with likelihood function (10.18), the log-likelihood function is $\begin{equation} \ln L(\theta|\mathbf{r})=-\frac{T}{2}\ln(2\pi)-\frac{T}{2}\ln(\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{t=1}^{T}(r_{t}-\mu)^{2},\tag{10.25} \end{equation}$ and we may determine the MLE for $\theta=(\mu,\sigma^{2})^{\prime}$ by maximizing the log-likelihood (10.25). The sample score is a $(2\times1)$ vector given by $S(\theta|\mathbf{r})=\left(\begin{array}{c} \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\mu}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\sigma^{2}} \end{array}\right),$ where $\begin{align*} \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\mu} & =\frac{1}{\sigma^{2}}\sum_{t=1}^{T}(r_{t}-\mu),\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\sigma^{2}} & =-\frac{T}{2}(\sigma^{2})^{-1}+\frac{1}{2}(\sigma^{2})^{-2}\sum_{i=1}^{T}(r_{t}-\mu)^{2}. \end{align*}$

Solving $S(\hat{\theta}_{mle}|\mathbf{r})=0$ gives the $\begin{align*} \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{r})}{\partial\mu} & =\frac{1}{\hat{\sigma}_{mle}^{2}}\sum_{t=1}^{T}(r_{t}-\hat{\mu}_{mle})=0\\ \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{r})}{\partial\sigma^{2}} & =-\frac{T}{2}(\hat{\sigma}_{mle}^{2})^{-1}+\frac{1}{2}(\hat{\sigma}_{mle}^{2})^{-2}\sum_{i=1}^{T}(r_{t}-\hat{\mu}_{mle})^{2}=0 \end{align*}$ Solving the first equation for $\hat{\mu}_{mle}$ gives $\begin{equation} \hat{\mu}_{mle}=\frac{1}{T}\sum_{i=1}^{T}r_{t}=\bar{r}.\tag{10.26} \end{equation}$ Hence, the sample average is the MLE for $\mu.$ Using $\hat{\mu}_{mle}=\bar{r}$ and solving the second equation for $\hat{\sigma}_{mle}^{2}$ gives $\begin{equation} \hat{\sigma}_{mle}^{2}=\frac{1}{T}\sum_{i=1}^{T}(r_{t}-\bar{r})^{2}.\tag{10.27} \end{equation}$ Notice that $\hat{\sigma}_{mle}^{2}$ is not equal to the sample variance which uses a divisor of $T-1$ instead of $T$ .

It is instructive to compare the MLEs for $\mu$ and $\sigma^{2}$ with the CER model estimates presented in Chapter 7. Recall, the CER model estimators for $\mu$ and $\sigma^{2}$ are motivated by the plug-in principle and are equal to the sample mean and variance, respectively. Here, we see that the MLE for $\mu$ is equal to the sample mean and the MLE for $\sigma^{2}$ is $(T-1)/T$ times the sample variance. In fact, it can be shown that the CER model estimates for the other parameters are also equal to or approximately equal to the MLEs.

$\blacksquare$

Example 2.9 (MLE for parameters in GARCH(1,1) model)

In the GARCH(1,1) model with likelihood function (10.23) and parameter vector $\theta=(\mu,\omega,\alpha_{1},\beta_{1})^{\prime}$ , the log-likelihood function is $\begin{equation} \ln L(\theta|\mathbf{r})=-\frac{T}{2}\ln(2\pi)-\sum_{t=1}^{T}\ln(\sigma_{t}^{2}(\theta))-\frac{1}{2}\sum_{t=1}^{T}\frac{(r_{t}-\mu)^{2}}{\sigma_{t}^{2}(\theta)}.\tag{10.28} \end{equation}$ In (10.23), it instructive to write $\sigma_{t}^{2}=\sigma_{t}^{2}(\theta$ ) to emphasize that $\sigma_{t}^{2}$ is a function of $\theta$ . That is, from (10.15) $\sigma_{t}^{2}=\sigma_{t}^{2}(\theta)=\omega+\alpha_{1}\epsilon_{t-1}^{2}+\beta_{1}\sigma_{t-1}^{2}(\theta)=\omega+\alpha_{1}(r_{t-1}-\mu)^{2}+\beta_{1}\sigma_{t-1}^{2}(\theta).$ The sample score is a $(4\times1)$ vector given by $S(\theta|\mathbf{r})=\left(\begin{array}{c} \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\mu}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\omega}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\alpha_{1}}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\beta_{1}} \end{array}\right).$ The elements of $S(\theta|\mathbf{r})$ , unfortunately, do not have simple closed form expressions and no analytic formulas are available for the MLEs of the elements of $\theta$ . As a result, numerical methods, as discussed in sub-section 10.3.6 below, are required to determine $\hat{\theta}_{mle}$ that maximizes (10.28) and solves $S(\hat{\theta}_{mle}|\mathbf{x})=0.$

$\blacksquare$

10.3.3 Invariance Property of Maximum Likelihood Estimators

One of the attractive features of the method of maximum likelihood is its invariance to one-to-one transformations of the parameters of the log-likelihood. That is, if $\hat{\theta}_{mle}$ is the MLE of $\theta$ and $\alpha=h(\theta)$ is a one-to-one function of $\theta$ then $\hat{\alpha}_{mle}=h(\hat{\theta}_{mle})$ is the MLE for $\alpha.$

Example 2.10 (MLE of volatility in the CER model)

In the CER model, the log-likelihood is parametrized in terms of $\mu$ and $\sigma^{2}$ and we have the MLEs (10.26) and (10.27). Suppose we are interested in the MLE for $\sigma=h(\sigma^{2})=(\sigma^{2})^{1/2},$ which is a one-to-one function for $\sigma^{2}>0.$ The invariance property of MLEs gives the result $\hat{\sigma}_{mle}=(\hat{\sigma}_{mle}^{2})^{1/2}=\left(\frac{1}{T}\sum_{i=1}^{T}(r_{t}-\hat{\mu}_{mle})^{2}\right)^{1/2}.$

10.3.4 The Precision of the Maximum Likelihood Estimator

The likelihood, log-likelihood and score functions for a typical model are illustrated in figure xxx. The likelihood function is always positive (since it is the joint density of the sample) but the log-likelihood function is typically negative (being the log of a number less than 1). Here the log-likelihood is globally concave and has a unique maximum at $\hat{\theta}_{mle}$ . Consequently, the score function is positive to the left of the maximum, crosses zero at the maximum and becomes negative to the right of the maximum.

Intuitively, the precision of $\hat{\theta}_{mle}$ depends on the curvature of the log-likelihood function near $\hat{\theta}_{mle}.$ If the log-likelihood is very curved or “steep” around $\hat{\theta}_{mle},$ then $\theta$ will be precisely estimated. In this case, we say that we have a lot of information about $\theta.$ On the other hand, if the log-likelihood is not curved or “flat” near $\hat{\theta}_{mle},$ then $\theta$ will not be precisely estimated. Accordingly, we say that we do not have much information about $\theta.$

The extreme case of a completely flat likelihood in $\theta$ is illustrated in figure xxx. Here, the sample contains no information about the true value of $\theta$ because every value of $\theta$ produces the same value of the likelihood function. When this happens we say that $\theta$ is not identified. Formally, $\theta$ is identified if for all $\theta_{1}\neq\theta_{2}$ there exists a sample $\mathbf{x}$ for which $L(\theta_{1}|\mathbf{x})\neq L(\theta_{2}|\mathbf{x})$ .

The curvature of the log-likelihood is measured by its second derivative (Hessian) $H(\theta|\mathbf{x})=\frac{\partial^{2}\ln L(\theta|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}.$ Since the Hessian is negative semi-definite, the information in the sample about $\theta$ may be measured by $-H(\theta|\mathbf{x}).$ If $\theta$ is a scalar then $-H(\theta|\mathbf{x})$ is a positive number. The expected amount of information in the sample about the parameter $\theta$ is the information matrix $I(\theta|\mathbf{x})=-E[H(\theta|\mathbf{x})].$ As we shall see in the next sub-section, the information matrix is directly related to the precision of the MLE.

10.3.5 Asymptotic Properties of Maximum Likelihood Estimators

Write this section in a similar way as the asymptotic section in the chapter on Estimating the CER model. Many of the concepts are exactly the same.
Draw connection between asymptotic results here and those in previous chapter.

Let $\{X_{t}\}$ be a covariance stationary time series with conditional probability density function (pdf) $f(x_{t};I_{t-1};\theta),$ where $\theta$ is a $(k\times1)$ vector of parameters that characterize $f(x_{t};I_{t-1};\theta).$ Under suitable regularity conditions, the ML estimator of $\theta$ has the following asymptotic properties as the sample size

$\hat{\theta}_{mle}\overset{p}{\rightarrow}\theta$
$\sqrt{n}(\hat{\theta}_{mle}-\theta)\overset{d}{\rightarrow}N(0,I(\theta|x_{t})^{-1}),$ where $I(\theta|x_{t})=-E\left[H(\theta|x_{t})\right]=-E\left[\frac{\partial^{2}\ln f(\theta|x_{t})}{\partial\theta\partial\theta^{\prime}}\right]$ That is, $\mathrm{avar}(\sqrt{n}(\hat{\theta}_{mle}-\theta))=I(\theta|x_{t})^{-1}$ Alternatively, $\hat{\theta}_{mle}\sim N\left(\theta,\frac{1}{n}I(\theta|x_{t})^{-1}\right)=N(\theta,I(\theta|\mathbf{x})^{-1})$ where $I(\theta|\mathbf{x})=nI(\theta|x_{t})=$ information matrix for the sample.
$\hat{\theta}_{mle}$ is efficient in the class of consistent and asymptotically normal estimators. That is, $\mathrm{avar}(\sqrt{n}(\hat{\theta}_{mle}-\theta))-\mathrm{avar}(\sqrt{n}(\tilde{\theta}-\theta))\leq0$ for any consistent and asymptotically normal estimator $\tilde{\theta}.$

10.3.6 Computing the MLE Using Numerical Optimization Methods

Computation: Newton-Raphson Iteration

Goal: Use an iterative scheme to compute $\hat{\theta}=\arg\max_{\theta}\text{ }\ln L(\theta|\mathbf{x})$ Numerical optimization methods generally have the following structure:

Determine a set of starting values
Update the starting values in a direction that increases $\ln L(\theta|\mathbf{x})$
Check to see if the update satisfies the FOCs. If not, update again.
Stop updating when the FOCs are satisfied.

The most common method for numerically maximizing () is . This method can be motivated as follows. Consider a second order Taylor’s series expansion of $\ln L(\theta|\mathbf{x})$ about starting value $\hat{\theta}_{1}$ $\begin{align} \ln L(\theta|\mathbf{x}) & =\ln L(\hat{\theta}_{1}|\mathbf{x})+\frac{\partial\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta^{\prime}}(\theta-\hat{\theta}_{1})\tag{10.29}\\ & +\frac{1}{2}(\theta-\hat{\theta}_{1})^{\prime}\frac{\partial^{2}\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}(\theta-\hat{\theta}_{1})+error.\nonumber \end{align}$ Now maximize (10.29) with respect to $\theta.$ The FOCs are $\underset{p\times1}{\mathbf{0}}=\frac{\partial\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta}+\frac{\partial^{2}\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}(\hat{\theta}_{2}-\hat{\theta}_{1}),$

which can be solved for $\hat{\theta}_{2}$ : $\begin{align*} \hat{\theta}_{2} & =\hat{\theta}_{1}-\left[\frac{\partial^{2}\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}\right]^{-1}\frac{\partial\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta}\\ & =\hat{\theta}_{1}-H(\hat{\theta}_{1}|\mathbf{x})^{-1}S(\hat{\theta}_{1}|\mathbf{x}) \end{align*}$ This suggests the iterative scheme $\hat{\theta}_{n+1}=\hat{\theta}_{n}-H(\hat{\theta}_{n}|\mathbf{x})^{-1}S(\hat{\theta}_{n}|\mathbf{x})$ Iteration stops when $S(\hat{\theta}_{n}|\mathbf{x})\approx0$ .

Practical considerations: Stopping criterion
Practical considerations: analytic or numerical derivatives
Practical considerations: NR gives $H(\hat{\theta}_{n}|\mathbf{x})^{-1}$ as a by-product of the optimization and this is used to compute the asymptotic standard errors of the MLE

If $X_{1},\ldots,X_{T}$ are discrete random variables, then $f(x_{1},\ldots,x_{T};\theta)=\Pr(X_{1}=x_{1},\ldots,X_{T}=x_{T})$ for a fixed value of $\theta.$ ↩︎
Standard regularity conditions are: (1) the support of the random variables $X,\,S_{X}=\{x:f(x;\theta)>0\},$ does not depend on $\theta$ ; (2) $f(x;\theta)$ is at least three times differentiable with respect to $\theta$ ; (3) the true value of $\theta$ lies in a compact set $\Theta$ . ↩︎