6.2 Maximum Likelihood with the Kalman Filter

The basic idea here is that if we can formulate a time series model as a state space model, then we can use the Kalman filter to compute the log-likelihood of the observed data for a given set of parameters. We can then maximize the log-likelihood in the usual way, using the Kalman filter each time to compute the log-likelihood.

Our general state space model formulation described in the previous section has an observation equation $y_t = A x_t + V_t$ and a state equation $x_t = \Theta x_{t-1} + W_t$ where $y_t$ is a $p\times 1$ vector, $x_t$ is a $k\times 1$ vector, $A$ is a $p\times k$ matrix and $\Theta$ is $k\times k$ matrix. We will assume $V_t\sim\mathcal{N}(0, S)$ and $W_t\sim\mathcal{N}(0, R)$ .

To evaluate and maximize the likelihood function of the data, we need the joint density of the observed data, $p(y_1,y_2,\dots,y_n)$ , which we can subsequently factor into $\begin{eqnarray*} p(y_1,y_2,\dots,y_n) & = & p(y_1)p(y_2,\dots,y_n\mid y_1)\\ & = & p(y_1)p(y_2\mid y_1)p(y_3,\dots,y_n\mid y_1,y_2)\\ & \vdots & \\ & = & p(y_1)p(y_2\mid y_1)p(y_3\mid y_1, y_2)\cdots p(y_n\mid y_1,\dots,y_{n-1}) \end{eqnarray*}$

If we pick apart this factorization, we initially need to compute $p(y_1)$ . We can do this by augmenting it with the state variable $x_1$ and then integrating it out. Although this sounds harder, it actually makes life easier.

$\begin{eqnarray*} p(y_1) & = & \int p(y_1, x_1)\,dx_1\\ & = & \int p(y_1\mid x_1)p(x_1)\,dx_1 \end{eqnarray*}$

If you recall from the previous section, $p(y_1\mid x_1)$ is the density for the observation equation, which in this case is $\mathcal{N}(Ax_1, S)$ . The density $p(x_1)$ is (again, from the previous section) $\mathcal{N}(x_1^0, P_1^0)$ , where

$\begin{eqnarray*} x_1^0 & = & \Theta x_0^0\\ P_1^0 & = & \Theta P_0^0\Theta^\prime + R. \end{eqnarray*}$

Integrating those two densities gives us $p(y_1) = \mathcal{N}(Ax_1^0, AP_1^0A^\prime + S).$ If we assume that $A$ is known (as previously), then the quantities $x_1^0$ and $P_1^0$ are all routinely computed in the implementation of the Kalman filtering algorithm.

The general idea here is that we will start with $t=1$ , run the Kalman filtering algorithm for each $t$ and compute the necessary quantities for each step of the likelihood function. By the time we reach $t=n$ , we should have everything we need to compute the joint likelihood function.

We can see for $t=2$ , we will need to compute $p(y_2\mid y_1)$ , which can be written

$\begin{eqnarray*} p(y_2\mid y_1) & = & \int p(y_2, x_2\mid y_1)\,dx_2\\ & = & \int p(y_2\mid x_2)p(x_2\mid y_1)\, dx_2\\ & = & \int \varphi(y_2\mid Ax_2, S)\varphi(x_2\mid x_2^1, P_2^1)\,dx_2\\ & = & \mathcal{N}(Ax_2^1, AP_2^1A^\prime + S). \end{eqnarray*}$

In general, we will have

$p(y_t\mid y_1,\dots,y_{t-1}) = \mathcal{N}(Ax_t^{t-1}, A P_t^{t-1}A^\prime + S).$ If we let $\boldsymbol{\beta}$ represent the vector of unknown parameters, then computing the log-likelihood would require computing the following sum, $\ell(\boldsymbol{\beta}) = \sum_{t=1}^n \log p(y_t\mid y_1,\dots,y_{t-1})$ where we would define $p(y_1\mid y_0) = p(y_1)$ because there is no $y_0$ . We could then maximize $\ell(\boldsymbol{\beta})$ with respect to $\boldsymbol{\beta}$ using standard non-linear maximization routines like Newton’s method or quasi-Newton approaches.