## 6.2 Maximum Likelihood with the Kalman Filter

The basic idea here is that if we can formulate a time series model as a state space model, then we can use the Kalman filter to compute the log-likelihood of the observed data for a given set of parameters. We can then maximize the log-likelihood in the usual way, using the Kalman filter each time to compute the log-likelihood.

Our general state space model formulation described in the previous section has an observation equation $y_t = A x_t + V_t$ and a state equation $x_t = \Theta x_{t-1} + W_t$ where $$y_t$$ is a $$p\times 1$$ vector, $$x_t$$ is a $$k\times 1$$ vector, $$A$$ is a $$p\times k$$ matrix and $$\Theta$$ is $$k\times k$$ matrix. We will assume $$V_t\sim\mathcal{N}(0, S)$$ and $$W_t\sim\mathcal{N}(0, R)$$.

To evaluate and maximize the likelihood function of the data, we need the joint density of the observed data, $$p(y_1,y_2,\dots,y_n)$$, which we can subsequently factor into $\begin{eqnarray*} p(y_1,y_2,\dots,y_n) & = & p(y_1)p(y_2,\dots,y_n\mid y_1)\\ & = & p(y_1)p(y_2\mid y_1)p(y_3,\dots,y_n\mid y_1,y_2)\\ & \vdots & \\ & = & p(y_1)p(y_2\mid y_1)p(y_3\mid y_1, y_2)\cdots p(y_n\mid y_1,\dots,y_{n-1}) \end{eqnarray*}$

If we pick apart this factorization, we initially need to compute $$p(y_1)$$. We can do this by augmenting it with the state variable $$x_1$$ and then integrating it out. Although this sounds harder, it actually makes life easier.

$\begin{eqnarray*} p(y_1) & = & \int p(y_1, x_1)\,dx_1\\ & = & \int p(y_1\mid x_1)p(x_1)\,dx_1 \end{eqnarray*}$

If you recall from the previous section, $$p(y_1\mid x_1)$$ is the density for the observation equation, which in this case is $$\mathcal{N}(Ax_1, S)$$. The density $$p(x_1)$$ is (again, from the previous section) $$\mathcal{N}(x_1^0, P_1^0)$$, where

$\begin{eqnarray*} x_1^0 & = & \Theta x_0^0\\ P_1^0 & = & \Theta P_0^0\Theta^\prime + R. \end{eqnarray*}$

Integrating those two densities gives us $p(y_1) = \mathcal{N}(Ax_1^0, AP_1^0A^\prime + S).$ If we assume that $$A$$ is known (as previously), then the quantities $$x_1^0$$ and $$P_1^0$$ are all routinely computed in the implementation of the Kalman filtering algorithm.

The general idea here is that we will start with $$t=1$$, run the Kalman filtering algorithm for each $$t$$ and compute the necessary quantities for each step of the likelihood function. By the time we reach $$t=n$$, we should have everything we need to compute the joint likelihood function.

We can see for $$t=2$$, we will need to compute $$p(y_2\mid y_1)$$, which can be written

$\begin{eqnarray*} p(y_2\mid y_1) & = & \int p(y_2, x_2\mid y_1)\,dx_2\\ & = & \int p(y_2\mid x_2)p(x_2\mid y_1)\, dx_2\\ & = & \int \varphi(y_2\mid Ax_2, S)\varphi(x_2\mid x_2^1, P_2^1)\,dx_2\\ & = & \mathcal{N}(Ax_2^1, AP_2^1A^\prime + S). \end{eqnarray*}$

In general, we will have

$p(y_t\mid y_1,\dots,y_{t-1}) = \mathcal{N}(Ax_t^{t-1}, A P_t^{t-1}A^\prime + S).$ If we let $$\boldsymbol{\beta}$$ represent the vector of unknown parameters, then computing the log-likelihood would require computing the following sum, $\ell(\boldsymbol{\beta}) = \sum_{t=1}^n \log p(y_t\mid y_1,\dots,y_{t-1})$ where we would define $$p(y_1\mid y_0) = p(y_1)$$ because there is no $$y_0$$. We could then maximize $$\ell(\boldsymbol{\beta})$$ with respect to $$\boldsymbol{\beta}$$ using standard non-linear maximization routines like Newton’s method or quasi-Newton approaches.