6.2 Maximum Likelihood with the Kalman Filter
The basic idea here is that if we can formulate a time series model as a state space model, then we can use the Kalman filter to compute the log-likelihood of the observed data for a given set of parameters. We can then maximize the log-likelihood in the usual way, using the Kalman filter each time to compute the log-likelihood.
Our general state space model formulation described in the previous section has an observation equation \[ y_t = A x_t + V_t \] and a state equation \[ x_t = \Theta x_{t-1} + W_t \] where \(y_t\) is a \(p\times 1\) vector, \(x_t\) is a \(k\times 1\) vector, \(A\) is a \(p\times k\) matrix and \(\Theta\) is \(k\times k\) matrix. We will assume \(V_t\sim\mathcal{N}(0, S)\) and \(W_t\sim\mathcal{N}(0, R)\).
To evaluate and maximize the likelihood function of the data, we need the joint density of the observed data, \(p(y_1,y_2,\dots,y_n)\), which we can subsequently factor into \[\begin{eqnarray*} p(y_1,y_2,\dots,y_n) & = & p(y_1)p(y_2,\dots,y_n\mid y_1)\\ & = & p(y_1)p(y_2\mid y_1)p(y_3,\dots,y_n\mid y_1,y_2)\\ & \vdots & \\ & = & p(y_1)p(y_2\mid y_1)p(y_3\mid y_1, y_2)\cdots p(y_n\mid y_1,\dots,y_{n-1}) \end{eqnarray*}\]
If we pick apart this factorization, we initially need to compute \(p(y_1)\). We can do this by augmenting it with the state variable \(x_1\) and then integrating it out. Although this sounds harder, it actually makes life easier.
\[\begin{eqnarray*} p(y_1) & = & \int p(y_1, x_1)\,dx_1\\ & = & \int p(y_1\mid x_1)p(x_1)\,dx_1 \end{eqnarray*}\]
If you recall from the previous section, \(p(y_1\mid x_1)\) is the density for the observation equation, which in this case is \(\mathcal{N}(Ax_1, S)\). The density \(p(x_1)\) is (again, from the previous section) \(\mathcal{N}(x_1^0, P_1^0)\), where
\[\begin{eqnarray*} x_1^0 & = & \Theta x_0^0\\ P_1^0 & = & \Theta P_0^0\Theta^\prime + R. \end{eqnarray*}\]
Integrating those two densities gives us \[ p(y_1) = \mathcal{N}(Ax_1^0, AP_1^0A^\prime + S). \] If we assume that \(A\) is known (as previously), then the quantities \(x_1^0\) and \(P_1^0\) are all routinely computed in the implementation of the Kalman filtering algorithm.
The general idea here is that we will start with \(t=1\), run the Kalman filtering algorithm for each \(t\) and compute the necessary quantities for each step of the likelihood function. By the time we reach \(t=n\), we should have everything we need to compute the joint likelihood function.
We can see for \(t=2\), we will need to compute \(p(y_2\mid y_1)\), which can be written
\[\begin{eqnarray*} p(y_2\mid y_1) & = & \int p(y_2, x_2\mid y_1)\,dx_2\\ & = & \int p(y_2\mid x_2)p(x_2\mid y_1)\, dx_2\\ & = & \int \varphi(y_2\mid Ax_2, S)\varphi(x_2\mid x_2^1, P_2^1)\,dx_2\\ & = & \mathcal{N}(Ax_2^1, AP_2^1A^\prime + S). \end{eqnarray*}\]
In general, we will have
\[ p(y_t\mid y_1,\dots,y_{t-1}) = \mathcal{N}(Ax_t^{t-1}, A P_t^{t-1}A^\prime + S). \] If we let \(\boldsymbol{\beta}\) represent the vector of unknown parameters, then computing the log-likelihood would require computing the following sum, \[ \ell(\boldsymbol{\beta}) = \sum_{t=1}^n \log p(y_t\mid y_1,\dots,y_{t-1}) \] where we would define \(p(y_1\mid y_0) = p(y_1)\) because there is no \(y_0\). We could then maximize \(\ell(\boldsymbol{\beta})\) with respect to \(\boldsymbol{\beta}\) using standard non-linear maximization routines like Newton’s method or quasi-Newton approaches.