6.2 Maximum Likelihood with the Kalman Filter

The basic idea here is that if we can formulate a time series model as a state space model, then we can use the Kalman filter to compute the log-likelihood of the observed data for a given set of parameters. We can then maximize the log-likelihood in the usual way, using the Kalman filter each time to compute the log-likelihood.

Our general state space model formulation described in the previous section has an observation equation yt=Axt+Vt and a state equation xt=Θxt1+Wt where yt is a p×1 vector, xt is a k×1 vector, A is a p×k matrix and Θ is k×k matrix. We will assume VtN(0,S) and WtN(0,R).

To evaluate and maximize the likelihood function of the data, we need the joint density of the observed data, p(y1,y2,,yn), which we can subsequently factor into p(y1,y2,,yn)=p(y1)p(y2,,yny1)=p(y1)p(y2y1)p(y3,,yny1,y2)=p(y1)p(y2y1)p(y3y1,y2)p(yny1,,yn1)

If we pick apart this factorization, we initially need to compute p(y1). We can do this by augmenting it with the state variable x1 and then integrating it out. Although this sounds harder, it actually makes life easier.

p(y1)=p(y1,x1)dx1=p(y1x1)p(x1)dx1

If you recall from the previous section, p(y1x1) is the density for the observation equation, which in this case is N(Ax1,S). The density p(x1) is (again, from the previous section) N(x01,P01), where

x01=Θx00P01=ΘP00Θ+R.

Integrating those two densities gives us p(y1)=N(Ax01,AP01A+S). If we assume that A is known (as previously), then the quantities x01 and P01 are all routinely computed in the implementation of the Kalman filtering algorithm.

The general idea here is that we will start with t=1, run the Kalman filtering algorithm for each t and compute the necessary quantities for each step of the likelihood function. By the time we reach t=n, we should have everything we need to compute the joint likelihood function.

We can see for t=2, we will need to compute p(y2y1), which can be written

p(y2y1)=p(y2,x2y1)dx2=p(y2x2)p(x2y1)dx2=φ(y2Ax2,S)φ(x2x12,P12)dx2=N(Ax12,AP12A+S).

In general, we will have

p(yty1,,yt1)=N(Axt1t,APt1tA+S). If we let \boldsymbol{\beta} represent the vector of unknown parameters, then computing the log-likelihood would require computing the following sum, \ell(\boldsymbol{\beta}) = \sum_{t=1}^n \log p(y_t\mid y_1,\dots,y_{t-1}) where we would define p(y_1\mid y_0) = p(y_1) because there is no y_0. We could then maximize \ell(\boldsymbol{\beta}) with respect to \boldsymbol{\beta} using standard non-linear maximization routines like Newton’s method or quasi-Newton approaches.