5.3 Deriving the One-dimensional Case
There are a variety of ways to drive the Kalman filtering equations but for statisticians it’s probably easiest to think of everything in terms of Normal distributions. Keep in mind that we generally won’t believe that the data are distributed as Normal, but rather we can think of the Normal distribution as a working model. We will continute with the local level model described above which has
\[\begin{eqnarray*} x_t & = & \theta x_{t-1} + w_t\\ y_t & = & x_t + v_t \end{eqnarray*}\] where \(w_t\sim\mathcal{N}(0, \tau^2)\) and \(v_t\sim\mathcal{N}(0,\sigma^2)\).
Let’s start at \(t=1\) where we will observe \(y_1\). Assume we have our initial state \(x_0\sim\mathcal{N}(x_0^0, P_0^0)\). First, we want to get the marginal distribution of \(x_1\), i.e. \(p(x_1)\). Because there is no \(y_0\) we cannot condition on any observed information yet. We can compute \(p(x_1)\) as \[\begin{eqnarray*} p(x_1) & = & \int p(x_1\mid x_0)p(x_0)\, dx_0\\ & = & \int\mathcal{N}(\theta x_0, \tau^2)\times\mathcal{N}(x_0^0, P_0^0)\,dx_0\\ & = & \mathcal{N}(\theta x_0^0, \theta^2 P_0^0 + \tau^2)\\ & = & \mathcal{N}(x_1^0, P_1^0). \end{eqnarray*}\] Note that we have defined \(x_1^0\stackrel{\Delta}{=}\theta x_0^0\) and \(P_1^0\stackrel{\Delta}{=}\theta^2P_0^0 + \tau^2\). We can think of \(x_1^0\) as the best prediction we can make based on our knowledge of the system and no data.
Given the new observation \(y_1\) we want to use this information to estimate \(x_1\). For that we need the conditional distribution \(p(x_1\mid y_1)\), which is called the filter density. We can figure that out with Bayes’ rule: \[ p(x_1\mid y_1) \propto p(y_1\mid x_1)p(x_1) \]
We know from the observation equation that \(p(y_1\mid x_1)=\mathcal{N}(x_1, \sigma^2)\) and we just computed \(p(x_1)\) above. Therefore, using the basic properties of the Normal distribution, we have \[\begin{eqnarray*} p(x_1\mid y_1) & \propto & p(y_1\mid x_1)p(x_1)\\ & = & \varphi(y_1\mid x_1, \sigma^2)\times\varphi(x_1\mid x_1^0, P_1^0)\\ & = & \mathcal{N}(x_1^0+K_1(y_1-x_1^0), (1-K_1)P_1^0) \end{eqnarray*}\] where \[ K_1 = \frac{P_1^0}{P_1^0 + \sigma^2}, \] is the Kalman gain coefficient. Then for \(t=1\) we have our new estimates \[\begin{eqnarray*} x_1^1 & = & \mathbb{E}[x_1\mid y_1]\\ & = & x_1^0+K_1(y_1-x_1^0) \end{eqnarray*}\] and \[\begin{eqnarray*} P_1^1 & = & \text{Var}(x_1\mid y_1)\\ & = & (1-K_1)P_1^0. \end{eqnarray*}\] So the filter density is \(p(x_1\mid y_1) = \mathcal{N}(x_1^1, P_1^1)\).
Let’s now iterate through this process for \(t=2\) where we will now have a new observation \(y_2\). We want to compute the new filter density for \(x_2\), \[ p(x_2\mid y_1, y_2) \propto p(y_2\mid x_2)p(x_2\mid y_1). \] Implicit in the statement above is that \(y_2\) does not depend on \(y_1\) conditional on the value of \(x_2\). The new filter density \(p(x_2\mid y_1, y_2)\) is a function of the history of the observed data and is a product of the observation density \(p(y_2\mid x_2)\) and the forcast density \(p(x_2\mid y_1)\).
In this case, the observation density is simply \(\mathcal{N}(x_2, \sigma^2)\). The forecast density can be derived by augmenting \(p(x_2\mid y_1)\) with the previous state value \(x_1\), \[\begin{eqnarray*} p(x_2\mid y_1) & = & \int p(x_2, x_1\mid y_1)\,dx_1\\ & \propto & \int p(x_2\mid x_1)p(x_1\mid y_1)\,dx_1. \end{eqnarray*}\] Inside the integral, we have the product of the state equation density and the filter density for \(x_1\mid y_1\), which we just computed for \(t=1\). The state equation density is \(\mathcal{N}(\theta x_1, \tau^2)\) and the filter density is \(\mathcal{N}(x_1^1, P_1^1)\). Integrating these out, we get \[\begin{eqnarray*} p(x_2\mid y_1) & \propto & \int\varphi(x_2\mid \theta x_1, \tau^2)\varphi(x_1\mid x_1^1, P_1^1)\,dx_1\\ & = & \varphi(x_2\mid \theta x_1^1, \theta^2 P_1^1 + \tau^2)\\ & = & \varphi(x_2\mid x_2^1, P_2^1) \end{eqnarray*}\]
Putting the forecast density we just computed with the observation density, we get \[\begin{eqnarray*} p(x_2\mid y_1, y_2) & \propto & \varphi(y_2\mid x_2,\sigma^2)\varphi(x_2\mid x_2^1, P_2^1)\\ & = & \mathcal{N}(x_2^1 + K_2(y_2-x_2^1),\, (1-K_2)P_2^1) \end{eqnarray*}\] where \[ K_2 = \frac{P_2^1}{P_2^1 + \sigma^2} \] is the new Kalman gain coefficient. If we define \(x_2^2=x_2^1 + K_2(y_2-x_2^1)\) and \(P_2^2=(1-K_2)P_2^1\) then we have that \(p(x_2\mid y_1,y_2)=\mathcal{N}(x_2^2, P_2^2)\).
Shall we do \(t=3\) just for fun? Given a new observation \(y_3\), we want the new filter density \[ p(x_3\mid y_1, y_2, y_3) \propto p(y_3\mid x_3) p(x_3\mid y_1, y_2). \] Using the same ideas as before, we know the observation density \(p(y_3\mid x_3)\) and the forecast density is \[\begin{eqnarray*} p(x_3\mid y_1, y_2) & = & \int p(x_3, x_2\mid y_1, y_2)\,dx_2\\ & \propto & \int p(x_3\mid x_2)p(x_2\mid y_1, y_2)\,dx_2\\ & = & \int \varphi(x_3\mid \theta x_2, \tau^2)\varphi(x_2\mid x_2^2, P_2^2)\,dx_2\\ & = & \varphi(x_3\mid \theta x_2^2,\, \theta^2P_2^2+\tau^2)\\ & = & \varphi(x_3\mid x_3^2, P_3^2) \end{eqnarray*}\] Now the new filter density is \[\begin{eqnarray*} p(x_3\mid y_1, y_2, y_3) & \propto & p(y_3\mid x_3) p(x_3\mid y_1, y_2)\\ & = & \varphi(y_3\mid x_3, \sigma^2)\varphi(x_3\mid x_3^2, P_3^2)\\ & = & \mathcal{N}(x_3^2 + K_3(y_3-x_3^2),\,(1-K_3)P_3^2) \end{eqnarray*}\] where \(K_3=P_3^2/(P_3^2 + \sigma^2)\).
To summarize, for each \(t=1,2,3,\dots\) the estimate of \(x_t\) is the mean of the filter density \(p(x_t\mid y_1, \dots, y_t)\) and the filter density is a product of the observation density and the forecast density, i.e. \[ p(x_t\mid y_1,\dots,y_t) \propto p(y_t\mid x_t)p(x_t\mid y_1,\dots, y_{t-1}). \] The benefit of the Kalman filtering algorithm is that we compute each estimate recursively so there is no need to “save” information from previous iterations. Each iteration has built into it all of the information from previous iterations.