Chapter 18 Examples of HMM, Non-homogeneous Poisson Process(Lecture on 03/04/2021)

We have done Bayesian estimation procedure for HMM. Now we give some examples about its application.

Example 18.1 (Simple application scenario of HMM in real life) Suppose we have some pictures from a friend everyday. The picture shows the clothes he worn, such as: “shirts and trousers”, “jacket over shirt”, \(\cdots\). We know from the picture what he wears in successive days, but we do not know the weather in those days. The weather is a latent variable. We assume there is some connection between the weather outside on that day and clothes he wearing. For simplicity, suppose there are three different states of weather, “sunny”, “windy” and “rainy”. Suppose also the dresses the person wear is a categorical variable, denoted as \(y\), and \(y\) can take values in \(\{0,1,\cdots,C\}\), which is known to us. Suppose we have the picture from the friend for 10 successive days. That is, we have \(y_1,\cdots,y_{10}\), and we are interested in the inference about the state of weather in these 10 days, denoted as \(S_1,\cdots,S_10\). We also do not know the transition probability for the weather states, that is \(Pr(S_{t+1}=j|S_t=i)\) is unknown to us. In addition, \(y_t|S_t\) is also unknown to us. This problem can be modeled using HMM. Then we can get an estimate of the transition probability for \(S_t\).

Example 18.2 (Application of HMM in genetics) Suppose we have a DNA sequence such as “ATGCTCGA\(\cdots\)”. Some part of the sequence may correspond to amino acid, and when amino acids are combined, that constructs proteins. Proteins are building blocks of our body. Different types of proteins are related to different type of functions. We have the DNA sequence but we do not know whether a subsequence of it corresponding to any specific type of proteins. HMM helps people to identify proteins from these long sequences. Not only that, it tells you the probability of having two proteins in the adjacent position to form a proterin block.

Example 18.3 (Application of HMM in natural language processing) In natural language processing (NLP), we try to break a long scentence into a few categories, such as: “verb”, “noun”, “modal” and “pronoun”. The data is given to us as many long sentences, constructed by these types of words, and we try to learn the transition probability between states of words. For example, \(Pr(S_{t+1}=N|S_t=V)\) is the probability that keep a “Verb” and a “Noun” in successive position. A computer based on training data identifies these transition probabilities in a text. Through HMM, the computer are able to do this job. This is known as the “Part of Speech Tagging” in NLP.

Non-homogeneous Poisson process

A homogeneous Poisson process with intensity \(\lambda\) means that, the number of occurences happens in \((0,t)\) follows a \(Pois(\lambda t)\) distribution. It is called homogeneous, because the rate of occurence \(\lambda\) is a constant as a function of \(t\). The expected number of occurence in time \((0,t)\), denoted as \(E(N(t))\), is equal to \(\lambda t\). However, the rate of occurence of an event may also be dependent on the time an event is occuring. Let’s say the rate of occurence at time \(t\) is \(\lambda(t)\). Thus, the number of occruence in the interval \((0,T]\) follows Poisson distribution \(Pois(\int_0^T\lambda(t)dt)\). If \(\lambda(t)=\lambda\), \(\int_0^T\lambda(t)dt=\lambda T\), which is just the homogeneous Poisson process.

Definition 18.1 (Non-homogeneous Poisson process) A non-homogeneous Poisson process (NHPP) over time is defined by its intensity function \(\lambda(\cdot)\), which is a non-negative and locally integrable function, i.e.

  1. \(\lambda(t)\geq 0\), \(\forall t\);

  2. \(\int_B\lambda(u)du<\infty\) for all bounded set \(B\).

The mean measure of the process is given by \(\Lambda(t)=\int_0^t\lambda(u)du\) for \(t\in\mathbb{R}^+\). Formally, a point process over time \(y=\{y(t):t\geq 0\}\) is a NHPP if \(y\) has independent increment, that is, \(y(t_2)-y(t_1)\) is independent of \(y(t_4)-y(t_3)\) where \(t_1<t_2<t_3<t_4\), and for any \(t>s\geq 0\), \(y(t)-y(s)\sim Pois(\Lambda(t)-\Lambda(s))\).

When we observe data and try to fit NHPP to that data, our goal is to draw inference for the intensity function \(\lambda(u)\). The data we have is usually the point at which the event occured, as displayed in Figure 18.1. Based on the data, our only job is to estimate the intensity function.

\label{fig:18001}Example of data that can be modeled using NHPP.

FIGURE 18.1: Example of data that can be modeled using NHPP.

If besides the point the event occurs, we also have addition variable corresponding to each point. We can use compound Poisson process to model it.

Approach 1: Density Estimation Idea

It is a very neat approach, through it may face computational issues when data size is even moderately large.

We consider a NHPP observed over time interval \((0,T]\). Let the events occurs at times \(0\leq t_1<t_2<\cdots<t_n\leq T\). The likelihood for the data is given by \(L\propto P(y(T)=n)f(t_1,\cdots,t_n|y(T)=n)\), which is the product of there are \(n\) events occured in \((0,T]\), given by \(P(y(T)=n)\), and the probability that these events occur at \(t_1,\cdots,t_n\), given by \(f(t_1,\cdots,t_n|y(T)=n)\). Firstly, since \(y(T)\sim Pois(\int_0^T\lambda(t)dt)\), we have \[\begin{equation} P(y(T)=n)=\exp(-\int_0^T\lambda(t)dt)\frac{(\int_0^T\lambda(t)dt)^n}{n!} \tag{18.1} \end{equation}\]

As for \(f(t_1,\cdots,t_n|y(T)=n)\), given that there are \(N\) events in the interval \([0,T]\), the point of occurence of these events are i.i.d. \(Unif(0,T)\). Therefore, \[\begin{equation} f(t_1,\cdots,t_n|y(T)=n)\propto \prod_{i=1}^n\frac{\lambda(t_i)}{\int_0^T\lambda(t)dt} \tag{18.2} \end{equation}\]

Thus, the full likelihood is \[\begin{equation} \begin{split} L&\propto \exp(-\int_0^T\lambda(t)dt)\frac{(\int_0^T\lambda(t)dt)^n}{n!}\prod_{i=1}^n\frac{\lambda(t_i)}{\int_0^T\lambda(t)dt}\\ &\propto \exp(-\int_0^T\lambda(t)dt)\prod_{i=1}^n\lambda(t_i) \end{split} \tag{18.3} \end{equation}\]

Let \(\gamma=\int_0^T\lambda(t)dt\). We are trying to estimate \(\lambda(t)\). We will do it using the density estimation framework. However, \(\lambda(t)\) is not a density funtion, but \(f(t)=\frac{\lambda(t)}{\gamma},t\in[0,T]\) is a density function in \([0,T]\).

Note that \((f(\cdot),\gamma)\) provides an equivalent representation of \(\lambda(\cdot)\). The prior we use is \(p(\gamma)\propto \frac{1}{\gamma}\) and \(f\) is a density, which will be modeled using Bayesian mixture modeling. \[\begin{equation} f(t|\boldsymbol{\theta})=\sum_{l=1}^S\omega_lK(t,\boldsymbol{\theta}_l) \tag{18.4} \end{equation}\] where \(K(t,\boldsymbol{\theta}_l)\) is a kernel function (density function) with parameters \(\boldsymbol{\theta}_l\). \(\omega_l\) are weights corresponding to the densities \(K(t,\boldsymbol{\theta}_l)\) and \(\sum_{l=1}^S\omega_l=1\).

A sophisticated construction of \(\omega_l\) comes through the stick breaking procedure. That is, \(\omega_1=Z_1\), \(\omega_2=Z_2(1-Z_1),\cdots\), \(\omega_j=Z_j\prod_{h=1}^j(1-Z_h),\cdots\), \(\omega_S=(1-Z_1)\cdots(1-Z_{S-1})\). Typically \(S\) is kept as a large number, say \(S=50\). If \(Z_l|\alpha\stackrel{i.i.d.}{\sim}Beta(1,\alpha),l=1,\cdots,S-1\), this stick breaking process gives rise to the Dirichlet process mixture model. There are other types of constructions as well (Pitman-Yor process).

We have to make the choice of the kernel \(k(t,\boldsymbol{\theta}_l)\). Let \(\boldsymbol{\theta}=(\mu,\tau)\), we can use \[\begin{equation} K(t;\mu,\tau)=\frac{t^{\mu\tau T^{-1}}(T-t)^{\tau(1-\mu\tau^{-1})-1}}{Beta(\mu\tau T^{-1},\tau(1-\mu T^{-1}))T^{\tau-1}} \tag{18.5} \end{equation}\]

This is sort of a rescaled Beta density. It was argued that mixture of beta densities yield a wide range of distributional shapes. In fact, they can be used to approximate arbitrarily well any density defined on a bounded interval. (Diaconis and Ylvisaker, 1985)

Note that this kernel parameterizes the rescaled Beta distribution (rescaled becuase the traditional Beta distribution is defined on \((0,1)\), while in the case we need the distribution on \((0,T)\)) with support on \((0,T),\mu\in(0,T)\), and \(\tau>0\). This mixture modeling reduces the estimation of \(f\), which is an infinite dimensional estimation problem to the estimation of finite number parameters.