Chapter 2 Overview of Topic Modeling

We first lay out a basic topic model without any accounting for time.

In this model, we have a number of people, M.

Each individual has a number of diagnoses, N_{i=1M}.We will simulate this as a poisson RV with mean of 10 diagnoses per patient.

There are a total number of \(N=\sum _{i=1}^{M}N_{i}\) diagnoses over all individuals.

There are a total of \(V\) disease codes entirely.

For each of the M individuals, there exists a distribution over K topics, \(\theta_{i}\).

This topic distribution for a given individual, \(\theta_i\) will be sparsely simulated so individuals are loaded on a minimal number of topics using a random dirichlet with \(\alpha\) = [0.1,…K] .

For each of the K topics, there is a distribution on V diseases. There are a total of V diseases. This parameter \(\phi_k\), over diseases, is simulated also according to a sparse dirichlet, this time with hyperparameter \(\beta\) for example also = 0.1_RV

Here we allow that each topic has a special ‘topic specific’ disease enrichment, such that for a select number d of randomly sampled diseases, here d=10, is (V-d)/V with the remaining = 1/V.

2.1 Generative model

Now each disease arises from a topic (there can be overlap such that one disease can be featured in several topics) with latent indicator \(z_{i=1..M,j=1:Ni}\) with an integer from 1:K.

\(1:Ni\) here represents the fact that an individual can have up to Ni diagnoses, so that \(Z\) is a giant vector of length N.

According to the assigned topic (1:K) for that diagnosis, the probability of the given diagnosis is drawn according to \(\phi_{kd}\)

2.2 Table of Variables

Variable Type Meaning
K integer number of topics (e.g. 50)
V integer number of diseases in the vocabulary (e.g. 50,000 or 1,000,000)
M integer number of persons
\(N_{i_{1..M}}\) integer number of diseases in person \(i\)
N integer total number of diseases in all persons; sum of all \(N_d\) values, i.e. \(N = \sum_{i=1}^{M} N_i\)
\(\alpha_{k_{1..K}}\) positive real prior weight of topic \(k\) in a person; usually the same for all topics; normally a number less than 1, e.g. 0.1, to prefer sparse topic distributions, i.e. few topics per person
\(\alpha\) K-dimensional vector of positive reals collection of all \(\alpha_k\) values, viewed as a single vector
\(\beta_{w_{1..V}}\) positive real prior weight of disease \(w\) in a topic; usually the same for all diseases; normally a number much less than 1, e.g. 0.001, to strongly prefer sparse disease distributions, i.e. few diseases per topic
\(\beta\) V-dimensional vector of positive reals collection of all \(\beta_w\) values, viewed as a single vector
\(\phi_{k_{1..K},w_{1..V}}\) probability (real number between 0 and 1) probability of disease \(w\) occurring in topic \(k\)
\(\Phi_{k_{1..K}}\) V-dimensional vector of probabilities, which must sum to 1 distribution of diseases in topic \(k\)
\(\theta_{d_{1..M},k_{1..K}}\) probability (real number between 0 and 1) probability of topic \(k\) occurring in person \(d\)
\(\Theta_{d_{1..M}}\) K-dimensional vector of probabilities, which must sum to 1 distribution of topics in person \(d\)
\(z_{d_{1..M},w_{1..N_d}}\) integer between 1 and K identity of topic of disease \(w\) in person \(d\)
Z N-dimensional vector of integers between 1 and K identity of topic of all diseases in all persons
\(w_{d_{1..M},w_{1..N_d}}\) integer between 1 and V identity of disease \(w\) in person \(d\)
W N-dimensional vector of integers between 1 and V identity of all diseases in all persons