Chapter 2 Overview of Topic Modeling

We first lay out a basic topic model without any accounting for time.

In this model, we have a number of people, M.

Each individual has a number of diagnoses, N_{i=1M}.We will simulate this as a poisson RV with mean of 10 diagnoses per patient.

There are a total number of N=Mi=1Ni diagnoses over all individuals.

There are a total of V disease codes entirely.

For each of the M individuals, there exists a distribution over K topics, θi.

This topic distribution for a given individual, θi will be sparsely simulated so individuals are loaded on a minimal number of topics using a random dirichlet with α = [0.1,…K] .

For each of the K topics, there is a distribution on V diseases. There are a total of V diseases. This parameter ϕk, over diseases, is simulated also according to a sparse dirichlet, this time with hyperparameter β for example also = 0.1_RV

Here we allow that each topic has a special ‘topic specific’ disease enrichment, such that for a select number d of randomly sampled diseases, here d=10, is (V-d)/V with the remaining = 1/V.

2.1 Generative model

Now each disease arises from a topic (there can be overlap such that one disease can be featured in several topics) with latent indicator zi=1..M,j=1:Ni with an integer from 1:K.

1:Ni here represents the fact that an individual can have up to Ni diagnoses, so that Z is a giant vector of length N.

According to the assigned topic (1:K) for that diagnosis, the probability of the given diagnosis is drawn according to ϕkd

2.2 Table of Variables

Variable Type Meaning
K integer number of topics (e.g. 50)
V integer number of diseases in the vocabulary (e.g. 50,000 or 1,000,000)
M integer number of persons
Ni1..M integer number of diseases in person i
N integer total number of diseases in all persons; sum of all Nd values, i.e. N=Mi=1Ni
αk1..K positive real prior weight of topic k in a person; usually the same for all topics; normally a number less than 1, e.g. 0.1, to prefer sparse topic distributions, i.e. few topics per person
α K-dimensional vector of positive reals collection of all αk values, viewed as a single vector
βw1..V positive real prior weight of disease w in a topic; usually the same for all diseases; normally a number much less than 1, e.g. 0.001, to strongly prefer sparse disease distributions, i.e. few diseases per topic
β V-dimensional vector of positive reals collection of all βw values, viewed as a single vector
ϕk1..K,w1..V probability (real number between 0 and 1) probability of disease w occurring in topic k
Φk1..K V-dimensional vector of probabilities, which must sum to 1 distribution of diseases in topic k
θd1..M,k1..K probability (real number between 0 and 1) probability of topic k occurring in person d
Θd1..M K-dimensional vector of probabilities, which must sum to 1 distribution of topics in person d
zd1..M,w1..Nd integer between 1 and K identity of topic of disease w in person d
Z N-dimensional vector of integers between 1 and K identity of topic of all diseases in all persons
wd1..M,w1..Nd integer between 1 and V identity of disease w in person d
W N-dimensional vector of integers between 1 and V identity of all diseases in all persons