Chapter 2 Overview of Topic Modeling
We first lay out a basic topic model without any accounting for time.
In this model, we have a number of people, M.
Each individual has a number of diagnoses, N_{i=1M}.We will simulate this as a poisson RV with mean of 10 diagnoses per patient.
There are a total number of N=∑Mi=1Ni diagnoses over all individuals.
There are a total of V disease codes entirely.
For each of the M individuals, there exists a distribution over K topics, θi.
This topic distribution for a given individual, θi will be sparsely simulated so individuals are loaded on a minimal number of topics using a random dirichlet with α = [0.1,…K] .
For each of the K topics, there is a distribution on V diseases. There are a total of V diseases. This parameter ϕk, over diseases, is simulated also according to a sparse dirichlet, this time with hyperparameter β for example also = 0.1_RV
Here we allow that each topic has a special ‘topic specific’ disease enrichment, such that for a select number d of randomly sampled diseases, here d=10, is (V-d)/V with the remaining = 1/V.
2.1 Generative model
Now each disease arises from a topic (there can be overlap such that one disease can be featured in several topics) with latent indicator zi=1..M,j=1:Ni with an integer from 1:K.
1:Ni here represents the fact that an individual can have up to Ni diagnoses, so that Z is a giant vector of length N.
According to the assigned topic (1:K) for that diagnosis, the probability of the given diagnosis is drawn according to ϕkd
2.2 Table of Variables
Variable | Type | Meaning |
---|---|---|
K | integer | number of topics (e.g. 50) |
V | integer | number of diseases in the vocabulary (e.g. 50,000 or 1,000,000) |
M | integer | number of persons |
Ni1..M | integer | number of diseases in person i |
N | integer | total number of diseases in all persons; sum of all Nd values, i.e. N=∑Mi=1Ni |
αk1..K | positive real | prior weight of topic k in a person; usually the same for all topics; normally a number less than 1, e.g. 0.1, to prefer sparse topic distributions, i.e. few topics per person |
α | K-dimensional vector of positive reals | collection of all αk values, viewed as a single vector |
βw1..V | positive real | prior weight of disease w in a topic; usually the same for all diseases; normally a number much less than 1, e.g. 0.001, to strongly prefer sparse disease distributions, i.e. few diseases per topic |
β | V-dimensional vector of positive reals | collection of all βw values, viewed as a single vector |
ϕk1..K,w1..V | probability (real number between 0 and 1) | probability of disease w occurring in topic k |
Φk1..K | V-dimensional vector of probabilities, which must sum to 1 | distribution of diseases in topic k |
θd1..M,k1..K | probability (real number between 0 and 1) | probability of topic k occurring in person d |
Θd1..M | K-dimensional vector of probabilities, which must sum to 1 | distribution of topics in person d |
zd1..M,w1..Nd | integer between 1 and K | identity of topic of disease w in person d |
Z | N-dimensional vector of integers between 1 and K | identity of topic of all diseases in all persons |
wd1..M,w1..Nd | integer between 1 and V | identity of disease w in person d |
W | N-dimensional vector of integers between 1 and V | identity of all diseases in all persons |