Chapter 2 Overview of Topic Modeling
We first lay out a basic topic model without any accounting for time.
In this model, we have a number of people, M.
Each individual has a number of diagnoses, N_{i=1M}.We will simulate this as a poisson RV with mean of 10 diagnoses per patient.
There are a total number of \(N=\sum _{i=1}^{M}N_{i}\) diagnoses over all individuals.
There are a total of \(V\) disease codes entirely.
For each of the M individuals, there exists a distribution over K topics, \(\theta_{i}\).
This topic distribution for a given individual, \(\theta_i\) will be sparsely simulated so individuals are loaded on a minimal number of topics using a random dirichlet with \(\alpha\) = [0.1,…K] .
For each of the K topics, there is a distribution on V diseases. There are a total of V diseases. This parameter \(\phi_k\), over diseases, is simulated also according to a sparse dirichlet, this time with hyperparameter \(\beta\) for example also = 0.1_RV
Here we allow that each topic has a special ‘topic specific’ disease enrichment, such that for a select number d of randomly sampled diseases, here d=10, is (V-d)/V with the remaining = 1/V.
2.1 Generative model
Now each disease arises from a topic (there can be overlap such that one disease can be featured in several topics) with latent indicator \(z_{i=1..M,j=1:Ni}\) with an integer from 1:K.
\(1:Ni\) here represents the fact that an individual can have up to Ni diagnoses, so that \(Z\) is a giant vector of length N.
According to the assigned topic (1:K) for that diagnosis, the probability of the given diagnosis is drawn according to \(\phi_{kd}\)
2.2 Table of Variables
Variable | Type | Meaning |
---|---|---|
K | integer | number of topics (e.g. 50) |
V | integer | number of diseases in the vocabulary (e.g. 50,000 or 1,000,000) |
M | integer | number of persons |
\(N_{i_{1..M}}\) | integer | number of diseases in person \(i\) |
N | integer | total number of diseases in all persons; sum of all \(N_d\) values, i.e. \(N = \sum_{i=1}^{M} N_i\) |
\(\alpha_{k_{1..K}}\) | positive real | prior weight of topic \(k\) in a person; usually the same for all topics; normally a number less than 1, e.g. 0.1, to prefer sparse topic distributions, i.e. few topics per person |
\(\alpha\) | K-dimensional vector of positive reals | collection of all \(\alpha_k\) values, viewed as a single vector |
\(\beta_{w_{1..V}}\) | positive real | prior weight of disease \(w\) in a topic; usually the same for all diseases; normally a number much less than 1, e.g. 0.001, to strongly prefer sparse disease distributions, i.e. few diseases per topic |
\(\beta\) | V-dimensional vector of positive reals | collection of all \(\beta_w\) values, viewed as a single vector |
\(\phi_{k_{1..K},w_{1..V}}\) | probability (real number between 0 and 1) | probability of disease \(w\) occurring in topic \(k\) |
\(\Phi_{k_{1..K}}\) | V-dimensional vector of probabilities, which must sum to 1 | distribution of diseases in topic \(k\) |
\(\theta_{d_{1..M},k_{1..K}}\) | probability (real number between 0 and 1) | probability of topic \(k\) occurring in person \(d\) |
\(\Theta_{d_{1..M}}\) | K-dimensional vector of probabilities, which must sum to 1 | distribution of topics in person \(d\) |
\(z_{d_{1..M},w_{1..N_d}}\) | integer between 1 and K | identity of topic of disease \(w\) in person \(d\) |
Z | N-dimensional vector of integers between 1 and K | identity of topic of all diseases in all persons |
\(w_{d_{1..M},w_{1..N_d}}\) | integer between 1 and V | identity of disease \(w\) in person \(d\) |
W | N-dimensional vector of integers between 1 and V | identity of all diseases in all persons |