Chapter 4 Methods: Dynamic

Remember our first objective was to model the time dependent process akin to Blei et al 2006:

In the context of dynamic topic models, we model the evolution of topic proportions or word distributions within topics over time. For a \(V\)-dimensional multinomial mean parameter \(\pi\), the \(i^{th}\) component of the natural parameter vector \(\eta_{t,k}\) for topic \(k\) at time slice \(t\) is given by \(\log(\pi_{i} / \pi_{V})\). However, unlike the Dirichlet distribution, we can employ a Gaussian Process (GP) to model this evolution in a non-parametric Bayesian framework.

A Gaussian Process defines a distribution over functions and can be used as a prior for the function space of the topic trajectories. This prior is updated with the observed data to provide a posterior distribution over the topic trajectories, allowing for smooth and flexible modeling over time. The GP is fully specified by its mean function \(\mu(t)\) and covariance function \(k(t, t')\).

In our adapted model, the evolution of each topic’s natural parameters \(\eta_{t,k}\) can be defined using a Gaussian Process:

\[ \eta_{t,k} \sim \mathcal{GP}(\mu(t), k(t, t')) \]

where \(\mu(t)\) is the mean function, and \(k(t, t')\) is the covariance function, which encodes our assumptions about the smoothness and correlation structure over time. This can capture the dynamics with more flexibility than a fixed autoregressive coefficient \(\rho\).

The mean function \(\mu(t)\) can be thought of as the ‘trend’ of the topic proportions over time, while the covariance function \(k(t, t')\) describes how these proportions change together over different time points. For example, a squared exponential covariance function will imply that the topic proportions change in a smooth manner over time.

The time-dependent natural parameters \(\eta_{t,k}\) derived from the Gaussian Process can be related back to the probability scale using the softmax function:

\[ \pi(\eta_{k,t,w}) = \frac{\exp(\eta_{k,t,w})}{\sum_{w}\exp(\eta_{k,t,w})} \]

This retains the probabilistic interpretation of topic proportions at each time step and allows for the incorporation of time-based dynamics in a principled and flexible way.

4.0.1 Time warping

Our work is complicated by the fact that \(\eta\) must not include individual level variables, so we modify the time scale and warp time:

Given a sequence of topic-word distributions over time \(\eta_{k,v,t}\) for topic \(k\), word \(v\), and time point \(t\), we introduce a time-warping function \(W(t, \rho_d)\) for individual \(d\) with a warping parameter \(\rho_d\). The warping function transforms the original time index \(t\) to a new time index \(t'\) that corresponds to the “warped” time for individual \(d\).

The warping function can be defined as:

\[ t' = W(t, \rho_d) = \left\lfloor t \cdot \rho_d \right\rfloor \]

where \(\lfloor \cdot \rfloor\) denotes the floor function which maps a real number to the largest previous integer, and \(\rho_d\) is the individual-specific warping parameter that modulates the speed at which the individual progresses through the topic content.

The “warped” topic-word distributions for individual \(d\) at time \(t'\) are then given by:

\[ \eta'_{d,k,v,t'} = \eta_{k,v,W(t, \rho_d)} \]

After applying the warping function, the resulting sequence may have gaps due to compression (when \(\rho_d < 1\)) or may need interpolation due to expansion (when \(\rho_d > 1\)). These gaps are filled using interpolation or extrapolation methods, and the resulting sequence is then normalized to ensure that it represents valid probability distributions.

The critical piece is that all individuals share the same topic-specific vocabulary, as captured in the natural parameterization of \(\eta\). Because \(\eta\) is simulated according to a dynamic topic model in which the distribtuion of a topic over diagnoses is centered at the distribtuion at the previous time step, the functions are natrually continuous.

However, we then impose a ‘warping’ to allow individuals to have disease early and late while retaining a loading on a specific topic.

4.1 Dynamic modeling: draw topics

For each of topic k, draw the natural parameters _{t,k} for the un-normlaized loadings from a MV normal with mean centrued around the natural parameters of the previous time splice.

library(MASS)
library(MCMCpack)
library(ggplot2)
#set.seed(123)

D <- 60 # number of patient

K <- 3 # Number of topics
V <- 10 # Vocabulary size
T <- 100 # Number of time slices

eta <- array(dim = c(K, V, T))
eta_tilde <- array(dim = c(D,K, V, T))
# Define a function to calculate softmax values and normalize
softmax_normalize <- function(x) {
  exp_x <- exp(x) # Subtract max to avoid overflow
  normalized_values <- exp_x / sum(exp_x)
  return(normalized_values)
}

We can run in two unique ways: 1. Using a Gaussian process in which we simulate the distribtuion of proportional representation at each time step as a MV Normal with a prespecificed correlation matrix between time steps

  1. Using a spline as in ATM.

Here we will use the Gaussian process.

cov_function <- function(t1, t2, lengthscale, variance) {
    exp(-0.5 * (t1 - t2)^2 / lengthscale^2) * variance
}

# Store disease-specific covariances
Sigma <- array(dim = c(K, V, T, T)) # K topics, V diseases, T x T covariance matrix each



# Bias factor (adjust as needed)


# Generate covariances (adjust as needed)
for (k in 1:K) {
  for (v in 1:V) {
    # Sample a single lengthscale and variance for each topic-disease pair
      cov_matrix_kv <- matrix(0, nrow = T, ncol = T)
      #lengthscale_kv <- runif(1,min=5,max=20) ## choose random lengthscale for each topic and disease
      #variance_kv <- runif(1,0.1,0.5)## choose random variance for each topic and disease
      variance_kv=2
      lengthscale_kv=10
    for (i in 1:T) {
for (j in 1:T) {
        cov_matrix_kv[i, j] <- cov_function(i, j, lengthscale_kv, variance_kv)
      }
    }

    Sigma[k, v, , ] <- cov_matrix_kv
  }
}

4.2 Sample from the Gaussian Process

The probability of a disease within a topic over time has some covariance structure

# Modify eta sampling to use the correct covariance

topic_word_mean=array(NA,dim = c(K,V))

for (k in 1:K) {
  for (v in 1:V) {
    centroid=rnorm(1,mean = 0,sd = 2)
    topic_word_mean[k,v]=centroid
    eta[k, v,] <- mvrnorm(1, mu = rep(centroid,T), Sigma = Sigma[k, v, ,])
  }
}



eta_softmax <- array(dim = c(K, V, T))
# Normalize using softmax
for (k in 1:K) {
for (t in 1:T) {
 eta_softmax[k, , t] <- softmax_normalize(eta[k, , t])
}
}

4.3 Introduce time warping

ONe of the unique features of our model is that individuals progress through topics at different rates according to the power of their genetic factor:

The non-trivial modeling piece here is that we wish to add an individual dimension to the matrix \(\beta\) that is usually part of the topic \(\times\) disease space, and so this will make \(\beta\) an array. We are careful to keep the topic *disease** array shared across individuals, but the correlation coefficient \(\rho\) applied will be different.

4.4 From the Universal to the specific

Recall that critical to topic modeling is that eta is the same across topics. So now, we need to add the time warping feature to the individual level.

Now we need to reverse transform so that the disease proportions sum to 1.

# Initialize a new array for storing normalized values
beta <- array(0, dim(eta_tilde))

## since warping t isn't guaranteed to choose for every time point, we sometimes need to fill in gaps, can choose either fucntion

interpolate_spline <- function(x) {
  # Identify time points with assigned values
  time_with_values <- !is.na(x)
  
  if (all(time_with_values)) {
    return(x)  # No interpolation needed if no NA values
  } else {
    # Apply spline interpolation
    interpolated_values <- spline(x = which(time_with_values), y = x[time_with_values], xout = 1:length(x), method = "fmm")$y
    
    # Replace NA values with interpolated values
    x[!time_with_values] <- interpolated_values[!time_with_values]
    
    return(x)
  }
}


fill_gaps <- function(values) {
  # Use na.approx from the zoo package for linear interpolation
  library(zoo)
  filled_values <- na.approx(values, na.rm = FALSE)
  
  # For leading NAs, use the first non-NA value
  first_non_na <- which(!is.na(filled_values))[1]
  filled_values[1:(first_non_na - 1)] <- filled_values[first_non_na]
  
  # For trailing NAs, use the last non-NA value
  last_non_na <- which(!is.na(filled_values))[length(which(!is.na(filled_values)))]
  filled_values[(last_non_na + 1):length(filled_values)] <- filled_values[last_non_na]
  
  return(filled_values[1:T])
}






# Iterate through each person, topic, and time point to fill gaps and normalize
for (person_idx in 1:D) {
  for (topic_idx in 1:K) {
        
    for (v in 1:V) {
      #eta_tilde[person_idx, topic_idx, v, ] <- fill_gaps(eta_tilde[person_idx, topic_idx, v, ])
      eta_tilde[person_idx, topic_idx, v, ] <- interpolate_spline(eta_tilde[person_idx, topic_idx, v, ])
    }}}




# Target diseases for each topic
topic_specific_diseases <- list()
diseases=1:V
for(k in 1:K){
  sampledx=sample(diseases,size = 2,replace = F)
  topic_specific_diseases[[k]]=sampledx
  diseases=setdiff(diseases,sampledx)
  print(length(diseases))
  
  }
## [1] 8
## [1] 6
## [1] 4
bias_value=max(eta)*2

for (person_idx in 1:D) {
  for (topic_idx in 1:K) {
    # Retrieve the diseases and their risk periods for this topic
    #specific_diseases_periods <- topic_disease_risk_periods[[topic_idx]]
    active_diseases <- topic_specific_diseases[[topic_idx]]
    for (time_idx in 1:T) {
      # Assuming time_idx corresponds to actual ages, adjust as needed for your model
     # actual_age <- time_idx  # Convert time index to actual age, adjust based on your time scale
      
      # Find diseases that are active at this time point
      #active_diseases <- specific_diseases_periods$disease[actual_age >= specific_diseases_periods$start_age & actual_age <= specific_diseases_periods$end_age]
      #active_diseases <- topic_specific_diseases[[k]]
 
      disease_values <- eta_tilde[person_idx, topic_idx, , time_idx]
      
      # Increase the values for active diseases at this time before normalization
      if (length(active_diseases) > 0) {
        disease_values[active_diseases] <- disease_values[active_diseases] + bias_value
      }
      
      # Apply softmax normalization
      normalized_disease <- softmax_normalize(disease_values)
      beta[person_idx, topic_idx, , time_idx] <- normalized_disease
    }
  }
}


which(order(apply(beta[,1,,],2,mean),decreasing = T)%in%(topic_specific_diseases[[1]])==TRUE)
## [1] 1 2
which(order(apply(beta[,2,,],2,mean),decreasing = T)%in%(topic_specific_diseases[[2]])==TRUE)
## [1] 1 2
which(order(apply(beta[,3,,],2,mean),decreasing = T)%in%(topic_specific_diseases[[3]])==TRUE)
## [1] 1 2
# 'beta' now contains the softmax-normalized disease values, and they should sum to 1 for each person, topic, and time point.