Chapter 7 Now computed posterior on topic porportions:

We can estimate theta in a couple of ways

### write previous instead of prior


update_theta <- function(theta_prior, new_diagnoses, normalized_probs) {
  K <- nrow(normalized_probs) # Number of topics

  # Adjust the prior based on gamma_d (genetic stickiness) – This likely remains unchanged

  # Calculate the likelihood of new diagnoses for each topic
  likelihood <- sapply(1:K, function(k) {
    prod((normalized_probs[k, new_diagnoses]))  # Include only diagnosed diseases
  })

  # The posterior is the product of the adjusted prior and the likelihood
  unnormalized_posterior <- theta_prior * likelihood

  # Normalize the posterior to ensure it sums to 1
  theta_posterior <- unnormalized_posterior / sum(unnormalized_posterior)

  return(theta_posterior)
}

Simulate diagnoses based on individuals topic distribution:

diags <- vector("list", D)
topics <- vector("list", D)
disease_in_topic <- vector("list", D)

for (d in 1:D) {
  diags[[d]] <-
    vector("list", T) # Each individual has a list for each time point
  topics[[d]] <-
    vector("list", T) # Each individual has a list for each time point
  
  for (t in 2:T) {
    lambda_dt = (t * 5) / T
    
    Nd = max(1, rpois(1, lambda = lambda_dt)) # Allow for more variance in number of diagnoses
    
    # Initialize an array with the correct length
    diagnoses_dt = rep(NA, Nd)
    topics_dt = rep(NA, Nd)
    truetest_dt = rep(NA, Nd)
    
    for (n in 1:Nd) {
      z = sample(K, size = 1, prob = Theta[d, , t]) # Sample topic
      topics_dt[n] <- z
      ## pull out disease vector
      beta_tk = beta[d, z, , t]
      
      sampled_disease = sample(V, size = 1, prob = beta_tk)
      truetest_dt[n] = sampled_disease %in% topic_specific_diseases[[z]]
      diagnoses_dt[n] <- sampled_disease
      # Store in the temporary array
      
    }
    
    # Assuming diags is a three-dimensional array, you might need to adjust this
    # part depending on how you want to handle varying sizes of Nd. Here it's assumed
    # diags[d,,t] can dynamically adjust its size to accommodate diagnoses_dt,
    # but this could depend on the data structure you're working with.
    if (length(diagnoses_dt) > 0) {
      diags[[d]][[t]] <-  diagnoses_dt
      topics[[d]][[t]] <-  topics_dt
      disease_in_topic[[d]][[t]] <-  truetest_dt
    }
  }}

7.1 Estimating Theta

Simple estimation of theta:

estimated_Theta <- array(dim = c(D, K, T)) # Initialize estimated_Theta with the simulated Theta
estimated_Theta[,,1]=Theta[,,1]

# 
# for (d in 1:D) {
#   for (t in 2:T)
#   {
#     diagnoses_dt = diags[[d]][[t]]
#     
#     if (length(diagnoses_dt) > 0) {
#       estimated_Theta[d, , t] <-
#         update_theta(estimated_Theta[d, , t - 1], diagnoses_dt, beta[d, , , t])
#     }
#   }
# }
# 

7.2 Estimating theta using full histroy

7.2.0.1 Original Bayesian Update

Our initial formula updates the belief about the probability of each topic \(k\) given new diagnoses data:

\[ p(\text{topic} = k | \text{diagnoses}) \propto p(\text{diagnoses} | \text{topic} = k) \times p(\text{topic} = k) \]

Here:

  • \(p(\text{diagnoses} | \text{topic} = k)\) is the likelihood of observing the diagnoses given the topic.
  • \(p(\text{topic} = k)\) is the prior probability of the topic, taken as the topic proportion from the previous time step.

7.3 Incorporating GP Predictions

GPs modify the prior, \(p(\text{topic} = k)\), to reflect the entire history up to the current point, capturing trends and dependencies over time. The GP provides a mean function (\(\mu_k(t)\)) and variance (\(\sigma^2_k(t)\)) for each topic \(k\) at each time \(t\), representing the expected proportion and the uncertainty around that expectation.

7.4 Log-Likelihood and GP Mean Integration

Blending the influence of new data (through its likelihood) with our prior beliefs (informed by the GP’s predictions) involves:

  1. Log-Likelihood Calculation: The log of the likelihood, \(\log p(\text{diagnoses} | \text{topic} = k)\), measures how well the observed diagnoses fit the expectation for topic \(k\).

  2. Adding GP Mean: Adding the GP mean (\(\mu_k(t)\)) to the log-likelihood updates our belief about the topic proportion, considering both the historical trend (GP mean) and the current data fit (likelihood).

7.5 Mathematical Framework Integration

The update process can be conceptually viewed as:

\[ \log p(\text{topic} = k | \text{diagnoses}) = \log p(\text{diagnoses} | \text{topic} = k) + \mu_k(t) + \text{constant} \]

This indicates that the updated log-posterior probability of a topic given the diagnoses is influenced by the log-likelihood of the diagnoses given the topic and the GP’s mean prediction.

7.6 Final Step: Softmax Normalization

After applying this update across all topics, softmax normalization converts these log-posterior scores back into a probability distribution:

\[ p(\text{topic} = k | \text{diagnoses}) = \text{softmax}(\log p(\text{diagnoses} | \text{topic} = k) + \mu_k(t)) \]

Softmax ensures that the final topic proportions sum to 1 across all topics \(k\), maintaining the probabilistic model’s integrity.

require(kernlab)
# Fit the GP model using the custom kernel
# Example of adjusting the sigma parameter

fit_gp <- function(t_train, y_train, sigma_value = sigma) {
  t_train_matrix <- as.matrix(t_train)
  
  # Specify the kernel parameters
  kpar_list <- list(sigma = sigma_value)
  
  # Fit the Gaussian Process model with the rbfdot kernel and specified sigma
  gp_model <- gausspr(x = t_train_matrix, y = y_train, kernel = "rbfdot", kpar = kpar_list)
  
  return(gp_model)
}



# Function to get the GP mean at new time points
get_gp_mean <- function(gp_model, t_new) {
  t_new_matrix <- as.matrix(t_new)
  gp_mean <- predict(gp_model, t_new_matrix)
  return(gp_mean)
}

update_theta_with_gp_concept <-
  function(t,
           gp_models_d,
           new_diagnoses,
           beta_matrix,smoothing_param,shrinkage_strength) {
    K <-
      dim(beta_matrix)[1]  # Assuming beta_matrix is [K x V] for time t
    
    # Container for GP predictions
    gp_means <- numeric(K)
    
    
    # Obtain GP mean and variance predictions for each topic at time t for a given individual d, will return a k vector
    
    for (k in 1:K) {
      gp_means[k] <- get_gp_mean(gp_models_d[[t]][[k]], t)
         }
    
    # Use GP means as the 'prior' and adjust them based on new diagnoses
    # Placeholder for combining GP means with new data; specifics depend on your model
    updated_proportions <- rep(0, K)  # Placeholder initialization
    # Compute the likelihood of observing new diagnoses for each topi
    
    

    uniform_prob <-
      1 / nrow(beta_matrix)  # Assuming uniform probability over diseases
    
    likelihood <- sapply(1:K, function(k_index) {
      disease_probs <- beta_matrix[k_index, new_diagnoses, drop = FALSE]
      smoothed_probs <-
        (1 - smoothing_param) * disease_probs + smoothing_param * uniform_prob
      prod(smoothed_probs)
    })
    
    #Example of combining GP predictions with likelihood
    # Here, you might simply add the GP mean to the log likelihood, then softmax
    for (k in 1:K) {
      updated_proportions[k] <-
        #likelihood[k]*gp_means[k]  
      log(likelihood[k])+gp_means[k] 
    }
    
    prelim_updated_proportions <- softmax(updated_proportions)
    
    final_updated_proportions <-
      (1 - shrinkage_strength) * prelim_updated_proportions + shrinkage_strength * estimated_Theta[d, k, t -
                                                                                                     1]
    theta_posterior <-
      final_updated_proportions / sum(final_updated_proportions)
    
    
    return(list(theta_posterior=theta_posterior,gp_means=gp_means))
  }

estimated_Theta <-
  array(dim = c(D, K, T)) # Initialize estimated_Theta with the simulated Theta
estimated_Theta[, , 1] = Theta[, , 1]


### initialize params
sig=0.01
smoothing_param=0.5
shrinkage_strength=0

gp_mean_array <-
  array(dim = c(D, K, T)) 

gp_models = list()
for (d in 1:D) {
  gp_models_d <- list()  # Store GP models for each document and topic
  for (t in 2:T) {
    gp_models_d[[t]] = list()
    if (t <= 3) {
      theta_prior <- estimated_Theta[d, , (t - 1)]

      # Assuming `diags[[d]][[t]]` holds the new diagnoses for document d at time point 2
      new_diagnoses <- diags[[d]][[t]]

      # Assuming `normalized_probs` is available and represents the probability of diagnoses given topics
      # This needs to be structured such that `normalized_probs[k, v]` gives the probability of diagnosis v under topic k

      # Update theta for time point 2 based on new diagnoses
      estimated_Theta[d, , t] <-
        update_theta(theta_prior = theta_prior, new_diagnoses, beta[d, , , t])

    } else {
      for (k in 1:K) {
        # Loop over topics to collect GP predictions
        
        # Historical time points and corresponding topic proportions up to time t-1
        t_train <- 1:(t - 1)  # Use only data up to t-1
        y_train <-estimated_Theta[d, k, 1:(t - 1)]  # Historical topic proportions up to t-1
        
        # Fit GP model based on historical data up to time t-1
        
        gp_model <- fit_gp(t_train, y_train,sigma_value = sig)
        
        gp_models_d[[t]][[k]] <- gp_model
        # Handle variances similarly if applicable
        
      }
      gp_models[[d]] = gp_models_d
      
      new_diagnoses <- diags[[d]][[t]]
      beta_matrix <-
        beta[d, , , t]  # Ensure this matches your data structure
      
      u = update_theta_with_gp_concept(
        t = t,
        gp_models_d = gp_models_d,
        new_diagnoses = new_diagnoses,
        beta_matrix = beta_matrix,
        smoothing_param = smoothing_param,shrinkage_strength = shrinkage_strength
      )
      
      estimated_Theta[d, , t]=u$theta_posterior
      gp_mean_array[d, , t]=u$gp_means
      
      
    }
  }
}

7.7 posterior on topic loadings

beta_estimated <- array(runif(K * V * T), dim = c(K, V, T)) 
word_topic_counts <- matrix(0, nrow = V, ncol = K) 
  
for(t in 1:T){
  
  for (d in 1:D) {
    diagnoses_dt <- diags[[d]][[t]]
    for (diagnosis in diagnoses_dt) {       
      for (k in 1:K) {
        # Weight the count by the topic proportion from Theta
        word_topic_counts[diagnosis, k] <- word_topic_counts[diagnosis, k] + Theta[d, k, t] 
      }
    }
  }

  # Smoothing (optional, same considerations as before)
  word_topic_counts <- word_topic_counts + 1  # Laplace smoothing

  # Update beta
  beta_estimated[,,t] <- word_topic_counts / rowSums(word_topic_counts) 
}

What do the estimated topics look like?

7.8 What do the topics look like?

# Calculate the covariance matrix
cov_matrix <- cor(t(beta_estimated[1,,]))

# Melt the covariance matrix for plotting
cov_melted <- melt(cov_matrix)

# Plot using ggplot2
ggplot(cov_melted, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile() + # Use geom_tile for heatmap
  scale_fill_gradient2(low="blue", high="red", mid="white", midpoint=0, limit=c(min(cov_melted$value), max(cov_melted$value)), name="Covariance") +
  theme_minimal() + # A minimal theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for better readability
        axis.title = element_blank()) + # Remove axis titles for a cleaner look
  labs(title = paste0("Estimated Covariance between Diseases:Topic",1), fill = "Covariance")
# Calculate the covariance matrix

cov_matrix <- cor(t(beta_estimated[2,,]))

# Melt the covariance matrix for plotting
cov_melted <- melt(cov_matrix)

# Plot using ggplot2
ggplot(cov_melted, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile() + # Use geom_tile for heatmap
  scale_fill_gradient2(limit=c(min(cov_melted$value), max(cov_melted$value)), name="Covariance") +
  theme_minimal() + # A minimal theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for better readability
        axis.title = element_blank()) + # Remove axis titles for a cleaner look
  labs(title = paste0("Estimated Covariance between Diseases:Topic",2), fill = "Covariance")