Chapter 8 Estimating Theta

We can estimate theta in a couple of ways. After we’ve simulated diagnoses:

estimated_Theta <- array(dim = c(D, K, T)) # Initialize estimated_Theta with the simulated Theta
estimated_Theta[,,1]=Theta[,,1]

8.1 Estimating theta using full history

8.1.0.1 Original Bayesian Update

Our initial formula updates the belief about the probability of each topic \(k\) given new diagnoses data:

\[ p(\text{topic} = k | \text{diagnoses}) \propto p(\text{diagnoses} | \text{topic} = k) \times p(\text{topic} = k) \]

Here:

\(p(\text{diagnoses} | \text{topic} = k)\) is the likelihood of observing the diagnoses given the topic.
\(p(\text{topic} = k)\) is the prior probability of the topic, taken as the topic proportion from the previous time step.

8.2 Incorporating GP Predictions

GPs modify the prior, \(p(\text{topic} = k)\), to reflect the entire history up to the current point, capturing trends and dependencies over time. The GP provides a mean function (\(\mu_k(t)\)) and variance (\(\sigma^2_k(t)\)) for each topic \(k\) at each time \(t\), representing the expected proportion and the uncertainty around that expectation.

8.3 Log-Likelihood and GP Mean Integration

Blending the influence of new data (through its likelihood) with our prior beliefs (informed by the GP’s predictions) involves:

Log-Likelihood Calculation: The log of the likelihood, \(\log p(\text{diagnoses} | \text{topic} = k)\), measures how well the observed diagnoses fit the expectation for topic \(k\).
Adding GP Mean: Adding the GP mean (\(\mu_k(t)\)) to the log-likelihood updates our belief about the topic proportion, considering both the historical trend (GP mean) and the current data fit (likelihood).

8.4 Mathematical Framework Integration

The update process can be conceptually viewed as:

\[ \log p(\text{topic} = k | \text{diagnoses}) = \log p(\text{diagnoses} | \text{topic} = k) + \mu_k(t) + \text{constant} \]

This indicates that the updated log-posterior probability of a topic given the diagnoses is influenced by the log-likelihood of the diagnoses given the topic and the GP’s mean prediction.

8.5 Final Step: Softmax Normalization

After applying this update across all topics, softmax normalization converts these log-posterior scores back into a probability distribution:

\[ p(\text{topic} = k | \text{diagnoses}) = \text{softmax}(\log p(\text{diagnoses} | \text{topic} = k) + \mu_k(t)) \]

Softmax ensures that the final topic proportions sum to 1 across all topics \(k\), maintaining the probabilistic model’s integrity.

8.6 Initialization

We need to inititialize theta with some values

require(kernlab)

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:coda':
## 
##     nvar

## The following object is masked from 'package:ggplot2':
## 
##     alpha

individual_weight=0.8
update_theta <- function(theta_prior, new_diagnoses, normalized_probs) {
  K <- nrow(normalized_probs) # Number of topics

  # Adjust the prior based on gamma_d (genetic stickiness) – This likely remains unchanged

  # Calculate the likelihood of new diagnoses for each topic
  likelihood <- sapply(1:K, function(k) {
    prod((normalized_probs[k, new_diagnoses]))  # Include only diagnosed diseases
  })

  # The posterior is the product of the adjusted prior and the likelihood
  #unnormalized_posterior <- theta_prior * likelihood
   unnormalized_posterior <- exp(log(theta_prior) + log(likelihood))

  # Normalize the posterior to ensure it sums to 1
  theta_posterior <- unnormalized_posterior / sum(unnormalized_posterior)

  return(theta_posterior)
}


fit_gp <- function(t_train, y_train, sigma_value = sigma,t=t) {
  t_train_matrix <- as.matrix(t_train)

  # Specify the kernel parameters
  kpar_list <- list(sigma = sigma_value)

  # Fit the Gaussian Process model with the rbfdot kernel and specified sigma
  gp_model <- gausspr(x = t_train_matrix, y = y_train, kernel = "rbfdot", kpar = kpar_list)
  #gp_model <- gausspr(x = t_train_matrix, y = y_train, kernel = "rbfdot",kpar="automatic")

  return(gp_model)
}



# rbfKernel <- function(x, y, lengthscale = 1.0, variance = 1.0) {
#   # Euclidean distance matrix between x and y
#   dist_matrix <- as.matrix(dist(rbind(x, y)))
#   # Select the relevant quadrant of the distance matrix
#   dist_xy <- dist_matrix[1:length(x), (length(x) + 1):(length(x) + length(y))]
#   # Compute the RBF kernel
#   exp(-dist_xy^2 / (2 * lengthscale^2)) * variance
# }
# 
# # Example usage in fitting a GP
# fit_gp <- function(t_train, y_train, lengthscale, variance) {
#   t_train_matrix <- as.matrix(t_train)
#   # Fit the Gaussian Process model using the custom RBF kernel
#   gp_model <- kergp::kergp(X = t_train_matrix, y = y_train, 
#                            covtype = "covUser", 
#                            covparam = list(kernel = function(x, y) rbfKernel(x, y, lengthscale, variance)))
#   return(gp_model)
# }



# Function to get the GP mean at new time points
get_gp_mean <- function(gp_model, t_new) {
  t_new_matrix <- as.matrix(t_new)
  gp_mean <- predict(gp_model, t_new_matrix)
  return(gp_mean)
}

estimated_Theta <- array(dim = c(D, K, T)) # Initialize estimated_Theta with the simulated Theta or with agnostic (uniform)
estimated_Theta[, , 1] <- MCMCpack::rdirichlet(D, alpha = c(1, 1, 1))

### initialize params
sig <- 0.01
smoothing_param <- 0.5
shrinkage_strength <- 0

## inference incorporate population

gp_mean_array <- array(dim = c(D, K, T)) 
ind_log_likes <- array(dim = c(D, K, T)) 
sigma_array=array(dim = c(D,K,T)) 
for(d in 1:D){
  print(d)
for(t in 2:3){
      theta_prior <- estimated_Theta[d, , (t - 1)]
      new_diagnoses <- diags[[d]][[t]]  # Assuming `diags[[d]][[t]]` holds the new diagnoses for document d at time point t

      # Update theta for time point t based on new diagnoses
      estimated_Theta[d, , t] <- update_theta(theta_prior, new_diagnoses, beta[d, , , t])
}}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60

gp_models_d <- list()
gp_means_d <- estimated_Theta
adjusted_gp_mean <- estimated_Theta
gp_means_pop <- apply(estimated_Theta,c(2,3),mean)
                      
gp_models_pop <- list()








for (t in 4:T) {
  print(t)
  gp_models_pop[[t]] <- list()
  gp_models_d[[t]] <- list()
  
  ## first fit population mean trend with GP
  
  for (k in 1:K) {
    t_train <- 1:(t - 1)
    # Mean topic proportions up to t for topic k in the population
    y_train_pop_k <-
      apply(estimated_Theta[, k, (1:t - 1)], 2, mean)
    gp_models_pop[[t]][[k]] <-
      fit_gp(t_train, y_train_pop_k, sigma_value = sig,t = t)
      #fit_gp(t_train, y_train_pop_k,lengthscale = min(t,20),variance=0.1)
    gp_mean_pop <- get_gp_mean(gp_models_pop[[t]][[k]], t)
    if (is.na(gp_mean_pop)) {
      # Check for NA and use previous if NA
      gp_mean_pop <- gp_means_pop[k, t - 1]
    }
    gp_means_pop[k, t] = as.numeric(gp_mean_pop)
  }
  
  
  
  
  # Store GP models for each document and topic
  
  for (d in 1:D) {
    gp_models_d[[t]][[d]] <- list()
    
    new_diagnoses <- diags[[d]][[t]]
    beta_matrix <-
      beta[d, , , t]  # Ensure this matches your data structure
    
    for (k in 1:K) {
      # Loop over topics to collect GP predictions
      
      # Historical time points and corresponding topic proportions up to time t-1
      t_train <- 1:(t - 1)  # Use only data up to t-1
      y_train <-
        estimated_Theta[d, k, 1:(t - 1)]  # Historical topic proportions up to t-1
      
      # Fit GP model based on historical data up to time t-1
      gp_model <- fit_gp(t_train, y_train, sigma_value = sig,t = t)
       gp_models_d[[t]][[d]][[k]] <- gp_model
      ## # Predict the GP mean for the individual directly from the fitted model
      
      # Capture the output where the sigma value is mentioned
      sigma_string <- capture.output(gp_model)[5]
      
      # Use regex to extract only the numeric part, which may include a decimal
      sigma_value <- as.numeric(gsub("[^0-9.]", "", sigma_string))
      
      # Print the extracted sigma value
      sigma_array[d, k, t] = sigma_value
      
      individual_gp_mean <-
        get_gp_mean(gp_models_d[[t]][[d]][[k]], t)
      if (is.na(individual_gp_mean)) {
        # Check for NA and use previous if NA
        individual_gp_mean  <- gp_means_d[d, k, t - 1]
      }
      gp_means_d[d, k, t] = as.numeric(individual_gp_mean)
      
      # Combine individual and population means with a weighted average
      ## individaul weights should be estimated
      adjusted_gp_mean[d, k, t] <-
        individual_weight * gp_means_d[d, k, t]  + (1 - individual_weight) *  gp_means_pop[k, t]
    }
    
    
    # Use GP means as the 'prior' and adjust them based on new diagnoses
    # Placeholder for combining GP means with new data; specifics depend on your model
    updated_proportions <- rep(0, K)  # Placeholder initialization
    # Compute the likelihood of observing new diagnoses for each topic
    
    uniform_prob <-
      1 / nrow(beta_matrix)  # Assuming uniform probability over diseases
    likelihood <- sapply(1:K, function(k_index) {
      disease_probs <- beta_matrix[k_index, new_diagnoses, drop = FALSE]
      smoothed_probs <-
        (1 - smoothing_param) * disease_probs + smoothing_param * uniform_prob
      prod(smoothed_probs)
    })
    
    
    
    log_likelihood <- log(likelihood)
    ind_log_likes[d, , t] = log_likelihood
    
    # Note: Ensure adjusted_gp_mean is in the log scale if you're adding it to log likelihood
    # If adjusted_gp_mean is in probability scale, you should convert it to log scale.
    log_adjusted_gp_mean <- log(adjusted_gp_mean[d, , t] + 1e-10)  # Adding a small constant to avoid log(0)
    
    # Combine using log space
    updated_log_proportions <- log_likelihood + log_adjusted_gp_mean
    
    # Exponentiate to get back to probability scale and normalize
    updated_proportions <- exp(updated_log_proportions)
    updated_proportions <- updated_proportions / sum(updated_proportions)
    estimated_Theta[d, , t] <- updated_proportions # Ensure it sums to 1
    # #
    

    
  }
}

## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100

In the function update_theta_with_gp_concept, there’s no need to incorporate previous diagnoses explicitly when estimating theta if you assume that beta (the probability of each diagnosis given a topic at time t) is known and that a Gaussian Process (GP) generated each topic’s disease probabilities over time. This is because the beta matrix already encodes the historical information of disease probabilities influenced by topics over time.

When you calculate the likelihood of new diagnoses given the current beta, you are inherently using historical information because the beta matrix is the result of the GP, which includes temporal dynamics. The GP incorporates information from all previous time points to give you the mean (and possibly the variance) at time t. This is the key point: the historical data influence the GP’s output at time t, so by using the GP’s current output, you implicitly include historical information.

8.7 posterior on topic loadings

beta_estimated <- array(runif(K * V * T), dim = c(K, V, T)) 
word_topic_counts <- matrix(0, nrow = V, ncol = K) 
  
for(t in 1:T){
  
  for (d in 1:D) {
    diagnoses_dt <- diags[[d]][[t]]
    for (diagnosis in diagnoses_dt) {       
      for (k in 1:K) {
        # Weight the count by the topic proportion from Theta
        word_topic_counts[diagnosis, k] <- word_topic_counts[diagnosis, k] + Theta[d, k, t] 
      }
    }
  }

  # Smoothing (optional, same considerations as before)
  word_topic_counts <- word_topic_counts + 1  # Laplace smoothing

  # Update beta
  beta_estimated[,,t] <- word_topic_counts / rowSums(word_topic_counts) 
}

#What do the estimated topics look like?

## What do the topics look like?
  
# Calculate the covariance matrix
cov_matrix <- cor(t(beta_estimated[1,,]))

# Melt the covariance matrix for plotting
cov_melted <- melt(cov_matrix)

# Plot using ggplot2
ggplot(cov_melted, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile() + # Use geom_tile for heatmap
  scale_fill_gradient2(low="blue", high="red", mid="white", midpoint=0, limit=c(min(cov_melted$value), max(cov_melted$value)), name="Covariance") +
  theme_minimal() + # A minimal theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for better readability
        axis.title = element_blank()) + # Remove axis titles for a cleaner look
  labs(title = paste0("Estimated Covariance between Diseases:Topic",1), fill = "Covariance")

# Calculate the covariance matrix

cov_matrix <- cor(t(beta_estimated[2,,]))

# Melt the covariance matrix for plotting
cov_melted <- melt(cov_matrix)

# Plot using ggplot2
ggplot(cov_melted, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile() + # Use geom_tile for heatmap
  scale_fill_gradient2(limit=c(min(cov_melted$value), max(cov_melted$value)), name="Covariance") +
  theme_minimal() + # A minimal theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for better readability
        axis.title = element_blank()) + # Remove axis titles for a cleaner look
  labs(title = paste0("Estimated Covariance between Diseases:Topic",2), fill = "Covariance")