Chapter 7 Now computed posterior on topic porportions:
We can estimate theta in a couple of ways
### write previous instead of prior
update_theta <- function(theta_prior, new_diagnoses, normalized_probs) {
K <- nrow(normalized_probs) # Number of topics
# Adjust the prior based on gamma_d (genetic stickiness) – This likely remains unchanged
# Calculate the likelihood of new diagnoses for each topic
likelihood <- sapply(1:K, function(k) {
prod((normalized_probs[k, new_diagnoses])) # Include only diagnosed diseases
})
# The posterior is the product of the adjusted prior and the likelihood
unnormalized_posterior <- theta_prior * likelihood
# Normalize the posterior to ensure it sums to 1
theta_posterior <- unnormalized_posterior / sum(unnormalized_posterior)
return(theta_posterior)
}
Simulate diagnoses based on individuals topic distribution:
diags <- vector("list", D)
topics <- vector("list", D)
disease_in_topic <- vector("list", D)
for (d in 1:D) {
diags[[d]] <-
vector("list", T) # Each individual has a list for each time point
topics[[d]] <-
vector("list", T) # Each individual has a list for each time point
for (t in 2:T) {
lambda_dt = (t * 5) / T
Nd = max(1, rpois(1, lambda = lambda_dt)) # Allow for more variance in number of diagnoses
# Initialize an array with the correct length
diagnoses_dt = rep(NA, Nd)
topics_dt = rep(NA, Nd)
truetest_dt = rep(NA, Nd)
for (n in 1:Nd) {
z = sample(K, size = 1, prob = Theta[d, , t]) # Sample topic
topics_dt[n] <- z
## pull out disease vector
beta_tk = beta[d, z, , t]
sampled_disease = sample(V, size = 1, prob = beta_tk)
truetest_dt[n] = sampled_disease %in% topic_specific_diseases[[z]]
diagnoses_dt[n] <- sampled_disease
# Store in the temporary array
}
# Assuming diags is a three-dimensional array, you might need to adjust this
# part depending on how you want to handle varying sizes of Nd. Here it's assumed
# diags[d,,t] can dynamically adjust its size to accommodate diagnoses_dt,
# but this could depend on the data structure you're working with.
if (length(diagnoses_dt) > 0) {
diags[[d]][[t]] <- diagnoses_dt
topics[[d]][[t]] <- topics_dt
disease_in_topic[[d]][[t]] <- truetest_dt
}
}}
7.1 Estimating Theta
Simple estimation of theta:
estimated_Theta <- array(dim = c(D, K, T)) # Initialize estimated_Theta with the simulated Theta
estimated_Theta[,,1]=Theta[,,1]
#
# for (d in 1:D) {
# for (t in 2:T)
# {
# diagnoses_dt = diags[[d]][[t]]
#
# if (length(diagnoses_dt) > 0) {
# estimated_Theta[d, , t] <-
# update_theta(estimated_Theta[d, , t - 1], diagnoses_dt, beta[d, , , t])
# }
# }
# }
#
7.2 Estimating theta using full histroy
7.2.0.1 Original Bayesian Update
Our initial formula updates the belief about the probability of each topic \(k\) given new diagnoses data:
\[ p(\text{topic} = k | \text{diagnoses}) \propto p(\text{diagnoses} | \text{topic} = k) \times p(\text{topic} = k) \]
Here:
- \(p(\text{diagnoses} | \text{topic} = k)\) is the likelihood of observing the diagnoses given the topic.
- \(p(\text{topic} = k)\) is the prior probability of the topic, taken as the topic proportion from the previous time step.
7.3 Incorporating GP Predictions
GPs modify the prior, \(p(\text{topic} = k)\), to reflect the entire history up to the current point, capturing trends and dependencies over time. The GP provides a mean function (\(\mu_k(t)\)) and variance (\(\sigma^2_k(t)\)) for each topic \(k\) at each time \(t\), representing the expected proportion and the uncertainty around that expectation.
7.4 Log-Likelihood and GP Mean Integration
Blending the influence of new data (through its likelihood) with our prior beliefs (informed by the GP’s predictions) involves:
Log-Likelihood Calculation: The log of the likelihood, \(\log p(\text{diagnoses} | \text{topic} = k)\), measures how well the observed diagnoses fit the expectation for topic \(k\).
Adding GP Mean: Adding the GP mean (\(\mu_k(t)\)) to the log-likelihood updates our belief about the topic proportion, considering both the historical trend (GP mean) and the current data fit (likelihood).
7.5 Mathematical Framework Integration
The update process can be conceptually viewed as:
\[ \log p(\text{topic} = k | \text{diagnoses}) = \log p(\text{diagnoses} | \text{topic} = k) + \mu_k(t) + \text{constant} \]
This indicates that the updated log-posterior probability of a topic given the diagnoses is influenced by the log-likelihood of the diagnoses given the topic and the GP’s mean prediction.
7.6 Final Step: Softmax Normalization
After applying this update across all topics, softmax normalization converts these log-posterior scores back into a probability distribution:
\[ p(\text{topic} = k | \text{diagnoses}) = \text{softmax}(\log p(\text{diagnoses} | \text{topic} = k) + \mu_k(t)) \]
Softmax ensures that the final topic proportions sum to 1 across all topics \(k\), maintaining the probabilistic model’s integrity.
require(kernlab)
# Fit the GP model using the custom kernel
# Example of adjusting the sigma parameter
fit_gp <- function(t_train, y_train, sigma_value = sigma) {
t_train_matrix <- as.matrix(t_train)
# Specify the kernel parameters
kpar_list <- list(sigma = sigma_value)
# Fit the Gaussian Process model with the rbfdot kernel and specified sigma
gp_model <- gausspr(x = t_train_matrix, y = y_train, kernel = "rbfdot", kpar = kpar_list)
return(gp_model)
}
# Function to get the GP mean at new time points
get_gp_mean <- function(gp_model, t_new) {
t_new_matrix <- as.matrix(t_new)
gp_mean <- predict(gp_model, t_new_matrix)
return(gp_mean)
}
update_theta_with_gp_concept <-
function(t,
gp_models_d,
new_diagnoses,
beta_matrix,smoothing_param,shrinkage_strength) {
K <-
dim(beta_matrix)[1] # Assuming beta_matrix is [K x V] for time t
# Container for GP predictions
gp_means <- numeric(K)
# Obtain GP mean and variance predictions for each topic at time t for a given individual d, will return a k vector
for (k in 1:K) {
gp_means[k] <- get_gp_mean(gp_models_d[[t]][[k]], t)
}
# Use GP means as the 'prior' and adjust them based on new diagnoses
# Placeholder for combining GP means with new data; specifics depend on your model
updated_proportions <- rep(0, K) # Placeholder initialization
# Compute the likelihood of observing new diagnoses for each topi
uniform_prob <-
1 / nrow(beta_matrix) # Assuming uniform probability over diseases
likelihood <- sapply(1:K, function(k_index) {
disease_probs <- beta_matrix[k_index, new_diagnoses, drop = FALSE]
smoothed_probs <-
(1 - smoothing_param) * disease_probs + smoothing_param * uniform_prob
prod(smoothed_probs)
})
#Example of combining GP predictions with likelihood
# Here, you might simply add the GP mean to the log likelihood, then softmax
for (k in 1:K) {
updated_proportions[k] <-
#likelihood[k]*gp_means[k]
log(likelihood[k])+gp_means[k]
}
prelim_updated_proportions <- softmax(updated_proportions)
final_updated_proportions <-
(1 - shrinkage_strength) * prelim_updated_proportions + shrinkage_strength * estimated_Theta[d, k, t -
1]
theta_posterior <-
final_updated_proportions / sum(final_updated_proportions)
return(list(theta_posterior=theta_posterior,gp_means=gp_means))
}
estimated_Theta <-
array(dim = c(D, K, T)) # Initialize estimated_Theta with the simulated Theta
estimated_Theta[, , 1] = Theta[, , 1]
### initialize params
sig=0.01
smoothing_param=0.5
shrinkage_strength=0
gp_mean_array <-
array(dim = c(D, K, T))
gp_models = list()
for (d in 1:D) {
gp_models_d <- list() # Store GP models for each document and topic
for (t in 2:T) {
gp_models_d[[t]] = list()
if (t <= 3) {
theta_prior <- estimated_Theta[d, , (t - 1)]
# Assuming `diags[[d]][[t]]` holds the new diagnoses for document d at time point 2
new_diagnoses <- diags[[d]][[t]]
# Assuming `normalized_probs` is available and represents the probability of diagnoses given topics
# This needs to be structured such that `normalized_probs[k, v]` gives the probability of diagnosis v under topic k
# Update theta for time point 2 based on new diagnoses
estimated_Theta[d, , t] <-
update_theta(theta_prior = theta_prior, new_diagnoses, beta[d, , , t])
} else {
for (k in 1:K) {
# Loop over topics to collect GP predictions
# Historical time points and corresponding topic proportions up to time t-1
t_train <- 1:(t - 1) # Use only data up to t-1
y_train <-estimated_Theta[d, k, 1:(t - 1)] # Historical topic proportions up to t-1
# Fit GP model based on historical data up to time t-1
gp_model <- fit_gp(t_train, y_train,sigma_value = sig)
gp_models_d[[t]][[k]] <- gp_model
# Handle variances similarly if applicable
}
gp_models[[d]] = gp_models_d
new_diagnoses <- diags[[d]][[t]]
beta_matrix <-
beta[d, , , t] # Ensure this matches your data structure
u = update_theta_with_gp_concept(
t = t,
gp_models_d = gp_models_d,
new_diagnoses = new_diagnoses,
beta_matrix = beta_matrix,
smoothing_param = smoothing_param,shrinkage_strength = shrinkage_strength
)
estimated_Theta[d, , t]=u$theta_posterior
gp_mean_array[d, , t]=u$gp_means
}
}
}
7.7 posterior on topic loadings
beta_estimated <- array(runif(K * V * T), dim = c(K, V, T))
word_topic_counts <- matrix(0, nrow = V, ncol = K)
for(t in 1:T){
for (d in 1:D) {
diagnoses_dt <- diags[[d]][[t]]
for (diagnosis in diagnoses_dt) {
for (k in 1:K) {
# Weight the count by the topic proportion from Theta
word_topic_counts[diagnosis, k] <- word_topic_counts[diagnosis, k] + Theta[d, k, t]
}
}
}
# Smoothing (optional, same considerations as before)
word_topic_counts <- word_topic_counts + 1 # Laplace smoothing
# Update beta
beta_estimated[,,t] <- word_topic_counts / rowSums(word_topic_counts)
}
What do the estimated topics look like?
7.8 What do the topics look like?
# Calculate the covariance matrix
cov_matrix <- cor(t(beta_estimated[1,,]))
# Melt the covariance matrix for plotting
cov_melted <- melt(cov_matrix)
# Plot using ggplot2
ggplot(cov_melted, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() + # Use geom_tile for heatmap
scale_fill_gradient2(low="blue", high="red", mid="white", midpoint=0, limit=c(min(cov_melted$value), max(cov_melted$value)), name="Covariance") +
theme_minimal() + # A minimal theme
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for better readability
axis.title = element_blank()) + # Remove axis titles for a cleaner look
labs(title = paste0("Estimated Covariance between Diseases:Topic",1), fill = "Covariance")
# Calculate the covariance matrix
cov_matrix <- cor(t(beta_estimated[2,,]))
# Melt the covariance matrix for plotting
cov_melted <- melt(cov_matrix)
# Plot using ggplot2
ggplot(cov_melted, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() + # Use geom_tile for heatmap
scale_fill_gradient2(limit=c(min(cov_melted$value), max(cov_melted$value)), name="Covariance") +
theme_minimal() + # A minimal theme
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for better readability
axis.title = element_blank()) + # Remove axis titles for a cleaner look
labs(title = paste0("Estimated Covariance between Diseases:Topic",2), fill = "Covariance")