Chapter 7 Blocks

The primary goal of the E-step is to estimate the probability of each topic being active at time (t), given the observed diagnoses up to that point. Since diseases cannot reoccur, we need to consider the probability of not observing the same disease again under any active topic. However, we still want to account for the fact that the occurrence of a disease at any point in the past provides evidence for the topic’s activation.

Here’s the refined E-step calculation:

\[\mathbb{E}[u_{ikt} | Y, H, \Theta, \beta] = \frac{P(Y_{i \cdot t} | u_{ikt} = 1, H_{i \cdot (t-1)}, \beta) \cdot \Theta_{ikt}} {\sum_{k'} P(Y_{i \cdot t} | u_{ik't} = 1, H_{i \cdot (t-1)}, \beta) \cdot \Theta_{ik't}}\]

where:

\[P(Y_{i \cdot t} | u_{ikt} = 1, H_{i \cdot (t-1)}, \beta) = \prod_{d' \in D_{\text{occurred up to t-1}}} (1 - \beta_{kd't}) \prod_{d'' \notin D_{\text{occurred up to t-1}}} (1 - \beta_{kd''t})\]

(do we need to include diseases that did not ocurr?)

M-Step

\(\Theta\) Update: \[\Theta_{ikt}^{new} = \mathbb{E}[u_{ikt} | Y, H, \Theta, \beta]\] \(\beta\) Update:

\[\beta_{kdt} = \frac{\text{Number of individuals with first diagnosis of disease } d \text{ at time } t \text{ under topic } k}{\text{Total number of individuals}}\]

\[\beta_{kdt}^{new} = \frac{\sum_{i \in I_{\text{not occurred}}} Y_{idt} \cdot \mathbb{E}[u_{ikt} | Y, H, \Theta, \beta] }{\sum_{i \in I} \mathbb{E}[u_{ikt} | Y, H, \Theta, \beta]}\]

where \((I_{\text{not occurred}})\) is the set of individuals who have not yet experienced disease (d) at time (t).

Key Features

Carrying Forward Disease History: The set (D_{}) in the E-step calculation includes all diseases that have occurred for individual (i) up to time (t), ensuring that the history of disease occurrences is carried forward to influence the topic activation probability.

Excluding Occurred Diseases in Beta Update: The set (I_{}) in the M-step update for () excludes individuals who have already experienced disease (d), preventing rediagnosis while still allowing them to contribute to the estimation of topic activation probabilities.

7.0.1 Theta as Population-Level and Individual-Level Probabilities

\((\Theta_{ikt})\) represents both:

Individual-Level Probability: The probability that topic (k) is active for individual (i) at time (t).

Population-Level Probability: The proportion of individuals in the population for whom topic (k) is active at time (t).

This duality arises because we are assuming that the topic activation probabilities are independent across individuals. Therefore, by averaging (_{ikt}) over all individuals (i), we obtain the population-level probability of topic (k) being active at time (t).

7.0.2 Example:

If ({1k1} = 0.8), ({2k1} = 0.2), and (_{3k1} = 0.5), this means:

Individual 1 has an 80% probability of having topic (k) active at time 1. Individual 2 has a 20% probability of having topic (k) active at time 1. Individual 3 has a 50% probability of having topic (k) active at time 1. On average, 50% of the population has topic (k) active at time 1.

# Parameters
convergence_threshold <- 1e-5
max_iterations <- 1000

for (iteration in 1:max_iterations) {
  cat("Iteration:", iteration, "\n")
  
  # E-step: Compute expected values of latent variables
  E_u <- array(0, dim = c(I, K, T))
  for (i in 1:I) {
    for (k in 1:K) {
      for (t in 1:T) {
        if (any(disease_occurred[i, , t] == 1)) {
          prod_term <- prod(1 - beta[k, disease_occurred[i, , t] == 1, t])
          numerator <- (1 - prod_term) * theta[i, k, t]
          denominator <- sum((1 - sapply(1:K, function(kk) prod(1 - beta[kk, disease_occurred[i, , t] == 1, t]))) * theta[i, , t])
          if (!is.na(denominator) && denominator != 0) {
            E_u[i, k, t] <- numerator / denominator
          } else {
            E_u[i, k, t] <- theta[i, k, t]
          }
        } else {
          E_u[i, k, t] <- theta[i, k, t]
        }
      }
    }
  }
  
  # M-step: Update parameters
  new_theta <- array(0, dim = c(I, K, T))
  new_beta <- array(0, dim = c(K, D, T))
  
  # Update theta
  for (i in 1:I) {
    for (k in 1:K) {
      for (t in 1:T) {
        new_theta[i, k, t] <- E_u[i, k, t]
      }
    }
  }
  
  # Update beta
  for (k in 1:K) {
    for (d in 1:D) {
      for (t in 1:T) {
        if (t == 1) {
          not_occurred_idx <- disease_occurred[, d, t] == 0
        } else {
          not_occurred_idx <- disease_occurred[, d, t - 1] == 0
        }
        
        # Only consider the first occurrence of the disease
        numerator_beta <- sum((disease_occurred[, d, t] == 1 & disease_occurred_tracker[, d] == t)[not_occurred_idx] * E_u[, k, t][not_occurred_idx])
        denominator_beta <- sum(E_u[, k, t])
        
        if (denominator_beta > 0) {
          new_beta[k, d, t] <- numerator_beta / denominator_beta
        } else {
          new_beta[k, d, t] <- beta[k, d, t]
        }
      }
    }
  }
  
  cat("Theta change:", max(abs(theta - new_theta)), "\n")
  cat("Beta change:", max(abs(beta - new_beta)), "\n")
  
  if (max(abs(theta - new_theta)) < convergence_threshold && max(abs(beta - new_beta)) < convergence_threshold) {
    cat("Converged at iteration:", iteration, "\n")
    break
  }
  
  theta <- new_theta
  beta <- new_beta
}

7.1 one time point, then flatten

7.2 is it doing basic clustering

7.3 output all updates

7.4 EM temporal data

7.5 dynamic clustering –

7.6 two topics, 10 diseases

# Load necessary libraries
library(MASS)
library(reshape2)
library(dplyr)
library(ggplot2)

# Parameters
I <- 20  # Number of individuals
K <- 5   # Number of topics
D <- 50  # Number of diseases
T <- 100  # Number of time points

# Initial values for theta and beta
set.seed(123)
theta <- array(runif(I * K * T, min = 0.01, max = 0.1), dim = c(I, K, T))
beta <- array(runif(K * D * T, min = 0.01, max = 0.1), dim = c(K, D, T))

# Simulated disease occurrences
# disease_occurred <- <your_loaded_data>
# disease_occurred_tracker <- <your_loaded_data>

# Define a logistic function for theta and beta
logistic_function <- function(x) {
  return(1 / (1 + exp(-x)))
}

# Define a Gaussian process for smooth temporal updates
gp_function <- function(t, lengthscale, variance) {
  exp(-0.5 * (t / lengthscale)^2) * variance
}

lengthscale <- 10
variance <- 0.01

# E-step: Compute expected values of latent variables
E_u <- array(0, dim = c(I, K, T))
for (i in 1:I) {
  for (k in 1:K) {
    for (t in 1:T) {
      if (any(disease_occurred[i, , t] == 1)) {
        prod_term <- prod(1 - logistic_function(beta[k, disease_occurred[i, , t] == 1, t]), na.rm = TRUE)
        numerator <- (1 - prod_term) * logistic_function(theta[i, k, t])
        denominator <- sum((1 - sapply(1:K, function(kk) prod(1 - logistic_function(beta[kk, disease_occurred[i, , t] == 1, t]), na.rm = TRUE))) * logistic_function(theta[i, , t]), na.rm = TRUE)
        
        if (!is.na(denominator) && denominator != 0) {
          E_u[i, k, t] <- numerator / denominator
        } else {
          E_u[i, k, t] <- logistic_function(theta[i, k, t])
        }
      } else {
        E_u[i, k, t] <- logistic_function(theta[i, k, t])
      }
    }
  }
}

# M-step: Update parameters
new_theta <- array(0, dim = c(I, K, T))
new_beta <- array(0, dim = c(K, D, T))

for (i in 1:I) {
  for (k in 1:K) {
    for (t in 1:T) {
      new_theta[i, k, t] <- E_u[i, k, t]
    }
  }
}

for (k in 1:K) {
  for (d in 1:D) {
    for (t in 1:T) {
      not_occurred_idx <- if (t == 1) disease_occurred[, d, t] == 0 else disease_occurred[, d, t - 1] == 0

      # Only consider the first occurrence of the disease
      numerator_beta <- sum((disease_occurred[, d, t] == 1 & disease_occurred_tracker[, d] == t) * E_u[, k, t], na.rm = TRUE)
      denominator_beta <- sum(E_u[, k, t], na.rm = TRUE)

      if (denominator_beta > 0) {
        new_beta[k, d, t] <- numerator_beta / denominator_beta
      } else {
        new_beta[k, d, t] <- beta[k, d, t]
      }

      # Apply Gaussian process for smooth temporal updates
      if (t > 1) {
        new_beta[k, d, t] <- logistic_function(new_beta[k, d, t] + gp_function(t, lengthscale, variance))
      } else {
        new_beta[k, d, t] <- logistic_function(new_beta[k, d, t])
      }
    }
  }
}

# Convergence check
convergence_threshold <- 1e-5
max_iterations <- 1000

for (iteration in 1:max_iterations) {
  cat("Iteration:", iteration, "\n")

  # E-step: Compute expected values of latent variables
  E_u <- array(0, dim = c(I, K, T))
  for (i in 1:I) {
    for (k in 1:K) {
      for (t in 1:T) {
        if (any(disease_occurred[i, , t] == 1)) {
          prod_term <- prod(1 - logistic_function(beta[k, disease_occurred[i, , t] == 1, t]), na.rm = TRUE)
          numerator <- (1 - prod_term) * logistic_function(theta[i, k, t])
          denominator <- sum((1 - sapply(1:K, function(kk) prod(1 - logistic_function(beta[kk, disease_occurred[i, , t] == 1, t]), na.rm = TRUE))) * logistic_function(theta[i, , t]), na.rm = TRUE)
          
          if (!is.na(denominator) && denominator != 0) {
            E_u[i, k, t] <- numerator / denominator
          } else {
            E_u[i, k, t] <- logistic_function(theta[i, k, t])
          }
        } else {
          E_u[i, k, t] <- logistic_function(theta[i, k, t])
        }
      }
    }
  }

  # M-step: Update parameters
  new_theta <- array(0, dim = c(I, K, T))
  new_beta <- array(0, dim = c(K, D, T))

  for (i in 1:I) {
    for (k in 1:K) {
      for (t in 1:T) {
        new_theta[i, k, t] <- E_u[i, k, t]
      }
    }
  }

  for (k in 1:K) {
    for (d in 1:D) {
      for (t in 1:T) {
        not_occurred_idx <- if (t == 1) disease_occurred[, d, t] == 0 else disease_occurred[, d, t - 1] == 0

        # Only consider the first occurrence of the disease
        numerator_beta <- sum((disease_occurred[, d, t] == 1 & disease_occurred_tracker[, d] == t) * E_u[, k, t], na.rm = TRUE)
        denominator_beta <- sum(E_u[, k, t], na.rm = TRUE)

        if (denominator_beta > 0) {
          new_beta[k, d, t] <- numerator_beta / denominator_beta
        } else {
          new_beta[k, d, t] <- beta[k, d, t]
        }

        # Apply Gaussian process for smooth temporal updates
        if (t > 1) {
          new_beta[k, d, t] <- logistic_function(new_beta[k, d, t] + gp_function(t, lengthscale, variance))
        } else {
          new_beta[k, d, t] <- logistic_function(new_beta[k, d, t])
        }
      }
    }
  }

  cat("Theta change:", max(abs(theta - new_theta), na.rm = TRUE), "\n")
  cat("Beta change:", max(abs(beta - new_beta), na.rm = TRUE), "\n")

  if (max(abs(theta - new_theta), na.rm = TRUE) < convergence_threshold && max(abs(beta - new_beta), na.rm = TRUE) < convergence_threshold) {
    cat("Converged at iteration:", iteration, "\n")
    break
  }

  theta <- new_theta
  beta <- new_beta
}