Threshold Analysis

Weibull Exploration

Author
Affiliation

Sarah Urbut

MGH

Published

February 24, 2001

Rare and Common Event Timing

In general, individuals with both high common and rare variant burden have ‘earlier’ onset disease, but how can we quantify the proportion of rare/common heritability by time of event. Let’s first simulate some data:

  1. We’ll simulate 10 (MAF 1e-1) reare and 100 (MAF 5e-2) common variants with rare variant effect sizes \N (15,0.5) and \N (0.1,1) respectively.
  2. Every individuals total genetic load will then be the sum of this rare and common genetic load which we can sum, and normalize N(0,1)
Code
library(survival)
library(gridExtra)
library(data.table)
library(ggsci)
library(dplyr)
library(reshape2)
library(dplyr)
library(ggplot2)

set.seed(123)

# Define the number of individuals and variants
n <- 10000
n_rare_variants <- 10 # Number of rare variants per individual
n_common_variants <- 100 # Number of common variants per individual

# Simulate rare variants
rare_variants <- matrix(rbinom(n * n_rare_variants, 1, 1e-3), ncol = n_rare_variants)
rare_effects <- t(matrix(rnorm(n_rare_variants, mean = 15, sd = 0.5), ncol = n_rare_variants))
rare_load <- rare_variants %*% rare_effects

# Simulate common variants
common_variants <- matrix(rbinom(n * n_common_variants, 1, 0.5), ncol = n_common_variants)
common_effects <- t(matrix(rnorm(n_common_variants, mean = 0, sd = 1), ncol = n_common_variants))
common_effects <- scale(common_effects) # Standardize effects
common_load <- common_variants %*% common_effects

# Calculate the genetic risk score for each individual
genetic_risk_score <- rowSums(common_load + rare_load)

# Normalize the genetic risk score
genetic_risk_score_normalized <- scale(genetic_risk_score)

# Create the data frame
df <- data.frame(
  Genetic.Risk.Score = genetic_risk_score,
  Normalized.Genetic.Risk.Score = genetic_risk_score_normalized,
  Rare.Load = rowSums(rare_load),
  Common.Load = rowSums(common_load)
)

# Display the first few rows of the dataframe
summary(df)
 Genetic.Risk.Score  Normalized.Genetic.Risk.Score   Rare.Load      
 Min.   :-16.82215   Min.   :-3.27631              Min.   : 0.0000  
 1st Qu.: -3.27015   1st Qu.:-0.65776              1st Qu.: 0.0000  
 Median : -0.01982   Median :-0.02972              Median : 0.0000  
 Mean   :  0.13400   Mean   : 0.00000              Mean   : 0.1406  
 3rd Qu.:  3.52942   3rd Qu.: 0.65607              3rd Qu.: 0.0000  
 Max.   : 26.13272   Max.   : 5.02354              Max.   :16.1285  
  Common.Load        
 Min.   :-16.822155  
 1st Qu.: -3.310597  
 Median : -0.079738  
 Mean   : -0.006591  
 3rd Qu.:  3.421738  
 Max.   : 17.302561  
Code
# Load necessary libraries
# Assuming you have a dataframe 'df' with the columns:
# 'Genetic Risk Score', 'Normalized Genetic Risk Score', 'Rare Load', 'Common Load'
# If you don't have it already, you should create it based on your simulation results.

# Plot 1: Genetic Risk Score Distribution
grid.arrange(ggplot(df, aes(x = common_load)) +
  geom_histogram(aes(y = ..density..), fill = "skyblue", color = "white") +
  geom_density(alpha = 0.5, fill = "skyblue") +
  labs(title = "Common Load Distribution", x = "Common Load (effect x number)", y = "Density") +
  theme_minimal(),
  ggplot(df, aes(x = rare_load)) +
  geom_histogram(aes(y = ..density..), fill = "lightgreen", color = "white") +
  geom_density(alpha = 0.5, fill = "lightgreen") +
  labs(title = "Rare Load Distribution", x = "Rare load (effect x number)", y = "Density") +
  theme_minimal(),

# Plot 2: Normalized Genetic Risk Score Distribution
ggplot(df, aes(x = Normalized.Genetic.Risk.Score)) +
  geom_histogram(aes(y = ..density..), fill = "lightpink", color = "white") +
  geom_density(alpha = 0.5, fill = "lightpink") +
  labs(title = "Normalized Genetic Risk Score Distribution", x = "Normalized Genetic Risk Score", y = "Density") +
  theme_minimal(),nrow=3)

3) We can examine the distribution of rare and common load:

Code
# Plot 3: Rare Load Distribution
grid.arrange(
ggplot(df, aes(x = Rare.Load)) +
  geom_histogram(aes(y = ..density..), fill = "lightgreen", color = "white") +
  geom_density(alpha = 0.5, fill = "lightgreen") +
  labs(title = "Rare Load Distribution", x = "Rare Load", y = "Density") +
  theme_minimal(),

# Plot 4: Common Load Distribution
ggplot(df, aes(x = Common.Load)) +
  geom_histogram(aes(y = ..density..), fill = "lavender", color = "white") +
  geom_density(alpha = 0.5, fill = "lavender") +
  labs(title = "Common Load Distribution", x = "Common Load", y = "Density") +
  theme_minimal(),nrow=1)

Finally, we can examine the distribution of normalized score by rare load, and see that those at the extremes of the score distribtuion are indeed enriched in rare variant load.

Code
# Scatter plot: Normalized Genetic Risk Score vs. Rare Load
ggplot(df, aes(x = Normalized.Genetic.Risk.Score, y = Rare.Load)) +
  geom_point(alpha = 0.6, color = "lightcoral") +
  labs(title = "Normalized Genetic Risk Score vs Rare Load",
       x = "Normalized Genetic Risk Score", y = "Rare Load") +
  theme_minimal()

Empirical Data

Our aim is to model the age of onset of disease accuartely, and so we’ll use typical event distribution times from the UK Biobank. We’ll use the phecode data on individuals diagnosed with MI, including the minimum age at diagnosis.

Weibull Model Fitting

  • Weibull Model Fitting: Fit a Weibull survival model to the data. The Weibull distribution is used here because it’s flexible and can model various shapes of hazard functions, making it suitable for time-to-event data like age at onset of MI. It is typical of chronic (commonly late onset disease), we’ll use the Weibull distribution.

  • Parameter Extraction: You extract the shape and scale parameters from the fitted Weibull model. In survival analysis, the shape parameter describes the failure rate over time, while the scale parameter adjusts the distribution’s spread.

Simulation Based on Genetic Risk

  • Adjusting Scale Parameter: You adjust the scale parameter of the Weibull distribution based on individuals’ genetic risk scores. Higher genetic risk scores lead to a lower scale parameter, implying earlier onset ages for MI, reflecting the influence of genetic predisposition on disease onset.

  • Simulating New Onset Ages: Using the adjusted Weibull parameters, you simulate new onset ages for MI. This step aims to generate a distribution of MI onset ages that accounts for genetic risk factors, allowing for the exploration of how genetic predisposition affects the age of MI onset.

Code
# mi from ukb
mi=readRDS("~/Library/CloudStorage/Dropbox-Personal/mi.rds")
#ggplot(mi,aes(x=`min(age_diag)`,fill="lightblue"))+geom_histogram()+labs(x="Age of MI, UKB",fill="")+theme_pubclean()
#summary(mi$`min(age_diag)`)
surv_obj <- Surv(mi$`min(age_diag)`)

# Fit the Weibull model
fit <- survreg(surv_obj ~ 1, dist = 'weibull')

# survreg scale parameter maps to 1/shape, linear predictor to log(scale) per 
## survreg.distributions {survival}

# Extract the shape and scale parameters
shape_parameter_ukb <- 1 / fit$scale
scale_parameter_ukb <- exp(coef(fit))

event_ages <- rweibull(n, shape = shape_parameter_ukb, scale = scale_parameter_ukb)

#ggplot(data.frame(event_ages),aes(x=event_ages,fill="lightred"))+geom_histogram()+labs(x="Simulated Event Times, UKB",fill="")+theme_pubclean()

ea=data.frame(event_ages)
mi$type="Empirical"
ea$type="Simulated"
names(mi)[c(2,3)]=names(ea)=c("age","type")
m=rbind(mi[,c("age","type")],ea)

ggplot(m,aes(x=age,fill=as.factor(type)))+geom_density()+
         theme_classic()+labs(x="Event Ages",fill="Event Type",y="Density")+scale_fill_aaas()+facet_wrap(~type)

Adding genetics

Now, we’ll try and use these settings to adjust our simulations accordingly: namely, individuals with higher genetic risk scores are predisposed to experiencing the event at earlier ages. Here’s a step-by-step summary of what’s happening:

1. Weibull Distribution Parameters

  • Shape Parameter: This is derived from your data (denoted as shape_parameter_ukb), which dictates the form of the distribution from the UK Biobank.

  • Scale Parameter Adjustments: The scale parameter of the Weibull distribution is adjusted based on individual genetic risk scores. Normally, the scale parameter influences the distribution’s spread, with larger values stretching the distribution over a longer time. In this model, higher genetic risk scores decrease the scale parameter value, simulating earlier event onset.

  • GRM (Genetic Risk Modifier): A constant (grm) is used to determine how much the genetic risk score affects the scale parameter. This effectively models the impact of genetics on the timing of the event, with a focus on ensuring the scale parameter remains positive to avoid unrealistic negative ages.

Recall in the weibull distribution:

Shape Parameter (k)

  • Controls the ‘curve’ of the distribution. The shape parameter determines whether your simulated event ages will follow a pattern where occurrences are concentrated early on, spread relatively evenly, or concentrated later.

  • Influences Failure Rate:

    • k < 1: Failure rate (in this case, the likelihood of the event happening) decreases over time. This might simulate a scenario where an early, high-risk period exists, but the risk diminishes as individuals age.

    • k = 1: Failure rate is constant. This is equivalent to an exponential distribution.

    • k > 1: Failure rate increases over time. This simulates situations where the risk of the event increases as individuals age.

Scale Parameter (λ)

  • Stretches or Compresses the distribution. It influences how spread out the simulated event ages will be.

  • Reference Point: The scale parameter roughly approximates the age at which 63.2% of the events would have occurred.

  • Effect of Genetic Risk Score: In your code, a higher genetic risk score decreasesthe scale parameter. This simulates earlier onset of the event for individuals with higher risk.

In Your Simulation

  1. J-shaped distribution: Your choice of the shape_parameter_ukb aims to create a distribution where many simulated events happen early in life, followed by a slower increase in event occurrences as people age.

  2. Genetic Risk Score: The way you modify the scale_parameter ensures that individuals with a high genetic risk score will tend to have events earlier in their simulated lives. The grm parameter controls how strongly genetic risk influences the age of onset.

Caveats

  • Minimum and Maximum Age: The min_age and max_age are there to enforce realistic boundaries for your simulation.

  • Rounding Event Ages: Rounding the values using round(event_ages) is likely for the purpose of grouping or analysis within your variant_summary data frame.

Code
# Parameters for the Weibull distribution
shape_parameter <- shape_parameter_ukb  # Adjusted shape parameter for the desired 'J-shaped' distribution
#scale_base <- 55  # Base scale parameter around which to center the disease onset ages
# Adjust the scale parameter of the Weibull distribution based on the genetic risk score
# Higher genetic risk scores should lead to a lower scale parameter (earlier onset)
# Add a constant to ensure the scale parameter stays positive
 # Determines how strongly the risk score affects the scale
grm = 5
scale_parameter <- scale_parameter_ukb - grm*genetic_risk_score_normalized # Adjust the 10 multiplier as needed
# Ensure that the scale parameter does not go negative or too low, which could cause unrealistic ages
#scale_parameter[scale_parameter < 1] <- 1
# Simulate the disease onset ages
event_ages <- rweibull(n, shape = shape_parameter, scale = scale_parameter)
# Ensure that event ages are within a realistic range
min_age <- 20
max_age <- 100

variant_summary <- data.frame(
  event_age = round(event_ages),  # Round event ages to the nearest whole number for grouping
  # Count of common variants
  rare_burden = rare_load,  # Burden of rare variants
  common_burden = common_load,
  score=genetic_risk_score_normalized # Burden of common variants
)
#event_ages[event_ages > max_age] <- max_age
# Create a histogram to visualize the distribution of event ages

ea <- data.table(event_ages)

ggplot(ea[event_ages < 100 & event_ages > 20, ], aes(x=event_ages)) +
  geom_density(fill="#69b3a2") + # A soft, appealing shade of blue-green
  theme_classic() +
  labs(x = "Genetically Simulated Event Ages", y = "Density")

We’re not completely satisfied because we see that the distribution is more symmetric than we might hope. But let’s proceed with aggregate analyses.

Code
#Create a dataframe that includes the event age, counts, and burdens of variants for each individual}

# Aggregate the counts and burdens by event age
summary_by_age <- variant_summary%>%group_by(event_age)%>%
summarize(n=length(rare_burden),rare_burden=mean(rare_burden),common_burden=mean(common_burden),mean_score=mean(score))
# Rename the columns for clarity
names(summary_by_age) <- c("Age", "Event Count", "Rare Burden", "Common Burden", "Score")
# Print the summary table
#print(summary_by_age,n=60)
m=melt(summary_by_age,id.vars="Age")
ggplot(m,aes(Age,value,fill=variable))+geom_bar(stat="identity")+
facet_wrap(~variable,scales="free")+theme_classic()

Heritability like

Take the correlation between the normalized score and the binary phenotype of MI by age N

R2 is effectively the observed scaled narrow sense heritability, although this aggregates individual level data for a population level statistic.

We show that

1) while the overall genetic heritability of ‘ever having’ an event increases with age, the rare heritability decreases with time (as the variation becomes explained by more common effects) even though late events are often for nongenomic causes .

Code
# Calculate correlations for ages 40 to 70
ages <- 40:70
cor_data <- data.frame(Age = integer(0), Correlation = numeric(0), Type = character(0))
for (age in ages) {
  cor_genetic <- cor(genetic_risk_score_normalized, ifelse(event_ages < age, 1, 0))
  cor_rare <- cor(rare_load, ifelse(event_ages < age, 1, 0))
  cor_common <- cor(common_load, ifelse(event_ages < age, 1, 0))
  
  cor_data <- rbind(cor_data, data.frame(Age = age, Correlation = cor_genetic, Type = "Genetic Risk"))
  cor_data <- rbind(cor_data, data.frame(Age = age, Correlation = cor_rare, Type = "Rare Load"))
  cor_data <- rbind(cor_data, data.frame(Age = age, Correlation = cor_common, Type = "Common Load"))
}

# Convert 'Type' to a factor for coloring
cor_data$Type <- factor(cor_data$Type, levels = c("Genetic Risk", "Rare Load", "Common Load"))


# Plot using ggplot2
ggplot(cor_data, aes(x = Age, y = Correlation, color = Type)) +
  geom_point() +
  geom_line() +
  scale_color_manual(values = c("Genetic Risk" = "black", "Rare Load" = "red", "Common Load" = "blue")) +
  labs(x = "Age", y = "Correlation (R2)", title = "Correlation of Load and Genetic Risk with Event Age") +
  theme_minimal() +
  theme(legend.title = element_blank())

Introducing non CAD competing events

In the normal data, we observe a plateau.

This plateau could indicate that there are competing risks present that affect the distribution of event ages. In survival analysis, competing risks occur when individuals are at risk of several different events and the occurrence of one event prevents the occurrence of the other event. This is common in older populations where the risk of death from non-CAD causes increases with age.

  1. Competing Risks Model: Instead of using a single Weibull distribution, consider using a competing risks framework to simulate the time to CAD event and the time to other events separately. The cmprsk package in R can handle competing risks.

  2. Subdistribution Hazard: Instead of focusing on the cause-specific hazard, which only considers the hazard for the event of interest, look at the subdistribution hazard, which takes into account the fact that some individuals will experience competing events. The survival package can fit subdistribution hazard models using the survreg function with the competing.risks argument.

  3. Plateau Incorporation: Adjust your simulation to incorporate a plateau. For instance, you could simulate two separate age distributions: one for the initial rise and peak and another for the plateau, and then combine them.

  4. Truncation or Censoring: You could also consider truncating the simulated distribution at an age where the plateau begins or treat the plateau as censoring due to competing risks.

To refine your simulation, you would take the following steps:

  • Simulate the non-CAD death times using another survival distribution or a constant rate of occurrence.

  • Generate CAD event times with the estimated Weibull parameters.

  • Combine the two sets of times, taking the minimum time for each individual as the observed time and noting the cause of the event.

  • Analyze the combined dataset considering the competing events.

Code
library(purrr)
n <- 10000  # The number of individuals to simulate
mean_non_cad_death_age <- 80  # The average age at non-CAD death
min_age_threshold <- 40  # The minimum realistic age for CAD onset


df_pheno=readRDS("~/Library/CloudStorage/Dropbox-Personal/pheno_dir/output/df_final_withddeath.rds")
# Simulate non-CAD death times with the corrected rate
non_cad_death_ages <- na.omit(df_pheno$Death_censor_age[df_pheno$Cad_0_Any==1])

fit_non_cad <- survreg(Surv(non_cad_death_ages) ~ 1, dist = 'weibull')

# Extract the shape and scale parameters
shape_non_cad <- 1 / fit_non_cad$scale
scale_non_cad <- exp(coef(fit_non_cad))


simulated_non_cad_death_times <- data.frame(rweibull(n, shape = shape_non_cad, scale = scale_non_cad))
simulated_non_cad_death_times$type="Non Cad Event Times"
names(simulated_non_cad_death_times)=c("Times","Type")


# Plot the histogram of the simulated non-CAD death times to compare with the empirical histogram
# hist(simulated_non_cad_death_times, breaks = 100, main = "Simulated Non-CAD Death Times")

combined_event_times <- data.frame(pmin(event_ages, simulated_non_cad_death_times$Times))
combined_event_times$Type="Minimum Time"
names(combined_event_times)=c("Times","Type")

# Exclude unrealistic early deaths by setting a minimum age threshold
combined_event_times$Times[combined_event_times$Times<min_age_threshold]=min_age_threshold


r=rbind(combined_event_times,simulated_non_cad_death_times)
event_age2=data.frame(event_ages)
event_age2$Type="Cad"
names(event_age2)=c("Times","Type")
r=rbind(r,event_age2)
# Plot the distribution of the combined event times with the threshold
ggplot(r,aes(x = Times,fill=as.factor(Type)))+geom_density(alpha=0.8)+
labs(x = "Histogram of Combined Event Times",y="Density",fill="Event Type")+theme_classic ()

Code
#So we can see that as expected, the common disease heritability increases over time, and the rare heritability decreases.