In general, individuals with both high common and rare variant burden have ‘earlier’ onset disease, but how can we quantify the proportion of rare/common heritability by time of event. Let’s first simulate some data:
We’ll simulate 10 (MAF 1e-1) reare and 100 (MAF 5e-2) common variants with rare variant effect sizes \N (15,0.5) and \N (0.1,1) respectively.
Every individuals total genetic load will then be the sum of this rare and common genetic load which we can sum, and normalize N(0,1)
Code
library(survival)library(gridExtra)library(data.table)library(ggsci)library(dplyr)library(reshape2)library(dplyr)library(ggplot2)set.seed(123)# Define the number of individuals and variantsn <-10000n_rare_variants <-10# Number of rare variants per individualn_common_variants <-100# Number of common variants per individual# Simulate rare variantsrare_variants <-matrix(rbinom(n * n_rare_variants, 1, 1e-3), ncol = n_rare_variants)rare_effects <-t(matrix(rnorm(n_rare_variants, mean =15, sd =0.5), ncol = n_rare_variants))rare_load <- rare_variants %*% rare_effects# Simulate common variantscommon_variants <-matrix(rbinom(n * n_common_variants, 1, 0.5), ncol = n_common_variants)common_effects <-t(matrix(rnorm(n_common_variants, mean =0, sd =1), ncol = n_common_variants))common_effects <-scale(common_effects) # Standardize effectscommon_load <- common_variants %*% common_effects# Calculate the genetic risk score for each individualgenetic_risk_score <-rowSums(common_load + rare_load)# Normalize the genetic risk scoregenetic_risk_score_normalized <-scale(genetic_risk_score)# Create the data framedf <-data.frame(Genetic.Risk.Score = genetic_risk_score,Normalized.Genetic.Risk.Score = genetic_risk_score_normalized,Rare.Load =rowSums(rare_load),Common.Load =rowSums(common_load))# Display the first few rows of the dataframesummary(df)
Genetic.Risk.Score Normalized.Genetic.Risk.Score Rare.Load
Min. :-16.82215 Min. :-3.27631 Min. : 0.0000
1st Qu.: -3.27015 1st Qu.:-0.65776 1st Qu.: 0.0000
Median : -0.01982 Median :-0.02972 Median : 0.0000
Mean : 0.13400 Mean : 0.00000 Mean : 0.1406
3rd Qu.: 3.52942 3rd Qu.: 0.65607 3rd Qu.: 0.0000
Max. : 26.13272 Max. : 5.02354 Max. :16.1285
Common.Load
Min. :-16.822155
1st Qu.: -3.310597
Median : -0.079738
Mean : -0.006591
3rd Qu.: 3.421738
Max. : 17.302561
Code
# Load necessary libraries# Assuming you have a dataframe 'df' with the columns:# 'Genetic Risk Score', 'Normalized Genetic Risk Score', 'Rare Load', 'Common Load'# If you don't have it already, you should create it based on your simulation results.# Plot 1: Genetic Risk Score Distributiongrid.arrange(ggplot(df, aes(x = common_load)) +geom_histogram(aes(y = ..density..), fill ="skyblue", color ="white") +geom_density(alpha =0.5, fill ="skyblue") +labs(title ="Common Load Distribution", x ="Common Load (effect x number)", y ="Density") +theme_minimal(),ggplot(df, aes(x = rare_load)) +geom_histogram(aes(y = ..density..), fill ="lightgreen", color ="white") +geom_density(alpha =0.5, fill ="lightgreen") +labs(title ="Rare Load Distribution", x ="Rare load (effect x number)", y ="Density") +theme_minimal(),# Plot 2: Normalized Genetic Risk Score Distributionggplot(df, aes(x = Normalized.Genetic.Risk.Score)) +geom_histogram(aes(y = ..density..), fill ="lightpink", color ="white") +geom_density(alpha =0.5, fill ="lightpink") +labs(title ="Normalized Genetic Risk Score Distribution", x ="Normalized Genetic Risk Score", y ="Density") +theme_minimal(),nrow=3)
3) We can examine the distribution of rare and common load:
Code
# Plot 3: Rare Load Distributiongrid.arrange(ggplot(df, aes(x = Rare.Load)) +geom_histogram(aes(y = ..density..), fill ="lightgreen", color ="white") +geom_density(alpha =0.5, fill ="lightgreen") +labs(title ="Rare Load Distribution", x ="Rare Load", y ="Density") +theme_minimal(),# Plot 4: Common Load Distributionggplot(df, aes(x = Common.Load)) +geom_histogram(aes(y = ..density..), fill ="lavender", color ="white") +geom_density(alpha =0.5, fill ="lavender") +labs(title ="Common Load Distribution", x ="Common Load", y ="Density") +theme_minimal(),nrow=1)
Finally, we can examine the distribution of normalized score by rare load, and see that those at the extremes of the score distribtuion are indeed enriched in rare variant load.
Code
# Scatter plot: Normalized Genetic Risk Score vs. Rare Loadggplot(df, aes(x = Normalized.Genetic.Risk.Score, y = Rare.Load)) +geom_point(alpha =0.6, color ="lightcoral") +labs(title ="Normalized Genetic Risk Score vs Rare Load",x ="Normalized Genetic Risk Score", y ="Rare Load") +theme_minimal()
Empirical Data
Our aim is to model the age of onset of disease accuartely, and so we’ll use typical event distribution times from the UK Biobank. We’ll use the phecode data on individuals diagnosed with MI, including the minimum age at diagnosis.
Weibull Model Fitting
Weibull Model Fitting: Fit a Weibull survival model to the data. The Weibull distribution is used here because it’s flexible and can model various shapes of hazard functions, making it suitable for time-to-event data like age at onset of MI. It is typical of chronic (commonly late onset disease), we’ll use the Weibull distribution.
Parameter Extraction: You extract the shape and scale parameters from the fitted Weibull model. In survival analysis, the shape parameter describes the failure rate over time, while the scale parameter adjusts the distribution’s spread.
Simulation Based on Genetic Risk
Adjusting Scale Parameter: You adjust the scale parameter of the Weibull distribution based on individuals’ genetic risk scores. Higher genetic risk scores lead to a lower scale parameter, implying earlier onset ages for MI, reflecting the influence of genetic predisposition on disease onset.
Simulating New Onset Ages: Using the adjusted Weibull parameters, you simulate new onset ages for MI. This step aims to generate a distribution of MI onset ages that accounts for genetic risk factors, allowing for the exploration of how genetic predisposition affects the age of MI onset.
Code
# mi from ukbmi=readRDS("~/Library/CloudStorage/Dropbox-Personal/mi.rds")#ggplot(mi,aes(x=`min(age_diag)`,fill="lightblue"))+geom_histogram()+labs(x="Age of MI, UKB",fill="")+theme_pubclean()#summary(mi$`min(age_diag)`)surv_obj <-Surv(mi$`min(age_diag)`)# Fit the Weibull modelfit <-survreg(surv_obj ~1, dist ='weibull')# survreg scale parameter maps to 1/shape, linear predictor to log(scale) per ## survreg.distributions {survival}# Extract the shape and scale parametersshape_parameter_ukb <-1/ fit$scalescale_parameter_ukb <-exp(coef(fit))event_ages <-rweibull(n, shape = shape_parameter_ukb, scale = scale_parameter_ukb)#ggplot(data.frame(event_ages),aes(x=event_ages,fill="lightred"))+geom_histogram()+labs(x="Simulated Event Times, UKB",fill="")+theme_pubclean()ea=data.frame(event_ages)mi$type="Empirical"ea$type="Simulated"names(mi)[c(2,3)]=names(ea)=c("age","type")m=rbind(mi[,c("age","type")],ea)ggplot(m,aes(x=age,fill=as.factor(type)))+geom_density()+theme_classic()+labs(x="Event Ages",fill="Event Type",y="Density")+scale_fill_aaas()+facet_wrap(~type)
Adding genetics
Now, we’ll try and use these settings to adjust our simulations accordingly: namely, individuals with higher genetic risk scores are predisposed to experiencing the event at earlier ages. Here’s a step-by-step summary of what’s happening:
1. Weibull Distribution Parameters
Shape Parameter: This is derived from your data (denoted as shape_parameter_ukb), which dictates the form of the distribution from the UK Biobank.
Scale Parameter Adjustments: The scale parameter of the Weibull distribution is adjusted based on individual genetic risk scores. Normally, the scale parameter influences the distribution’s spread, with larger values stretching the distribution over a longer time. In this model, higher genetic risk scores decrease the scale parameter value, simulating earlier event onset.
GRM (Genetic Risk Modifier): A constant (grm) is used to determine how much the genetic risk score affects the scale parameter. This effectively models the impact of genetics on the timing of the event, with a focus on ensuring the scale parameter remains positive to avoid unrealistic negative ages.
Recall in the weibull distribution:
Shape Parameter (k)
Controls the ‘curve’ of the distribution. The shape parameter determines whether your simulated event ages will follow a pattern where occurrences are concentrated early on, spread relatively evenly, or concentrated later.
Influences Failure Rate:
k < 1: Failure rate (in this case, the likelihood of the event happening) decreases over time. This might simulate a scenario where an early, high-risk period exists, but the risk diminishes as individuals age.
k = 1: Failure rate is constant. This is equivalent to an exponential distribution.
k > 1: Failure rate increases over time. This simulates situations where the risk of the event increases as individuals age.
Scale Parameter (λ)
Stretches or Compresses the distribution. It influences how spread out the simulated event ages will be.
Reference Point: The scale parameter roughly approximates the age at which 63.2% of the events would have occurred.
Effect of Genetic Risk Score: In your code, a higher genetic risk score decreasesthe scale parameter. This simulates earlier onset of the event for individuals with higher risk.
In Your Simulation
J-shaped distribution: Your choice of the shape_parameter_ukb aims to create a distribution where many simulated events happen early in life, followed by a slower increase in event occurrences as people age.
Genetic Risk Score: The way you modify the scale_parameter ensures that individuals with a high genetic risk score will tend to have events earlier in their simulated lives. The grm parameter controls how strongly genetic risk influences the age of onset.
Caveats
Minimum and Maximum Age: The min_age and max_age are there to enforce realistic boundaries for your simulation.
Rounding Event Ages: Rounding the values using round(event_ages) is likely for the purpose of grouping or analysis within your variant_summary data frame.
Code
# Parameters for the Weibull distributionshape_parameter <- shape_parameter_ukb # Adjusted shape parameter for the desired 'J-shaped' distribution#scale_base <- 55 # Base scale parameter around which to center the disease onset ages# Adjust the scale parameter of the Weibull distribution based on the genetic risk score# Higher genetic risk scores should lead to a lower scale parameter (earlier onset)# Add a constant to ensure the scale parameter stays positive# Determines how strongly the risk score affects the scalegrm =5scale_parameter <- scale_parameter_ukb - grm*genetic_risk_score_normalized # Adjust the 10 multiplier as needed# Ensure that the scale parameter does not go negative or too low, which could cause unrealistic ages#scale_parameter[scale_parameter < 1] <- 1# Simulate the disease onset agesevent_ages <-rweibull(n, shape = shape_parameter, scale = scale_parameter)# Ensure that event ages are within a realistic rangemin_age <-20max_age <-100variant_summary <-data.frame(event_age =round(event_ages), # Round event ages to the nearest whole number for grouping# Count of common variantsrare_burden = rare_load, # Burden of rare variantscommon_burden = common_load,score=genetic_risk_score_normalized # Burden of common variants)#event_ages[event_ages > max_age] <- max_age# Create a histogram to visualize the distribution of event agesea <-data.table(event_ages)ggplot(ea[event_ages <100& event_ages >20, ], aes(x=event_ages)) +geom_density(fill="#69b3a2") +# A soft, appealing shade of blue-greentheme_classic() +labs(x ="Genetically Simulated Event Ages", y ="Density")
We’re not completely satisfied because we see that the distribution is more symmetric than we might hope. But let’s proceed with aggregate analyses.
Code
#Create a dataframe that includes the event age, counts, and burdens of variants for each individual}# Aggregate the counts and burdens by event agesummary_by_age <- variant_summary%>%group_by(event_age)%>%summarize(n=length(rare_burden),rare_burden=mean(rare_burden),common_burden=mean(common_burden),mean_score=mean(score))# Rename the columns for claritynames(summary_by_age) <-c("Age", "Event Count", "Rare Burden", "Common Burden", "Score")# Print the summary table#print(summary_by_age,n=60)m=melt(summary_by_age,id.vars="Age")ggplot(m,aes(Age,value,fill=variable))+geom_bar(stat="identity")+facet_wrap(~variable,scales="free")+theme_classic()
Heritability like
Take the correlation between the normalized score and the binary phenotype of MI by age N
R2 is effectively the observed scaled narrow sense heritability, although this aggregates individual level data for a population level statistic.
We show that
1) while the overall genetic heritability of ‘ever having’ an event increases with age, the rare heritability decreases with time (as the variation becomes explained by more common effects) even though late events are often for nongenomic causes .
Code
# Calculate correlations for ages 40 to 70ages <-40:70cor_data <-data.frame(Age =integer(0), Correlation =numeric(0), Type =character(0))for (age in ages) { cor_genetic <-cor(genetic_risk_score_normalized, ifelse(event_ages < age, 1, 0)) cor_rare <-cor(rare_load, ifelse(event_ages < age, 1, 0)) cor_common <-cor(common_load, ifelse(event_ages < age, 1, 0)) cor_data <-rbind(cor_data, data.frame(Age = age, Correlation = cor_genetic, Type ="Genetic Risk")) cor_data <-rbind(cor_data, data.frame(Age = age, Correlation = cor_rare, Type ="Rare Load")) cor_data <-rbind(cor_data, data.frame(Age = age, Correlation = cor_common, Type ="Common Load"))}# Convert 'Type' to a factor for coloringcor_data$Type <-factor(cor_data$Type, levels =c("Genetic Risk", "Rare Load", "Common Load"))# Plot using ggplot2ggplot(cor_data, aes(x = Age, y = Correlation, color = Type)) +geom_point() +geom_line() +scale_color_manual(values =c("Genetic Risk"="black", "Rare Load"="red", "Common Load"="blue")) +labs(x ="Age", y ="Correlation (R2)", title ="Correlation of Load and Genetic Risk with Event Age") +theme_minimal() +theme(legend.title =element_blank())
Introducing non CAD competing events
In the normal data, we observe a plateau.
This plateau could indicate that there are competing risks present that affect the distribution of event ages. In survival analysis, competing risks occur when individuals are at risk of several different events and the occurrence of one event prevents the occurrence of the other event. This is common in older populations where the risk of death from non-CAD causes increases with age.
Competing Risks Model: Instead of using a single Weibull distribution, consider using a competing risks framework to simulate the time to CAD event and the time to other events separately. The cmprsk package in R can handle competing risks.
Subdistribution Hazard: Instead of focusing on the cause-specific hazard, which only considers the hazard for the event of interest, look at the subdistribution hazard, which takes into account the fact that some individuals will experience competing events. The survival package can fit subdistribution hazard models using the survreg function with the competing.risks argument.
Plateau Incorporation: Adjust your simulation to incorporate a plateau. For instance, you could simulate two separate age distributions: one for the initial rise and peak and another for the plateau, and then combine them.
Truncation or Censoring: You could also consider truncating the simulated distribution at an age where the plateau begins or treat the plateau as censoring due to competing risks.
To refine your simulation, you would take the following steps:
Simulate the non-CAD death times using another survival distribution or a constant rate of occurrence.
Generate CAD event times with the estimated Weibull parameters.
Combine the two sets of times, taking the minimum time for each individual as the observed time and noting the cause of the event.
Analyze the combined dataset considering the competing events.
Code
library(purrr)n <-10000# The number of individuals to simulatemean_non_cad_death_age <-80# The average age at non-CAD deathmin_age_threshold <-40# The minimum realistic age for CAD onsetdf_pheno=readRDS("~/Library/CloudStorage/Dropbox-Personal/pheno_dir/output/df_final_withddeath.rds")# Simulate non-CAD death times with the corrected ratenon_cad_death_ages <-na.omit(df_pheno$Death_censor_age[df_pheno$Cad_0_Any==1])fit_non_cad <-survreg(Surv(non_cad_death_ages) ~1, dist ='weibull')# Extract the shape and scale parametersshape_non_cad <-1/ fit_non_cad$scalescale_non_cad <-exp(coef(fit_non_cad))simulated_non_cad_death_times <-data.frame(rweibull(n, shape = shape_non_cad, scale = scale_non_cad))simulated_non_cad_death_times$type="Non Cad Event Times"names(simulated_non_cad_death_times)=c("Times","Type")# Plot the histogram of the simulated non-CAD death times to compare with the empirical histogram# hist(simulated_non_cad_death_times, breaks = 100, main = "Simulated Non-CAD Death Times")combined_event_times <-data.frame(pmin(event_ages, simulated_non_cad_death_times$Times))combined_event_times$Type="Minimum Time"names(combined_event_times)=c("Times","Type")# Exclude unrealistic early deaths by setting a minimum age thresholdcombined_event_times$Times[combined_event_times$Times<min_age_threshold]=min_age_thresholdr=rbind(combined_event_times,simulated_non_cad_death_times)event_age2=data.frame(event_ages)event_age2$Type="Cad"names(event_age2)=c("Times","Type")r=rbind(r,event_age2)# Plot the distribution of the combined event times with the thresholdggplot(r,aes(x = Times,fill=as.factor(Type)))+geom_density(alpha=0.8)+labs(x ="Histogram of Combined Event Times",y="Density",fill="Event Type")+theme_classic ()
Code
#So we can see that as expected, the common disease heritability increases over time, and the rare heritability decreases.