Chapter 5 Lectures and Webinars
5.1 Transethnic Genetics
5.1.1 Alica Martin, PhD
- Genetic variation reflects human history
- European genetics vary much less than African genetics
- Genetic studies are increasingly powerful, as individuals in GWAS exponentially increases
- But genetics also has a large diversity problem
- How do ancestry study biases (eurocentric) impact generalizability of knowledge?
- Preventing white supremacists from continuing to latch onto genetics as their tool
- Important considerations
- Causal variants effects are mostly shared across populations
- Allele frequencies, LD, environment, and other factors differ (more complex factors)
- Polygenic risk score
- Genetic prediction of an individual’s phenotype
- \(Y = \sum_{j=1}^m g_j\beta_j\)
- Sum the products of genotypes effect size estimates from a GWAS across the genome
- Considerations
- Which snps to include
- What weights to apply?
- LD
- P-value thresholds
- Pruning, thresholding, bayes: as methods
- Incorporation of polygenic risk score into clinical practice due to:
- Increased sample size
- Cheaper testing (<$100)
- Better integration with other clinical factors
- Yet, no discussion of ancestry!
- Causal effects are mostly shared scross populations, but what about polygenic scores?
- Szc risk in european and asian individuals demonstrate few causal differences in spite of allele differences (very few different OR or different MAF and different OR)
- Human demographic history impacts genetic risk prediction across diverse populations
- Genetic prediction accuracy decays with increasing genetic distance between discovery and target data
- Polygenic scores may differ across populations, but their biases are not meaningful
- Neutral human evolution is sufficient to explain differences
- Natural selection is not necessary
- Large disparities in accuracy
- Imroving risk prediction accuracy in schizophrenia:
- Better result: ancestry-mathced GWAS
- Best result: meta-analysis combining diverse cohorts
- Imroving risk prediction accuracy in schizophrenia:
- Genetic studies generalize better across populations with diverse data
- Use of BioBank data
- Prediction accuracy outliers highlight known biology especially relevant to populations
- Outliers that correspond to genes known to cofer hemochromatosis, sickle cell, etc quite relevant to African populations
- Cimplexities of comparing across populations/cohorts
- Where is the data comping from, hostpial vs general population
5.1.2 Omer Weissbrod
- Polygenic risk scores can identify individuals at risk
- Clinical use of current polygenic risk scores may exacerbate health disparities
- Why do polygenic scores lose accuracy across populations?
- Discovery vs target populations
- LD differences, when using non-causal SNPs to predict (for one population)
- MAF differences, even when using causal SNPs (comon vs rare snps)
- Different genetic architectures (trans-ethnic genetic correlation; GxE, GxG, incorrect coefficients, or incorrect modeling)
- Different heritabilities (population specific environmental factor, genetics becoming less important in determination of trait)
- Strategies to mitigate loss of PGS accuracy across population
- LD: predict using causal SNPs (fine mapping)
- MAF differences/different additive genetic arch: combine data from both populations
- Different heritabilities: nothing :(
- How do different factors contribute loss of PGS accuracy
Assumptions: - \(M_c\) causal SNPs - \(M_T\) predictor SNPs - Predictor SNPs are approximately not in LD
- LD+MAF differences explain most PGS-accuracy loss in African populations, but not in South/East-Asian populations
5.1.3 LD Differences
- There are substantial LD differences between African, East-Asian and European populations
- LD similarity decays exponentially with number of generations since population separation
In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is higher or lower than what would be expected if the loci were independent and associated randomly.[1]
Linkage disequilibrium is influenced by many factors, including selection, the rate of genetic recombination, mutation rate, genetic drift, the system of mating, population structure, and genetic linkage. As a result, the pattern of linkage disequilibrium in a genome is a powerful signal of the population genetic processes that are structuring it.
- Fine mapping can help address LD differences by identifying causal SNPs
5.1.4 MAF Differences
- Most causal variants are private (population specific)
- MAF: minor allele frequency
- If a causal snp exists in only one population, we can’t analyze in another population
- Private variatns explain up to 50% of heritability of many complex traits
- Low-frequency SNPs have larger effect sizes due to negative selection (population history models)
5.1.5 Different genetic architectures
- Causal effect sizes are strongly but imperfectly correlated across populations (around .8)
- Causal effect sizes are more population-specific in functionally-important regions
5.1.5.1 Different heritabilities
- Similar load of deleterious variants acros populations suggest similar heritabilities
- Higher SNP heritability estimates in UK Biobank compared to Biobank Japan, possibly due to phenotypic heterogeneity
- Cross-cohort analysis finds cohort-specific ‘hidden’ heritability for educational attainment and reproductive behavior
- Substantial evidence for HD mediated by GxE interactions
5.2 Neurogenetics in Biowulf
5.2.1 Meta GWAS
- Meta analysis of GWAs, typically using fixed effects models
- Comparison of regression coeff. and std dev. between many studies
- Summary statistics combined using fixed effects
- Random effects used to account for study heterogeneity
- Manolio et al. 2009 nature paper
- PD Gwas studies, alongside biowulf cpu usage, has grown tremendously over time
5.2.2 Basic workflow
Previously published GWAS
Combined with new and unpublished GWAS creates risk loci discovery
This in turn produces heritability and polygenic risk estimates (LDSC to quantify heritability, predictive models using polygenic risk scoring, etc)
Alongside gene level analyses, gene set analyses, and shared heritability analyses
takes large amounts of imputed genotype data
reference data from various public sources (mendelian randomization, qtl datasets)
Web services (pathways, servers)
Standardized workflows (github, notebooks, etc)
This all includes over 2.4 TB of genotype data (PD study)
5.2.3 Results
- 90 independent common risk factors for PD
- Manhattan plot demonstrates new and old genes as risk factors towards PD
- Many genes contain moderate effects for PD
- 38 risk factors identified were novel hits
- Gene level analysis –> summary of QTL mendelian randomization
- Used to make functional inferences of effects (like an RCT, which gene is best to follow up with?)
- QTL-MR analyses nominate GRN under a large peak of Manhattan plot, a possible link to FTD
5.2.4 Population Scale Analyses
- PRS = Polygenic risk score
- Summary of aggregate independent genetic risk variants for a disease per individual
- Weighted by external GWAS effect estimates
- \(PRS = B_p * SNP_p ...\)
- Consider overall disease heritability, prevalence and sample sizes
- Can be combined with clinical and demographic data for better results
- PRS is a stronger predictive model than simply one SNP (weighted aggregate SNPS are more powerful)
- Can apply this model to a clinical trial before recruitment to improve efficacy
- Diagnosis of Parkinson’s disease on the basis of clinical and genetic classification a population based modeling study
5.2.6 Heritability and polygenic risk estimates
- Ability to make a genetic predictor is based on possible heritability
- LDSC heritability ~215, including 26-36% of heritable risk via GWAS
- Up to 70% AUC at validation with PRS comprising > 1800 variants
5.2.7 Where are we now with risk locus discovery in PD
- More loci we have, the more biological insight we gain
- Power calculations based on sub-significant variants from PRS analysis point to 99K cases
- Once alleles get very rare with moderate effect sizes, hard to build into PRS
- Larger reference data for prioritization and colocalization
- Better imputation panels
- Actively recruiting more diverse populations of PD cases and controls
- Would a genetic predictor work in more diverse populations?
5.2.8 Next Steps in NDD genetics
Topics | Tools |
---|---|
Predictors of progression - Single outcomes like cognitive score - General progression trajectories |
Deeply phenotyped studies |
More data from diverse sources - Huge number of collaborators |
Machine learning pipelines - Supervised for prediction - Unsupervised for disease subtyping |
Improved disease predictors | Hybrid cloud for cost efficient collaboration |
General concerns | Public code, data, and resources |