Chapter 2 Readings

2.1 DIMM-SC

A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

2.1.1 Summary

Droplet-based technologies utilize Unique Molecular Identifier (UMI) to annotate the 3’ end of each transcript in order to reduce PCR amplification bias
How to identify cell subtypes from heterogeneous data in RNA-seq?
Most methods are developed for continuous data rather than UMI data
DIMM-SC models within-cluster and between cluster variane of UMI count data
- Cluster uncertainty for each cell
Compared work to CellTree and Seurat

2.2 Bayesian Hierarchichal Models and GLM

2.2.1 1. Introduction

Challenges associated with analyzing high dimensional molecular and clinical data
1. How to select predictive factors (within many options)
2. Building a high imensional predictive model w/ acc. estimations of the parameters
3. Integration of biological information in a meaningful manner
Hearchical modeling is commonly used to analyze large-scale/structured data
They developed Bayesian hierarchical models for such data, through BhGLM

2.2.2 2. Models and algorithms

Bayesian: $p(\beta, \theta | y, X) = C p(y|\beta, \theta, X) p(\beta, \theta)$
BhGLM provides uniform priors on $\theta$ , and four tpes of iformative priors on $\beta$
The first two priors are double exponential and Student-t:
- $\beta_j \sim \textsf{de}(0, \theta s_j)$ , and $\beta_j \sim t_v(0, \theta s_j)$ , where $s_j$ induces stronger shrinkage on $\beta_j$
- Student- $t$ distribution induces normal and cuachy distributions as special cases
- The last two types of priors are spike-and-slab mixture double-exponential and student- $t$ distributions:
  - $\beta_j \sim de(0, (1 - \gamma_j)s_0 + \gamma_j s_1)$ , and $\beta_j \sim t_v(0, (1 - \gamma_j)s_0 + \gamma_j s_1)$ , where $gamma_j$ is the indicator variable
BhGLM package estimates parameters by maximizing the log joint posterior density: $\textsf{log}[p(\beta, \theta | y, X)]$
Functions in the package provide point estimates and p-values

2.2.3 3. Features

BhGLM, bglm, bpolr, bcoxph, bmlasso
covariates allows users to transform predictors into a common scale and fill in missing data
summary.bh summarizes output from modeling functions
predict.bh predicts from a model
cv.bh performs k-fold cross validation

2.3 kTWAS: Integrating kernel-machine with transcriptome-wide association studies improves statistical power and reveals novel genes

2.3.1 Introduction

TWAS is an important technique for associating genetic variants and phenotypic changes
Often conducted in two steps
1. Model is trained to predict gene expression from genotypes (using reference dataset w/paired expression of genotype data)
  - Techniques such as ElasticNet, Bayesian sparse linear mixed models, deep autoencoder models, and deep learning regression models are used
2. Model is used to predict expression activity from the main dataset for genotype-phenotype association mapping (GWAS
  - Contains genotype and phenotype data wo/ expression data
Meta-analysis methods for conducting TWAS using summary statistics from relevant GWAS have also been created
Transcriptomic data may be used to select for genetic variants critical to gene expression, such as eQTLS, improving the quality of downstream GWAS
TWAS: many genetic variants into smaller number of meaningful linear combinations (similar to PCA)
Predicted gene expressions capture genetic component of expressions more precisely than actual expressions (which includes experimental artifacts/environmental factors)
TWAS is being used more than sequence kernel association test (SKAT)
- Kernel models the aggregated effects of many genetic variants, captures genetic interactions within a local regions (TWAS uses expressions, kernel methods use genetic data)
Machine learning analogy
- TWAS: akin to feature selection/pruning
- Kernel: Feature modeling
Lack of comparision between the two in literature
kTWAS hopes to use expression data via TWAS feature selection alongside kernel-based test (robust to non-linear genetic architecture)
kTWAS tends to outperform alternatives w/significant margins

2.3.2 Materials and methods

2.3.2.1 Mathematical details of SKAT, PrediXcan, and kTWAS

SKAT score test: $Q = y'Ky$
- Where $y$ is a vector of phenotype values
- $K$ is a kernel calculated from the centralized genotype matrix $G$
  - Where $G_{ij}$ is the variant of the $j$ -th genomic position in the GWAS focal region of the $i$ -th individual
Paper only focuses on common variants to be more comparable to TWAS (other extensions of SKAT focus on rare variants, removal of cofactors, etc)
Earliest TWAS method: PrediXcan
Linear model is trained to predict genetically regulated gene expression using reference panel
Uses elasticnet to conduct regression parameter training
GReX expressions are estimated for genotypes from GWAS dataset
- $\hat{Z} \sim \sum \beta_i G_i$
Estimated values are associated to the phenotype
- $Y \sim \hat{Z} + \epsilon$
kTWAS method
- First extract $\beta_i$ from ElasticNet Model: $Z \sim \sum \beta_i G_i + \epsilon$
- Prepare kernel $K_W$ for use in SKAT, where $K_W$ is weighted according to the contribution of each variant in the elasticnet model: $K_W = GWG'$
- Conduct Q score test from SKAT using TWAS-informed kernel $K_W$ , testing the hypothesis that the variance components explained by local genetic region are uniformly zero (Q follows a mixture of chi-squared distributions under the null hypothesis): $Q = y'K_Wy$