Chapter 2 Readings

2.1 DIMM-SC

A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

2.1.1 Summary

  • Droplet-based technologies utilize Unique Molecular Identifier (UMI) to annotate the 3’ end of each transcript in order to reduce PCR amplification bias
  • How to identify cell subtypes from heterogeneous data in RNA-seq?
  • Most methods are developed for continuous data rather than UMI data
  • DIMM-SC models within-cluster and between cluster variane of UMI count data
    • Cluster uncertainty for each cell
  • Compared work to CellTree and Seurat

2.2 Bayesian Hierarchichal Models and GLM

2.2.1 1. Introduction

  • Challenges associated with analyzing high dimensional molecular and clinical data
    1. How to select predictive factors (within many options)
    2. Building a high imensional predictive model w/ acc. estimations of the parameters
    3. Integration of biological information in a meaningful manner
  • Hearchical modeling is commonly used to analyze large-scale/structured data
  • They developed Bayesian hierarchical models for such data, through BhGLM

2.2.2 2. Models and algorithms

  • Bayesian: \(p(\beta, \theta | y, X) = C p(y|\beta, \theta, X) p(\beta, \theta)\)
  • BhGLM provides uniform priors on \(\theta\), and four tpes of iformative priors on \(\beta\)
  • The first two priors are double exponential and Student-t:
    • \(\beta_j \sim \textsf{de}(0, \theta s_j)\), and \(\beta_j \sim t_v(0, \theta s_j)\), where \(s_j\) induces stronger shrinkage on \(\beta_j\)
    • Student-\(t\) distribution induces normal and cuachy distributions as special cases
    • The last two types of priors are spike-and-slab mixture double-exponential and student-\(t\) distributions:
      • \(\beta_j \sim de(0, (1 - \gamma_j)s_0 + \gamma_j s_1)\), and \(\beta_j \sim t_v(0, (1 - \gamma_j)s_0 + \gamma_j s_1)\), where \(gamma_j\) is the indicator variable
  • BhGLM package estimates parameters by maximizing the log joint posterior density: \(\textsf{log}[p(\beta, \theta | y, X)]\)
  • Functions in the package provide point estimates and p-values

2.2.3 3. Features

  • BhGLM, bglm, bpolr, bcoxph, bmlasso
  • covariates allows users to transform predictors into a common scale and fill in missing data
  • summary.bh summarizes output from modeling functions
  • predict.bh predicts from a model
  • cv.bh performs k-fold cross validation

2.3 kTWAS: Integrating kernel-machine with transcriptome-wide association studies improves statistical power and reveals novel genes

2.3.1 Introduction

  • TWAS is an important technique for associating genetic variants and phenotypic changes

  • Often conducted in two steps

    1. Model is trained to predict gene expression from genotypes (using reference dataset w/paired expression of genotype data)
      • Techniques such as ElasticNet, Bayesian sparse linear mixed models, deep autoencoder models, and deep learning regression models are used
    2. Model is used to predict expression activity from the main dataset for genotype-phenotype association mapping (GWAS
      • Contains genotype and phenotype data wo/ expression data
  • Meta-analysis methods for conducting TWAS using summary statistics from relevant GWAS have also been created

  • Transcriptomic data may be used to select for genetic variants critical to gene expression, such as eQTLS, improving the quality of downstream GWAS

  • TWAS: many genetic variants into smaller number of meaningful linear combinations (similar to PCA)

  • Predicted gene expressions capture genetic component of expressions more precisely than actual expressions (which includes experimental artifacts/environmental factors)

  • TWAS is being used more than sequence kernel association test (SKAT)

    • Kernel models the aggregated effects of many genetic variants, captures genetic interactions within a local regions (TWAS uses expressions, kernel methods use genetic data)
  • Machine learning analogy

    • TWAS: akin to feature selection/pruning
    • Kernel: Feature modeling
  • Lack of comparision between the two in literature

  • kTWAS hopes to use expression data via TWAS feature selection alongside kernel-based test (robust to non-linear genetic architecture)

  • kTWAS tends to outperform alternatives w/significant margins

2.3.2 Materials and methods

2.3.2.1 Mathematical details of SKAT, PrediXcan, and kTWAS

  • SKAT score test: \(Q = y'Ky\)

    • Where \(y\) is a vector of phenotype values
    • \(K\) is a kernel calculated from the centralized genotype matrix \(G\)
      • Where \(G_{ij}\) is the variant of the \(j\)-th genomic position in the GWAS focal region of the \(i\)-th individual
  • Paper only focuses on common variants to be more comparable to TWAS (other extensions of SKAT focus on rare variants, removal of cofactors, etc)

  • Earliest TWAS method: PrediXcan

  • Linear model is trained to predict genetically regulated gene expression using reference panel

  • Uses elasticnet to conduct regression parameter training

  • GReX expressions are estimated for genotypes from GWAS dataset

    • \(\hat{Z} \sim \sum \beta_i G_i\)
  • Estimated values are associated to the phenotype

    • \(Y \sim \hat{Z} + \epsilon\)
  • kTWAS method

    • First extract \(\beta_i\) from ElasticNet Model: \(Z \sim \sum \beta_i G_i + \epsilon\)
    • Prepare kernel \(K_W\) for use in SKAT, where \(K_W\) is weighted according to the contribution of each variant in the elasticnet model: \(K_W = GWG'\)
    • Conduct Q score test from SKAT using TWAS-informed kernel \(K_W\), testing the hypothesis that the variance components explained by local genetic region are uniformly zero (Q follows a mixture of chi-squared distributions under the null hypothesis): \(Q = y'K_Wy\)