Chapter 6 High-Dimensional Feature Assessment

A frequent task is to find features which differ with respect to one or more experimental factors. We illustrate this type of analysis using a gene expression experiment (\(p=15923\) genes) performed with \(12\) randomly selected mice from two strains. The features are the \(p=15923\) genes and the strain (A vs B) is the experimental factor.

A commonly used format for gene expression data is the ExpressionSet class from the Biobase package. The actual expressions are retrieved using the function exprs. Information on the phenotypes is obtained using pData and with fData we get more information on the genes (“features”).

We load the ExpressionSet.

esetmouse <- readRDS(file="data/esetmouse.rds")
class(esetmouse)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
dim(esetmouse)
## Features  Samples 
##    15923       24

We can look at the expression values of the first sample and the first 6 genes.

exprs(esetmouse)[1:6,1]
## 1367452_at 1367453_at 1367454_at 1367455_at 1367456_at 1367457_at 
##  10.051651  10.163334  10.211724  10.334899  10.889349   9.666755

An overview on the phenotype data can be obtained using the following commands.

table(pData(esetmouse)$strain)
## 
##  A  B 
## 12 12

6.1 Gene-wise Two-sample Comparison

We are interested in comparing gene expression between the mice strains A and B.

x <- esetmouse$strain # strain information
y <- t(exprs(esetmouse)) # gene expressions matrix (columns refer to genes)

We start by visualizing the expression of gene \(j=11425\).

boxplot(y[,11425]~x)

This gene seems to be higher expressed in A. To nail down this observeration we can do a more formal hypothesis test. We build the ordinary t-statistic

\[\begin{align*} t_{j}&=\frac{\overline{y}_{j}^B-\overline{y}_{j}^A}{s_{j}\sqrt{\frac{1}{n_A}+\frac{1}{n_B}}}. \end{align*}\]

We can calculate the two-sided p-value

\[\begin{align*} q_j&=2\left(1-F(|t_j|,\nu=n_A+n_B-2)\right). \end{align*}\]

In R we can perform a two-sample t-test using the function t.test.

ttest <- t.test(y[,11425]~x,var.equal=TRUE)
ttest$statistic #tscore
##        t 
## 1.774198
ttest$p.value
## [1] 0.0898726

We obtain \(q_{11425}\)=0.09 and based on that we would not reject the null-hypothesis for this specific gene at the \(\alpha=0.05\) level. What about the other genes? We continue by repeating the analysis for all \(p=\) 15923 genes. We save the results in a data frame.

pvals <- apply(y,2,FUN=
                 function(y){
                   t.test(y~x,var.equal=TRUE)$p.value
                 })
tscore <- apply(y,2,FUN=
                  function(y){
                    t.test(y~x,var.equal=TRUE)$statistic
                  })
res.de <- data.frame(p.value=pvals,
                     t.score=tscore,
                     geneid=names(tscore))

Next we count the number of significant genes.

sum(res.de$p.value<0.05)
## [1] 2908

According to this analysis 2908 genes are differentially expressed between strains A and B. This is 18.3% of all genes. In the next section we will explain that this analysis misses an important point, namely it neglects the issue of multiple testing.

6.2 Multiple Testing

To illustrate the multiple testing problem we create an artificial gene expression data set where we are certain that none of the genes is differentially expressed.

set.seed(1)
p <- ncol(y)
n <- nrow(y)
ysim <- matrix(rnorm(n*p),n,p)

Now we repeat the gene-wise two-sample comparisons for the artificial data set.

pvals.sim <- apply(ysim,2,FUN=
                 function(y){
                   t.test(y~x,var.equal=TRUE)$p.value
                 })
tscore.sim <- apply(ysim,2,FUN=
                  function(y){
                    t.test(y~x,var.equal=TRUE)$statistic
                  })
res.de.sim <- data.frame(p.value=pvals.sim,t.score=tscore.sim)

We count the number of significant genes.

sum(res.de.sim$p.value<0.05)
## [1] 773

This is a surprise! According to the analysis 773 genes are differentially expressed. However, we know that this cannot be true. What did we miss? The reason for the large number of falsely declared significant genes is that we performed multiple significance tests simultaneously. Each test is associated with an error which accumulate over the various test. In particular, we re-call that the probability of falsely rejecting the null-hypothesis (=Type-I error) is \[ {\rm Prob}(q_j<\alpha)\leq \alpha. \]

We performed a significance test for each gene which makes the expected number of falsely rejected null-hypotheses \(p\times\alpha\)=796.15. Under the null hypothesis we would expect the p-values to follow a uniform distribution. Indeed, that is what we observe in our simulation example.

hist(res.de.sim$p.value)

The distribution of p-values obtained from the real example has a peak near zero which indicates that some genes are truly differentially expressed between strains A and B.

hist(res.de$p.value)

In the next section we will discuss p-value adjustment which is a method to counteract the issue of multiple testing.

6.3 P-value Adjustment

Our previous consideration suggest that we could adjust the p-values by multiplying with the number \(p\) of performed tests, i.e.

\[q_{j}^{\rm adjust}=p\times q_j.\]

This adjustment method is known as the Bonferroni correction. The method has the property that it controls the so-called family-wise-error rate (FWER). Let’s assume that \(p_0\) is the number of true null hypotheses (unknown to the researcher), then we can show

\[\begin{align*} {\rm FWER}&={\rm Prob}({\rm at\;least\;one\;false\;positive})\\ &={\rm Prob}(\min_{j=1..p_0} q^{\rm adjust}_j\leq \alpha)\\ &={\rm Prob}(\min_{j=1..p_0} q_j\leq \alpha/p)\\ &\leq \sum_{j=1}^{p_0} {\rm Prob}(q_j\leq \alpha/p)\\ &={p_0}\frac{\alpha}{p}\leq\alpha. \end{align*}\]

In our example we calculate the Bonferroni adjusted p-values.

res.de$p.value.bf <- p*res.de$p.value
res.de.sim$p.value.bf <- p*res.de.sim$p.value

The number of significant genes in the real and simulated data are provided next. Note that none of the genes is significant in the simulated data which is in line with our expectations.

sum(res.de$p.value.bf<0.05)
## [1] 82
sum(res.de.sim$p.value.bf<0.05)
## [1] 0

The R function p.adjust offers various adjustment procedures. The different methods are based on different assumptions and/or they control a different error measure. The Bonferroni correction is the most conservative approach and often leads to too few significant result (loss of statistical power). Less conservative is the so-called FDR approach which controls the False Discovery Rate (instead of FWER). We calculate the FDR adjusted p-values and print the number of significant genes.

res.de$p.value.fdr <- p.adjust(res.de$p.value,method="fdr")
res.de.sim$p.value.fdr <- p.adjust(res.de.sim$p.value,method="fdr")
sum(res.de$p.value.fdr<0.05)
## [1] 1123
sum(res.de.sim$p.value.fdr<0.05)
## [1] 0

6.4 Volcano Plot

It is important to effectively display statistical results obtained from high-dimensional data. We have discussed how to calculate p-values and how to adjust them for multiplicity. However, the p-value is often not the only quantity of interest. In differential gene expression analysis we are also interested in the magnitude of change in expression.

magn<- apply(y,2,FUN=
             function(y){
               mba <- tapply(y,x,mean)
               return(mba[2]-mba[1])
             })
magn.sim <- apply(ysim,2,FUN=
                  function(y){
                    mba <- tapply(y,x,mean)
                    return(mba[2]-mba[1])
                  })
res.de$magn <- magn
res.de.sim$magn <- magn.sim

A frequently used display is the volcano plot which shows on the y-axis the \(-\log_{10}\) p-values and on the x-axis the magnitude of change. By using \(-\log_{10}\), the “highly significant” features appear at the top of the plot. Using log also permits us to better distinguish between small and very small p-values. We can further highlight the “top genes” as those with adjusted p-value <0.05 and magnitude of change \(>1\).

res.de%>%
  dplyr::mutate(topgene=ifelse(p.value.fdr<0.05&abs(magn)>1,
                               "top",
                               "other")
                )%>%
  ggplot(.,aes(x=magn,y=-log10(p.value),col=topgene))+
  geom_point()+
  scale_color_manual(values = c("top"="red","other"="black"))+
  theme_bw()+
  theme(legend.position = "none")+
  xlim(-3,3)+ylim(0,10)+
  geom_vline(xintercept = 1)+
  geom_vline(xintercept = -1)

We repeat the same plot with the simulated data.

res.de.sim%>%
  dplyr::mutate(topgene=ifelse(p.value.fdr<0.05&abs(magn)>1,
                               "top",
                               "other")
                )%>%
  ggplot(.,aes(x=magn,y=-log10(p.value),col=topgene))+
  geom_point()+
  scale_color_manual(values = c("top"="red","other"="black"))+
  theme_bw()+
  theme(legend.position = "none")+
  xlim(-3,3)+ylim(0,10)+
  geom_vline(xintercept = 1)+
  geom_vline(xintercept = -1)

6.5 Variance Shrinkage and Empirical Bayes

The basis of the statistical analyses are the t-statistics

\[\begin{align*} t_{j}&=\frac{\overline{y}_{j}^B-\overline{y}_{j}^A}{s_{j}\sqrt{\frac{1}{n_A}+\frac{1}{n_B}}}. \end{align*}\]

In a small sample size setting the estimated standard deviations exhibit high variability which can lead to large t-statistics. Extensive statistical methodology has been developed to counteract this challenge. The key idea of those methods is to shrink the gene-wise variances \(\rm s^2_{j}\) towards a common variance \(s^2_0\) (\(s^2_0\) is estimated from the data)

\[\begin{align*} \widetilde{s}_{j}^2&=\frac{d_0 s_0^2+d s^2_{j}}{d_0+d}. \end{align*}\]

A so-called moderated t-statistic is obtained by replacing in the denominator \(s_{j}\) with the “shrunken” \(\widetilde{s}_{j}\). The moderated t-statistic has favourable statistical properties in the small \(n\) setting. The statistical methodology behind the approach is referred to as empirical Bayes and is implemented in the function eBayes of the limma package. Limma starts with running gene-wise linear regression using the lmFit function.

library(limma)
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
# first argument: gene expression matrix with genes in rows and sample in columns
# second argument: design matrix
fit <- lmFit(t(y), design=model.matrix(~ x)) 
head(coef(fit))
##            (Intercept)           xB
## 1367452_at   10.027453  0.092544985
## 1367453_at   10.173732  0.026867630
## 1367454_at   10.275137  0.003017421
## 1367455_at   10.371786 -0.101727288
## 1367456_at   10.815641 -0.006899555
## 1367457_at    9.607297  0.038318498

We can compare it with a standard lm fit.

coef(lm(y[,1]~x)) # gene 1
## (Intercept)          xB 
## 10.02745295  0.09254499
coef(lm(y[,2]~x)) # gene 2
## (Intercept)          xB 
## 10.17373179  0.02686763

Next, we use the eBayes function to calculate the moderated t statistics and p-values.

ebfit <- eBayes(fit)
head(ebfit$t) # moderated t statistics
##            (Intercept)          xB
## 1367452_at    182.4496  1.19066640
## 1367453_at    221.0103  0.41271149
## 1367454_at    184.0507  0.03821825
## 1367455_at    195.0270 -1.35258223
## 1367456_at    257.5965 -0.11619667
## 1367457_at    196.7628  0.55492613
head(ebfit$p.value) # p.values based on moderated t statistics
##             (Intercept)        xB
## 1367452_at 3.871319e-43 0.2441871
## 1367453_at 2.236672e-45 0.6830894
## 1367454_at 3.061059e-43 0.9697959
## 1367455_at 6.452172e-44 0.1874515
## 1367456_at 3.639411e-47 0.9083599
## 1367457_at 5.084786e-44 0.5835310

We can also retrieve the “shrunken” standard deviations

head(sqrt(ebfit$s2.post)) # shrunken standard deviations
## 1367452_at 1367453_at 1367454_at 1367455_at 1367456_at 1367457_at 
##  0.1903875  0.1594624  0.1933930  0.1842254  0.1454464  0.1691410