Chapter 5 High-dimensional feature assessment

A frequent task is to find features which differ with respect to one or more experimental factors. We illustrate this type of analysis using a gene expression expression experiment performed with \(12\) randomly selected mice from two strains. The features are the \(p=15923\) genes and the strain (A vs B) is the experimental factor.

A commonly used format for gene expression data is ExpressionSet class. The actual expressions are retrieved using the function exprs. Information on the phenotypes is obtained using pData and with fData we get more information on the genes (“features”).

We load the ExpressionSet.

load("data/maPooling_sub.rda")
class(maPooling_sub)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
dim(maPooling_sub)
## Features  Samples 
##    15923       24

We can look at the expression values for the first 6 genes.

##                  a10       a11       a12       a14        a2        a3
## 1367452_at 10.051651  9.772611 10.100564 10.307930 10.120983  9.877421
## 1367453_at 10.163334 10.126906 10.401630 10.473059 10.284476  9.745209
## 1367454_at 10.211724 10.364283 10.267256 10.410698 10.519770  9.945073
## 1367455_at 10.334899 10.182461 10.490526 10.730913 10.491253 10.050945
## 1367456_at 10.889349 10.716313 10.804297 11.163798 10.779459 10.552024
## 1367457_at  9.666755  9.554593  9.618053  9.788813  9.479614  9.358437
##                   a4        a5        a6        a7        a8        a9
## 1367452_at  9.885281 10.124788  9.970484  9.834936  9.972449 10.310338
## 1367453_at 10.007384 10.321689 10.061782 10.049299 10.157477 10.292536
## 1367454_at 10.091313 10.388906 10.149787 10.209088 10.375866 10.367887
## 1367455_at 10.331264 10.476248 10.244007 10.305424 10.364826 10.458670
## 1367456_at 10.924683 10.789990 10.862684 10.829804 10.705432 10.769861
## 1367457_at  9.596411  9.570945  9.725613  9.589732  9.731538  9.607054
##                  b10       b11       b12       b13       b14       b15
## 1367452_at  9.930178 10.049333 10.226445 10.212999  9.952872 10.374269
## 1367453_at 10.186875 10.253727 10.213584 10.117653 10.172536 10.230923
## 1367454_at 10.085802 10.371507 10.279507 10.408542 10.002413 10.507157
## 1367455_at 10.096736 10.285130 10.270176 10.424791 10.106323 10.610902
## 1367456_at 10.716026 10.816883 10.906947 10.819123 10.866374 11.011218
## 1367457_at  9.641134  9.598255  9.722557  9.884596  9.539368  9.946626
##                   b2        b3        b5        b6        b8        b9
## 1367452_at 10.300247  9.643475  9.929848 10.245458 10.330313 10.244539
## 1367453_at 10.255677  9.933902 10.115643 10.358207 10.328802 10.239663
## 1367454_at 10.567162  9.849516 10.003595 10.364660 10.500085 10.397912
## 1367455_at 10.402120  9.904367 10.124746 10.280486 10.554032 10.180899
## 1367456_at 10.836565 10.443976 10.798685 10.792686 10.937668 10.758750
## 1367457_at  9.671455  9.104227  9.481612  9.705924  9.729583  9.722043

An overview on the phenotype and feature data can be obtained using the following commands.

head(pData(maPooling_sub))
##     a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a14 b2 b3 b5 b6 b8 b9 b10 b11 b12 b13
## a10  0  0  0  0  0  0  0  0   1   0   0   0  0  0  0  0  0  0   0   0   0   0
## a11  0  0  0  0  0  0  0  0   0   1   0   0  0  0  0  0  0  0   0   0   0   0
## a12  0  0  0  0  0  0  0  0   0   0   1   0  0  0  0  0  0  0   0   0   0   0
## a14  0  0  0  0  0  0  0  0   0   0   0   1  0  0  0  0  0  0   0   0   0   0
## a2   1  0  0  0  0  0  0  0   0   0   0   0  0  0  0  0  0  0   0   0   0   0
## a3   0  1  0  0  0  0  0  0   0   0   0   0  0  0  0  0  0  0   0   0   0   0
##     b14 b15 strain
## a10   0   0      A
## a11   0   0      A
## a12   0   0      A
## a14   0   0      A
## a2    0   0      A
## a3    0   0      A

5.1 Gene-wise two-sample comparison

We are interested in comparing gene expression between the mice strains A and B.

x <- maPooling_sub$strain # strain information
y <- t(exprs(maPooling_sub)) # gene expressions matrix (columns refer to genes)

We start by visualizing the expression of gene \(j=11425\).

boxplot(y[,11425]~x)

This gene seems to be higher expressed in A. To nail down this observeration we can do a more formal hypothesis test. We build the z-statistic

\[\begin{align*} Z_{j}&=\frac{\bar{Y}_{j}^B-\bar{Y}_{j}^A}{S_{j}}. \end{align*}\]

Approximately, \(Z_j\sim N(0,1)\) and we can calculate the two-sided p-value

\[\begin{align*} P_j&=2(1-\Phi(Z_j)). \end{align*}\]

In R we can perform a two-sample t-test using the function t.test.

ttest <- t.test(y[,11425]~x,var.equal=TRUE)
ttest$statistic #zscore
##        t 
## 1.774198
ttest$p.value
## [1] 0.0898726

We obtain $P_{11425}=$0.09 and based on that we would not reject the null-hypothesis for this specific gene at the \(\alpha=5%\) level. What about the other genes? We continue by repeating the analysis for all \(p=\) 15923 genes. We save the results in a data frame.

pvals <- apply(y,2,FUN=
                 function(y){
                   t.test(y~x,var.equal=TRUE)$p.value
                 })
zscore <- apply(y,2,FUN=
                  function(y){
                    t.test(y~x,var.equal=TRUE)$statistic
                  })
res.de <- data.frame(p.value=pvals,z.score=zscore,geneid=names(zscore))

Next we count the number of significant genes.

sum(res.de$p.value<0.05)
## [1] 2908

According to this analysis 2908 genes are differentially expressed between strains A and B. This is 18.3% of all genes. In the next section we will explain that this analysis misses an important point, namely it neglects the issue of multiple testing.

5.2 Multiple testing

To illustrate the multiple testing problem we create an artificial gene expression data set where we are certain that none of the genes is differentially expressed.

set.seed(1)
p <- ncol(y)
n <- nrow(y)
ysim <- matrix(rnorm(n*p),n,p)

Now we repeat the gene-wise two-sample comparisons for the artificial data set.

pvals.sim <- apply(ysim,2,FUN=
                 function(y){
                   t.test(y~x,var.equal=TRUE)$p.value
                 })
zscore.sim <- apply(ysim,2,FUN=
                  function(y){
                    t.test(y~x,var.equal=TRUE)$statistic
                  })
res.de.sim <- data.frame(p.value=pvals.sim,z.score=zscore.sim)

We count the number of significant genes.

sum(res.de.sim$p.value<0.05)
## [1] 773

This is a surprise! According to the analysis 773 genes are differentially expressed. However, we know that this cannot be true. What did we miss? The reason for the large number of falsely declared significant genes is that we performed multiple significance tests simultaneously. Each test is associated with an error which accumulate over the various test. In particular, we re-call that he probability of falsely rejecting the null-hypothesis (=Type-I error) is \[ {\rm Prob}(P_j<\alpha)\leq \alpha. \]

We performed a significant test for each gene which makes the expected number of falsely rejected null-hypotheses \(p\times\alpha\)=796.15. Under the null hypothesis we would expect the p-values to follow a uniform distribution. Indeed, that is what we observe in our simulation example.

hist(res.de.sim$p.value)

The distribution of p-values obtained from the real example has a peak near zero which indicates that some genes are truly differentially expressed between strains A and B.

hist(res.de$p.value)

In the next section we will discuss p-value adjustment which is a method to counteract the issue of multiple testing.

5.3 P-value adjustment

Our previous consideration suggest that we could adjust the p-values by multiplying with the number \(p\) of performed tests, i.e.

\[P_{j}^{\rm adjust}=p\times P_j.\]

This adjustment method is known as the Bonferroni correction. The method has the property that it controls the so-called family-wide-error rate (FWER). The derivation goes as follows

\[\begin{align*} {\rm FWER}&={\rm Prob}({\rm at\;least\;one\;rejection})\\ &={\rm Prob}({\rm min}\; P^{\rm adjust}_j\leq \alpha)\\ &={\rm Prob}({\rm min}\; P_j\leq \alpha/p)\\ &\leq \sum_{j=1}^p {\rm Prob}(P_j\leq \alpha/p)\\ &=p\frac{\alpha}{p}=\alpha. \end{align*}\]

In our example we calculate the Bonferroni adjusted p-values.

res.de$p.value.bf <- p*res.de$p.value
res.de.sim$p.value.bf <- p*res.de.sim$p.value

The number of significant genes in the real and simulated dataare provided next. In particular none of the genes is significant in the simulated data which is in line with our expectations.

sum(res.de$p.value.bf<0.05)
## [1] 82
sum(res.de.sim$p.value.bf<0.05)
## [1] 0

The R function p.adjust offers various adjustment procedures. The different methods are based on different assumption and/or control a different error measure. The Bonferroni correction is the most conservative approach and often leads to too few significant result (loss of statistical power). Less conservative is the so-called FDR approach which controls the False Discovery Rate (instead of FWER). We calculate the FDR adjusted p-values and print the number of significant genes.

res.de$p.value.fdr <- p.adjust(res.de$p.value,method="fdr")
res.de.sim$p.value.fdr <- p.adjust(res.de.sim$p.value,method="fdr")
sum(res.de$p.value.fdr<0.05)
## [1] 1123
sum(res.de.sim$p.value.fdr<0.05)
## [1] 0

5.4 The Volcano plot

It is important to effectively display statistical results obtained from high-dimensional data. We have discussed how to calculate p-values and how to adjust them for multiplicity. However, the p-value is often not the only quantity of interest. In our example we are also interested to understand the effect size, that is the magnitude of the difference in expression between strains A and B. The following code calculates the effect sizes.

ef<- apply(y,2,FUN=
             function(y){
               mba <- tapply(y,x,mean)
               return(mba[2]-mba[1])
             })
ef.sim <- apply(ysim,2,FUN=
                  function(y){
                    mba <- tapply(y,x,mean)
                    return(mba[2]-mba[1])
                  })
res.de$effectsize <- ef
res.de.sim$effectsize <- ef.sim

A frequently used display is the volcano plot which shows on the y-axis the \(-\log_{10}\) p-values and on the x-axis the effect size. By using \(-\log_{10}\), the “highly significant” features appear at the top of the plot. Using log also permits us to better distinguish between small and very small p-values. We can further highlight the “top genes” as those with adjusted p-value <0.05 and absolute effect size \(>1\).

res.de%>%
  dplyr::mutate(topgene=ifelse(p.value.fdr<0.05&abs(effectsize)>1,"top","other"))%>%
  ggplot(.,aes(x=effectsize,y=-log10(p.value),col=topgene))+
  geom_point()+
  scale_color_manual(values = c("top" = "red", "other" = "black"))+
  theme_bw()+
  theme(legend.position = "none")+
  xlim(-3,3)+ylim(0,10)+
  geom_vline(xintercept = 1)+
  geom_vline(xintercept = -1)

We repeat the same plot with the simulated data.

res.de.sim%>%
  dplyr::mutate(topgene=ifelse(p.value.fdr<0.05&abs(effectsize)>1,"top","other"))%>%
  ggplot(.,aes(x=effectsize,y=-log10(p.value),col=topgene))+
  geom_point()+
  scale_color_manual(values = c("top" = "red", "other" = "black"))+
  theme_bw()+
  theme(legend.position = "none")+
  xlim(-3,3)+ylim(0,10)+
  geom_vline(xintercept = 1)+
  geom_vline(xintercept = -1)

5.5 Variance shrinkage and empirical Bayes

The basis of the statistical analyses are the Z-scores

\[\begin{align*} Z_{j}&=\frac{\bar{Y}_{j}^B-\bar{Y}_{j}^A}{S_{j}}. \end{align*}\]

with the effect size in the numerator and the pooled standard deviation in the denominator. In a small sample size setting the estimated standard deviations exhibit high variability which can lead to large Z-scores. Extensive statistical methodology has been developed to counteract this challenge. The key idea of those methods is based to shrinks the gene-wise variances \(\rm S^2_{j}\) towards a common variance \(S^2_0\)

\[\begin{align*} \tilde{S}_{j}^2&=\frac{d_0 S_0^2+d S^2_{j}}{d_0+d}. \end{align*}\]

A so-called moderated Z-score is obtained by replacing the denominator with the “shrunken” \(\tilde{S}_{j}\). The moderated Z-score has favourable statistical properties in the small \(n\) setting. The statistical methodology behind the approach is referred to as empirical Bayes and is implemented in the function eBayes of the limma package. Limma starts with running gene-wise linear regression using the lmFit function.

library(limma)
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
# first argument: gene expression matrix with genes in rows and sample in columns
# second argument: design matrix
fit <- lmFit(t(y), design=model.matrix(~ x)) 
head(coef(fit))
##            (Intercept)           xB
## 1367452_at   10.027453  0.092544985
## 1367453_at   10.173732  0.026867630
## 1367454_at   10.275137  0.003017421
## 1367455_at   10.371786 -0.101727288
## 1367456_at   10.815641 -0.006899555
## 1367457_at    9.607297  0.038318498

We can compare it with a stanard lm fit.

coef(lm(y[,1]~x)) # gene 1
## (Intercept)          xB 
## 10.02745295  0.09254499
coef(lm(y[,2]~x)) # gene 2
## (Intercept)          xB 
## 10.17373179  0.02686763

Next, we use the eBayes function to calculate the moderated t statistics and p-values.

ebfit <- eBayes(fit)
head(ebfit$t) # moderated t statistics
##            (Intercept)          xB
## 1367452_at    182.4496  1.19066640
## 1367453_at    221.0103  0.41271149
## 1367454_at    184.0507  0.03821825
## 1367455_at    195.0270 -1.35258223
## 1367456_at    257.5965 -0.11619667
## 1367457_at    196.7628  0.55492613
head(ebfit$p.value) # p.values based on moderated t statistics
##             (Intercept)        xB
## 1367452_at 3.871319e-43 0.2441871
## 1367453_at 2.236672e-45 0.6830894
## 1367454_at 3.061059e-43 0.9697959
## 1367455_at 6.452172e-44 0.1874515
## 1367456_at 3.639411e-47 0.9083599
## 1367457_at 5.084786e-44 0.5835310

We can also retrieved the “shrunken” standard deviations.

head(ebfit$s2.post) # shrunken standard deviations
## 1367452_at 1367453_at 1367454_at 1367455_at 1367456_at 1367457_at 
## 0.03624740 0.02542827 0.03740085 0.03393898 0.02115466 0.02860868