# Chapter 11 Demo

## 11.1 Project Setup

I suggest building your data analysis projects in RStudio(Click File - New project - New dictionary - Empty project). Then assign a name for your project. I also recommend the following tips if you are familiar with it.

• Use git/github to make version control of your code and sync your project online.

• Don’t use your name for your project because other peoples might cooperate with you and someone might check your data when you publish your papers. Each project should be a work for one paper or one chapter in your thesis.

• Use workflow document(txt or doc) in your project to record all of the steps and code you performed for this project. Treat this document as digital version of your experiment notebook

• Use data folder in your project folder for the raw data and the results you get in data analysis

• Use figure folder in your project folder for the figure

• Use munuscript folder in your project folder for the manuscript (you could write paper in rstudio with the help of template in Rmarkdown)

• Just double click [yourprojectname].Rproj to start your project

## 11.2 Data input

xcms does not support all of the raw files formate from every mass spectrometry. You need to convert your raw data into some open-source data format such as mzData, mzXML or CDF files. The tool is MSconvert from ProteoWizard.

## 11.3 Optimization

IPO package could be used to optimaze the parameters for XCMS. Try the following code and here we employ 5 NIST 1950 samples and 6 matrix blank samples as demodata from rmwf package.

library(IPO)
library(xcms)
library(rmwf)
path <- system.file("extdata/data", package = "rmwf")
files <- list.files(path,recursive = T,full.names = T)

# BiocManager::install("IPO")
library(IPO)
library(xcms)
peakpickingParameters <- getDefaultXcmsSetStartingParams('centWave')
# Demo data
path <- system.file("extdata/data", package = "rmwf")
files <- list.files(path,recursive = T,full.names = T)
# Uncomment this line to use your own data(suggested 3-5 pooled QC samples)
# path <- 'path/to/your/files'
# change to 5 for obitrap
peakpickingParameters$ppm <- 10 resultPeakpicking <- optimizeXcmsSet(files = files[c(7,9,11)], params = peakpickingParameters, plot = F, subdir = NULL) optimizedXcmsSetObject <- resultPeakpicking$best_settings$xset retcorGroupParameters <- getDefaultRetGroupStartingParams() resultRetcorGroup <- optimizeRetGroup(xset = optimizedXcmsSetObject, params = retcorGroupParameters, plot = F, subdir = NULL) para <- c(resultPeakpicking$best_settings$parameters, resultRetcorGroup$best_settings)
save(para,file = 'para.RData')
sessionInfo()

## 11.4 Wrap function

Here we would use the optimized parameters for peak picking, retention time correction and peaks filling.

load('para.RData')
library(xcms)
library(stringr)
getrtmz <- function(path,index = NULL){
peakwidth <- c(para$min_peakwidth,para$max_peakwidth)
ppm <- para$ppm noise <- para$noise
snthresh <- para$snthresh mzdiff <- para$mzdiff
prefilter <- c(para$prefilter,para$value_of_prefilter)
integrate <- para$integrate profStep <- para$profStep
center <- para$center response <- para$response
gapInit <- para$gapInit gapExtend <- para$gapExtend
factorDiag <- para$factorDiag factorGap <- para$factorGap
localAlignment <- para$localAlignment bw <- para$bw
mzwid <- para$mzwid minfrac <- para$minfrac
minsamp <- para$minsamp max <- para$max
distFunc <- para$distFunc plottype <- para$plottype
retcorMethod <- para$retcorMethod mzCenterFun <- para$mzCenterFun
fitgauss <- para$fitgauss verboseColumns <- para$verbose.columns
files <- list.files(path,full.names = T,recursive = T)
if(!is.null(index)){
files <- files[index]
}
xset <- xcmsSet(files,
method = "centWave",
peakwidth       = peakwidth,
ppm             = ppm,
noise           = noise,
snthresh        = snthresh,
mzdiff          = mzdiff,
prefilter       = prefilter,
mzCenterFun     = mzCenterFun,
integrate       = integrate,
fitgauss        = fitgauss,
verbose.columns = verboseColumns)
xset <- retcor(
xset,
method         = retcorMethod,
plottype       = plottype,
distFunc       = distFunc,
profStep       = profStep,
center         = center,
response       = response,
gapInit        = gapInit,
gapExtend      = gapExtend,
factorDiag     = factorDiag,
factorGap      = factorGap,
localAlignment = localAlignment)
xset <- group(
xset,
method  = "density",
bw      = bw,
mzwid   = mzwid,
minfrac = minfrac,
minsamp = minsamp,
max     = max)

xset <- fillPeaks(xset)
return(xset)
}

### 11.4.1 Peak picking

The first step to process the MS data is that find the peaks against the noises. In xcms, all of related staffs are handled by xcmsSet function.

For any functions in xcms or R, you could get their documents by type ? before certain function. Another geek way is input the name of the function in the console of Rstudio and press F1 for help.

?xcmsSet

In the document of xcmsset, we could set the sample classes, profmethod, profparam, polarity,etc. In the online version, such configurations are shown in certain windows. In the local analysis environment, such parameters are setup by yourselves. However, I think the default configurations could satisfied most of the analysis because related information should have been recorded in your Raw data and xcms could find them. All you need to do is that show the data dictionary for xcmsSet.

If your data have many groups such as control and treated group, just put them in separate subfolder of the data folder and xcmsSet would read them as separated groups.

### 11.4.2 Data correction

Reasons of data correction might come from many aspects such as the unstable instrument and pollution on column. In xcms, the most important correction is retention time correction. Remember the original retention time might changed and use another object to save the new object.

### 11.4.3 Peaks filling

After the retention time correction, filling the missing peaks could be done by fillpeaks. Peaks filling could avoid two many NAs for false statistical analysis. The algorithm could use the baseline signal to impute the data.

## 11.5 Peaks list

Then we could extract the peaks list from xcmsSet objects.

library(enviGCMS)
# get the xcmsset object
path <- system.file("extdata/data", package = "rmwf")
# path <- 'path/to/your/file'
srm <- getrtmz(path)
# back up the xcmsset object, xcmsEIC object and peak list
mzrt <- enviGCMS::getmzrt(srm, name = 'srm', eic = T, type = 'mapo')

## 11.6 Peaks filtering

After we get the peaks list, it’s nessary to filter the peaks list to retain the peaks with high quality for further analysis. Normally, we could use the relative standard deviation of blank and pooled QC samples to control the peaks list.

load('srmmzrt.RDS')
# get the peak intensity, m/z, retention time and group information as list
mzrt <- enviGCMS::getmzrt(srm)
data(mzrt)
# get the mean and rsd for each group
mzrtm <- enviGCMS::getdoe(mzrt)
gm <- mzrtm$groupmean gr <- mzrtm$grouprsd
# find the blank group and pool QC group, demo data only have matrix blank
srm <- grepl('NIST',colnames(gm))
blk <- grepl('Matrix',colnames(gm))
# pqc <- grepl('pool',colnames(gm))
# filter by pool QC and blank's group mean intensity(pool QC should larger than three times of blank), return numbers and index
# in demo data, use sample average intensity for each peak
sum(indexmean <- apply(gm,1,function(x) all(x[srm]>= 3*x[blk])))
# filter by pool qc rsd%, return numbers and index
# in demo data, use sample average intensity for each peak
# mean rsd analysis
# filter by pool qc rsd%, return numbers and index, here we use SRM samples
rsdcf <- 30
sum(indexrsd <- apply(gr,1,function(x) ifelse(is.na(x[srm]),T,x[srm]<rsdcf)))
# overlap with rsd% and mean filter
sum(index <- indexmean&indexrsd)
# new list, update group and remove pool qc/blk
# qcindex <- grepl('k',mzrt$group$class) | grepl('q',mzrt$group$class)
# mzrtfilter <- mzrtfilter <- enviGCMS::getfilter(mzrt,rowindex = index,colindex = !qcindex, name = 'lif', type = 'm')
mzrtfilter <- mzrtfilter <- enviGCMS::getfilter(mzrt,rowindex = index, name = 'lif', type = 'm')

## 11.7 Normalization (Optional)

Normailizaiton is nesscery if you data are collected at different batch or runned in differen instrument parameters.

# visulize the batch effect
mzrtsim::rlaplot(mzrt$data,lv = mzrt$group$class) mzrtsim::ridgesplot(mzrt$data,lv = mzrt$group$class)
# get the simulation data and test on NOREVA
sim <- mzrtsim::simmzrt(mzrt$data) mzrtsim::simdata(sim) # correct the batch effect by sva mzrtcor <- mzrtsim::svacor(mzrt$data,lv = mzrt$group$class)
# visulize the batch effect correction
li <- mzrtsim::limmaplot(mzrtcor,lv = mzrt$group$class)
# return the corrected data
mzrt$data <- mzrtcor$dataCorrected

## 11.8 Statistic analysis

Here we could use caret package to perform statistical analysis.

library(caret)
## Spliting data
trainIndex <- createDataPartition(mzrtfilter$data, p = .8, list = FALSE, times = 1) ## Get the training and testing datasets Train <- data[ trainIndex,] Test <- data[-trainIndex,] ## Set the cross validation method fitControl <- trainControl(## 10-fold CV method = "repeatedcv", number = 10, ## repeated ten times repeats = 10) # extra papameters for GBM gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9), n.trees = (1:30)*50, shrinkage = 0.1, n.minobsinnode = 20) set.seed(825) gbmFit <- train(mzrtfilter$group ~ ., data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
## Now specify the exact models
## to evaluate:
tuneGrid = gbmGrid)
# show the fitting process
plot(gbmFit)
# ANOVA analysis for model selection
anova(fit1,fit2)
# find the important variables
Imp <- varImp(fit)
plot(Imp)

## 11.9 Annotation

Here I use xMSannotator package to make annotation with HMDB as reference database.

library(xMSannotator)
num_nodes = 10
anno <- xsetplus::fanno(negsub,outloc = 'D:/metademo/anno',mode = 'neg')

## 11.10 Pathway Analysis

We could output the files for online pathway analysis with mummichog algorithm.

# get the file
xsetplus::mumdata(neg,lv = mzrt$group$class)
# http://mummichog.org/index.html
# mummichog1 -f 'test.txt' -o myResult

## 11.11 MetaboAnalyst

Actully, after you perform data correction, you have got the data matrix for statistic analysis. You might choose MetaboAnalyst online or offline to make furthor analysis, which supplied more statistical choices than xcms.

The input data format for MetaboAnalyst should be rows for peaks and colomns for samples. You could also add groups infomation if possible. Use the following code to get the data for analysis.

# get the csv file for Metaboanalyst.ca
enviGCMS::getupload(neg,name = 'metabo')

## 11.12 Summary

This is the offline metaboliomics data process workflow. For each study, details would be different and F1 is always your best friend.

Enjoy yourself in data mining!