Chapter 5 Raw data pretreatment

Raw data from the instruments such as LC-MS or GC-MS were hard to be analyzed. To make it clear, the structure of those data could be summarised as:

  • Indexed scan with time-stamp

  • Each scan contains a full scan mass spectra

Commen formats for open source mass spectrum data are mzxml, mzml or CDF. However, masscomp might shrink the data size(Yang, Chen, and Ochoa 2019).

5.1 Peak extraction

GC/LC-MS data are usually be shown as a matrix with column standing for retention times and row standing for masses after bin them into small cell.

Demo of GC/LC-MS data

Figure 5.1: Demo of GC/LC-MS data

Conversation from the mass-retention time matrix into a vector with selected MS peaks at certain retention time is the basic idea of Peak extraction. You could EIC for each mass to charge ratio and use the change of trace slope to determine whether there is a peak or not. Then we could make integration for this peak and get peak area and retention time.

Demo of EIC with peak

Figure 5.2: Demo of EIC with peak

However, due to the accuracy of instrument, the detected mass to charge ratio would have some shift and EIC would fail if different scan get the intensity from different mass to charge ratio.

In the matchedfilter algorithm(Smith et al. 2006), they solve this issue by bin the data in m/z dimension. The adjacent chromatographic slices could be combined to find a clean signal fitting fixed second-derivative Gaussian with full width at half-maximum (fwhm) of 30s to find peaks with about 1.5-4 times the signal peak width. The the integration is performed on the fitted area.

Demo of matchedfilter

Figure 5.3: Demo of matchedfilter

The Centwave algorithm(Tautenhahn, Böttcher, and Neumann 2008) based on detection of regions of interest(ROI) and the following Continuous Wavelet Transform (CWT) is preferred for high-resolution mass spectrum. ROI means a regine with stable mass for a certain time. When we find the ROIs, the peak shape is evaluated and ROI could be extended if needed. This algotithm use prefilter to accelerate the processing speed. prefilter with 3 and 100 means the ROI should contain 3 scan with intensity above 100. Centwave use a peak width range which should be checked on pool QC. Another important parameter is ppm. It is the maximum allowed deviation between scans when locating regions of interest (ROIs), which is different from vendor number and you need to extend them larger than the company claimed. For profparam, it’s used for fill peaks or align peaks instead of peak picking. snthr is the cutoff of signal to noise ratio.

5.2 Retention Time Correction

For single file, we could get peaks. However, we should make the peaks align across samples for subsquece analysis and retention time corrections should be performed. The basic idea behind retention time correction is that use the high quality grouped peaks to make a new retention time. You might choose obiwarp(for dramatic shifts) or loess regression(fast) method to get the corrected retention time for all of the samples. Remember the original retention times might be changed and you might need cross-correct the data. After the correction, you could group the peaks again for a better cross-sample peaks list. However, if you directly use obiwarp, you don’t have to group peaks before correction.

(Fu et al. 2017) show a matlab based shift correction methods.

5.3 Filling missing values

Too many zeros or NA in peaks list are problematic for statistics. Then we usually need to integreate the area exsiting a peak. xcms 3 could use profile matrix to fill the blank. They also have function to impute the NA data by replace missing values with a proportion of the row minimum or random numbers based on the row minimum. It depends on the user to select imputation methods as well as control the minimum fraction of featuers appeared in single group.

Peak filling of GC/LC-MS data

Figure 5.4: Peak filling of GC/LC-MS data

With many groups of samples, you will get another data matrix with column standing for peaks at cerntain retention time and row standing for samples after the Raw data pretreatment.

Demo of many GC/LC-MS data

Figure 5.5: Demo of many GC/LC-MS data

5.4 Spectral deconvolution

Without fracmental infomation about certain compound, the peak extraction would suffer influnces from other compounds. At the same retention time, co-elute compounds might share similar mass. Hard electron ionization methods such as electron impact ionization (EI), APPI suffer this issue. So it would be hard to distighuish the co-elute peaks’ origin and deconvolution method(Du and Zeisel 2013) could be used to seperate different groups according to the similar chromatogragh beheviors. Another computational tool eRah could be a better solution for the whole process(Domingo-Almenara et al. 2016). Also the ADAD-GC3.0 could also be helpful for such issue(Ni et al. 2016).

5.5 Dynamic Range

Another issue is the Dynamic Range. For metabolomics, peaks could be below the detection limit or over the detection limit. Such Dynamic range issues might raise the loss of information.

5.5.1 Non-detects

Some of the data were limited by the detect of limitation. Thus we need some methods to impute the data if we don’t want to lose information by deleting the NA or 0.

Two major imputation way could be used. The first way is use model-free method such as half the minimum of the values across the data, 0, 1, mean/median across the data( enviGCMS package could do this via getimputation function). The second way is use model-based method such as linear model, random forest, KNN, PCA. Try simputation package for various imputation methods. As mentioned before, you could also use imputeRowMin or imputeRowMinRand within xcms package to perform imputation.

Tobit regression is preferred for censored data. Also you might choose maximum likelihood estimation(Estimation of mean and standard deviation by MLE. Creating 10 complete samples. Pool the results from 10 individual analyses).

## Call:
## tobit(formula = y ~ x, left = 0)
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##           1000              0           1000              0 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0000     0.4415   2.265   0.0235 *  
## x            10.0000     0.3162  31.623   <2e-16 ***
## Log(scale)    2.1449     0.0000     Inf   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Scale: 8.541 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 1 
## Log-likelihood: -3064 on 3 Df
## Wald-statistic:  1000 on 1 Df, p-value: < 2.22e-16

According to Ronald Hites’s simulation(Hites 2019), measurements below the LOD (even missing measurements) with the LOD/2 or with the \(LOD/\sqrt2\) causes little bias and “Any time you have a % non-detected >20%, for whatever reason, it is unlikely that the data set can give useful results.”

5.5.2 Over Detection Limit

CorrectOverloadedPeaks could be used to correct the Peaks Exceeding the Detection Limit issue(Lisec et al. 2016).

5.6 RSD/fold change Filter

Some peaks need to be rule out due to high RSD% and small fold changes compared with blank samples.

5.7 Power Analysis Filter

As shown in [Exprimental design(DoE)], the power analysis in metabolomics is ad-hoc since you don’t know too much before you perform the experiment. However, we could perform power analysis after the experiment done. That is, we just rule out the peaks with a lower power in exsit Exprimental design.

5.8 Software

5.8.1 Peak picking

  • ProteoWizard Toolkit provides a set of open-source, cross-platform software libraries and tools (Chambers et al. 2012). Msconvert is one tool in this toolkit.

  • xcms LC/MS and GC/MS Data Analysis(Smith et al. 2006)

  • apLCMS Generate peaks list (Yu et al. 2009)

  • x13cms global tracking of isotopic labels in untargeted metabolomics (Huang et al. 2014)

  • FTMSVisualization is a suite of tools for visualizing complex mixture FT-MS data(Kew et al. 2017)

  • MZmine is an open-source software for mass-spectrometry data processing, with the main focus on LC-MS data(Pluskal et al. 2010)

  • MS-DAIL is a universal program for untargeted metabolomics- and lipidomics supporting any type of chromatography/mass spectrometry methods (GC/MS, GC-MS/MS, LC/MS, and LC-MS/MS etc.) (Tsugawa et al. 2015)

  • OpenMS is an open-source software C++ library for LC/MS data management and analyses(Röst et al. 2016)

  • MZmatch is a Java collection of small commandline tools specific for metabolomics MS data analysis (Scheltema et al. 2011; Creek et al. 2012)

  • iMet-Q is an automated tool with friendly user interfaces for quantifying metabolites in full-scan liquid chromatography-mass spectrometry (LC-MS) data.(Chang et al. 2016)

  • MAVEN is an open source cross platform metabolomics data analyser.(Melamud, Vastag, and Rabinowitz 2010)

5.8.2 For MS/MS

  • MS-DAIL for data independent MS/MS deconvolution of comprehensive metabolome analysis.(Tsugawa et al. 2015)

  • decoMS2 An Untargeted Metabolomic Workflow to Improve Structural Characterization of Metabolites(Nikolskiy et al. 2013)

  • msPurity Automated Evaluation of Precursor Ion Purity for Mass Spectrometry-Based Fragmentation in Metabolomics(Lawson et al. 2017)

  • ULSA Deconvolution algorithm and a universal library search algorithm (ULSA) for the analysis of complex spectra generated via data-independent acquisition based on Matlab (Samanipour et al. 2018)

5.8.3 Improved Peak picking

  • IPO A Tool for automated Optimization of XCMS Parameters(Libiseller et al. 2015).

  • Warpgroup is used for chromatogram subregion detection, consensus integration bound determination and accurate missing value integration(Mahieu, Spalding, and Patti 2016)

  • xMSanalyzer improved Peak picking for xcms and apLCMS(Uppal et al. 2013)

  • ms-flo A Tool To Minimize False Positive Peak Reports in Untargeted Liquid Chromatography–Mass Spectroscopy (LC-MS) Data Processing(DeFelice et al. 2017)


Chambers, Matthew C., Brendan Maclean, Robert Burke, Dario Amodei, Daniel L. Ruderman, Steffen Neumann, Laurent Gatto, et al. 2012. “A Cross-Platform Toolkit for Mass Spectrometry and Proteomics.” Nat. Biotechnol. 30 (October): 918–20.

Chang, Hui-Yin, Ching-Tai Chen, T. Mamie Lih, Ke-Shiuan Lynn, Chiun-Gung Juo, Wen-Lian Hsu, and Ting-Yi Sung. 2016. “iMet-Q: A User-Friendly Tool for Label-Free Metabolomics Quantitation Using Dynamic Peak-Width Determination.” PLOS ONE 11 (1): e0146112.

Creek, Darren J., Andris Jankevics, Karl E. V. Burgess, Rainer Breitling, and Michael P. Barrett. 2012. “IDEOM: An Excel Interface for Analysis of LCMS-Based Metabolomics Data.” Bioinformatics 28 (7): 1048–9.

DeFelice, Brian C., Sajjan Singh Mehta, Stephanie Samra, Tomáš Čajka, Benjamin Wancewicz, Johannes F. Fahrmann, and Oliver Fiehn. 2017. “Mass Spectral Feature List Optimizer (MS-FLO): A Tool to Minimize False Positive Peak Reports in Untargeted Liquid ChromatographyMass Spectroscopy (LC-MS) Data Processing.” Anal. Chem. 89 (6): 3250–5.

Domingo-Almenara, Xavier, Jesus Brezmes, Maria Vinaixa, Sara Samino, Noelia Ramirez, Marta Ramon-Krauel, Carles Lerin, et al. 2016. “eRah: A Computational Tool Integrating Spectral Deconvolution and Alignment with Quantification and Identification of Metabolites in GC/MS-Based Metabolomics.” Anal. Chem. 88 (19): 9821–9.


Fu, Hai-Yan, Ou Hu, Yue-Ming Zhang, Li Zhang, Jing-Jing Song, Peang Lu, Qing-Xia Zheng, et al. 2017. “Mass-Spectra-Based Peak Alignment for Automatic Nontargeted Metabolic Profiling Analysis for Biomarker Screening in Plant Samples.” Journal of Chromatography A 1513 (Supplement C): 201–9.

Hites, Ronald A. 2019. “Correcting for Censored Environmental Measurements.” Environ. Sci. Technol., September.

Huang, Xiaojing, Ying-Jr Chen, Kevin Cho, Igor Nikolskiy, Peter A. Crawford, and Gary J. Patti. 2014. “X13CMS: Global Tracking of Isotopic Labels in Untargeted Metabolomics.” Anal. Chem. 86 (3): 1632–9.

Kew, William, John W. T. Blackburn, David J. Clarke, and Dušan Uhrín. 2017. “Interactive van Krevelen Diagrams Visualisation of Mass Spectrometry Data of Complex Mixtures.” Rapid Commun. Mass Spectrom. 31 (7): 658–62.

Lawson, Thomas N., Ralf J. M. Weber, Martin R. Jones, Andrew J. Chetwynd, Giovanny Rodrıg'uez-Blanco, Riccardo Di Guida, Mark R. Viant, and Warwick B. Dunn. 2017. “msPurity: Automated Evaluation of Precursor Ion Purity for Mass Spectrometry-Based Fragmentation in Metabolomics.” Anal. Chem. 89 (4): 2432–9.

Libiseller, Gunnar, Michaela Dvorzak, Ulrike Kleb, Edgar Gander, Tobias Eisenberg, Frank Madeo, Steffen Neumann, et al. 2015. “IPO: A Tool for Automated Optimization of XCMS Parameters.” BMC Bioinformatics 16 (April): 118.

Lisec, Jan, Friederike Hoffmann, Clemens Schmitt, and Carsten Jaeger. 2016. “Extending the Dynamic Range in Metabolomics Experiments by Automatic Correction of Peaks Exceeding the Detection Limit.” Anal. Chem. 88 (15): 7487–92.

Mahieu, Nathaniel G., Jonathan L. Spalding, and Gary J. Patti. 2016. “Warpgroup: Increased Precision of Metabolomic Data Processing by Consensus Integration Bound Analysis.” Bioinformatics 32 (2): 268–75.

Melamud, Eugene, Livia Vastag, and Joshua D. Rabinowitz. 2010. “Metabolomic Analysis and Visualization Engine for LC-MS Data.” Anal. Chem. 82 (23): 9818–26.

Ni, Yan, Mingming Su, Yunping Qiu, Wei Jia, and Xiuxia Du. 2016. “ADAP-GC 3.0: Improved Peak Detection and Deconvolution of Co-Eluting Metabolites from GC/TOF-MS Data for Metabolomics Studies.” Anal. Chem. 88 (17): 8802–11.

Nikolskiy, Igor, Nathaniel G. Mahieu, Ying-Jr Chen, Ralf Tautenhahn, and Gary J. Patti. 2013. “An Untargeted Metabolomic Workflow to Improve Structural Characterization of Metabolites.” Anal. Chem. 85 (16): 7713–9.

Pluskal, Tomáš, Sandra Castillo, Alejandro Villar-Briones, and Matej Orešič. 2010. “MZmine 2: Modular Framework for Processing, Visualizing, and Analyzing Mass Spectrometry-Based Molecular Profile Data.” BMC Bioinformatics 11: 395.

Röst, Hannes L., Timo Sachsenberg, Stephan Aiche, Chris Bielow, Hendrik Weisser, Fabian Aicheler, Sandro Andreotti, et al. 2016. “OpenMS: A Flexible Open-Source Software Platform for Mass Spectrometry Data Analysis.” Nat Meth 13 (9): 741–48.

Samanipour, Saer, Malcolm J. Reid, Kine Bæk, and Kevin V. Thomas. 2018. “Combining a Deconvolution and a Universal Library Search Algorithm for the Nontarget Analysis of Data-Independent Acquisition Mode Liquid Chromatography-High-Resolution Mass Spectrometry Results.” Environ. Sci. Technol. 52 (8): 4694–4701.

Scheltema, Richard A., Andris Jankevics, Ritsert C. Jansen, Morris A. Swertz, and Rainer Breitling. 2011. “PeakML/mzMatch: A File Format, Java Library, R Library, and Tool-Chain for Mass Spectrometry Data Analysis.” Anal. Chem. 83 (7): 2786–93.

Smith, Colin A., Elizabeth J. Want, Grace O’Maille, Ruben Abagyan, and Gary Siuzdak. 2006. “XCMS:  Processing Mass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment, Matching, and Identification.” Anal. Chem. 78 (3): 779–87.

Tautenhahn, Ralf, Christoph Böttcher, and Steffen Neumann. 2008. “Highly Sensitive Feature Detection for High Resolution LC/MS.” BMC Bioinformatics 9: 504.

Tsugawa, Hiroshi, Tomas Cajka, Tobias Kind, Yan Ma, Brendan Higgins, Kazutaka Ikeda, Mitsuhiro Kanazawa, Jean VanderGheynst, Oliver Fiehn, and Masanori Arita. 2015. “MS-DIAL: Data-Independent MS/MS Deconvolution for Comprehensive Metabolome Analysis.” Nat Meth 12 (6): 523–26.

Uppal, Karan, Quinlyn A. Soltow, Frederick H. Strobel, W. Stephen Pittard, Kim M. Gernert, Tianwei Yu, and Dean P. Jones. 2013. “xMSanalyzer: Automated Pipeline for Improved Feature Detection and Downstream Analysis of Large-Scale, Non-Targeted Metabolomics Data.” BMC Bioinformatics 14 (1): 15.

Yang, Ruochen, Xi Chen, and Idoia Ochoa. 2019. “MassComp, a Lossless Compressor for Mass Spectrometry Data.” BMC Bioinformatics 20 (1): 368.

Yu, Tianwei, Youngja Park, Jennifer M. Johnson, and Dean P. Jones. 2009. “apLCMSAdaptive Processing of High-Resolution LC/MS Data.” Bioinformatics 25 (15): 1930–6.