2 Data Description
We need datasets such as genelist, coverage pileup, and sampleInfo to obtain the sample quality index outputs and plots.
Descriptions of the input data and variable names are provided below.
genelist: a vector of gene namespileupPath: a vector for file paths of coverage pileupData including .RData file namesgeneInfo: a data frame of gene information including gene ID and propertiesgene_id: ensembl gene IDgeneSymbol: gene namesmerged: gene lengthexon.wtpct_gc: weighted percentage of GC from exon level datasubcategory: protein coding or lncRNA
sampleInfo: a data frame of sample information including sample ID and properties from Picard RnaSeqMetricsSampleID: sample IDPF_BASES: the total number of bases within the PF_READS of the SAM or BAM file to be examinedPF_ALIGNED_BASES: the total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequenceRIBOSOMAL_BASES: number of bases in primary alignments that align to ribosomal sequenceCODING_BASES: number of bases in primary alignments that align to a non-UTR coding base for some gene, and not ribosomal sequenceUTR_BASES: number of bases in primary alignments that align to a UTR base for some gene, and not a coding baseINTRONIC_BASES: number of bases in primary alignments that align to an intronic base for some gene, and not a coding or UTR baseINTERGENIC_BASES: number of bases in primary alignments that do not align to any geneRINs: RIN value
TPM: a data frame for TPM normalization
Alliance/CALGB
This example consists of 1,000 selected genes among protein coding and lncRNA genes and fresh frozen and total RNA-seq (FFT) 171 samples. Among the samples, 156 are tumor types and the others are normal.
The summary table from summarytools show descriptive statistics to review the distribution and missing values for the provided datasets.
library(summarytools)
library(dplyr)
descr(
sampleInfo %>% select(-c(SampleID)),
stats = c("min", "med", "max", "n.valid"),
transpose = TRUE,
headings = FALSE
)##
## Min Median Max N.Valid
## ---------------------- --------------- ---------------- ---------------- ---------
## CODING_BASES 360227386.00 3208673516.00 7201831541.00 171.00
## INTERGENIC_BASES 1273275706.00 2567603596.00 16945505453.00 171.00
## INTRONIC_BASES 434293079.00 5593128001.00 10076495286.00 171.00
## PF_ALIGNED_BASES 5448216119.00 14566862147.00 23291961710.00 171.00
## PF_BASES 6481274100.00 16315945500.00 26148083100.00 171.00
## RIBOSOMAL_BASES 0.00 150.00 6600.00 171.00
## RINs 1.10 5.30 9.20 167.00
## UTR_BASES 381863999.00 2783953295.00 4803719959.00 171.00