3  Explore composition

3.1 Pre-processing

The trans() function contains many normalization methods, suitable for pre-processing of different omics, some refer to vegan::decostand().

Table 3.1: Normalization methods used in omics
Method Description
cpm Counts per million
minmax linear transfer to (min, max)
acpm Counts per million, then asinh transfer
log1 $log(n+1)$ transformat
total divide by total
max divide by maximum
frequency divide by total and multiply by the number of non-zero items, so that the average of non-zero entries is one.
normalize make margin sum of squares equal to one
range standardize values into range (0,1) (same to minmax(0,1)). If all values are constant, they will be transformed to 0.
rank rank replaces abundance values by their increasing ranks leaving zeros unchanged.
rrank rrank is similar but uses relative ranks with maximum 1.
pa scale x to presence/absence scale (0/1).
standardize scale x to zero mean and unit variance.
hellinger square root of method = "total"
log logarithmic transformation as suggested by Anderson et al. (2006): $log_b(x)+1$ for x>0, where b is the base of the logarithm; zeros are left as zeros.
alr Additive log ratio ('alr') transformation (Aitchison 1986) reduces data skewness and compositionality bias.
clr centered log ratio ('clr') transformation proposed by Aitchison (1986) reduces data skewness and compositionality bias.
rclr robust clr ('rclr') is similar to regular clr (see above) but allows data that contains zeroes.
trans(otutab, method = "log1") %>% head()
NS1 NS2 NS3 NS4 NS5 NS6 WS1 WS2 WS3 WS4 WS5 WS6 CS1 CS2 CS3 CS4 CS5 CS6
s__un_f__Thermomonosporaceae 6.996682 7.560601 6.698268 7.211557 6.970730 6.976348 7.133296 7.376508 7.193686 6.848005 7.118016 6.919684 7.746733 7.831617 7.444249 7.588830 7.266827 7.331715
s__Pelomonas_puraquae 7.582229 7.118826 7.767687 7.712891 7.973844 7.512071 6.469250 6.206576 7.115582 7.158514 6.860664 6.455199 7.174724 7.324490 6.739337 7.029088 7.302496 7.069023
s__Rhizobacter_bergeniae 6.378426 6.129050 6.791221 6.804614 7.112327 6.749931 6.405228 6.154858 6.976348 6.936343 6.741701 6.508769 6.937314 7.497207 6.910751 7.090910 7.085902 6.637258
s__Flavobacterium_terrae 5.501258 5.459586 7.501634 6.513230 7.276556 6.198479 5.765191 7.563720 7.309212 6.903747 6.359574 5.886104 6.985642 7.105786 6.626718 6.049734 6.940222 7.253470
s__un_g__Rhizobacter 7.267525 6.023448 6.280396 6.633318 7.162397 6.228511 6.222576 6.381816 6.100319 6.431331 6.489205 6.063785 7.032624 7.277939 6.311735 6.369901 7.008505 6.806829
s__un_o__Burkholderiales 6.787845 6.527958 6.715383 6.816736 7.315218 6.937314 5.463832 5.533390 5.886104 5.945421 5.961005 5.863631 6.313548 6.293419 6.169611 6.327937 6.242223 6.208590

rm_low(), guolv() and hebing() functions can help filter or aggregate the omics data.

3.2 Stackplot

# sum the abundance to Phylum level
hebing(otutab, taxonomy$Phylum, margin = 1, act = "sum") -> phylum
stackplot(phylum, metadata, group = "Group", topN = 10) +
  scale_fill_manual(values = get_cols(10))
stackplot(phylum, metadata,
  group = "Group", style = "sample",
  group_order = TRUE, flow = TRUE, relative = FALSE
) +
  scale_fill_manual(values = get_cols(10))
Figure 3.1: Stack plot
Figure 3.2: Stack flow plot with sample style

The stackplot function offers a wide range of parameters to facilitate the adjustment of the final graphical output. For further details and examples on drawing bar plots and their transformations in R, please refer to R Draw the bar plot and its transformation.

3.3 Rarefaction

Rarefaction is a method used to assess the relationship between sample community data sampling effort and species richness within the sample. Commonly used in ecology, it evaluates whether the observed species diversity saturates given a certain number of samples or sequences, and whether additional sampling is needed to fully capture species richness within the sample.

For sample data, Rarefaction curves display the number of observed species at different levels of sampling effort (or sequence numbers). By plotting Rarefaction curves, one can assess how species diversity changes with varying sampling depths and determine if saturation has been reached.

For species data, Rarefaction curves display the number of observed samples at different levels of sampling effort (or sequence numbers). Through this method, one can evaluate how sample numbers change with different sampling depths and determine if additional samples are necessary to fully capture species richness.

a <- rare_curve_sample(otutab)
plot(a)

a <- rare_curve_species(otutab, mode = 1)
plot(a)
Figure 3.3: Rarefaction curves of sample data
Figure 3.4: Rarefaction curves of species data

3.4 Phylogenetic tree

ann_tree(taxonomy, otutab) -> tree
easy_tree(tree, add_abundance = FALSE)
Figure 3.5: Phylogenetic tree
easy_tree(tree, add_tiplab = FALSE) -> p
some_tax <- table(taxonomy$Phylum) %>%
  sort(decreasing = TRUE) %>%
  head(5) %>%
  names()
add_strip(p, some_tax)
Figure 3.6: Phylogenetic tree with some strips

You can refer to Basic phylogenetic tree and Advanced phylogenetic tree to plot the phylogenetic tree using ggtree.

In addition, you can also choose to use online platforms like iPhylo and iTOL for interactive plotting.

iPhylo is a free online platform for generating, annotating, and visualizing phylogenetic trees. It allows you to plot various tree diagrams for species, compounds, and other hierarchical structures and conveniently add complex annotation information.

iPhylo Website: https://iphylo.net

For more information, you can check out our Online Guide Book.

sangji_plot(tree, width = 900, height = 700)
Figure 3.7: Sankey plot of taxanomy
sunburst(tree)
Figure 3.8: Sunburst plot of taxanomy

3.5 Rtaxonkit

Taxonkit is a Practical and Efficient NCBI Taxonomy Toolkit (1).

We recommend you download this excellent software to help next analysis. Or you can use Taxonkit in R by pctax interface as followed:

# 1. This function help you install suitable version taxonkit
install_taxonkit()
# taxonkit has been successfully installed!

# 2. Then download the NCBI taxonomy database.
download_taxonkit_dataset()
# Taxonkit files downloaded and copied successfully.

# 3. Check whether taxonkit is ready
check_taxonkit()
# ==============Taxonkit is available if there is help message above==============
# =========================Taxonkit dataset is available!=========================

Then you can use taxonkit in R just like in terminal.

?taxonkit_lineage

# taxonkit_list
# taxonkit_reformat
# taxonkit_name2taxid
# taxonkit_filter
# taxonkit_lca
lineage <- taxonkit_lineage("9606\n63221", show_name = TRUE, show_rank = TRUE, text = T)
lineage
[1] "9606\tcellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\tHomo sapiens\tspecies"                                                   
[2] "63221\tcellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens;Homo sapiens neanderthalensis\tHomo sapiens neanderthalensis\tsubspecies"