2  Construction

2.1 Pre-processing

The trans() function contains many normalization methods, suitable for pre-processing of different omics, some refer to vegan::decostand().

Table 2.1: Normalization methods used in omics
Method Description
cpm Counts per million
minmax linear transfer to (min, max)
acpm Counts per million, then asinh transfer
log1 $log(n+1)$ transformat
total divide by total
max divide by maximum
frequency divide by total and multiply by the number of non-zero items, so that the average of non-zero entries is one.
normalize make margin sum of squares equal to one
range standardize values into range (0,1) (same to minmax(0,1)). If all values are constant, they will be transformed to 0.
rank rank replaces abundance values by their increasing ranks leaving zeros unchanged.
rrank rrank is similar but uses relative ranks with maximum 1.
pa scale x to presence/absence scale (0/1).
standardize scale x to zero mean and unit variance.
hellinger square root of method = "total"
log logarithmic transformation as suggested by Anderson et al. (2006): $log_b(x)+1$ for x>0, where b is the base of the logarithm; zeros are left as zeros.
alr Additive log ratio ('alr') transformation (Aitchison 1986) reduces data skewness and compositionality bias.
clr centered log ratio ('clr') transformation proposed by Aitchison (1986) reduces data skewness and compositionality bias.
rclr robust clr ('rclr') is similar to regular clr (see above) but allows data that contains zeroes.
trans(otutab, method = "log1") %>% head()
NS1 NS2 NS3 NS4 NS5 NS6 WS1 WS2 WS3 WS4 WS5 WS6 CS1 CS2 CS3 CS4 CS5 CS6
s__un_f__Thermomonosporaceae 6.996682 7.560601 6.698268 7.211557 6.970730 6.976348 7.133296 7.376508 7.193686 6.848005 7.118016 6.919684 7.746733 7.831617 7.444249 7.588830 7.266827 7.331715
s__Pelomonas_puraquae 7.582229 7.118826 7.767687 7.712891 7.973844 7.512071 6.469250 6.206576 7.115582 7.158514 6.860664 6.455199 7.174724 7.324490 6.739337 7.029088 7.302496 7.069023
s__Rhizobacter_bergeniae 6.378426 6.129050 6.791221 6.804614 7.112327 6.749931 6.405228 6.154858 6.976348 6.936343 6.741701 6.508769 6.937314 7.497207 6.910751 7.090910 7.085902 6.637258
s__Flavobacterium_terrae 5.501258 5.459586 7.501634 6.513230 7.276556 6.198479 5.765191 7.563720 7.309212 6.903747 6.359574 5.886104 6.985642 7.105786 6.626718 6.049734 6.940222 7.253470
s__un_g__Rhizobacter 7.267525 6.023448 6.280396 6.633318 7.162397 6.228511 6.222576 6.381816 6.100319 6.431331 6.489205 6.063785 7.032624 7.277939 6.311735 6.369901 7.008505 6.806829
s__un_o__Burkholderiales 6.787845 6.527958 6.715383 6.816736 7.315218 6.937314 5.463832 5.533390 5.886104 5.945421 5.961005 5.863631 6.313548 6.293419 6.169611 6.327937 6.242223 6.208590

rm_low(), guolv() and hebing() functions can help filter or aggregate the omics data.

2.2 Pairwise relationship

Pairwise relationship is the basis of network construction.

How to determine the pairwise relationship, because the experimental data is generally relatively rare, we mainly relying on statistical inference.

At present, there are mainly two statistical ways:

  • Similarity or correlation (1). For example: Spearman, Pearson, Bray-Curtis, etc. The similarity matrix between paired features can be calculated, and the randomized data can be used to repeatedly calculate the significance. Finally, meaningful similarities are retained in network.

  • Regression. Divide features into source and target, and use multiple regression to calculate the relationship between features.

Some tools use special methods to optimize the network construction, such as SparCC, etc.

Correlation

Correlation is a statistical term describing the degree to which two variables move in coordination with one-another.

Correlation calculation is the first step in all omics network analysis software, there are many method to get \(\rho\) and \(p\)-value. However, as the data size of omics grow larger and larger, some methods became very time and computational resource consuming.

Here, we provide the c_net_calculate() function for one single table or two tables to calculate correlation fastly (fig-2-packages-comparison), which will return a coor object including \(\rho\) and \(p\)-value.

# single table
t(otutab) -> totu
c_net_calculate(totu, method = "spearman") -> corr
corr
## Correlation table:
## Table dimensions: 485 rows, 485 columns
## Use method:  spearman
# two tables
metadata[, 3:10] -> env
c_net_calculate(totu, env, method = "spearman") -> corr2
corr2
## Correlation table:
## Table dimensions: 485 rows, 8 columns
## Use method:  spearman

If your input table is too large, the calculation may take a longer time, you can save the result and read it later.

c_net_calculate(matrix(rnorm(1000 * 100), ncol = 1000), method = "spearman") -> corr

# save the result
save_corr(corr, "../saved_corr1")
read_corr("../saved_corr1") -> corr_new

# they are identical
identical(corr, corr_new)

According to the test, a table with 1,000 features will cost about 0.2s and to calculate the correlation, and the size of saved .corr file is about 6 MB.

Time, memory consumption, and saved file size approximately increase linearly with the square of the number of features, as shown in the following figure. By the way, these consumption increase a little bit with the number of samples.

Code
feature_number <- c(100, 500, 1000, 2000, 3000, 4000)
time <- c()
memory <- c()
save_size <- c()
for (i in feature_number) {
  print(i)
  matrix(rnorm(i * 50), ncol = i) -> tmp
  system.time({
    c_net_calculate(tmp, method = "spearman") -> tmp_corr
  }) -> time1
  time <- c(time, time1[3])
  memory <- c(memory, object.size(tmp_corr))
  save_corr(tmp_corr, "../tmp_corr.corr")
  save_size <- c(save_size, file.size("../tmp_corr.corr"))
}

plot_data <- data.frame(
  feature_number = feature_number,
  `time(s)` = time, `memory(MB)` = memory / 1024 / 1024,
  `save_size(MB)` = save_size / 1024 / 1024, check.names = F
)
plot_data <- melt(plot_data, id.vars = "feature_number")

ggplot(plot_data, aes(x = feature_number, y = value, color = variable)) +
  geom_line() +
  geom_point() +
  facet_wrap(~variable, scales = "free") +
  labs(
    title = "Time and memory consumption of `c_net_calculate`",
    x = "feature number", y = "value"
  ) +
  theme_bw() +
  theme(legend.position = "none")
Figure 2.1: Time and memory consumption of c_net_calculate

The calculation speed of c_net_calculate is much faster than other R packages, especially when the number of features is large:

Figure 2.2: Comparison of correlation calculation speed between different packages

SparCC

SparCC fits the Dirichlet distribution to the observed data, and iteratively calculates the proportion and correlation of species several times. The resulting correlation is the median of the distribution. \(p\)-values were calculated using the bootstrap method.

This metric is said to be more useful with non-normal microbiome data.

\(D(x_i,x_j)=var(\log(\frac{x_i}{x_j}))\)

c_net_calculate(totu, method = "sparcc")
## NULL corr object.

Distance

We should understand the dissimilarity indices (distance) for community. The vegan::vegdist() function provides a wide range of distance metrics based on abundance calculations (Table 2.2). Additionally, there are distance metrics that incorporate phylogenetic data such as “unifrac,” “beta_mpd,” “beta_mntd,” and “phylosor”, etc.

And the similarity indices are the inverse of the distance indices, often as \(1-distance\). We can also use c_net_calculate() to calculate the distance. The the similarity indices are more suitable for relationship between samples.

c_net_calculate(otutab, method = "bray") -> sim_df
sim_df
## Similarity (1-Distance) table:
## Table dimensions: 18 rows, 18 columns
## Use method:  bray
Table 2.2: Some dissimilarity indices for community
Distance.Coefficient Description
Manhattan Manhattan distance calculates the sum of absolute differences between coordinates in each dimension. It is suitable for data with categorical variables or attributes.
Euclidean Euclidean distance calculates the straight-line distance between two points in a multidimensional space. It is commonly used when data are continuous and have no categorical attributes.
Canberra Canberra distance is a weighted version of Manhattan distance that emphasizes the importance of small differences between coordinates. It is often used for ecological data analysis.
Clark Clark distance measures the proportion of non-zero attributes that are different between two samples.
Bray Bray distance calculates the dissimilarity between two samples based on the sum of absolute differences in abundances.
Kulczynski Kulczynski distance measures the similarity between two samples based on the arithmetic mean of proportions of common attributes.
Jaccard Jaccard distance measures dissimilarity between two samples based on the proportion of unique attributes. It is commonly used in ecology for binary data.
Gower Gower distance is a generalization of Manhattan distance for mixed data types, including categorical and continuous variables.
AltGower AltGower distance is an alternative form of Gower distance that uses an alternative method to standardize continuous variables.
Morisita Morisita distance measures dissimilarity between two samples based on the frequency of shared attributes, with emphasis on rare attributes.
Horn Horn distance measures dissimilarity between two samples based on the geometric mean of proportions of shared attributes.
Mountford Mountford distance measures dissimilarity between two samples based on the reciprocal of the arithmetic mean of proportions of shared attributes.
Raup Raup distance measures dissimilarity between two samples based on the probability of shared attributes.
Binomial Binomial distance measures dissimilarity between two samples based on the probability of observing shared attributes.
Chao Chao distance is a modification of Jaccard distance that adjusts for undersampling of rare species.
Cao Cao distance is a modification of Chao distance that incorporates species abundance information.
Mahalanobis Mahalanobis distance measures the distance between two samples in multidimensional space, accounting for correlation between variables.
Chisq Chisq distance calculates the dissimilarity between two samples based on the chi-squared distance between categorical variables.
Chord Chord distance calculates dissimilarity between two samples based on the angles between points in multidimensional space.
Hellinger Hellinger distance measures dissimilarity between two samples based on the square root of the sum of squared differences in square root-transformed abundances.
Aitchison Aitchison distance is a transformation of compositional data that allows for Euclidean distance calculation in log-ratio space.
Robust Aitchison Robust Aitchison distance is a robust version of Aitchison distance that reduces the influence of outliers in the data.
Unifrac Unifrac distance measures dissimilarity between microbial communities based on evolutionary distances in a phylogenetic tree.
Beta MPD Beta MPD (Mean Phylogenetic Distance) measures the phylogenetic diversity between two communities based on the mean phylogenetic distance of species pairs.
Beta MNTD Beta MNTD (Mean Nearest Taxon Distance) measures the phylogenetic turnover between two communities based on the mean nearest taxon distance.
Phylosor Phylosor distance measures the dissimilarity between communities based on the proportion of shared branches in a phylogenetic tree.

Others

There are some other methods available for network construction in NetCoMi:

  • Biweight Midcorrelation bicor() from WGCNA package

  • CCLasso (R code on GitHub)

  • CCREPE (ccrepe package)

  • SPRING (SPRING package)

  • gCoda (R code on GitHub)

  • propr (propr package)

  • Kullback-Leibler divergence (KLD) (KLD() from LaplacesDemon package)

  • Jeffrey divergence (own code using KLD() from LaplacesDemon package)

  • Jensen-Shannon divergence (own code using KLD() from LaplacesDemon package)

2.3 Build network

Normally build

If you have done the c_net_calculate() step, then you can get a network (metanet object, inherit from igraph object) easily by c_net_build(). Some common attributes will be set automatically.

c_net_build(corr, r_thres = 0.6, p_thres = 0.05, delete_single = T) -> co_net
co_net
## =================================== metanet ==================================== 
## IGRAPH 1fe7058 UNW- 483 1513 -- 
## + attr: n_type (g/c), name (v/c), v_group (v/c), v_class (v/c), size
## | (v/n), label (v/c), shape (v/c), color (v/c), id (e/n), from (e/c),
## | to (e/c), weight (e/n), cor (e/n), p.value (e/n), e_type (e/c), width
## | (e/n), color (e/c), e_class (e/c), lty (e/n)
## + edges from 1fe7058 (vertex names):
## [1] s__un_f__Thermomonosporaceae--s__Actinocorallia_herbida    
## [2] s__un_f__Thermomonosporaceae--s__Kribbella_catacumbae      
## [3] s__un_f__Thermomonosporaceae--s__Kineosporia_rhamnosa      
## [4] s__un_f__Thermomonosporaceae--s__un_f__Micromonosporaceae  
## [5] s__un_f__Thermomonosporaceae--s__Flavobacterium_saliperosum
## + ... omitted several edges
plot(co_net)
Figure 2.3: Simple co-occurrence network
c_net_build(corr2) -> co_net2
plot(co_net2)
Figure 2.4: Simple bipartite network

Multi-tables

When you have more than two tables for correlation network analysis, you can choose the multi_net_build() to calculate and build network.

For subsequent multi-omics analysis, see Chapter 7.

data("multi_test", package = "MetaNet")
# microbiome
dim(micro)
## [1] 18 50
# metabolome
dim(metab)
## [1] 18 50
# transcriptome
dim(transc)
## [1] 18 50

multi_net_build(list(
  Microbiome = micro,
  Metabolome = metab,
  Transcriptome = transc
)) -> multi1

plot(multi1)
Figure 2.5: Simple multi-omics network

Edgelist

If you already get the pairwise relationship of data from other approaches (database), you can form it into a edgelist and use c_net_from_edgelist to build network. It is useful for following analysis.

data("edgelist", package = "MetaNet")
dnet <- c_net_from_edgelist(arc_count, direct = T)
plot(dnet)
Figure 2.6: Simple directed network

Others

If you have network in other format, such as pajek, lgl, graphml, dimacs, graphdb, gml and dl, you can use read.graph to read it as igraph object. And you can also use graph_from_* functions to get igraph object, then use c_net_update to transform it into metanet object.

g1 <- graph_from_lcf(12, c(5, -5), 6)
class(g1)
## [1] "igraph"
plot(g1)

c_net_update(g1) -> g2
class(g2)
## [1] "metanet" "igraph"
plot(g2)
(a) igraph plot
(b) metanet plot
Figure 2.7: igraph plot and metanet plot

2.4 RMT optimization

The correlation-based relevance network method is most commonly used largely due to its simple calculation procedure and noise tolerance. However, most studies involving relevance network analysis use arbitrary thresholds (usually, we use r>0.6, p<0.05), and thus the constructed networks are subjective rather than objective.

This problem has been partially solved by a random matrix theory (RMT)-based approach (Figure 2.8), which is able to automatically identify a proper threshold for network construction from micro-array data (also other omics data) (2).

Figure 2.8: Random matrix theory (RMT)-based approach

use RMT_threshold() , we can find a proper r_threshold to build the network with smallest noise.

the bigger log_LE, less log_LW, less log_see, bigger p_ks_test indicate the better r_threshold for a meaningful network construction.You can change the threshold range to calculate more finely.

RMT_threshold(corr,
  out_dir = "../RMT_temp/",
  min_threshold = 0.5, max_threshold = 0.9, step = 0.02,
  verbose = T
) -> rmt_res
plot(rmt_res)
Figure 2.9: RMT_threshold result from 0.5 to 0.9

You can set the gif=TRUE in RMT_threshold() and get a gif file to observe the distribution of eigenvalue spacing for different r-thresholds.

Figure 2.10: the distribution of eigenvalue spacing from 0.5 to 0.9