2.2 Correlation analysis

An essential step in the early phases of a mixture analysis is the assessment of the correlation between mixture components. This preliminary analysis gives a sense of the relationship between exposures, allows a preliminary assessment of exposures patterns and clusters, and gives important information that migjt inform which method could be better suited for future modeling.

Given two continuous covariates, a simple assessment of their relationship can be checked with a simple two-ways scatterplot. Here we show a set of three 2x2 comparisons, also adding a lowess trend line on top of the scatter plot.

Scatter plots of paris of exposures in the simulated data

Figure 2.2: Scatter plots of paris of exposures in the simulated data

We see that some combinations of covariates being highly correlated (like \(X_3\) and \(X_4\)), while other exposures seem to be completely independent (e.g. \(X_1\) and \(X_5\)).

A correlation coefficient and a correlation test, will additionally provide a quantitative assessment of this relationship. The Pearson correlation (\(r\)) measures the linear dependence between two variables and it can only be used when both covariates are normally distributed:

\[r=\frac{\sum(x-m_x)(y-m_y)}{\sqrt{\sum(x-m_x)^2(y-m_y)^2}}\]

where \(m_x\) and \(m_y\) are the means of the two covariates \(x\) and \(y\)

The Spearman correlation (\(\rho\)) measures the correlation between the rank of the two covariates \(x\) and \(y\):

\[\rho=\frac{\sum(x'-m_{x'})(y'-m_{y'})}{\sqrt{\sum(x'-m_{x'})^2(y'-m_{y'})^2}}\]

where \(m_{x'}\) and \(m_{y'}\) are the ranks of \(x\) and \(y\). This correlation test is non-parametric and does not require assuming normality for the two evaluated covariates. Both \(r\) and \(\rho\) are bounded between -1 and 1 (negative and positive correlation). There is no correlation between the covariates when the coefficient is equal to 0. Tests for significance of the correlation coefficient are available for both \(r\) and \(\rho\), testing the null hypothesis of no correlation.

When evaluating the correlation between several exposures we can create a correlation matrix, displayed in Table 2.1. This can be graphically displayed as a correlation plot (or correlogram), which can be plotted using the package corrplot. Note that the command requires the input of the correlation matrix you previously defined.

cor.matrix <- cor (data2[,3:16], method = "spearman")
Table 2.1: Correlation matrix from the simulated data
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14
x1 1.00 0.30 -0.04 -0.03 -0.03 0.04 0.16 -0.05 0.15 -0.10 -0.11 0.35 0.34 0.01
x2 0.30 1.00 0.05 0.06 0.08 0.07 0.18 0.03 0.14 -0.02 -0.03 0.39 0.38 0.05
x3 -0.04 0.05 1.00 0.99 0.93 0.60 0.28 0.74 0.17 0.40 0.56 -0.11 -0.14 0.70
x4 -0.03 0.06 0.99 1.00 0.94 0.61 0.29 0.74 0.18 0.41 0.57 -0.09 -0.13 0.71
x5 -0.03 0.08 0.93 0.94 1.00 0.59 0.29 0.72 0.17 0.42 0.56 -0.10 -0.14 0.69
x6 0.04 0.07 0.60 0.61 0.59 1.00 0.46 0.64 0.38 0.45 0.54 0.06 0.03 0.62
x7 0.16 0.18 0.28 0.29 0.29 0.46 1.00 0.39 0.70 0.36 0.41 0.46 0.48 0.41
x8 -0.05 0.03 0.74 0.74 0.72 0.64 0.39 1.00 0.37 0.55 0.64 0.01 -0.03 0.74
x9 0.15 0.14 0.17 0.18 0.17 0.38 0.70 0.37 1.00 0.32 0.36 0.50 0.50 0.40
x10 -0.10 -0.02 0.40 0.41 0.42 0.45 0.36 0.55 0.32 1.00 0.77 0.00 -0.05 0.42
x11 -0.11 -0.03 0.56 0.57 0.56 0.54 0.41 0.64 0.36 0.77 1.00 -0.01 -0.07 0.54
x12 0.35 0.39 -0.11 -0.09 -0.10 0.06 0.46 0.01 0.50 0.00 -0.01 1.00 0.90 0.11
x13 0.34 0.38 -0.14 -0.13 -0.14 0.03 0.48 -0.03 0.50 -0.05 -0.07 0.90 1.00 0.07
x14 0.01 0.05 0.70 0.71 0.69 0.62 0.41 0.74 0.40 0.42 0.54 0.11 0.07 1.00
corrplot(cor.matrix,
         method="circle",
         order = "hclust",
         addrect =10,
         tl.pos = "l",
         tl.col = "black",
         sig.level = 0.05)
Correlation Plot from simulated data

Figure 2.3: Correlation Plot from simulated data

This link provides a very useful description of the several corrplot options.

The correlation plot displayed in Figure 2.3 in the example provides several important information: first of all, we see a cluster of highly correlated exposures (\(X_3\),\(X_4\),\(X_5\)), and a cluster of moderately correlated exposures (\(X_{12}\), \(X_{13}\)). In addition, we see that additional pairs of exposures exhibit low to moderate levels of correlation, and it is not straightforward to clearly define additional subgroups of exposures.

2.2.1 Weighted correlation network analysis

Network analysis is emerging as a flexible and powerful technique in different fields. In a nutshell, a network is a complex structure of variables, called nodes, and the relationships (formally called edges) between these nodes. Correlation networks define such relationships on the basis of the quantitative correlations of the nodes, and are increasingly being used in biology to analyze high-dimensional data sets. Weighted correlation networks, in particular, preserve the continuous nature of the underlying correlation information without dicothomizing information. While the theory behind network analysis is beyond the scope of this course, and we refer to other publications for further details (Langfelder and Horvath (2008)), (Hevey (2018)), it is here useful to mention that these networks can be used in descriptive analyses to graphically display the relationship between exposures in our mixture based on the correlation structure. This can be now obtained with several R packages, including qgraph, documented [here]https://github.com/SachaEpskamp/qgraph), used to derive the plot in Figure 2.4.

Weighted correlation network of exposures in the simulated data

Figure 2.4: Weighted correlation network of exposures in the simulated data

This network confirms our finding from the correlation plot, but provides a different and possibly better way of representing and visualizing the relationships between components of the mixture.

References

Hevey, David. 2018. “Network Analysis: A Brief Overview and Tutorial.” Health Psychology and Behavioral Medicine 6 (1): 301–28.
Langfelder, Peter, and Steve Horvath. 2008. “WGCNA: An r Package for Weighted Correlation Network Analysis.” BMC Bioinformatics 9 (1): 1–13.