34.1 Correlation coefficients
Describing the linear relationship between two quantitative variables, requires a description of the form, direction and variation. A correlation coefficient is a single number encapsulating all this information.
In the population, the unknown value of the correlation coefficient is denoted \(\rho\) (‘rho’); in the sample the value of the correlation coefficient is denoted \(r\). As usual, \(r\) (the statistic) is an estimate of \(\rho\) (the parameter), and the value of \(r\) is likely to be different in every sample (that is, sampling variation exists).
Correlation coefficients only apply if the form is approximately linear, so checking if the relationship is linear first (using a scatterplot) is important. Here, the Pearson correlation coefficient is discussed, which is suitable for describing linear relationships between quantitative data16.
The values of \(\rho\) and \(r\) are always between \(-1\) and \(+1\). The sign indicates whether the relationship has a positive or negative linear association, and the value of the correlation coefficient tells us the strength of the relationship:
- \(r=0\) means no linear relationship between the two variables: Knowing how the value of \(x\) changes tells us nothing about how the value of \(y\) changes.
- \(r=+1\) means a perfect, positive relationship: knowing the value of \(x\) means we can perfectly predict the value of \(y\) (and larger values of \(y\) are associated with larger values of \(x\), in general).
- \(r=-1\) means a perfect, negative relationship: knowing the value of \(x\) means we can perfectly predict the value of \(y\) (and larger values of \(y\) are associated with smaller values of \(x\), in general).
The animation below demonstrates how the values of the correlation coefficient work.
Numerous example scatterplots were shown in Sect. 33.3; a correlation coefficient is not relevant for Plots C, D, E or H, as those relationships are not linear. In Plot A, the correlation coefficient will be positive, and reasonably close to one. In Plot B, the correlation coefficient will be negative, but not that close to \(-1\). In Plot F, the correlation coefficient will close to zero.
Think 34.2 (Guess the value of \(r\)) Earlier, we looked at the NHANES data to explore the relationship between direct HDL cholesterol and current smoking status. The NHANES project is an observational study, so confounding is a potential issue. For this reason, relationships between the response and extraneous variables, and between explanatory and extraneous variables, should be examined.
For example, the relationship between Age (an extraneous variables) and direct HDL cholesterol (the response variable) is shown in Fig. 34.4.
How would you describe the relationship? What do you guess for the value of \(r\)?
The web page http://guessthecorrelation.com makes a game out of trying to guess the correlation!
Other types of correlation coefficients also exist, such as the Spearman correlation, which may be used for monotonic, non-linear relationships.↩︎