16 Correlations between quantitative variables
So far, you have learnt to ask a RQ, design a study, collect the data, describe the data, and summarise data.
In this chapter, you will learn to:
- describe the relationships between two quantitative variables.
- compute and interpret correlation coefficients and \(R^2\).
16.1 Introduction
So far, RQs about single variables and RQs for comparing two groups have been studied. In this chapter, the relationship between two quantitative variables is studied; that is, correlational RQs.
16.2 Graphs
Scatterplots display the relationship between two quantitative variables. Conventionally, and when appropriate, the response variable is shown on the vertical axis (and denoted \(y\)), and the explanatory variable is shown on the horizontal axis (and denoted \(x\)). Two quantitative variables are measured on each individual, and a point is placed on the scatterplot for each individual at the values of the two variables. In some cases, a relationship is explored, but which variable is denoted \(x\) and which is \(y\) is not important (e.g., see Exercise 37.5.)
As with any graph, describing the message in the graph is important, because the purpose of a graph is to display the information in the clearest, simplest possible way.
Example 16.1 (Red-deer data) Holgate (1965) examined the relationship between the age of \(n = 78\) male red deer and the weight of their molars. The data (below) comprises two quantitative variables, and both measurements are made on the same individuals (i.e., male red deer).
The scatterplot (Fig. 16.2) shows one dot for each deer (individual). The response variable is the molar weight, which is on the vertical axis and denoted \(y\). The explanatory variable is the deer age, which is on the horizontal axis and denoted \(x\).
For instance, one deer is just over \(4\) years of age (so \(x\) has a value a bit larger than \(4\)), and has a molar weight of \(2.42\)(so that \(y = 2.42\)). This is the first deer listed in the data above.
16.3 Describing scatterplots
The purpose of a graph is to facilitate understanding of the data. For a scatterplot, the form, direction, and variation in the relationship (or the strength of the relationship) are described:
- Form: The overall form or structure of the relationship (e.g., linear; curved upwards; etc.).
-
Direction: The direction of the relationship (sometimes not relevant if the relationship is non-linear):
- A positive association exists if high values of one variable accompany high values of the other variable, in general.
- A negative association exists if high values of one variable accompany low values of the other variable, in general.
- Variation: The amount of variation in the relationship. A small amount of variation in the response variable for given values of the explanatory variable means the relationship is strong; a lot of variation in the response variable for given values of the explanatory variable means the relationship is less strong. Describing the variation can be difficult; an objective, numerical way to do so is explained in Sect. 16.4.
Anything unusual or noteworthy should also be discussed. These features explain the type of relationship (form; direction), and the strength of that relationship (variation). Examples are shown in the carousel below (click to move through the scatterplots).
Example 16.2 (Scatterplots) For the red deer data (Fig. 16.2), the relationship is approximately linear (form) with a negative direction (older deer generally have lighter teeth); the variation is... perhaps moderate.
Example 16.3 (Describing scatterplots) A study (Tager et al. 1979; Kahn 2005) measured the lung capacity of children in Boston (using forced expiratory volume, FEV, in litres). The scatterplot (Fig. 16.3) is curved (form), where older children have larger FEVs, in general (direction). The variation in FEV gets larger for taller youth.
16.4 Summarising the relationship: correlation and \(R^2\)
16.4.1 Correlation coefficients
In general, summarising the relationship between two quantitative variables is difficult, because the possible relationships vary greatly (consider the variety in the scatterplots shown in the carousel above). However, if we focus only on approximately linear relationships, the best way to numerically summarise the relationship between the variables is to use a correlation coefficient. Each variable can also be numerically summarised individually.
Definition 16.1 (Correlation coefficient) The Pearson correlation coefficient measures the strength and direction of the linear relationship between two quantitative variables. Its value is always between \(-1\) and \(1\).
Pearson correlation coefficients only apply if the form is approximately linear, so checking the scatterplot first is important. All correlations in this book are Pearson correlation coefficients. Only the Pearson correlation coefficient is discussed here, but other correlation coefficients also exist (such as the Spearman correlation coefficient or Kendall correlation coefficient, which may be used for increasing-only or decreasing-only non-linear relationships).
The Pearson correlation coefficient only makes sense if the relationship is approximately linear.
In the population, the unknown value of the correlation coefficient is denoted \(\rho\) ('rho'); in the sample, the value of the correlation coefficient is denoted \(r\). As usual, \(r\) (the statistic) is an estimate of \(\rho\) (the parameter), and the value of \(r\) is likely to be different in every sample (that is, sampling variation exists).
The symbol \(\rho\) is the Greek letter 'rho', pronounced 'row', as in 'row your boat'.
The values of \(\rho\) and \(r\) are always between \(-1\) and \(+1\). The sign indicates whether the relationship has a positive or negative linear association, and the value of the correlation coefficient describes the strength of the relationship:
- \(r = -1\) indicates a perfect, negative relationship: by 'perfect', we mean that knowing the value of \(x\) produces an exact prediction of the value of \(y\); the negative value means larger values of \(y\) are associated with smaller values of \(x\).
- Values of \(r\) between \(-1\) and \(0\) indicates a negative relationship: knowing the value of \(x\) produces a prediction of the value of \(y\), and larger values of \(y\) are associated with smaller values of \(x\) (in general).
- \(r = 0\) indicates no linear relationship between the variables: knowing how the value of \(x\) changes tells us nothing about how the corresponding value of \(y\) changes. The best prediction of \(y\) for any value of \(x\) is the mean of \(y\); i.e., \(\bar{y}\).
- Values of \(r\) between \(0\) and \(+1\) indicates a positive relationship: knowing the value of \(x\) produces a prediction of the value of \(y\), and larger values of \(y\) are associated with larger values of \(x\) (in general).
- \(r = +1\) indicates a perfect, positive relationship: by 'perfect', we mean that knowing the value of \(x\) produces an exact prediction of the value of \(y\); the positive value means larger values of \(y\) are associated with larger values of \(x\).
Almost all values of \(r\) are between the extremes of \(r = -1\) and \(r = +1\). The animation below demonstrates how the values of the correlation coefficient work.
Numerous example scatterplots were shown in Sect. 16.3. A correlation coefficient is not relevant for Plots C, D, E or H, as those relationships are not linear. For the others:
- Plot A: the correlation coefficient is positive, and reasonably close to one.
- Plot B: the correlation coefficient is negative, but not near \(-1\).
- Plot F: the correlation coefficient is close to zero.
Example 16.4 (Correlation coefficients) Leuchtenberger et al. (2022) and Nishizaki et al. (2022) explored the relationship between water temperature and fertilization rates for sand dollars (Fig. 16.4). The correlation coefficient is \(r = -0.71\) (left panel), which might suggest that higher temperatures result in lower fertilization rates. However, a curved relationship is apparent (right panel), and so the relationship is more complex: the fertilization rate increases up to about \(18\)oC, and then starts falling again.
A Pearson correlation coefficient is not suitable for describing the relationship.
Formulas exist to compute the value of \(r\), but are tedious to use manually. We will use software output to obtain values of \(r\).
Example 16.5 (Correlation coefficients) For the red-deer data (Fig. 16.2), the relationship is approximately linear, and the software output shows that \(r = -0.584\) (Fig. 16.5). The value of \(r\) is negative because, in general, older deer (\(x\)) are associated with smaller weight molars (\(y\)). The relationship may be described as 'moderately strong' perhaps.
Example 16.6 (Correlation coefficients) Tager et al. (1979) studied the lung capacity (forced expiratory volume; FEV) of children in Boston (Kahn 2005). The scatterplot in Fig. 16.3 is not linear, so a correlation coefficient is inappropriate.
The web page http://guessthecorrelation.com makes a game out of trying to guess the correlation coefficient from a scatterplot. It's very difficult!
16.4.2 R-squared (\(R^2\))
While \(r\) describes the strength and direction of the linear relationship, knowing exactly what the value means is tricky. Interpretation is easier using \(R^2\): the square of the value of \(r\). The animation below shows some values of \(R^2\).
Definition 16.2 ($R^2$) The value of \(R^2\) is how much the unexplained variation in the values of \(y\) is reduced due to using the extra information in the values of \(x\).
\(R^2\) is pronounced as '\(r\)-squared'.
The value of \(R^2\) is never negative, and is usually multiplied by \(100\) and expressed as a percentage.
The value of \(R^2\) is never negative!
However, you need to be careful using your calculator.
On most calculators, entering -0.5^2
returns an answer of -0.25
.
The calculator interprets your input as meaning -(0.5^2)
.
Use brackets; (-0.5)^2
gives the correct answer of 0.25
(or \(25\)%).
Example 16.7 (Values of $R$-squared) For the red-deer data (Fig. 16.2), the value of \(R^2\) is \(R^2 = (-0.584)^2 = 0.341\), usually written as a percentage: \(34.1\)%. The value of \(R^2\) is positive, even though the value of \(r\) is negative.
This means a reduction of about \(34.1\)% in the unexplained variation of the molar weights, due to using the information in the age of the deer. The rest of the variation in molar weights is due to chance, and to extraneous variables, such as weight, diet, amount of exercise, genetics, etc.
\(R^2\) measures the reduction in the unexplained variation in values of \(y\) because the value of \(x\) is known. If the values of \(x\) were unknown, the best summary of the \(y\)-values is the mean of the \(y\)-values. However, a relationship between the values of \(x\) and \(y\) suggests (potentially) that better estimates of the value of \(y\) could be made by knowing the value of \(x\). That means that less variation should be left unexplained.
When expressed as a percentage, \(R^2\) measures how much the unexplained variation reduces due to our knowledge of the linear relationship. If \(R\)-squared is zero, then the amount of unexplained variation has not reduced at all, and exploring the relationship between \(x\) and \(y\) has no value.
Example 16.8 (Unknown variation in $y$) For the red-deer data, the unexplained variation in the values of \(y\) (molar weight), without knowing anything about the age of the deer, is the variation in the distances from the mean to each observation (Fig. 16.6, left panels). Effectively, the unexplained variation is the standard deviation of the molar weights (\(s = 0.7263\)).
If the age of the deer (\(x\)) is used, the unexplained variation in the values of \(y\) is now the variation in the distances from the line explaining the relationship to each observation (Fig. 16.6, right panel). The distances are much shorter, in general, showing a decrease in the unexplained variation. Effectively, the unexplained variation is the standard deviation of the distances from the line to the observations (\(s = 0.5895\)).
Hence, the reduction in the square of the standard deviations is \((0.7263^2 - 0.5895^2)/0.7263^2 = 0.341\), or \(34.1\)%. This is the value of \(R^2\).
16.5 Numerical summary tables
In general, numerically summarising the relationship between two quantitative variables is difficult because of the many types of possible relationships (Sect. 16.3). However, for linear relationships, both quantitative variables can be summarised, and the correlation coefficient can be given (Table 16.1).
Mean | Standard deviation | Sample size | Correlation | |
---|---|---|---|---|
Age (in years) | \(\phantom{0}7.7\) | \(\phantom{0}2.34\) | \(\phantom{0}78\) | \(-0.584\) |
Molar weight (in g) | \(\phantom{0}3.0\) | \(\phantom{0}0.73\) | \(\phantom{0}78\) |
16.6 Example: removal efficiency
In wastewater treatment facilities, air from biofiltration is passed through a membrane and dissolved in water, and is transformed into harmless byproducts. The removal efficiency \(y\) (in %) may depend on the inlet temperature (in oC; \(x\)). Chitwood and Devinny (2001) asked:
In treating biofiltation wastewater, is the removal efficiency linearly associated with the inlet temperature?
A scatterplot of \(n = 32\) observations (Fig. 16.7, left panel) suggests an approximately linear relationship (Devore and Berk 2007). The direction is positive: larger inlet temperatures are associated with a larger removal efficiency, in general. The variation is always hard to describe, but is perhaps 'reasonably small'.
A more precise way to measure the strength of the linear association is to use the correlation coefficient. Using software output (Fig. 16.7, right panel), \(r = 0.891\), and so \(R^2 = (0.891)^2 = 79.4\)%. This means that about \(79.4\)% of the variation in removal efficiency can be explained by knowing the inlet temperature.
16.7 Chapter summary
A scatterplot displays the relationship between two quantitative variables (the response denoted \(y\); the explanatory denoted \(x\)). The relationship is described by the form (linear, or otherwise), the direction of the relationship (sometimes not relevant if the graph is not linear), and the variation in the relationship (or the strength of the relationship).
Linear relationships are measured numerically using the correlation coefficient and \(R^2\). Correlation coefficients (denoted \(r\) in the sample; \(\rho\) in the population) are always between \(-1\) and \(+1\). Positive values denote positive relationships between the two variables: as one values gets larger, the other tends to get larger too. Negative values denote negative relationships between the two variables: as one values gets larger, the other tends to get smaller. Values close to \(-1\) or \(+1\) are very strong relationships; values near zero shows very little linear relationship between the variables.
Sometimes, \(R^2\) is used to describe the relationship: it indicates what percentage of the variation in the response variable can be explained by knowing the value of the explanatory variables.
16.8 Quick review questions
A study of onion growth (Mead 1970) produced the scatterplot shown in Fig. 16.8.
- What is the \(x\)-variable?
- What is the best description for the form of the relationship?
- What is the best description for the direction of the relationship?
- What is the best description for the variation in the relationship?
16.9 Exercises
Answers to odd-numbered exercises are available in App. E.
Exercise 16.1 Draw a scatterplot with:
- a negative correlation coefficient, with \(r\) very close to (but not equal to) \(-1\).
- a positive correlation coefficient, with \(r\) very close to (but not equal to) \(+1\).
- a correlation coefficient very close to \(0\).
Exercise 16.2 Estimate the correlation coefficients from scatterplots in Fig. 16.9. (You can only give very rough estimates!)
Exercise 16.3 Hacisalihoglu, Beisel, and Settles (2021) studied of the nutritional content of peas (Pisum sativum), and measured the quantities of various minerals. In these plots, it does not matter which of the pair of variables is used on the horizontal axis and which is used on the vertical axis.
From Fig. 16.10 (left panel), estimate the value of \(r\).
Exercise 16.4 Hacisalihoglu, Beisel, and Settles (2021) studied of the nutritional content of peas (Pisum sativum), and measured the quantities of various minerals. In these plots, it does not matter which of the pair of variables is used on the horizontal axis and which is used on the vertical axis.
From Fig. 16.10 (right panel), estimate the value of \(r\).
Exercise 16.5 Schepaschenko et al. (2017a) measured the diameter and the age of \(385\) small-leaved lime trees (Fig. 16.11).
- What does each point on the scatterplot represent?
- Describe the scatterplot.
- Would a correlation coefficient be appropriate? Explain.
Exercise 16.6 K. Y. Kim and Kim (2022) measured numerous data from \(969\) South Korean subjects, including \(517\) females. A scatterplot of the height and age for females is shown in Fig. 16.12.
- What does each point on the scatterplot represent?
- Describe the relationship.
- Would a correlation coefficient be appropriate? Explain.
- Does the scatterplot suggest than people become less tall as they age? Explain.
Exercise 16.7 Montgomery and Peck (1992) examined the time taken to deliver soft drinks to vending machines.
- Describe the relationship (Fig. 16.13, left panel).
- What does each point represent?
- Describe the relationship.
- Would a correlation coefficient be appropriate? Explain.
Exercise 16.8 Royston and Altman (1994) examined the mandible length and gestational age for \(167\) foetuses from the \(12\)th week of gestation onward. Describe the relationship (Fig. 16.13, right panel).
Exercise 16.9 Wright et al. (2021) recorded the chest-beating rate of \(25\) gorillas, and the gorillas' size (measured by the breadth of the gorillas' backs). Describe the relationship (Fig. 16.14, left panel).
Exercise 16.10 Joglekar, Scheunemyer, and LaRiccia (1989) examined the relationship between direct current (DC) generated by a windmill and wind speed (Hand et al. 1996). Describe the relationship (Fig. 16.14, right panel).
Exercise 16.11 Lambie, Mudge, and Stevenson (2021) recorded the percentage carbon (C) and the percentage nitrogen (N) in \(28\) irrigated farming plots (Fig. 16.15, left panel).
- Describe the relationship.
- Does it matter which variable is \(x\) and which is \(y\)? Explain.
- What does each point represent?
Exercise 16.12 The weights of students starting at university (Week 1) and in Week 12 are shown in Fig 16.15 (right panel).
- Describe the relationship.
- What does each point represent?
Exercise 16.13 The relationship (P. K. Dunn and Smyth 2018) between the number of cyclones \(y\) in the Australian region each year from 1969 to 2005, and a climatological index called the Ocean Niño Index (ONI, \(x\)), is shown in Fig. 16.16.
From software, \(r = -0.682\). What is the value of \(R^2\)? What does it mean?