17 Correlations between quantitative variables

So far, you have learnt about the research process, including analysing data using confidence intervals and hypothesis tests. Specifically, you have learnt to construct confidence intervals, and performs hypothesis tests, for one groups and for comparing two separate groups.

In this chapter, you will learn to:

  • describe the relationships between two quantitative variables.
  • compute and interpret correlation coefficients and \(R^2\).

17.1 Introduction

So far, RQs about single variables and RQs for comparing two groups have been studied. In this chapter, the relationship between two quantitative variables is studied; that is, correlational RQs.

17.2 Graphs

Scatterplots display the relationship between two quantitative variables. Conventionally, and when appropriate, the response variable is shown on the vertical axis (and denoted \(y\)), and the explanatory variable is shown on the horizontal axis (and denoted \(x\)). Two quantitative variables are measured on each individual, and a point is placed on the scatter plot for each individual at the values of the two variables. In some cases, only a simple relationship is being explored, and which variable is denoted \(x\) and which is \(y\) is not important (for example, see Exercise 38.4.)

As with any graph, describing the message in the graph is important, because the purpose of a graph is to display the information in the clearest, simplest possible way.

Example 17.1 (Red-deer data) Holgate (1965) examined the relationship between the age of \(n = 78\) male red deer and the weight of their molars. The data (below) comprises two quantitative variables, and both measurements are made on the same individuals (i.e., male red deer).

The scatterplot (Fig. 17.2) shows one dot for each deer (individual). The response variable is the molar weight, which is on the vertical axis and denoted \(y\). The explanatory variable is the deer age, which is on the horizontal axis and denoted \(x\).

For instance, one deer is just over \(2\) years of age (so \(x\) has a value a bit larger than \(2\)), and has a molar weight of \(2.42\) g (so that \(y = 2.42\)). This is the first deer listed in the data above.

FIGURE 17.1: The male red deer data

A plot of the red-deer data

FIGURE 17.2: A plot of the red-deer data

17.3 Describing scatterplots

The purpose of a graph is to facilitate understanding of the data (Sect. 18.2). For a scatterplot, the form, direction, and variation in the relationship (or the strength of the relationship) are described:

  1. Form: The overall form or structure of the relationship (e.g., linear; curved upwards; etc.).
  2. Direction: The direction of the relationship (sometimes not relevant if the relationship is non-linear):
    • A positive association exists if high values of one variable accompany high values of the other variable, in general.
    • A negative association exists if high values of one variable accompany low values of the other variable, in general.
  3. Variation: The amount of variation in the relationship. A small amount of variation in the response variable for given values of the explanatory variable means the relationship is strong; a lot of variation in the response variable for given values of the explanatory variable means the relationship is less strong. Describing the variation is difficult; an objective, numerical way to do so is explained in Sect. 17.4.

Anything unusual or noteworthy should also be discussed. These features explain the type of relationship (form; direction), and the strength of that relationship (variation). Examples are shown in the carousel below (click to move through the scatterplots).

Example 17.2 (Scatterplots) For the red deer data (Fig. 17.2), the relationship is approximately linear (form) with a negative direction (older deer generally have lighter teeth); the variation is... perhaps moderate.

Example 17.3 (Describing scatterplots) A study (Tager et al. 1979; Kahn 2005) measured the lung capacity of children in Boston (using forced expiratory volume, FEV, in litres). The scatterplot (Fig. 17.3) is curved (form), where older children have larger FEVs, in general (direction). The variation in FEV gets larger for taller youth.

FEV plotted against height for children in Boston

FIGURE 17.3: FEV plotted against height for children in Boston

17.4 Summarising the relationship: correlation and \(R^2\)

17.4.1 Correlation coefficients

In general, summarising the relationship between two quantitative variables is difficult, because the possible relationships vary greatly (consider the scatterplots shown in the carousel above). However, if we focus only on linear relationships, the best way to summarise the relationship between the variables is to use a correlation coefficient. Each variable can also be numerically summarised.

The Pearson correlation coefficient measures the strength and direction of the linear relationship between two quantitative variables2. Pearson correlation coefficients only apply if the form is approximately linear, so checking the scatterplot first is important.

The Pearson correlation coefficient only makes sense if the relationship is approximately linear.

In the population, the unknown value of the correlation coefficient is denoted \(\rho\) ('rho'); in the sample the value of the correlation coefficient is denoted \(r\). As usual, \(r\) (the statistic) is an estimate of \(\rho\) (the parameter), and the value of \(r\) is likely to be different in every sample (that is, sampling variation exists).

The symbol \(\rho\) is the Greek letter 'rho', pronounced 'row', as in 'row your boat'.

The values of \(\rho\) and \(r\) are always between \(-1\) and \(+1\). The sign indicates whether the relationship has a positive or negative linear association, and the value of the correlation coefficient describes the strength of the relationship:

  • \(r = -1\) means a perfect, negative relationship: knowing the value of \(x\) produces a perfect prediction of the value of \(y\), and larger values of \(y\) are associated with smaller values of \(x\).
  • Values of \(r\) between \(-1\) and \(0\) mean a negative relationship: knowing the value of \(x\) produces a prediction of the value of \(y\), and larger values of \(y\) are associated with smaller values of \(x\) (in general).
  • \(r = 0\) means no linear relationship between the variables: Knowing how the value of \(x\) changes tells us nothing about how the corresponding value of \(y\) changes. The best prediction for any value of \(x\) is the mean of \(y\).
  • Values of \(r\) between \(0\) and \(+1\) mean a positive relationship: knowing the value of \(x\) produces a prediction of the value of \(y\), and larger values of \(y\) are associated with larger values of \(x\) (in general).
  • \(r = +1\) means a perfect, positive relationship: knowing the value of \(x\) produces a perfect prediction of the value of \(y\), and larger values of \(y\) are associated with larger values of \(x\).

Most values of \(r\) are between the extremes of \(r = -1\) and \(r = +1\). The animation below demonstrates how the values of the correlation coefficient work.

Numerous example scatterplots were shown in Sect. 17.3. A correlation coefficient is not relevant for Plots C, D, E or H, as those relationships are not linear. For the others:

  • Plot A: the correlation coefficient will be positive, and reasonably near one.
  • Plot B: the correlation coefficient will be negative, but not near \(-1\).
  • Plot F: the correlation coefficient will close to zero.

Example 17.4 (Correlation coefficients) Leuchtenberger et al. (2022) and Nishizaki et al. (2022) explored the relationship between water temperature and fertilization rates for sand dollars (Fig. 17.4). The correlation coefficient is \(r = -0.49\) (left panel), which might suggest that higher temperatures result in lower fertilization rates. However, a curved relationship is apparent (right panel), and so the relationship is more involved: the fertilization rate increases up to about 18oC, and then starts falling again.

A Pearson correlation coefficient is not suitable for describing the relationship.

Water temperature vs fertilization rates for sand dollars. Left: a linear relationship. Right: a curved relationship

FIGURE 17.4: Water temperature vs fertilization rates for sand dollars. Left: a linear relationship. Right: a curved relationship

Formulas exist to compute the value of \(r\), but are tedious to use manually. We will used software to obtain values of \(r\).

Example 17.5 (Correlation coefficients) For the red-deer data (Fig. 17.2), the relationship is approximately linear, and the output from jamovi shows that \(r = -0.584\) (Fig. 17.5). The value of \(r\) is negative because, in general, older deer (\(x\)) are associated with smaller weight molars (\(y\)). The relationship may be described as 'moderately strong'.

Example 17.6 (Correlation coefficients) Tager et al. (1979) studied the lung capacity (forced expiratory volume; FEV) of children in Boston (Kahn 2005). The scatterplot in Fig. 17.3 is not linear, so a correlation coefficient is inappropriate.

In addition, the variation in the FEV increases as children get taller. For instance, the variation in FEV for children about \(50\) cm tall is much smaller than the variation in FEV for children about \(70\) cm tall.

The web page http://guessthecorrelation.com makes a game out of trying to guess the correlation coefficient from a scatterplot. It's very difficult!

jamovi correlation output for the red-deer data

FIGURE 17.5: jamovi correlation output for the red-deer data

17.4.2 R-squared (\(R^2\))

\(r\) describes the strength and direction of the linear relationship, but knowing exactly what the value means is tricky. Interpretation is easier using \(R^2\), or 'R-squared': the square of the value of \(r\). The animation below shows some values of \(R^2\).

The value of \(R^2\) is never negative, and is usually multiplied by \(100\) and expressed as a percentage.

The value of \(R^2\) is never negative! However, you need to be careful using your calculator. On most calculators, entering -0.5^2 returns an answer of -0.25. The calculator interprets your input as meaning -(0.5^2).

Use brackets; (-0.5)^2 gives the correct answer of 0.25.

\(R^2\) measures the reduction in the unexplained variation in values of \(y\) because the value of \(x\) is known. If the values of \(x\) are unknown, the best summary of the \(y\)-values is the mean of the \(y\)-values.

However, a relationship between the values of \(x\) and \(y\) suggests we (potentially) can make better estimates of the value of \(y\) by knowing the value of \(x\). That means that less variation should be unexplained. When expressed as a percentage, \(R^2\) measures how much the unexplained variation reduces due to our knowledge of the linear relationship. If \(R\)-squared is zero, then the amount of unexplained variation has not reduced at all, and exploring the relationship has no value.

Example 17.7 (Unknown variation in $y$) For the red-deer data, the unexplained variation in the values of \(y\) (molar weight), without knowing anything about the age of the deer, is the variation in the distances from the mean to each observation (Fig. 17.6, left panels). Effectively, the unexplained variation is the standard deviation of the molar weights (\(s = 0.7263\)).

If the age of the deer (\(x\)) is used, the unexplained variation in the values of \(y\) is now the variation in the distances from the line explaining the relationship to each observation (Fig. 17.6, right panel). The distances are much shorter, in general, showing a decrease in the unexplained variation. Effectively, the unexplained variation is the standard deviation of the distance from the line to the observations (\(s = 0.5895\)).

Hence, the reduction in the square of the standard deviations is \((0.7263^2 - 0.5895^2)/0.7263^2 = 0.341\), or \(34.1\)%. This is the value of \(R^2\).

The unexplained variation for the red-deer data. Left panels: when no information about the age of the deer is used, the mean (the horizontal grey line in the top panel) is the best summary of the molar weight. Right panels: When information about the age of the deer is used (as shown the grey line in the top panel), the distances are shorter in general.

FIGURE 17.6: The unexplained variation for the red-deer data. Left panels: when no information about the age of the deer is used, the mean (the horizontal grey line in the top panel) is the best summary of the molar weight. Right panels: When information about the age of the deer is used (as shown the grey line in the top panel), the distances are shorter in general.

Example 17.8 (Values of $R$-squared) For the red-deer data (Fig. 17.2), the value of \(R^2\) is \(R^2 = (-0.584)^2 = 0.341\), usually written as a percentage: \(34.1\)%. The value of \(R^2\) is positive, even though the value of \(r\) is negative.

This means a reduction of about \(34.1\)% in the unknown variation of the molar weights, due to using the information in the age of the deer. The rest of the variation in molar weights is due to chance, and to extraneous variables, such as weight, diet, amount of exercise, genetics, etc.

17.5 Numerical summary tables

In general, numerically summarising the relationship between two quantitative variables is difficult because of the many types of possible relationships (Sect. 17.3). However, for linear relationships, both quantitative variables can be summarised plus the correlation coefficient (Table 17.1).

TABLE 17.1: A numerical summary of the red deer data
Mean Std. dev. Sample size Correlation
Age \(\phantom{0}7.7\) \(\phantom{0}2.34\) \(\phantom{0}78\) \(-0.584\)
Molar weight \(\phantom{0}3.0\) \(\phantom{0}0.73\) \(\phantom{0}78\)

17.6 Chapter summary

A scatterplot displays the relationship between two quantitative variables (the response denoted \(y\); the explanatory denoted \(x\)). The relationship is described by the form (linear, or otherwise), the direction of the relationship (sometimes not relevant if the graph is not linear), and the variation in the relationship (or the strength of the relationship).

Linear relationship are measured numerical using the correlation coefficients and \(R^2\). Correlation coefficients (denoted \(r\) in the sample; \(\rho\) in the population) are always between \(-1\) and \(+1\). Positive values denote positive relationships between the two variables: as one values gets larger, the other tends to get larger too. Negative values denote negative relationships between the two variables: as one values gets larger, the other tends to get smaller. Values close to \(-1\) or \(+1\) are very strong relationships; values near zero shows very little linear relationship between the variables.

Sometimes, \(R^2\) is used to describe the relationship: it indicates what percentage of the variation in the response variable can be explained by knowing the value of the explanatory variables.

17.7 Quick review questions

A study of onion growth (Mead 1970) produced the scatterplot shown in Fig. 17.7.

Onion yield plotted against planting density

FIGURE 17.7: Onion yield plotted against planting density

  1. What is the \(x\)-variable?
  2. What is the best description for the form of the relationship?
  3. What is the best description for the direction of the relationship?
  4. What is the best description for the variation in the relationship?

17.8 Exercises

Answers to odd-numbered exercises are available in App. E.

Exercise 17.1 Draw a scatterplot with:

  1. a negative correlation coefficient, with \(r\) very close to (but not equal to) \(-1\).
  2. a positive correlation coefficient, with \(r\) very close to (but not equal to) \(+1\).
  3. a correlation coefficient very close to \(0\).

Exercise 17.2 Estimate the correlation coefficients from scatterplots in Fig. 17.8. (You can only give very broad estimates!)

Four plots: estimate the correlation coefficients

FIGURE 17.8: Four plots: estimate the correlation coefficients

Exercise 17.3 A study of the nutritional content of peas (Pisum sativum) measured the quantities of various minerals in pea seeds (Hacisalihoglu, Beisel, and Settles 2021). In these plots, it does not matter which of the pair of variables is used on the horizontal axis and which is used on the vertical axis.

From the left scatterplot in Fig. 17.9, estimate the value of \(r\).

Exercise 17.4 A study of the nutritional content of peas (Pisum sativum) measured the quantities of various minerals in pea seeds (Hacisalihoglu, Beisel, and Settles 2021). In these plots, it does not matter which of the pair of variables is used on the horizontal axis and which is used on the vertical axis.

From the right scatterplot in Fig. 17.9, estimate the value of \(r\).

The relationship between some minerals in pea seeds

FIGURE 17.9: The relationship between some minerals in pea seeds

Exercise 17.5 Schepaschenko et al. (2017a) measured the diameter and the age of \(385\) small-leaved lime trees (Fig. 17.10).

  1. Describe the scatterplot.
  2. What does each point on the scatterplot represent?
  3. Describe the relationship.
  4. Would a correlation coefficient be appropriate? Explain.
The age and foliage biomass of small-leaved lime trees grown in Russia ($n = 385$). The solid line on the left panel displays the linear relationship.

FIGURE 17.10: The age and foliage biomass of small-leaved lime trees grown in Russia (\(n = 385\)). The solid line on the left panel displays the linear relationship.

Exercise 17.6 Montgomery and Peck (1992) examined the time taken to deliver soft drinks to vending machines.

  1. Describe the relationship (Fig. 17.11, left panel).
  2. What does each point represent?
  3. Describe the relationship.
  4. Would a correlation coefficient be appropriate? Explain.

Exercise 17.7 Royston and Altman (1994) examined the mandible length and gestational age for \(167\) foetuses from the \(12\)th week of gestation onward. Describe the relationship (Fig. 17.11, right panel).

Two scatterplots. Left: the time taken to deliver soft drinks to vending machines. Right: the relationship between gestational age and mandible length. In both plots, the solid line displays the linear relationship.

FIGURE 17.11: Two scatterplots. Left: the time taken to deliver soft drinks to vending machines. Right: the relationship between gestational age and mandible length. In both plots, the solid line displays the linear relationship.

Exercise 17.8 Wright et al. (2021) recorded the chest beating of \(25\) gorillas, and the gorillas' size (measured by the breadth of the gorillas' backs). Describe the relationship (Fig. 17.12, left panel).

Two scatterplots. Left: chest beating in gorillas. Right: the relationship between DC output and wind speed.

FIGURE 17.12: Two scatterplots. Left: chest beating in gorillas. Right: the relationship between DC output and wind speed.

Exercise 17.9 Joglekar, Scheunemyer, and LaRiccia (1989) examined the relationship between direct current generated by a windmill and wind speed (Hand et al. 1996). Describe the relationship (Fig. 17.12, right panel).

Exercise 17.10 Lambie, Mudge, and Stevenson (2021) recorded the percentage carbon (C) and the percentage nitrogen (N) in \(28\) irrigated farming plots (Fig. 17.13, left panel).

  1. Describe the relationship.
  2. Does it matter which variable is \(x\) and which is \(y\)? Explain.
  3. What does each point represent?
Left: the percentage N and percentage C in irrigated plots. Right: the weight of students in Week\ 1 and Week\ 12 of the university semester.

FIGURE 17.13: Left: the percentage N and percentage C in irrigated plots. Right: the weight of students in Week 1 and Week 12 of the university semester.

Exercise 17.11 The weights of students starting at university (Week 1) and in Week 12 are shown in Fig 17.13 (right panel).

  1. Describe the relationship.
  2. What does each point represent?

Exercise 17.12 The relationship (P. K. Dunn and Smyth 2018) between the number of cyclones \(y\) in the Australian region each year from 1969 to 2005, and a climatological index called the Ocean Niño Index (ONI, \(x\)), is shown in Fig. 17.14.

From software, \(r = -0.682\). What is the value of \(R^2\)? What does it mean?

The number of cyclones in the Australian region each year from 1969 to 2005, and the ONI for October, November, December

FIGURE 17.14: The number of cyclones in the Australian region each year from 1969 to 2005, and the ONI for October, November, December