35 Correlation

So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data describe the data, summarise data graphically and numerically, understand the tools of inference, to form confidence intervals, and to perform hypothesis tests.

In this chapter, you will learn about correlation. You will learn to:

  • produce correlation coefficients for exploring the relationship between two quantitative variables.
  • produce and interpret \(R^2\).
  • conduct hypothesis tests for correlation coefficients.

35.1 Correlation coefficients

Describing the linear relationship between two quantitative variables, requires a description of the form, direction and variation. A correlation coefficient is a single number encapsulating all this information.

In the population, the unknown value of the correlation coefficient is denoted \(\rho\) ('rho'); in the sample the value of the correlation coefficient is denoted \(r\). As usual, \(r\) (the statistic) is an estimate of \(\rho\) (the parameter), and the value of \(r\) is likely to be different in every sample (that is, sampling variation exists).

The symbol \(\rho\) is the Greek letter 'rho', pronounced 'row', as in 'row your boat'.

Correlation coefficients only apply if the form is approximately linear, so checking if the relationship is linear first (using a scatterplot) is important. Here, the Pearson correlation coefficient is discussed, which is suitable for describing linear relationships between quantitative data506.

The Pearson correlation coefficient only make sense if the relationship is approximately linear.

The values of \(\rho\) and \(r\) are always between \(-1\) and \(+1\). The sign indicates whether the relationship has a positive or negative linear association, and the value of the correlation coefficient tells us the strength of the relationship:

  • \(r = 0\) means no linear relationship between the two variables: Knowing how the value of \(x\) changes tells us nothing about how the value of \(y\) changes.
  • \(r= +1\) means a perfect, positive relationship: knowing the value of \(x\) means we can perfectly predict the value of \(y\) (and larger values of \(y\) are associated with larger values of \(x\), in general).
  • \(r = -1\) means a perfect, negative relationship: knowing the value of \(x\) means we can perfectly predict the value of \(y\) (and larger values of \(y\) are associated with smaller values of \(x\), in general).

The animation below demonstrates how the values of the correlation coefficient work.

Numerous example scatterplots were shown in Sect. 34.3; a correlation coefficient is not relevant for Plots C, D, E or H, as those relationships are not linear.

  • In Plot A, the correlation coefficient will be positive, and reasonably close to one.
  • In Plot B, the correlation coefficient will be negative, but not that close to \(-1\).
  • In Plot F, the correlation coefficient will close to zero.

Example 35.1 (Correlation coefficients) For the red deer data (Fig. 34.2), \(r = -0.584\).

The value of \(r\) is negative because, in general, older deer (\(x\)) are associated with smaller weight molars (\(y\)).

Scatterplot for the sheep-food data

FIGURE 35.1: Scatterplot for the sheep-food data

Example 35.2 (Correlation coefficients) Consider the plot in Fig. 35.2 from the NHANES data. This scatterplot of diastolic BP against age is not linear, so a correlation coefficient is not appropriate.

A scatterplot of the diastolic blood pressure against age for the NHANES data

FIGURE 35.2: A scatterplot of the diastolic blood pressure against age for the NHANES data

Example 35.3 (Correlation coefficients) Consider the plot in Fig. 35.3 from the NHANES data. This scatterplot of systolic BP against age is approximately linear, so a correlation coefficient is appropriate. The correlation coefficient is \(r = 0.532\).

A scatterplot of the systolic blood pressure against age for the NHANES data

FIGURE 35.3: A scatterplot of the systolic blood pressure against age for the NHANES data

A study evaluated various food mixtures for sheep.507 One combination of variables that was assessed is shown in Fig. 35.1.

Estimate the value of \(r\).

\(r\) will be a positive number (since the scatterplot shows a positive linear relationship), and its value will be close to 1 as the relationship looks very strong.

Earlier, we looked at the NHANES data to explore the relationship between direct HDL cholesterol and current smoking status.

The NHANES project is an observational study, so confounding is a potential issue. For this reason, relationships between the response and extraneous variables, and between explanatory and extraneous variables, should be examined.

For example, the relationship between Age (an extraneous variables) and direct HDL cholesterol (the response variable) is shown in Fig. 35.4.

How would you describe the relationship? What do you guess for the value of \(r\)?

Not much relationship: the mean of the direct HDL cholesterol concentration is similar for any age.

Perhaps describe the scatterplot as 'little relationship'.

We cannot make good guess about the value of \(r\), but it will be near zero.

Direct HDL cholesterol plotted against age for the NHANES data

FIGURE 35.4: Direct HDL cholesterol plotted against age for the NHANES data

The web page http://guessthecorrelation.com makes a game out of trying to guess the correlation coefficient!

35.2 Using software

Software is used to compute the value of \(r\). For the red deer data (Fig. 34.2), the relationship is approximately linear, and the jamovi output (Fig. 35.5) and SPSS output (Fig. 35.6) show that \(r = -0.584\):

  • Direction: The sign of \(r\) indicates the direction. Here we see a negative relationship: Higher ages are associated with lighter molars (in general), which makes sense.
  • Variation: The value of \(r\) indicates the strength of the relationship. Here, perhaps we could describe the variation as "moderate".
jamovi correlation output for the red deer data

FIGURE 35.5: jamovi correlation output for the red deer data

SPSS correlation output for the red deer data

FIGURE 35.6: SPSS correlation output for the red deer data

35.3 R-squared (\(R^2\))

While using \(r\) tells us about the strength and direction of the linear relationship, knowing exactly what the value means is tricky. Interpretation is easier using \(R^2\), or 'R-squared': the square of the value of \(r\).

The animation below shows some values of \(R^2\).

The value of \(R^2\) is never negative, and is usually expressed as a percentage.

The value of \(R^2\) is never negative. However, you need to be careful when using your calculator!

With most calculators, entering -0.5^2 will return -0.25. This is correct, because the calculator interprets your input as meaning -(0.25^2).

You need to enter (-0.5)^2. This will give you the expected answer of 0.25.

The value of \(R^2\) is the percentage reduction in the unknown variation of \(y\) because the value of \(x\) is known.

In other words, \(R^2\) is the percentage of the variation in \(y\) explained by using the linear relationship, rather than just the mean value of \(y\).

Example 35.4 (Values of R-squared) For the red deer data (Fig. 34.2), the value of \(R^2\) from the software output (Fig. 35.5; Fig. 35.6) is \(R^2 = (-0.584)^2 = 0.341\), usually written as a percentage: 34.1%.

The value of \(R^2\) is positive, even though the value of \(r\) is negative.

For the red deer data, \(R^2\) means that about 34.1% of the variation in molar weights can be explained by variation in the age of the deer. The rest of the variation in molar weights is due to extraneous variables, such as weight, diet, amount of exercise, genetics, etc.

From Example 35.3, the correlation coefficient between the systolic blood pressure and age in the NHANES data is \(r = 0.532\).

What is the value of \(R^2\)? What does it mean?

\(R^2 = (0.532)^2 = 0.283\): about 28.3% of the variation in systolic BP is due to age; extraneous variable (weight, gender, amount of exercise, genetics, etc.) explain the remaining 71.7% of the variation in SBP values.

35.4 Hypothesis testing

35.4.1 Introduction

For the red deer data (Sect. 34.2; Sect. 35.1), the population correlation coefficient between the weight of molars \(y\) and age of the deer \(x\) is unknown and denoted by \(\rho\).

The sample correlation coefficient is \(r = -0.584\), but the value of \(r\) varies from sample to sample (there is sampling variation).

The size of the sampling variation is measured with a standard error. However, there is a complication for correlation coefficients508; so we will not produce CIs for the correlation coefficient.

35.4.2 Hypothesis testing details

As usual, questions can be asked about the relationship between the variables, as measured by the unknown population correlation coefficient:

Is the population correlation coefficient zero, or not?

In the context of the red deer data:

In male red deer, is there a correlation between age and the weight of molars?

The RQ is about the population parameter \(\rho\). Clearly, the sample correlation coefficient \(r\) is not zero, and the RQ is asking whether this could be attributed to sampling variation. The null hypotheses is:

  • \(H_0\): \(\rho = 0\)

The parameter is \(\rho\), the population correlation between the age and molar weight in the red deer.

This is the usual 'no relationship' position, which proposes that the population correlation coefficient is zero. The alternative hypothesis is:

  • \(H_1\): \(\rho \ne 0\)

This is a two-tailed test here, based on the RQ.

The approach is to assume that \(\rho=0\) (from \(H_0\)), then describe what values of \(r\) could be expected, under that assumption, just through sampling variation. Then the observed value of \(r\) is compared to the expected values to determine if the valuew of \(r\) supports or contradicts the assumption.

Software is used to test the hypotheses; the output in Figs. 35.5 (jamovi) and 35.6 (SPSS) contains the relevant \(P\)-value (twice in the SPSS output!).

The two-tailed \(P\)-value for the test (labelled Sig. by SPSS) is less than 0.001 (0.000 in SPSS). That is, the \(P\)-value is zero to three decimal places, so there is very strong evidence to support \(H_1\) (that the correlation in the population is not zero).

We write:

The sample presents very strong evidence (two-tailed \(P < 0.001\)) of a correlation between molar weight and the age of the male red deer (\(r = -0.584\); \(n = 78\)) in the population.

Notice the three features of writing conclusions again: An answer to the RQ; evidence to support the conclusion ('two-tailed \(P < 0.001\)'; no test statistic is given); and some sample summary information ('\(r = -0.584\); \(n = 78\)').

The evidence suggests that the correlation is not zero (in the population). However, a non-zero correlation doesn't necessarily mean a strong correlation.

The correlation may be weak in the population (as estimated by the value of \(r\)), but there is evidence that it is not zero in the population.

This may be a useful analogy: If a rain forecast says 'there is a very high chance of rain tomorrow', it doesn't mean there will be a lot of rain, just a high chance of some rain.

35.4.3 Statistical validity conditions

As usual, these results hold under certain conditions to be met. The conditions for which the test is statistically valid are:

  1. The relationship is approximately linear.
  2. The variation in the response variable is approximately constant for all values of the explanatory variable.
  3. The sample size is at least 25.

The sample size of 25 is a rough figure here, and some books give other values.

In addition to the statistical validity condition, the test will be externally valid if the sample is a simple random sample from the population. The test will also be internally valid if the study was well designed.

Example 35.5 (Statistical validity) For the red deer data, the scatterplot (Fig. 34.2) shows that the relationship is approximately linear, and the variation in molar weights doesn't seem to be obviously getting larger or smaller for older deer, so correlations are sensible. The sample size is also greater than 25.

The test in Sect. 35.4 will be statistically valid.

Example 35.6 (Statistical validity) A study509 examined the foliage biomass of small-leaved lime trees. A plot of the foliage biomass against diameter (Fig. 35.7) shows that the relationship is non-linear. In addition, the variation in foliage biomass increases for larger diameters (for values of \(x\) near 10, the values of \(y\) do not vary much at all, but for values of \(x\) near 30, the values of \(y\) vary greatly).

Both of these issues mean that correlations are not appropriate. A hypothesis test similar to that in Sect. 35.4 is inappropriate.

Foliage biomass plotted against diameter for small-leaved lime trees

FIGURE 35.7: Foliage biomass plotted against diameter for small-leaved lime trees

Example 35.7 (Phu Quoc ridgeback dogs) A study of Phu Quoc Ridgeback dogs (Canis familiaris) recorded many measurements of the dogs, including body length and body height.510

The scatterplot (Fig. 35.8) shows an approximate linear relationship. We know that each sample could produce a different sample correlation coefficient. We expect that taller dogs would also be longer, so we may ask:

For these dogs, are longer dogs also taller dogs, in general?

The hypotheses are:

  • \(H_0\): \(\rho = 0\)
  • \(H_1\): \(\rho > 0\) (i.e., one-tailed)

The correlation co-efficient is \(r = 0.837\) and software notes that the two-tailed \(P < 0.001\), based on \(n = 30\) dogs.

We write:

There is very strong evidence that longer Phu Quoc ridgeback dogs are also taller (\(r = 0.837\); one-tailed \(P<0.001\); \(n = 30\)).

Since (a) the sample size is larger than 25; (b) the relationship is approximately linear; and (c) the variation in heights do not seem to differ for different lengths, the test is statistically valid.

Scatterplot of the body height vs body length for Phu Quoc ridgeback dogs

FIGURE 35.8: Scatterplot of the body height vs body length for Phu Quoc ridgeback dogs

Example 35.8 (Drug calculations) A study of \(n = 30\) paramedicine students examined (among other things) the relationship between the amount of stress experienced (measured using the State–Trait Anxiety Inventory (STAI)) while performing drug-dose calculation, and length of work experience.511

The hypotheses are:

  • \(H_0\): \(\rho = 0\)
  • \(H_1\): \(\rho \ne 0\)

The correlation co-efficient is given as \(r = 0.346\) and \(P = 0.18\).

No scatterplot is provided, so the test is statistically valid only if the relationship is approximately linear and that the variation in STAI scores does not vary for different levels of work experience. The sample size is larger than 25, however.

We write:

There is no evidence (\(r = 0.346\); two-tailed \(P = 0.18\)) that the length of work experience is associated with STAI stress levels when performing drug-dose calculations.

35.5 Example: Removal efficiency

In wastewater treatment facilities, air from biofiltration is passed through a membrane and dissolved in water, and is transformed into harmless byproducts. The removal efficiency \(y\) (in %) may depend on the inlet temperature (in \(^\circ\)C; \(x\)).

The RQ is

In treating biofiltation wastewater, is the removal efficiency associated with the inlet temperature?

The population parameter is \(\rho\), the correlation between the removal efficiency and inlet temperature.

A scatterplot of \(n=32\) samples (Fig. 35.9) suggests an approximately linear relationship.512

The output (jamovi: Fig. 35.10; SPSS: Fig. 35.11) shows that the sample correlation coefficient is \(r=0.891\), and so \(R^2 = (0.891)^2 = 79.4\)%. This means that about 79.4% of the variation in removal efficiency can be explained by knowing the inlet temperature.

The relationship between removal efficiency and inlet temperature

FIGURE 35.9: The relationship between removal efficiency and inlet temperature

To test if a relationship exists in the population, write:

  • \(H_0\): \(\rho = 0\);
  • \(H_1\): \(\rho \ne 0\): Two-tailed (as implied by the RQ).

The software output (jamovi: Fig. 35.10; SPSS: Fig. 35.11) shows that \(P < 0.001\) (which is what \(P = 0.000\) in SPSS means). We conclude:

The sample presents very strong evidence (two-tailed \(P < 0.001\)) that removal efficiency depends on the inlet temperature (\(r = 0.891\); \(n = 32\)) in the population.

The relationship is approximately linear and there is no obvious non-constant variance, and the sample size is larger than 25, so the hypothesis test results will be statistically valid.

jamovi output for the removal-efficiency data

FIGURE 35.10: jamovi output for the removal-efficiency data

SPSS output for the removal-efficiency data

FIGURE 35.11: SPSS output for the removal-efficiency data

35.6 Summary

In this chapter, correlation was used to describe the strength and direction of linear relationships between two quantitative variables.

Correlation coefficients (denoted \(r\) in the sample; \(\rho\) in the population) are always between \(-1\) and \(+1\).

Positive values denote positive relationships between the two variables: as one values gets larger, the other tends to get larger too.

Negative values denote negative relationships between the two variables: as one values gets larger, the other tends to get smaller. Values close to \(-1\) or \(+1\) are very strong relationships; values near zero shows very little linear relationship between the variables. Hypothesis tests for \(r\) can be conducted using software.

Sometimes, \(R^2\) is used to describe the relationship: it indicates what percentage of the variation in the response variable can be explained by knowing the value of the explanatory variables.

The following short video may help explain some of these concepts:

35.7 Quick review questions

A study of Chinese paedatric patients513 studied the relationship between the 6-minute walk distance (6MWD) and maximum oxygen uptake (VO2max) for \(n = 29\) patients.

The correlation coefficient is reported as \(r = 0.457\), and the corresponding \(P\)-value as \(P = 0.013\).

  1. The \(x\)-variable is

  2. True or false: Since the \(P\)-value is small, the correlation must be quite strong.

  3. The relationship is best described as

  4. The value of \(R^2\) (to one decimal place, expressed as a percentage) is:

  5. For statistical validity, we need to assume that:

Progress:

35.8 Exercises

Selected answers are available in Sect. D.32.

Exercise 35.1 Draw a scatterplot with:

  1. A negative correlation coefficient, with the value of \(r\) very close to (but not equal to) \(-1\).
  2. A positive correlation coefficient, with the value of \(r\) very close to (but not equal to) \(+1\).
  3. A correlation coefficient very close to \(0\).

Exercise 35.2 A study (Raymond H. Myers514, p. 75) of American footballers measured the right-leg strengths \(x\) of 13 players (using a weight lifting test), and the distance \(y\) they punted a football (with their right leg) (Fig. 35.12).

  1. The value of the correlation coefficient is 0.881. Compute the value of \(R^2\), and explain what this means.
  2. jamovi was used to study the correlation (Fig. 35.13). Using this output, perform a hypothesis test to determine if a correlation exists between punting distance and right-leg strength.
Punting distance and right leg strength

FIGURE 35.12: Punting distance and right leg strength

jamovi output for the punting data

FIGURE 35.13: jamovi output for the punting data

Exercise 35.3 A study examined the time taken to deliver soft drinks to vending machines515 using a sample of size \(n = 25\) (Fig. 35.14).

To perform a test of the correlation coefficient, are the statistical validity conditions met?

The time taken to deliver soft drinks to vending machines

FIGURE 35.14: The time taken to deliver soft drinks to vending machines

Exercise 35.4 A study of hot mix asphalt516 created \(n = 42\) samples of asphalt and measured the volume of air voids and the bitumen content by weight (Fig. 35.15).

  1. Using the plot, estimate the value of \(r\).
  2. The value of \(R^2\) is 99.29%. What is the value of \(r\)? (Hint: Be careful!)
  3. Would you expect the \(P\)-value testing \(H_0\): \(\rho=0\) to be small or large? Explain.
  4. Would the test be statistically valid?
Air voids in bitumen samples

FIGURE 35.15: Air voids in bitumen samples

Exercise 35.5 The California Bearing Ratio (CBR) value is used to describe soil-sub grade for flexible pavements (such as in the design of air field runways).

One study517 examined the relationship between CBR and other properties of soil, including the plasticity index (PI, a measure of the plasticity of the soil).

The scatterplot from 16 different soil samples from Assam, India, is shown in Fig. 35.16.

  1. Using the plot, estimate the value of \(r\).
  2. The value of \(R^2\) is 67.07%. What is the value of \(r\)? (Hint: Be careful!)
  3. Would you expect the \(P\)-value testing \(H_0\): \(\rho=0\) to be small or large? Explain.
  4. Would the test be statistically valid?
The relationship between CBR and PI in sixteen soil samples

FIGURE 35.16: The relationship between CBR and PI in sixteen soil samples