Chapter 8 Correlation

Association (or “correlation,” in the broad sense) is a statistical concept that describes how two variables change together, indicating whether and how strongly they are related. Two variables are associated (correlated) when knowing the value of one provides information about the other.

8.1 Scatter plot

A scatterplot is one of the simplest and most powerful tools for exploring the relationship between two quantitative variables. It displays data points on a two-dimensional plane, where: the horizontal axis (x-axis) represents one variable, while the vertical axis (y-axis) represents the other variable.

Each point on the plot corresponds to a single observation.

Scatterplots allow us to visually assess:

  1. Direction of association: upward trend indicates positive correlation (as X increases, Y tends to increase), while downward trend indicates negative correlation (as X increases, Y tends to decrease).

  2. Form of relationship: linear vs nonlinear patterns.

  3. Strength of association: points tightly clustered along a (straight) line indicate strong (linear) correlation.

  4. Potential outliers and influential observations: these are points that deviate significantly from the overall pattern.

Figure 8.1: Scatter plot showing heights of 1078 fathers and their adult sons England, ca. 1900. Dataset used by Pearson to illustrate correlation.

8.2 Pearson correlation coefficient

The Pearson correlation coefficient4 measures the strength of association between two quantitative variables. The correlation coefficient can take values in the range from -1 to 1 (inclusive).

A correlation equal to exactly 1 or -1 indicates a functional, linear relationship between two variables. The sign (1 or -1) depends on the sign of the slope of the function that transforms one variable into the other.

The more tightly the points are clustered around a straight line, the closer the Pearson correlation coefficient is to 1 or -1. The more “loosely” the points are clustered, the closer the coefficient is to zero.

Figure 8.2 shows example scatter plots based on simulated data for various levels of non-negative (\(\ge 0\)) correlation coefficients.

Example scatter plots for non-negative values of the correlation coefficient

Figure 8.2: Example scatter plots for non-negative values of the correlation coefficient

In the case of negative correlation (as illustrated in Figure 8.3 the “clouds of points” slope downward:

Example scatter plots for negative values of the correlation coefficient

Figure 8.3: Example scatter plots for negative values of the correlation coefficient

If the points on the scatter plot are arranged as shown in Figures 8.2 and 8.3 — that is, grouped around a straight line, and in the case of lower interdependence, the shape resembles a tilted ellipse—then the joint distribution of two variables can be described using five numbers:

  • the mean of variable X,
  • the standard deviation of variable X,
  • the mean of variable Y,
  • the standard deviation of variable Y,
  • and the correlation coefficient between variable X and variable Y.

8.2.1 Correlation coefficient — formula

The formula for the correlation coefficient can be written in several equivalent ways.

Using only the values of variables X and Y and their means, the Pearson correlation (denoted here as \(r(X,Y)\) or \(r_{xy}\)) can be calculated as:

\[r(X,Y) = \frac{\sum_i{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum_i(x_i-\bar{x})^2\sum_i(y_i-\bar{y})^2}} \tag{8.1} \]

Another formula, which is much easier to remember, is the following: correlation coefficient is the average product of standardized values (z-scores, see (5.1)) of X and Y5:

\[r(X,Y) = \frac{1}{n}\sum_{i=1}^n z_{x_i} z_{y_i} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\widehat{\sigma}_X}\right)\left(\frac{y_i-\bar{y}}{\widehat{\sigma}_Y}\right) \tag{8.2} \]

Correlation for a sample is often denoted by \(r\), while correlation for a population can be denoted by \(\rho\) (“rho”). If this is not clear from the context, it is worth specifying which variables the correlation refers to (e.g., by writing \(r(X,Y)\), \(r_{xy}\), \(\rho_{XY}\), etc.).

8.2.2 Pearson correlation coefficient – properties

  • Pearson correlation coefficient is dimensionless, does not depend on units of measurement, it is always between -1 and 1.

  • Pearson correlation coefficient is symmetrical:

\[r(X, Y) = r(Y, X)\]

  • Pearson correlation coefficient is sensitive to outliers: a single extreme value can significantly change the correlation.

  • Pearson correlation coefficient is invariant under linear transformations: if you transform X and/or Y by adding a constant or multiplying by a positive constant, the correlation value does not change.

  • Pearson correlation measures linear association, and not nonlinear relationships.

8.2.3 Covariance

Covariance is another measure that indicates how two variables vary together.

The formula for covariance, often called the “sample formula”, is:

\[s_{xy} = \frac{\sum_{i=1}^n \left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{n-1} \tag{8.3}\]

The “sigma” formula (sometimes referred to as the “population covariance”, though occasionally used for samples) is:

\[ \widehat{\sigma}_{xy} = \frac{\sum_{i=1}^n \left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{n} \tag{8.4} \]

The Pearson linear correlation coefficient can be interpreted (and computed) as a standardized covariance:

\[r_{xy} = \frac{s_{xy}}{s_x s_y} \tag{8.5} \]

or, using \(\sigma\) versions:

\[r_{xy} = \frac{\widehat{\sigma}_{xy}}{\widehat{\sigma}_x \widehat{\sigma}_y} \tag{8.6} \]

In these formulas, \(s_{xy}\) is the “sample” covariance (8.3), \(\widehat{\sigma}_{xy}\) is the “sigma” covariance (8.4); \(s_x\), \(s_y\), \(\widehat{\sigma}_x\), \(\widehat{\sigma}_y\) are the standard deviations (see (4.1) and (4.2)) of variables \(X\) and \(Y\).

8.3 Association and causation

When X and Y are correlated, it is possible that:

  • X might cause Y directly,
  • or through a mediating variable;
  • Y might cause X (directly or indirectly),
  • there may be a third variable, a confounder;
  • the structure may be more complex
  • the correlation may be completely spurious, due to chance or trends.

Correlation does not imply causation” is a fundamental principle in statistics. Just because two variables move together does not mean that one causes the other. For example, ice cream sales and drowning incidents both increase in summer, but buying ice cream does not cause drowning — the underlying factor is the season. Confusing correlation with causation can lead to misleading conclusions, so careful analysis, controlled experiments, or additional evidence are needed before claiming a causal relationship.

8.5 Questions

8.5.1 Discussion questions

Question 8.1 (Freedman, Pisani, and Purves 2007) According to your intuition, what is the Pearson correlation coefficient for the data illustrated on this scatter plot? Is there a correlation (association) between x and y?

Question 8.1 (Freedman, Pisani, and Purves 2007) For a certain data set, correlation coefficient = 0.57. Say whether each of the following statements is true of false, and explain briefly; if you need more information, say what you need, and why.

  • There are no outliers.

  • There is a non-linear association.

Question 8.2 (Freedman, Pisani, and Purves 2007)For school children, shoe size is strongly correlated with reading skills. How is this possible?

Question 8.3 (Freedman, Pisani, and Purves 2007)The correlation between height and weight among men age 18-74 in the U.S. is about 0.40. Say whether each conclusion below follows from the data; explain your answer.

  • Taller men tend to be heavier.

  • The correlation between weight and height for men age 18-74 is about 0.40.

  • Heavier men tend to be taller.

  • If someone eats more and puts on 10 pounds, he is likely to get somewhat taller.

Question 8.4 (Freedman, Pisani, and Purves 2007)Studies find a negative correlation between hours spent watching TV and scores on reading tests. Does watching TV make people less able to read?

Question 8.5 (Freedman, Pisani, and Purves 2007)Many studies have found an association (and causation!) between cigarette smoking and heart disease. One study found an association between cofee drinking and heart disease. Should you conclude that coffee drinking causes heart disease? Or there is some other explanation?

8.5.2 Test questions

Question 8.6 A circle of diameter d has area \(\frac{1}{4}\pi d^2\). We plot a scatter diagram of area against diameter for a sample of circles.

The correlation coefficient is:

8.6 Exercises

Exercise 8.1 Examine the data from the 2000 U.S. presidential election in Florida, available in the UsingR package as the florida dataset. Create a scatter plot with Al Gore’s (Democratic Party) votes on the x‑axis and Pat Buchanan’s (Reform Party) votes on the y‑axis. Do you notice any outliers? Investigate the story behind the outlier.

Data for Gore and Buchanan votes can also be found in the appendix of La certeza absoluta y otras ficciones: los secretos de la estadística by Pere Grima.

Literature

Freedman, David, Robert Pisani, and Roger Purves. 2007. Statistics, 4th Edition. New York: W. W. Norton & Company.

  1. If someone uses the term correlation without further clarification, it usually refers to the Pearson correlation coefficient.↩︎

  2. Please note the equation uses \(\widehat{\sigma}\), the “population” formula of the standard deviation, see (4.2).↩︎