Chapter 10 Correlation

Our book uses “measurement variables” for quantitative variables at the ordinal, interval, or ratio level of measurement. The majority of our examples will be from the strongest level (ratio).

I’m choosing to include material on correlation as chapter 10 and linear regression as chapter 11 of my notes, although your textbook scatters this material throughout these two chapters.

10.1 Univariate Statistics vs Bivariate Statistics

Data Set: Variable $X$ is ACT composite score; the $X$ variable will be referred to as the explanatory variable, predictor variable, or independent variable

Variable $Y$ is the freshman college GPA; the $Y$ variable will be referred to as the response variable or dependent variable

In statistic, we often want to know if two variables $X$ and $Y$ are mathematically related to each other, and eventually if we can form a mathematical model to explain or predict variable $Y$ based on variable $X$

This is a sample of $n=10$ college students. I have computed the means and standard deviations of both variables.

$X$	$Y$
32	4.0
28	3.5
26	1.2
24	3.3
22	3.0
21	2.8
20	2.6
20	2.1
19	3.5
18	2.4

## [1] "Variable X (ACT Score)"

##  mean       sd
##    23 4.472136

## [1] "Variable Y (GPA)"

##  mean        sd
##  2.84 0.8126773

10.2 Scatterplot

We will construct a scatterplot with the explanatory variable on the horizontal $x$ -axis and the response variable on the vertical $y$ -axis. I will draw by hand on the board and below with my software

What sort of mathematical model could we use to try to explain the student’s freshman GPA, using their ACT score?

Are there any points that seem to be “outliers”?

10.3 The Correlation Coefficient

A statistic that is commonly used to quantify the strength of a linear relationship between two variables is the correlation coefficient. There are many such coefficients; the most common one, which we will use in this course, is sometimes called Pearson’s correlation coefficient.

If our bivariate data represent an entire population, we use the Greek letter “rho”, $\rho$ , to represent the population correlation as a parameter.

More commonly, our data is a sample and we compute the sample statistic $r$ as our estimate of the population correlation $\rho$ , similarly to using $\bar{x}$ to estimate $\mu$ or $s^2$ to estimate $\sigma^2$ .

The correlation coefficient has the property that it will always take on a numerical value between $-1$ and $+1$ . $-1 \leq r \leq +1$

$\pagebreak$

If the correlation is $r=+1$ , this is perfect positive correlation and all points lie exactly on a straight line with positive slope.

$\pagebreak$

Similarly for $r=-1$ , except the line will have negative slope in order to have perfect negative correlation.

$\pagebreak$

For the correlation to be exactly zero, there must be no linear relationship between the two variables. On the plot, you wouldn’t even be able to tell if the “line of best fit” would have a positive or negative slope.

Note that this does not preclude that there is some nonlinear relationship, such as in this graph with $r=0$ .

Most data sets of interest will have a correlation that is not exactly $\pm 1$ or $0$ . Generally, we are interested in the magnitude and the direction of a correlation. I will demonstrate some examples from some data sets built into my statistical software package $R$ .

The data set faithful has data from Old Faithful Geyser in Yellowstone National Park. The park rangers use a linear regression model to predict the waiting time until the next eruption, using the length of the previous eruption. The scatterplot below is for $n=272$ eruptions of this geyser.

If you would like to see a livestream of the geyser (along with a prediction of the next eruption), go to https://www.nps.gov/yell/learn/photosmultimedia/webcams.htm .

$\pagebreak$

The correlation is strong and positive, $r=+0.901$ .

##     eruptions waiting
## 41      4.350      80
## 57      3.717      71
## 220     4.150      76
## 245     4.583      85
## 256     3.817      80

## [1] "Length of Eruption (minutes)"

##      mean       sd   n
##  3.487783 1.141371 272

## [1] "Waiting Time until next eruprtion (minutes)"

##      mean       sd   n
##  70.89706 13.59497 272

Using data from Motor Trend magazine, below is the scatterplot looking at the relationship between the weight of the vehicle (measured in thousands of pounds) and the gas mileage (measured in miles per gallon) for a sample of $n=32$ cars.

What sort of correlation do you expect?

$\pagebreak$

##                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 240D     24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230      22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Camaro Z28    13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4

## [1] "Weight of Vehicle (thousands of pounds"

##     mean        sd  n
##  3.21725 0.9784574 32

## [1] "Gas Mileage (miles per gallon)"

##      mean       sd  n
##  20.09062 6.026948 32

The correlation is negative and strong, with $r=-0.868$ .

The scatterplot below shows the relationship between wind speed (miles per hour) and ozone level in the air (parts per billion) in New York City.

What sort of correlation do you expect? (probably harder to answer without looking at the graph unless you know a lot about the science of ozone levels)

$\pagebreak$

##     Ozone Solar.R Wind Temp Month Day
## 15     18      65 13.2   58     5  15
## 51     13     137 10.3   76     6  20
## 76      7      48 14.3   80     7  15
## 85     80     294  8.6   86     7  24
## 108    22      71 10.3   77     8  16

## [1] "Wind Speed (miles per hour)"

##     mean       sd   n
##  9.93964 3.557713 111

## [1] "Ozone Level (parts per billion)"

##     mean       sd   n
##  42.0991 33.27597 111

What about the correlation between Ozone and Temperature?

$\pagebreak$

## [1] "Temperature (degrees Fahrenheit)"

##      mean       sd   n
##  77.79279 9.529969 111

## [1] "Ozone Level (parts per billion)"

##     mean       sd   n
##  42.0991 33.27597 111

What about the relationship between the weight of a college student’s backpack to their bodyweight ?

$\pagebreak$

##     BackpackWeight BodyWeight     Ratio BackProblems      Major Year    Sex
## 1                9        125 0.0720000            1        Bio    3 Female
## 2                8        195 0.0410256            0 Philosophy    5   Male
## 33               9        135 0.0666667            0         LS    6 Female
## 42              13        135 0.0962963            1       SOCS    3 Female
## 100             15        170 0.0882353            0    History    5   Male

## [1] "Weight of Backpack (pounds)"

##   mean       sd   n
##  11.66 5.765134 100

## [1] "Body Weight (pounds)"

##    mean       sd   n
##  153.05 29.39744 100

Using my class data, do you think there is a strong, moderate, or weak correlation between Texts and Height? Try to guess the correlation.

##    Gender  Color Texts Chocolate HSAlgebra Pizza Sushi Tacos Temp Height
## 1       F yellow     3        10         8     3     1     2   85     62
## 3       M orange    21         9         3     3     1     2   88     66
## 5       M   blue     0         7         5     1     3     2   80     74
## 13      M  green     0         6         4     3     1     2   85     72
## 18      M purple     0         7         9     2     1     3   87     71
## 27      F   blue    31         8         6     1     3     2   80     65
## 31      F   blue     3         8         3     1     3     2   89     64
## 33      F  coral     6         7         1     2     3     1   88     64
## 40      F   pink    10         5         8     1     3     2   87     67
## 42      F purple    11         9         8     1     3     2   80     66

## [1] "Number of Text Messages"

##      mean       sd  n
##  12.59091 18.51878 44

## [1] "Height (inches)"

##      mean       sd  n
##  67.22727 4.192261 44

$\pagebreak$

10.4 Correlation does not imply causation

As we have discussed before, just because there is a an association or relationship between two measurement variables, as indicated by looking at a scatterplot or by computing the correlation coefficient $r$ , does not mean that there is a causal relationship. There are many situations where a strong correlation between variables $X$ and $Y$ is found when $X$ does not cause $Y$ (or vice versa).

Shark attacks vs ice cream sales

Number of churches vs number of liquor stores

Coffee consumption vs heart attacks

What are the confounding (or lurking) variables in the above examples?

10.5 Calculating the correlation coefficient

$\Large{r=\frac{1}{n-1} \sum_{i=1}^n [\frac{(x_i-\bar{x})}{s_x}] [\frac{(y_i-\bar{y})}{s_y}]}$

This estimates the population correlation coefficient $\rho$ (the Greek letter “rho”). A different, but algebraically equivalent version of this formula is in your book. In reality, you would use software or a statistical calculator to compute $r$ .

Let’s go back to our sample of $n=10$ .

$X$	$Y$
32	4.0
28	3.5
26	1.2
24	3.3
22	3.0
21	2.8
20	2.6
20	2.1
19	3.5
18	2.4

We computed: $\Large{\bar{x}=23, s_x=4.472136, \bar{y}=2.84, s_y=0.8126773}$

Our formula could be written:

$\Large{r=\frac{1}{n-1} \sum_{i=1}^n Z_X \times Z_Y}$

If you were working this by hand, you would find the following (a spreadsheet would be nice here):

$\Large{Z_X = \frac{x_i-\bar{x}}{s_x}}$

$\Large{Z_Y = \frac{y_i-\bar{y}}{s_y}}$

$\pagebreak$

##        X   Y    Z_X    Z_Y Z_X.Z_Y
##  [1,] 32 4.0  2.012  1.427   2.873
##  [2,] 28 3.5  1.118  0.812   0.908
##  [3,] 26 1.2  0.671 -2.018  -1.354
##  [4,] 24 3.3  0.224  0.566   0.127
##  [5,] 22 3.0 -0.224  0.197  -0.044
##  [6,] 21 2.8 -0.447 -0.049   0.022
##  [7,] 20 2.6 -0.671 -0.295   0.198
##  [8,] 20 2.1 -0.671 -0.911   0.611
##  [9,] 19 3.5 -0.894  0.812  -0.726
## [10,] 18 2.4 -1.118 -0.541   0.605

The summation part of the formula is the sum of List L5.

$\Large{r=\frac{1}{10-1}(3.218)=+0.358}$

We have a moderate to weak positive correlation between ACT score and freshman GPA.

Let’s look at our plot again. Notice the dashed lines are drawn at $\bar{x}=23$ and $\bar{y=2.84}$ and serve to divide our graph into four “quadrants”. What is true about the points (the students) in the upper right hand quadrant?

What if we deleted the outlier or influential point (the student with the ususually low GPA)? What would happen to the correlation?

$\pagebreak$