Chapter4 Correlation and Regression

4.1 Covariance

4.1.1 🧠Referesher🧠

The covariance between two variables is a stepping stone towards the correlation between two variables. Both characterize the degree to which two variables fluctuate generally in the same direction. Covariance is to correlation as variance is to standard deviation.

Variance Covariance
What it shows Spread of 1 variable How to variables change together
Formula \(\mathrm{Var}(X) = \frac{\sum (X - \bar{X})^2}{df}\) \(\mathrm{Cov}(X, Y) = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{df}\)
  • \(x\) is an individual observation from variable 1
  • \(\bar{x}\) is the sample mean of variable 1
  • \(y\) is an individual observation from variable 2
  • \(\bar{y}\) is the sample mean of variable 2
  • \(df\) stands for degrees of freedom, which differ based on whether you’re computing population or sample statistics
  • For population variance or covariance, \(df = N\)
  • For sample variance or covariance, \(df = n - 1\)

Note that the “n” in the degrees of freedom calculation can be slightly ambiguous. For variance and standard deviation, “n” refers to the number of observations. For covariance and correlation, “n” refers to the number of pairs or observations. If you think about it, each student is an “observation” in the example above. Each student comes with two data points, some hours studied and an exam score.

Also note that the variance can be re-phrased to be \(\frac{\Sigma(x-\bar{x})(x-\bar{x})}{df}\). This makes it look even more like the co-variance formula.

4.1.2 💪Worked example💪

We are examining the relationship between Hours Studied \((M = 3.00, SD = 1.58)\) and Test Score \((M = 67.40, SD = 12.93)\) for 5 students. Let’s calculate the sample variance for these students.

Hours Studied Test Score
Student A 1 52
Student B 2 60
Student C 3 65
Student D 4 75
Student E 5 85



We want to calculate “deviations from the mean” for both variables. Since the mean of Hours Studied, \(\bar{x}\), is 3, and the mean of Test Score, \(\bar{y}\), is 67.4…

Deviation x Deviation y
Student A 1 - 3 = -2 52 - 67.4 = -15.4
Student B 2 - 3 = -1 60 - 67.4 = -7.4
Student C 3 - 3 = 0 65 - 67.4 = -2.4
Student D 4 - 3 = 1 75 - 67.4 = 7.6
Student E 5 - 3 = 2 85 - 67.4 = 17.6



Now, since the numerator in the covariance is \((x-\bar{x})(y-\bar{y})\), that means we need to multiply the deviations for each pair of xy observations:

Deviation x Deviation y \(x \times y\)
Student A -2 -15.4 30.8
Student B -1 -7.4 7.4
Student C 0 -2.4 0
Student D 1 7.6 7.6
Student E 2 17.6 35.2



Now we have a list of the cross-products of deviations for x and y, (30.8, 7.4, 0, 7.6, 35.2). “\(\Sigma (x-\bar{x})(y-\bar{y})\)” means “add those cross products together”. 30.8 + 7.4 + 0 + 7.6 + 35.2 = 81.

Now we just divide by degrees of freedom, \(df\). We have 5 students, so \(df=5-1=4\).

\(\frac{81}{4}= 20.25\)

4.1.3 📝Homework problems📝

  1. We are interested in the relationship between Hours Spent on Social Media and Depression Score. Four students reported their data:
Student Social Media Hours (x) Depression Score (y)
A 1 5
B 3 10
C 2 8
D 4 15

The average Social Media Hours, \(\bar{x}\), is 2.5. The average depression score, \(\hat{y}\) is 9.5.

Now you have all the info you need to calculate the deviations and cross products:



Student \(x\) \(y\) \(x - \bar{x}\) \(y - \bar{y}\) \((x - \bar{x})(y - \bar{y})\)
A 1 5
B 2 10
C 3 8
D 4 15



The sum of the cross products, \(\Sigma (x - \bar{x})(y - \bar{y})\):

The sample covariance, \(\frac{\Sigma (x - \bar{x})(y - \bar{y})}{df}\):

  1. A small study looked at how much students exercise per week and their reported stress levels.
Participant \(x\) (Hours of Exercise) \(y\) (Stress Score) \(x - \bar{x}\) \(y - \bar{y}\) \((x - \bar{x})(y - \bar{y})\)
A 2.0 18
B 4.0 14
C 6.0 12

Given:

  • Mean of \(x\): \(\bar{x} = 4.0\)
  • Mean of \(y\): \(\bar{y} = 14.7\)

Use the table to calculate the sample covariance: \[ s_{xy} = \frac{\sum (x - \bar{x})(y - \bar{y})}{n - 1} \]

4.2 Correlation coefficient

The correlation coefficient is a standardized version of the covariance. Let’s look at them side-by-side:

4.2.1 🧠Correlation coefficient refresher🧠

Covariance Correlation
\(COV_{xy} = \frac{\sum (x - \bar{x})(y - \bar{y})}{n - 1}\) \(\frac{COV(x,y)}{s_x s_y}\)

(Note: \(s_x\) and \(s_y\) are the sample standard deviations of x and y, respectively.)

See? The correlation is just a slight modification of the covariance.

4.2.2 💪Worked example💪

Suppose a researcher collected data on hours of sleep (x) and mood scores (y) for a group of participants. They calculated the following:

  • Covariance between hours slept and mood: 2.1
  • Standard deviation of hours slept: 1.4
  • Standard deviation of mood scores: 1.75

We can calculate the Pearson correlation coefficient using the formula:

\[ r = \frac{\text{cov}(x, y)}{s_x \cdot s_y} \]

Where: - \(\text{cov}(x, y)\) is the covariance between x and y - \(s_x\) is the standard deviation of x - \(s_y\) is the standard deviation of y

\[ r = \frac{2.1}{1.4 \times 1.75} \]

\[ r = \frac{2.1}{2.45} \approx 0.857 \]

Final Answer:

\[ \boxed{r \approx 0.86} \]

Note: When calculating correlation coefficients by hand, it’s easy to get incorrect answers due to rounding errors. The formulas we use must have accurate inputs—especially for the covariance and standard deviations. If you round too early or too much, you might end up with a correlation value that falls outside the possible range (i.e., less than -1 or greater than 1), which is a sign that something went wrong. Always carry enough decimal places during intermediate steps, and only round at the very end.

4.2.3 📝Homework problems📝

  1. You are studying the relationship between hours spent on social media and self-reported anxiety levels. You are given the following statistics:

    Covariance between the two variables: 8.6

    Standard deviation of social media use: 3.2

    Standard deviation of anxiety scores: 4.1

Calculate the Pearson correlation coefficient (r) between social media use and anxiety.



  1. You’re told that the correlation between number of caffeinated drinks consumed and hours of sleep is -0.45. The standard deviation of drinks consumed is 1.8, and the standard deviation of sleep hours is 2.4. What is the covariance between caffeine consumption and sleep?

4.3 t-test for correlation coefficient

4.3.1 🧠t-test for correlation refresher🧠

When we compute a correlation coefficient (r) from sample data, we test whether it differs from the population correlation \(\rho\), often zero.

\[ H_0: \rho = 0 \]

The t-statistic considers the difference between the population parameter and the sample estimate (\(\rho - r\)).

\[ t = \frac{r - \rho}{SE_r} \]

where \(SE_r\) is the standard error of r.

Since the null hypothesis (\(H_0\)) assumes \(\rho = 0\), the equation above translates to:

\[ t = \frac{r - 0}{SE_r} = \frac{r}{SE_r} \]

Since \(SE_r = \sqrt{\frac{1 - r^2}{n - 2}}\), we can sub that in to get

\[ t = \frac{r}{\sqrt{\frac{1 - r^2}{n - 2}}} = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}} \] Believe it or not, \(\frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}\) is about the simplest we can make it 🙄

Once we’ve calculated our t-value, we compare it to the t-distribution. The t-distribution needs you to specify degrees of freedom (\(df\)). In this context,

\[ df = n - 2 \]

The number of pairs of observations minus 2.

This t value indicates how many standard errors r is away from the hypothesized population correlation \(\rho\). You can compare it to a critical t value to assess significance.

4.3.2 💪Worked example💪

Suppose a researcher calculates a sample correlation between hours studied and exam score:

  • Sample correlation: \(r = 0.60\)
  • Sample size: \(n = 15\)

We want to test if this correlation is significantly different from zero.

\[ t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}} = \frac{0.60 \times \sqrt{15 - 2}}{\sqrt{1 - 0.60^2}} = \frac{0.60 \times \sqrt{13}}{\sqrt{1 - 0.36}} \] \[ t = \frac{0.60 \times 3.61}{\sqrt{0.64}} = \frac{2.17}{0.80} = 2.71 \] Degrees of freedom is \(n - 2\), which comes out to (15 - 2 = 13)

The calculated t-value is 2.71 with 13 degrees of freedom. You would compare this to critical values from the t-distribution to assess significance.

4.3.3 📝Homework problems📝

  1. Given:
  • Sample correlation, \(r = 0.45\)
  • Sample size, \(n = 20\)
    Calculate the t-value and degrees of freedom.

  1. Given:
  • Sample correlation, \(r = -0.10\)
  • Sample size, \(n = 50\)
    Calculate the t-value and degrees of freedom.

  1. Given:
  • \(t = 3.2\)
  • \(r = 0.55\)
    Calculate the sample size \(n\).

  1. Given:
  • \(t = 2.8\)
  • \(n = 20\)
    Calculate the sample correlation \(r\).

  1. Given:
  • \(r = -0.42\)
  • \(t = -1.75\)
    Calculate the degrees of freedom \(df\).

4.4 Simple linear regression

4.4.1 🧠Simple linear regression refresher🧠

In simple linear regression, we model the relationship between a predictor x and an outcome y using a straight line:

\[ \hat{y} = b_0 + b_1x \]

Where:
- \(\hat{y}\) is the predicted value of y
- \(b_0\) is the intercept (the predicted value when x = 0)
- \(b_1\) is the slope (how much y changes for each 1-unit increase in x)

4.4.2 💪Worked example💪

Suppose the regression equation for predicting depression score (y) from weekly hours on social media (x) is:

\[ \hat{y} = 5 + 0.6x \]

Let’s the equation above to predict the depression score for someone who spends 18 hours per week on social media.

\[ \hat{y} = 5 + 0.6 \cdot 18 = 5 + 10.8 = 15.8 \]

Order of operations is very important here. When using a regression equation like: \(\hat{y} = b_0 + b_1x\), make sure to multiply before you add. This follows the standard order of operations (PEMDAS): First, multiply the slope b₁ by the value of x. Then, add the intercept b₀. If you mix up the order, you’ll get the wrong prediction!

4.4.3 📝Homework problems📝

  1. A simple linear regression equation predicting test grade from hours studied is:
    \(\hat{y} = 55 + 8x\)
    What is the predicted test grade for someone who studied for 3 hours?



  1. A researcher found that daily mood rating can be predicted from hours of sleep using the equation:
    \(\hat{y} = 20 + 6.5x\)
    What is the predicted mood rating for someone who slept 7 hours?

  2. Given: \(\hat{y} = ?\), Intercept = 4.5, Slope = 0.8, \(x = 10\)
    What is the predicted value of \(\hat{y}\)?



  1. Given: \(\hat{y} = 22.6\), Intercept = ?, Slope = 1.3, \(x = 14\)
    What is the intercept?



  1. Given: \(\hat{y} = 19.2\), Intercept = 3, Slope = ?, \(x = 12\)
    What is the slope?



  1. Given: \(\hat{y} = 17\), Intercept = 2, Slope = 1.5, \(x = ?\)
    What is the value of \(x\)?



  1. Given: \(\hat{y} = ?\), Intercept = -1.2, Slope = 0.4, \(x = 25\)
    What is the predicted value of \(\hat{y}\)?



4.5 Relationship between slope, correlation, and R^2

4.5.1 🧠Refresher🧠

In simple linear regression, we only have one predictor (x). Here’s something important to remember: The slope doesn’t give us much new information if we already know the correlation between x and y.

Here’s the equation for the slope:

\[ b_1 = r \cdot \frac{s_y}{s_x} \]

This means the slope depends on the correlation between x and y as well as the the standard deviations of x and y.

If the correlation between x and y is strong enough to be statistically significant, then the slope in the regression equation will also be significant.

Even if we flipped x and y (switching predictor and predicted), the new slope would still be significant. The number would change, but it still tells us the same thing: how strongly x and y are related.

Also remember: the correlation tells us how well the regression model fits. The value \(R^2\) tells us how much of the change in y can be “accounted for” by x. In simple linear regression: \(r^2 = R^2\) and \(r = \sqrt{R^2}\). That means the correlation squared gives us the proportion of variance explained by the model.

4.5.2 💪Worked example💪

George tries using GPA (M = 2.89, SD = 0.45) to predict people’s level of art appreciation (M = 17.7, SD = 1.23). There’s a correlation (\(r = 0.15\)) between the two.

He’s bummed to find that the regression slope is: \(b_1 = 0.51\). But it comes with a p-value of .33. Not statistically significant! 😕

He wonders: What if he flips things around and uses art appreciation to predict GPA instead? Might that give him a better result?

Conceptually, an astute reader will know the answer is no. Reversing the variables in a simple linear regression changes the slope, but not the strength or significance of the relationship. Let’s do the math anyway.

In simple linear regression, the slope is given by:

\[ b_1 = r \cdot \frac{s_y}{s_x} \]

Where:

  • \(r = 0.15\)
  • \(s_y = 1.23\) (the SD of art appreciation, currently the outcome)
  • \(s_x = 0.45\) (the SD of GPA, currently the predictor)

So the original slope (predicting art appreciation from GPA) is:

\[ b_1 = 0.15 \cdot \frac{1.23}{0.45} = 0.51 \]

For current purposes, we might want to re-lable some things:

\[ b_1 = r \cdot \frac{s_{outcome}}{s_{predictor}} \]

We want to reverse the regression—predict GPA from art appreciation.

Let’s call the new slope \(\tilde{b}_1\):

\[ \tilde{b}_1 = r \cdot \frac{0.45}{1.23} \]

Plug in the values:

\[ \tilde{b}_1 = 0.15 \cdot \frac{0.45}{1.23} \approx 0.0549 \]

  • The slope is now smaller because the units flipped. GPA has a smaller range of numbers compared to art appreciation.
  • The p-value would be the same, because the size of the correlation hasn’t changed. Importantly, neither has the sample size.

Poor George! 😈 He also observed a measly \(R^2\) of 0.0225 in his original analysis. That means his regression equation accounts for only 2.25% of the variance in art appreciation. He might hope to increase the predictive power of his model if he flips the script, but we know this won’t work.

\(r = \sqrt{R^2}\) in simple linear regression. The correlation between two variables is non-directional. The correlation between x and y is 0.15 and the correlation between y and x is 0.15.

4.5.3 📝Homework problems📝

  1. The correlation between hours of sleep and test scores is \(r=0.4\). The standard deviation of sleep is 2 hours, and the standard deviation of test scores is 10 points. What is the slope of the regression line if we use sleep to predict test scores?

  2. The correlation between GPA and stress levels is \(r=−0.3\). The SD of GPA is 0.5, and the SD of stress scores is 8. What is the slope of the regression line if GPA is the predictor and stress is the outcome?

  3. A researcher reports that the regression slope for predicting job satisfaction from years at a company is 0.25. The SD of job satisfaction is 1.5, and the SD of years at a company is 6. What is the correlation between the two variables?

  4. A researcher finds that the correlation between study hours and exam scores is \(r=0.6\). What proportion of variance in exam scores is explained by study hours?

  5. A researcher reports that 64% of the variance in cholesterol levels can be accounted for by age. What is the correlation between age and cholesterol level?