Chapter5 Sum of squares, general model fitting

5.1 Regression residuals

5.1.1 🧠💪Refresher + Worked example💪🧠

In simple linear regression, you have a set of \(x\) values and a set of \(y\) values. Let’s say \(x\) represents GPA and \(y\) represents ACT scores for three students.

Plug each student’s GPA into the equation to get their predicted ACT score:

Student GPA (\(x\)) ACT (\(y\))
Student 1 3.00 23
Student 2 3.50 25
Student 3 2.75 17

A regression equation derived from these data would look like this:

\[ \hat{y} = -8.286 + 9.714 \cdot \text{GPA} \]

Plug each student’s GPA into the equation to get their predicted ACT score:

Student GPA Actual ACT (\(y\)) Predicted ACT (\(\hat{y}\))
Student 1 3.00 23 \(-8.286 + 9.714 \cdot 3.00 = 20.856\)
Student 2 3.50 25 \(-8.286 + 9.714 \cdot 3.50 = 25.713\)
Student 3 2.75 17 \(-8.286 + 9.714 \cdot 2.75 = 18.4535\)

The regression equation isn’t perfect. It underpredicts some students’ scores and overpredicts others. The difference between each actual value and its predicted value is called a residual:

\[ \text{Residual} = y - \hat{y} \]

Below are the residuals for each student:

Student \(y\) \(\hat{y}\) Residual (\(y - \hat{y}\))
Student 1 23 20.856 \(2.144\)
Student 2 25 25.713 \(-0.713\)
Student 3 17 18.4535 \(-1.4535\)

If you add the residuals:

\[ \sum (y - \hat{y}) = 2.144 + (-0.713) + (-1.4535) = -0.0225 \]

The result is very close to zero, and this is not a coincidence. Least squares regression always minimizes residuals such that their sum is approximately zero (aside from rounding).

But if we want to measure overall error, adding positives and negatives won’t help — they cancel out. So instead, we square the residuals:

\[ 2.144^2 + (-0.713)^2 + (-1.4535)^2 = 4.595 + 0.5084 + 2.1137 = 7.2171 \]

This is called the sum of squared residuals, or \(SS_{\text{residual}}\).

The simple linear regression model chooses the slope and intercept that minimize this quantity. In other words, the “best-fitting line” is the one that makes \(SS_{\text{residual}}\) as small as possible.

5.1.2 📝Homework problems📝

  1. Suppose the regression equation for predicting exam score from hours studied is:

\[ \hat{y} = 50 + 5x \]

Complete the table below and calculate the residuals.

Student Hours Studied (\(x\)) Actual Score (\(y\)) Predicted Score (\(\hat{y}\)) Residual (\(y - \hat{y}\))
A 2 62
B 4 75
C 3 64



  1. A regression equation is given by:

\[ \hat{y} = 100 - 8x \]

Use it to fill in the missing values in the table below.

Observation \(x\) Actual \(y\) \(\hat{y}\) Residual
1 2 85
2 60 64
3 3 76 -4



  1. Suppose a regression line is given by:

\[ \hat{y} = 4 + 2x \]

If \(x = 5\), what is the predicted value \(\hat{y}\)? If the observed value \(y = 12\), what is the residual?



  1. Suppose you are given the following observed values (\(y\)) and corresponding predicted values (\(\hat{y}\)):
Observation \(y\) \(\hat{y}\)
1 8 7
2 6 5
3 9 9
4 4 5
5 7 8

Without computing each residual, what is the sum of the residuals? Briefly explain why.



  1. A dataset has 12 observations, and the residuals from a simple linear regression are:

\[ 1.5,\ -0.5,\ 0.3,\ -0.3,\ -1.0,\ 0.7,\ -1.2,\ 0.2,\ 0.8,\ -0.7,\ -0.8,\ x \]

What must the missing residual \(x\) be? Justify your answer.



5.2 Generalizing Sum of Squares (SS) for assessing model fit

5.2.1 🧠💪Refresher + Worked example💪🧠

There are three major sources of variance that we examine in regression:

\(SS_{\text{Total}}\): The total amount of variability in the outcome (DV)

\(SS_{\text{Model}}\): The portion of that variability your model can explain

\(SS_{\text{Residual}}\): The portion of variability your model cannot explain

Let’s say we have a very simple model for predicting GPA based on whether a student is in the honors program:

\[ \hat{y} = \begin{cases} 3.5 & \text{if } x = \text{Honors student} \\ 2.8 & \text{if } x = \text{Not an honors student} \end{cases} \]

This model says: “If someone’s in honors, predict a 3.5 GPA; otherwise, predict 2.8.”

We use the following data:

Actual GPA (\(y\)) In Honors? Predicted GPA (\(\hat{y}\)) Residual (\(y - \hat{y}\))
1 3.42 Yes 3.5 -0.08
2 2.75 No 2.8 -0.05
3 2.92 No 2.8 0.12

5.2.2 Step 1: \(SS_{\text{Residual}}\)

This is the sum of squared residuals:

\[ SS_{\text{Residual}} = \sum (y - \hat{y})^2 = (-0.08)^2 + (-0.05)^2 + (0.12)^2 \]

\[ = 0.0064 + 0.0025 + 0.0144 = 0.0233 \]

This represents the portion of variability in GPA not explained by the model.

\(SS_{\text{Total}}\) is the total variability in GPA, regardless of the model:

\[ SS_{\text{Total}} = \sum (y - \bar{y})^2 \]

First, compute the mean GPA:

\[ \bar{y} = \frac{3.42 + 2.75 + 2.92}{3} = \frac{9.09}{3} = 3.03 \]

Now:

\[ SS_{\text{Total}} = (3.42 - 3.03)^2 + (2.75 - 3.03)^2 + (2.92 - 3.03)^2 \] \[ = 0.1521 + 0.0784 + 0.0121 = 0.2426 \]

Since \(SS_{Total} = SS_{Model} + SS_{Residual}\), we can do some simple re-arranging to get…

\[ SS_{\text{Model}} = SS_{\text{Total}} - SS_{\text{Residual}} = 0.2426 - 0.0233 = \boxed{0.2193} \]

One of the key metrics of model fit is \(R^2\), which acts like a test grade for your model. It’s the proportion of variance in the dependent variable that’s accounted for by the model:

\[ R^2 = \frac{SS_{\text{Model}}}{SS_{\text{Total}}} = \frac{0.2193}{0.2426} \approx \boxed{0.904} \]

That’s 90.4% of the variability in GPA explained by honors status. A pretty solid “grade” for a model!

🔎 In the social sciences, don’t expect to see \(R^2\) values this high very often. But this example is meant to show the math mechanics in a clean, simplified way.

5.2.3 📝Homework problems📝

  1. Below are three observations with their actual values and predicted values. Calculate the Sum of Squared Residuals (\(SS_{Residual}\)).
Observation Actual \(y\) Predicted \(\hat{y}\)
1 2.5 2.7
2 3.2 3.0
3 1.8 1.9



  1. Below are three observations with their actual \(y\) values and the mean \(\bar{y} = 3.0\). Calculate the Total Sum of Squares (\(SS_{Total}\)).
Observation Actual \(y\) Mean \(\bar{y}\)
1 2.6 3.0
2 3.5 3.0
3 2.9 3.0



  1. You are told that \(SS_{Total} = 12.4\) and \(SS_{Residual} = 3.1\). What is \(SS_{Model}\)?



  1. If \(SS_{Model} = 6.3\) and \(SS_{Total} = 10.5\), what is \(SS_{Residual}\)?



  1. If \(SS_{Residual} = 2.2\) and \(R^2 = 0.75\), what is \(SS_{Total}\)?



  1. If \(SS_{Total} = 16.0\) and \(R^2 = 0.40\), what is \(SS_{Model}\)?



  1. If \(SS_{Model} = 8.0\), \(SS_{Residual} = 12.0\), what is \(R^2\)?