Chapter5 Sum of squares, general model fitting

5.1 Regression residuals

5.1.1 🧠💪Refresher + Worked example💪🧠

In simple linear regression, you have a set of \(x\) values and a set of \(y\) values. Let’s say \(x\) represents GPA and \(y\) represents ACT scores for three students.

Plug each student’s GPA into the equation to get their predicted ACT score:

Student	GPA (\(x\))	ACT (\(y\))
Student 1	3.00	23
Student 2	3.50	25
Student 3	2.75	17

A regression equation derived from these data would look like this:

\[ \hat{y} = -8.286 + 9.714 \cdot \text{GPA} \]

Plug each student’s GPA into the equation to get their predicted ACT score:

Student	GPA	Actual ACT (\(y\))	Predicted ACT (\(\hat{y}\))
Student 1	3.00	23	\(-8.286 + 9.714 \cdot 3.00 = 20.856\)
Student 2	3.50	25	\(-8.286 + 9.714 \cdot 3.50 = 25.713\)
Student 3	2.75	17	\(-8.286 + 9.714 \cdot 2.75 = 18.4535\)

The regression equation isn’t perfect. It underpredicts some students’ scores and overpredicts others. The difference between each actual value and its predicted value is called a residual:

\[ \text{Residual} = y - \hat{y} \]

Below are the residuals for each student:

Student	\(y\)	\(\hat{y}\)	Residual (\(y - \hat{y}\))
Student 1	23	20.856	\(2.144\)
Student 2	25	25.713	\(-0.713\)
Student 3	17	18.4535	\(-1.4535\)

If you add the residuals:

\[ \sum (y - \hat{y}) = 2.144 + (-0.713) + (-1.4535) = -0.0225 \]

The result is very close to zero, and this is not a coincidence. Least squares regression always minimizes residuals such that their sum is approximately zero (aside from rounding).

But if we want to measure overall error, adding positives and negatives won’t help — they cancel out. So instead, we square the residuals:

\[ 2.144^2 + (-0.713)^2 + (-1.4535)^2 = 4.595 + 0.5084 + 2.1137 = 7.2171 \]

This is called the sum of squared residuals, or \(SS_{\text{residual}}\).

The simple linear regression model chooses the slope and intercept that minimize this quantity. In other words, the “best-fitting line” is the one that makes \(SS_{\text{residual}}\) as small as possible.

5.1.2 📝Homework problems📝

Suppose the regression equation for predicting exam score from hours studied is:

\[ \hat{y} = 50 + 5x \]

Complete the table below and calculate the residuals.

Student	Hours Studied (\(x\))	Actual Score (\(y\))
A	2	62
B	4	75
C	3	64

A regression equation is given by:

\[ \hat{y} = 100 - 8x \]

Use it to fill in the missing values in the table below.

Observation	\(x\)	Actual \(y\)	\(\hat{y}\)	Residual
1	2	85
2		60	64
3	3		76	-4

Suppose a regression line is given by:

\[ \hat{y} = 4 + 2x \]

If \(x = 5\), what is the predicted value \(\hat{y}\)? If the observed value \(y = 12\), what is the residual?

Suppose you are given the following observed values (\(y\)) and corresponding predicted values (\(\hat{y}\)):

Observation	\(y\)	\(\hat{y}\)
1	8	7
2	6	5
3	9	9
4	4	5
5	7	8

Without computing each residual, what is the sum of the residuals? Briefly explain why.

A dataset has 12 observations, and the residuals from a simple linear regression are:

\[ 1.5,\ -0.5,\ 0.3,\ -0.3,\ -1.0,\ 0.7,\ -1.2,\ 0.2,\ 0.8,\ -0.7,\ -0.8,\ x \]

What must the missing residual \(x\) be? Justify your answer.

5.2 Generalizing Sum of Squares (SS) for assessing model fit

5.2.1 🧠💪Refresher + Worked example💪🧠

There are three major sources of variance that we examine in regression:

\(SS_{\text{Total}}\): The total amount of variability in the outcome (DV)

\(SS_{\text{Model}}\): The portion of that variability your model can explain

\(SS_{\text{Residual}}\): The portion of variability your model cannot explain

Let’s say we have a very simple model for predicting GPA based on whether a student is in the honors program:

\[ \hat{y} = \begin{cases} 3.5 & \text{if } x = \text{Honors student} \\ 2.8 & \text{if } x = \text{Not an honors student} \end{cases} \]

This model says: “If someone’s in honors, predict a 3.5 GPA; otherwise, predict 2.8.”

We use the following data:

	Actual GPA (\(y\))	In Honors?	Predicted GPA (\(\hat{y}\))	Residual (\(y - \hat{y}\))
1	3.42	Yes	3.5	-0.08
2	2.75	No	2.8	-0.05
3	2.92	No	2.8	0.12

5.2.2 Step 1: \(SS_{\text{Residual}}\)

This is the sum of squared residuals:

\[ SS_{\text{Residual}} = \sum (y - \hat{y})^2 = (-0.08)^2 + (-0.05)^2 + (0.12)^2 \]

\[ = 0.0064 + 0.0025 + 0.0144 = 0.0233 \]

This represents the portion of variability in GPA not explained by the model.

\(SS_{\text{Total}}\) is the total variability in GPA, regardless of the model:

\[ SS_{\text{Total}} = \sum (y - \bar{y})^2 \]

First, compute the mean GPA:

\[ \bar{y} = \frac{3.42 + 2.75 + 2.92}{3} = \frac{9.09}{3} = 3.03 \]

Now:

\[ SS_{\text{Total}} = (3.42 - 3.03)^2 + (2.75 - 3.03)^2 + (2.92 - 3.03)^2 \] \[ = 0.1521 + 0.0784 + 0.0121 = 0.2426 \]

Since \(SS_{Total} = SS_{Model} + SS_{Residual}\), we can do some simple re-arranging to get…

\[ SS_{\text{Model}} = SS_{\text{Total}} - SS_{\text{Residual}} = 0.2426 - 0.0233 = \boxed{0.2193} \]

One of the key metrics of model fit is \(R^2\), which acts like a test grade for your model. It’s the proportion of variance in the dependent variable that’s accounted for by the model:

\[ R^2 = \frac{SS_{\text{Model}}}{SS_{\text{Total}}} = \frac{0.2193}{0.2426} \approx \boxed{0.904} \]

That’s 90.4% of the variability in GPA explained by honors status. A pretty solid “grade” for a model!

🔎 In the social sciences, don’t expect to see \(R^2\) values this high very often. But this example is meant to show the math mechanics in a clean, simplified way.

5.2.3 📝Homework problems📝

Below are three observations with their actual values and predicted values. Calculate the Sum of Squared Residuals (\(SS_{Residual}\)).

Observation	Actual \(y\)	Predicted \(\hat{y}\)
1	2.5	2.7
2	3.2	3.0
3	1.8	1.9

Below are three observations with their actual \(y\) values and the mean \(\bar{y} = 3.0\). Calculate the Total Sum of Squares (\(SS_{Total}\)).

Observation	Actual \(y\)	Mean \(\bar{y}\)
1	2.6	3.0
2	3.5	3.0
3	2.9	3.0

You are told that \(SS_{Total} = 12.4\) and \(SS_{Residual} = 3.1\). What is \(SS_{Model}\)?

If \(SS_{Model} = 6.3\) and \(SS_{Total} = 10.5\), what is \(SS_{Residual}\)?

If \(SS_{Residual} = 2.2\) and \(R^2 = 0.75\), what is \(SS_{Total}\)?

If \(SS_{Total} = 16.0\) and \(R^2 = 0.40\), what is \(SS_{Model}\)?

If \(SS_{Model} = 8.0\), \(SS_{Residual} = 12.0\), what is \(R^2\)?