Chapter5 Sum of squares, general model fitting
5.1 Regression residuals
5.1.1 🧠💪Refresher + Worked example💪🧠
In simple linear regression, you have a set of \(x\) values and a set of \(y\) values. Let’s say \(x\) represents GPA and \(y\) represents ACT scores for three students.
Plug each student’s GPA into the equation to get their predicted ACT score:
Student | GPA (\(x\)) | ACT (\(y\)) |
---|---|---|
Student 1 | 3.00 | 23 |
Student 2 | 3.50 | 25 |
Student 3 | 2.75 | 17 |
A regression equation derived from these data would look like this:
\[ \hat{y} = -8.286 + 9.714 \cdot \text{GPA} \]
Plug each student’s GPA into the equation to get their predicted ACT score:
Student | GPA | Actual ACT (\(y\)) | Predicted ACT (\(\hat{y}\)) |
---|---|---|---|
Student 1 | 3.00 | 23 | \(-8.286 + 9.714 \cdot 3.00 = 20.856\) |
Student 2 | 3.50 | 25 | \(-8.286 + 9.714 \cdot 3.50 = 25.713\) |
Student 3 | 2.75 | 17 | \(-8.286 + 9.714 \cdot 2.75 = 18.4535\) |
The regression equation isn’t perfect. It underpredicts some students’ scores and overpredicts others. The difference between each actual value and its predicted value is called a residual:
\[ \text{Residual} = y - \hat{y} \]
Below are the residuals for each student:
Student | \(y\) | \(\hat{y}\) | Residual (\(y - \hat{y}\)) |
---|---|---|---|
Student 1 | 23 | 20.856 | \(2.144\) |
Student 2 | 25 | 25.713 | \(-0.713\) |
Student 3 | 17 | 18.4535 | \(-1.4535\) |
If you add the residuals:
\[ \sum (y - \hat{y}) = 2.144 + (-0.713) + (-1.4535) = -0.0225 \]
The result is very close to zero, and this is not a coincidence. Least squares regression always minimizes residuals such that their sum is approximately zero (aside from rounding).
But if we want to measure overall error, adding positives and negatives won’t help — they cancel out. So instead, we square the residuals:
\[ 2.144^2 + (-0.713)^2 + (-1.4535)^2 = 4.595 + 0.5084 + 2.1137 = 7.2171 \]
This is called the sum of squared residuals, or \(SS_{\text{residual}}\).
The simple linear regression model chooses the slope and intercept that minimize this quantity. In other words, the “best-fitting line” is the one that makes \(SS_{\text{residual}}\) as small as possible.
5.1.2 📝Homework problems📝
- Suppose the regression equation for predicting exam score from hours studied is:
\[ \hat{y} = 50 + 5x \]
Complete the table below and calculate the residuals.
Student | Hours Studied (\(x\)) | Actual Score (\(y\)) | Predicted Score (\(\hat{y}\)) | Residual (\(y - \hat{y}\)) |
---|---|---|---|---|
A | 2 | 62 | ||
B | 4 | 75 | ||
C | 3 | 64 |
- A regression equation is given by:
\[ \hat{y} = 100 - 8x \]
Use it to fill in the missing values in the table below.
Observation | \(x\) | Actual \(y\) | \(\hat{y}\) | Residual |
---|---|---|---|---|
1 | 2 | 85 | ||
2 | 60 | 64 | ||
3 | 3 | 76 | -4 |
- Suppose a regression line is given by:
\[ \hat{y} = 4 + 2x \]
If \(x = 5\), what is the predicted value \(\hat{y}\)? If the observed value \(y = 12\), what is the residual?
- Suppose you are given the following observed values (\(y\)) and corresponding predicted values (\(\hat{y}\)):
Observation | \(y\) | \(\hat{y}\) |
---|---|---|
1 | 8 | 7 |
2 | 6 | 5 |
3 | 9 | 9 |
4 | 4 | 5 |
5 | 7 | 8 |
Without computing each residual, what is the sum of the residuals? Briefly explain why.
- A dataset has 12 observations, and the residuals from a simple linear regression are:
\[ 1.5,\ -0.5,\ 0.3,\ -0.3,\ -1.0,\ 0.7,\ -1.2,\ 0.2,\ 0.8,\ -0.7,\ -0.8,\ x \]
What must the missing residual \(x\) be? Justify your answer.
5.2 Generalizing Sum of Squares (SS) for assessing model fit
5.2.1 🧠💪Refresher + Worked example💪🧠
There are three major sources of variance that we examine in regression:
\(SS_{\text{Total}}\): The total amount of variability in the outcome (DV)
\(SS_{\text{Model}}\): The portion of that variability your model can explain
\(SS_{\text{Residual}}\): The portion of variability your model cannot explain
Let’s say we have a very simple model for predicting GPA based on whether a student is in the honors program:
\[ \hat{y} = \begin{cases} 3.5 & \text{if } x = \text{Honors student} \\ 2.8 & \text{if } x = \text{Not an honors student} \end{cases} \]
This model says: “If someone’s in honors, predict a 3.5 GPA; otherwise, predict 2.8.”
We use the following data:
Actual GPA (\(y\)) | In Honors? | Predicted GPA (\(\hat{y}\)) | Residual (\(y - \hat{y}\)) | |
---|---|---|---|---|
1 | 3.42 | Yes | 3.5 | -0.08 |
2 | 2.75 | No | 2.8 | -0.05 |
3 | 2.92 | No | 2.8 | 0.12 |
5.2.2 Step 1: \(SS_{\text{Residual}}\)
This is the sum of squared residuals:
\[ SS_{\text{Residual}} = \sum (y - \hat{y})^2 = (-0.08)^2 + (-0.05)^2 + (0.12)^2 \]
\[ = 0.0064 + 0.0025 + 0.0144 = 0.0233 \]
This represents the portion of variability in GPA not explained by the model.
\(SS_{\text{Total}}\) is the total variability in GPA, regardless of the model:
\[ SS_{\text{Total}} = \sum (y - \bar{y})^2 \]
First, compute the mean GPA:
\[ \bar{y} = \frac{3.42 + 2.75 + 2.92}{3} = \frac{9.09}{3} = 3.03 \]
Now:
\[ SS_{\text{Total}} = (3.42 - 3.03)^2 + (2.75 - 3.03)^2 + (2.92 - 3.03)^2 \] \[ = 0.1521 + 0.0784 + 0.0121 = 0.2426 \]
Since \(SS_{Total} = SS_{Model} + SS_{Residual}\), we can do some simple re-arranging to get…
\[ SS_{\text{Model}} = SS_{\text{Total}} - SS_{\text{Residual}} = 0.2426 - 0.0233 = \boxed{0.2193} \]
One of the key metrics of model fit is \(R^2\), which acts like a test grade for your model. It’s the proportion of variance in the dependent variable that’s accounted for by the model:
\[ R^2 = \frac{SS_{\text{Model}}}{SS_{\text{Total}}} = \frac{0.2193}{0.2426} \approx \boxed{0.904} \]
That’s 90.4% of the variability in GPA explained by honors status. A pretty solid “grade” for a model!
🔎 In the social sciences, don’t expect to see \(R^2\) values this high very often. But this example is meant to show the math mechanics in a clean, simplified way.
5.2.3 📝Homework problems📝
- Below are three observations with their actual values and predicted values. Calculate the Sum of Squared Residuals (\(SS_{Residual}\)).
Observation | Actual \(y\) | Predicted \(\hat{y}\) |
---|---|---|
1 | 2.5 | 2.7 |
2 | 3.2 | 3.0 |
3 | 1.8 | 1.9 |
- Below are three observations with their actual \(y\) values and the mean \(\bar{y} = 3.0\). Calculate the Total Sum of Squares (\(SS_{Total}\)).
Observation | Actual \(y\) | Mean \(\bar{y}\) |
---|---|---|
1 | 2.6 | 3.0 |
2 | 3.5 | 3.0 |
3 | 2.9 | 3.0 |
- You are told that \(SS_{Total} = 12.4\) and \(SS_{Residual} = 3.1\). What is \(SS_{Model}\)?
- If \(SS_{Model} = 6.3\) and \(SS_{Total} = 10.5\), what is \(SS_{Residual}\)?
- If \(SS_{Residual} = 2.2\) and \(R^2 = 0.75\), what is \(SS_{Total}\)?
- If \(SS_{Total} = 16.0\) and \(R^2 = 0.40\), what is \(SS_{Model}\)?
- If \(SS_{Model} = 8.0\), \(SS_{Residual} = 12.0\), what is \(R^2\)?