Rows: 400
Columns: 3
$ score <dbl> 120, 89, 78, 42, 115, 97, 94, 68, 103, 94, 117, 64, 72, 104, 78,…
$ edu <dbl> 2, 1, 2, 1, 4, 1, 1, 2, 3, 3, 2, 4, 2, 2, 3, 1, 1, 3, 3, 2, 2, 2…
$ age <dbl> 21, 17, 19, 20, 26, 20, 20, 24, 19, 24, 24, 29, 19, 23, 26, 22, …
Module 0: Multiple Linear Regression Review
Readings
- Review matrix algebra Matrix Cookbook.
- BIOS 509 notes on Applied Linear Regression.
- Gelman & Hill-style regression intuition (figure referenced in class).
Overview & Concepts: In this module you will get a brief review of concepts learned in BIOS 509 including the multiple linear regression (MLR) framework for estimation, interpretation, prediction, and diagnostics. This material will not be included in the labs or quizzes, but is simply a review guide for you to brush up on material needed for Modules 1-3.
Linear regression model in matrix notation
Inference for regression coefficient estimates, expected values, and predictions.
Dummy variables
Effect modification and confounding
Motivation and Examples
Multiple Linear Regression (MLR) is the standard extension of simple linear regression to include two or more predictors.
Formally:
\[ y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \epsilon_i, \quad \epsilon_i \sim N(0,\sigma^2) \]
\(y_i\) = outcome (continuous response)
\(x_{ij}\) = predictor variables (continuous or categorical)
\(\beta_j\) = regression coefficients (effects of predictors, holding others constant)
\(\epsilon_i\) = error term
Matrix form
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}, \quad \operatorname{Var}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{I} \]
Key points about MLR:
Goal: quantify how multiple predictors are simultaneously associated with an outcome.
Interpretation: each \(\beta_j\) is the average change in \(y\) for a one-unit change in \(x_{j}\), holding other predictors fixed.
Uses: adjust for confounding, test effect modification (interactions), improve prediction accuracy.
Assumptions: linearity, independence, constant variance (homoscedasticity), and normally distributed errors.
In short: MLR models a continuous outcome as a linear combination of multiple explanatory variables, enabling adjusted and interpretable population-level associations.
Motivating Example:(NLSY): maternal traits vs. child cognitive score (age 3).
We want to answer the question: What maternal traits are associated with a child’s cognitive score at age 3?
Variables:
score
: cognitive test score (outcome)age
: maternal age at delivery (numeric)edu
: maternal education — 1: <HS, 2: HS, 3: some college, 4: college+
Model (baseline):
\[y_i = \beta_0 + \beta_1 \text{age}_i + \epsilon_i\]
MLR (with confounding):
\[ y_i = \beta_0 + \beta_1\text{age}_i + \beta_2\mathbb{1}(\text{edu}=2) + \beta_3\mathbb{1}(\text{edu}=3) + \beta_4\mathbb{1}(\text{edu}=4) + \epsilon_i \]
Below we give some quick visuals
Simple Linear Regression
Linear regression is a method that summarizes how the average values of a numerical outcome variable vary over subpopulations (covariate values) defined by linear functions of predictors. Two goals of linear regression:
Prediction of outcome values given fixed predictor values.
Assessment of impact of predictors on the outcome (Estimation).
Let \(y_i\) denote the test score for a given child \(i\). \(age_i\) denotes the corresponding maternal age at delivery for mom \(i\) corresponding to child \(i\), and \(\epsilon_i\) denotes the error term. We will consider the following linear model:
\[y_i = \beta_0 + \beta_1 \cdot age_i + \epsilon_i, \quad i = 1,2,3,...,400 \tag{1}\]
We can write Eq. 1 in matrix form by stacking observations by row:
\[\begin{align} \mathbf{Y}& = \begin{bmatrix} y_1 \\ y_2\\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} \beta_0 + \beta_1 x_1\\ \beta_0 + \beta_1 x_2 \\ \vdots \\ \beta_0 + \beta_1 x_n \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}\\ & = \begin{bmatrix} 1 & x_1\\1 & x_2\\ \vdots \\ 1 & x_n\end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1\end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}\\ &= \mathbf{X\beta + \epsilon} \end{align}\]
Clearly, we cannot find \(\mathbf{\beta}\) such that this model fits all of our data perfectly.
Model Assumptions:
\(y_i\) is a linear function of \(age_i\), i.e, \(E(\mathbf{Y}) = \mathbf{X\beta} = \mathbf{Age\beta}\) is linear in the \(\beta's\).
\(\beta_0=\) the test score for a child born of a mother at age=0 (not meaningful!)
\(\beta_1=\) average increase in test score associated with one year increase in maternal age.
\(E(\epsilon_i) = 0\) [Expectation of the errors is zero].
Least Squares Solution
Our aim is to find estimates of \(\beta_0\) and \(\beta_1\) that “best” fit the data, but what does “best” mean?
We seek \((\hat{\beta}_0, \hat{\beta}_1)\) that minimizes a loss function.
Our approach is to minimize the sum of squared differences between \(\mathbf{Y}\) and \(\hat{Y} = \mathbf{X\hat{\beta}}\), i.e., this defines the differences between \(\mathbf{Y}\) and the predicted value \(\hat{Y}\) based on a fixed set of \(\beta's\).
Geometric Interpretation of the OLS estimate
The fitted values \(\hat{y}\) are a linear transformation of \(y\). We see from the figure \(y\) in real space is projected on \(\hat{y}\) in model space.
The HAT matrix is defined as \(\mathbf{H} = \mathbf{X(X^{T} X)^{-1}X^T}\). It is a projection matrix that projects \(y\) \(\rightarrow\) \(\hat{y}\).
\[ \begin{align} \mathbf{\hat{y}} &= \mathbf{X\hat{\beta}}\\ &= \mathbf{X(X'X)^{-1}X'y} \end{align} \]
The projection matrix is idempotent, i.e., \(HH=H\).
The trace \(=\) rank of \(X\) or the number of covariates plus intercept, i.e, \(tr(H) = p\).
\(\hat{\mathbf{Y}}\) and \(\mathbf{\hat{\epsilon}}\) are independent, i.e., orthogonal (perpendicular in model space).
Properties of the OLS Estimator \(\hat{\beta}\)
- If \(E(Y) = X\beta\) then \(\hat{\beta}\) is unbiased for \(\beta\).
\[ \begin{align} E(\hat{\beta}) & = E[(X^TX)^{-1} X^Ty]\\ & = (X^T X)^{-1} X^T E(y)\\ & =(X^T X)^{-1} X^T X \beta\\ & = \beta \end{align} \]
\(\hat{\beta}\) is consistent estimation of \(\beta\) under very general assumptions, i.e., \(lim_{n\rightarrow \infty} P(|\hat{\beta} - \beta|>\epsilon) = 0\).
\(\mathbf{\hat{Y}}\) is an unbiased estimator of the mean trend \(\mathbf{X\beta}\).
If \(cov(y) = \sigma^2 I\), the covariance of the estimator \(cov(\mathbf{\hat{\beta}}) = \sigma^2 (X'X)^{-1}\).
Find the \(\operatorname{Cov}(\hat{\beta})\)
We have the linear model \[ \mathbf{y} = \mathbf{X}\beta + \boldsymbol{\epsilon}, \qquad \mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}, \qquad \operatorname{Var}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{I}. \]
The OLS estimator is \[ \hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y} = \beta + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \boldsymbol{\epsilon}. \]
Therefore,
\[ \begin{aligned} \operatorname{Cov}(\hat{\beta}) &= \operatorname{Var}\!\left[(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \boldsymbol{\epsilon}\right] \\ &= (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \underbrace{\operatorname{Var}(\boldsymbol{\epsilon})}_{\sigma^2 \mathbf{I}} \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1} \\ &= \sigma^2 \, (\mathbf{X}^\top \mathbf{X})^{-1}. \end{aligned} \]
OLS Calculation in R
#Calculate beta manually
= cbind(1, dat$age) #Design Matrix
X = dat$score #Response/Outcome Vector
Y = solve( t(X) %*% X) %*% t(X) %*% Y
beta beta
[,1]
[1,] 67.7826813
[2,] 0.8402729
#Visualize the univariate association
plot(score ~ age, data = dat, xlab = "Age", ylab = "Score")
abline(beta, lwd = 3, col = 2)
We see that the regression coefficient estimate for age is \(0.840\). This means that for every 0.840 unit increase in \(age\), there is a 1-unit increase in \(child score\).
Statistical Linear Regression & Inference
The relation between OLS, Normality, and Maximum Likelihood
So far, we have not discussed the distribution of the errors. For OLS we did not need to make any distributional assumption.
\[ \hat{\beta}_{OLS} = (X^TX)^{-1}X^Ty \]
But for inference (SEs, CIs, p-values), we need to make some distributional assumptions for the errors. Let’s now assume…
\[ y_i = \beta_0 + \beta_1 age_i + \epsilon_i, \quad i=1, ..., 400 \] where \(\epsilon \overset{iid}{\sim} N(0, \sigma^2)\).
What does i.i.d mean? identically and independently distributed. This implies the following model:
\(y_i\) is normally distributed.
This parametrization is equal to
\[y_i \sim N(\beta_0 + \beta_1 x_i,\sigma^2)\].
where \(\beta_0 + \beta_1 x_i\) captures the trend
where \(\sigma^2\) captures the variability (uncertainty) around the trend. It is constant, i.e., the same across all observations.
In matrix form: \(\mathbf{Y} \sim N(\mathbf{X\beta}, \Sigma)\) where \(\Sigma = \sigma^2I\).
\(\sigma^2\) is unknown!
The Maximum Likelihood
The likelihood of the normal distribution is given by
\[ L(\beta, \sigma^2, y) = (2\pi \sigma^2)^{-n/2} exp\left(\frac{-1}{2\sigma^2} (Y-X\beta)'(Y-X\beta)\right) \]
Taking the log of the likelihood, we get:
\[ l(\beta, \sigma^2, y) = \frac{-n}{2} log(2\pi\sigma^2) - \frac{1}{2\sigma^2} (Y-X\beta)'(Y-X\beta) \]
To find Maximum Likelihood Estimators we take partial derivatives and set the equal to zero to find the estimators that maximize the log-likelihood.
What do we get?
\(\hat{\beta}_{MLE} = (X'X)^{-1}X'y\) so \(\hat{\beta}_{MLE} = \hat{\beta}_{OLS}\) under normality.
\(\hat{\sigma}^2_{MLE} = \frac{1}{n} (Y-X\hat{\beta})'(Y-X\hat{\beta})\).
\(E(\hat{\sigma}^2_{MLE}) \neq \sigma^2\) so it is NOT unbiased.
An unbiased estimator for \(\sigma^2\) is given by \(s^2 = \frac{1}{n-p} (Y-X\beta)'(Y-X\beta)\).
OLS | MLE | |
---|---|---|
Distributional assumption | NONE | Normal \(Y\sim N(X\beta, \sigma^2I)\) |
Estimator \(\hat{\beta}\) | \((X'X)^{-1}X'y\) | \((X'X)^{-1}X'y\) |
\(cov(\hat{\beta})\) | \(\sigma^2 (X'X)^{-1}\) | \(\sigma^2 (X'X)^{-1}\) |
Estimator \(\hat{\sigma}^2\) | Cannot get | \(\frac{1}{n} (Y-X\hat{\beta})'(Y-X\hat{\beta})\) is biased |
Estimating the Residual Error
We saw above that the MLE estimator of the residual error: \(\hat{\sigma}^2 = \frac{1} (Y-X\beta)^T(Y-X\beta)\). But this estimator is biased, i.e., \(E(\hat{\sigma}^2) \neq \sigma^2\).
The unbiased estimator of the error variance is the mean squared error (MSE) of the residuals: \(\hat{\sigma}^2 = \frac{1}{n-p} (Y-X\hat{\beta})'(Y-X\hat{\beta})\).
where \(p\) is the number of fitted coefficients (including the intercept). Dividing by \(n-p\) (the residual degrees of freedom) corrects the small-sample bias that would arise if we divided by \(n\).
Its square root, \(\hat\sigma\), is the residual standard error—the typical deviation of observations around the regression line, reported in outcome units.
library(ggplot2)
#Here p = 2
= lm(score ~ age, data = dat)
fit summary(fit)
Call:
lm(formula = score ~ age, data = dat)
Residuals:
Min 1Q Median 3Q Max
-67.109 -11.798 2.971 14.860 55.210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.7827 8.6880 7.802 5.42e-14 ***
age 0.8403 0.3786 2.219 0.027 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.34 on 398 degrees of freedom
Multiple R-squared: 0.01223, Adjusted R-squared: 0.009743
F-statistic: 4.926 on 1 and 398 DF, p-value: 0.02702
#Can we confirm this manually
= fit$fitted.values #Predicted Values
yhat = dat$score #Observed Values
y = length(y)
N =2 #number columns in design matrix
P= (y-yhat)^2 #Squared errors
sq_error <- sum(sq_error)/(N-P) #Unbiased est of the variance
sigma_squared_est = sqrt(sigma_squared_est) #Unbiased est of the std. error
sigma_est sigma_est
[1] 20.34027
Indeed we see that the estimated residual standard error is \(20.3\). But what does this really mean?
We can visualize uncertainty in the regression line (mean). The pointwise 95% confidence band for the mean at covariate value \(x\) is:
\[ \hat{y}(x)\ \pm\ t_{n-2,\ 0.975}\ \hat{\sigma}\, \sqrt{\frac{1}{n} + \frac{(x-\bar x)^2}{S_{xx}}}\,, \] where
\[ \hat{y}(x)=\hat\beta_0+\hat\beta_1 x,\quad \bar x=\frac{1}{n}\sum_{i=1}^n x_i,\quad S_{xx}=\sum_{i=1}^n (x_i-\bar x)^2,\quad \hat{\sigma}^2=\frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-2}. \]
#Include 95% confidence intervals in your summary
<- data.frame(age = seq(min(dat$age), max(dat$age), length.out = 200))
nd <- as.data.frame(predict(fit, newdata = nd, interval = "confidence", level = 0.95))
ci <- cbind(nd, ci)
plt
ggplot() +
geom_point(data = dat, aes(age, score), alpha = 0.6) +
geom_ribbon(data = plt, aes(x = age, ymin = lwr, ymax = upr), alpha = 0.2) +
geom_line(data = plt, aes(x = age, y = fit), linewidth = 1.1, color = "darkgreen") +
labs(x = "Maternal age (years)", y = "Child score",
title = "OLS fit with 95% CI for the mean response") +
theme_minimal()
Model Interpretation: A one year increase in mother’s age at delivery was associated with a 0.84 (p-value= 0.027) increase in the child’s average test score, where test score was measured at age 3. The association was statistically significant at a type I error rate of 0.05.
There was considerable heterogeneity in test scores \((\hat{\sigma} = 20.34)\). Therefore, the regression model does not predict individual test scores well \((R^2 = 0.012)\). 1.2% of variances is explained by mother’s age.
Prediction Interval for a New Observation
Let \(\tilde{y}_i\) be a new observation with covariate value \(age_i\). How can you obtain its predictive uncertainty? We want to capture the uncertainty in our predicted estimate of the expected value \(\hat{y}_i = X\hat{\beta}\) PLUS the uncertainty due to measurement error \(\epsilon_i\).
\[ \tilde{y}_i = \hat{\beta}_0 + \hat{\beta}_1 age_i + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2) \]
The total uncertainty surrounding the predicted value \(\tilde{y}_i\) can be broken down as the sum of two components.
\[ Var( \tilde{y}_i) = Var(\hat{\beta}_0 + \hat{\beta}_1 age_i) + Var(\epsilon_i) \]
- We know the first term on the right, \(Var(\hat{\beta}) = \sigma^2 (X'X)^{-1}\) and so \(Var(X\hat{\beta}) = \sigma^2 (x_i^{'} (X'X)^{-1} x_i)\).
- We know the variance of the error term \(Var(\epsilon_i) = \sigma^2\).
We plug in our estimators to get the predicted variance
\[ \tilde{\sigma}^2 = \widehat{\mathrm{Var}}(\tilde{y}_i) = \hat{\sigma}^2\, x_i^\top (X^\top X)^{-1} x_i + \hat{\sigma}^2,\quad \tilde{y}_i \pm t_{n-p,0.975}\sqrt{\widehat{\mathrm{Var}}(\tilde{y}_i)}\]
From this derivation we can see that \(\tilde{\sigma}^2 \geq \hat{\sigma}^2\). So the 95% prediction interval is going to be wider than the 95% confidence interval.
The approximate 95% prediction interval is given by \(\tilde{y}_i \pm 1.96 \tilde{\sigma}\).
<- lm(score ~ age, data = dat)
fit
# 2) Make a smooth grid over age
<- data.frame(age = seq(min(dat$age), max(dat$age), length.out = 200))
nd
# 3) Get intervals
<- as.data.frame(predict(fit, newdata = nd, interval = "confidence", level = 0.95))
ci <- as.data.frame(predict(fit, newdata = nd, interval = "prediction", level = 0.95))
pi
<- cbind(nd,
plot_df fit = ci$fit,
lwr_ci = ci$lwr, upr_ci = ci$upr,
lwr_pi = pi$lwr, upr_pi = pi$upr)
# 4) Plot: points, prediction band (wider), mean CI band (narrower), fitted line
ggplot() +
# prediction band (wider)
geom_ribbon(data = plot_df,
aes(x = age, ymin = lwr_pi, ymax =upr_pi), alpha = 0.12, fill = "green") +
# mean CI band (narrower)
geom_ribbon(data = plot_df,
aes(x = age, ymin = lwr_ci, ymax =upr_ci),
alpha = 0.25) +
# fitted line
geom_line(data = plot_df,
aes(x = age, y = fit),
linewidth = 1.1) +
# points last so they sit on top
geom_point(data = dat,
aes(x = age, y = score),
alpha = 0.6) +
labs(x = "Maternal age (years)", y = "Child score",
title = "OLS: 95% CI (mean) and 95% PI (individual)") +
theme_minimal()
Model Diagnostics
Residual vs. Fitted
\(\hat{\epsilon}_i\) should be independent of \(\hat{y}_i\), i.e., no patterns.
Linearity: Red line should be flat, ie banded around 0.
Constant variance (homoscedasticity).
Normal Q-Q Plot
Standardized residual \(\hat{\epsilon}_i/ (\hat{ \sigma} \sqrt{1-h_{ii}})\) should be a standard normal.
We expect the points to follow a straight line, Check non-normality, particularly skewed tails.
Scale-Location
- Similar to residual vs fitted, but use standardized residual. Diagnose heteroscedasticity (eg red line increasing).
Residual vs Leverage
- Leverage \(h_{ii}\) is how far away \(x_i\) is from other \(x_j\) where \(h_{ii} = \frac{\partial \hat{y}_i}{\partial y_i}\).
- Cook’s Distance: measures how much model changes when remove \(ith\) point; >1 is problematic.
par(mfrow=c(2,2))
plot(fit)
Hypothesis Testing
With the linear regression model established, we now turn to hypothesis testing to determine whether the relationships observed between the independent and dependent variables are statistically significant.
(Intercept): \(H_0 : \beta_0 = 0\) when \(age =0\). In other words test scores are equal to zero for a mother at \(age=0\) versus \(H_a: \beta_0 \neq 0\).
Age: \(H_0: \beta_1=0\) In words, the slope of age is equal to zero or there is no linear effect. \(H_a: \beta_1 \neq 0\) if there is a significant linear effect.
For a given covariate \(\beta_j\), we make the following assumption:
\[ \hat{\beta}_j \sim \mathcal{N}(\beta_j, \mathrm{Var}(\hat{\beta}_j)) \]
The variance of \(\hat{\beta}_j\) is:
\[ \mathrm{Var}(\hat{\beta}_j) = \sigma^2 (X^TX)^{-1}_{jj} \]
Since \(\sigma^2\) is unknown, we estimate it using the residual sum of squares:
\[ \hat{\sigma}^2 = \frac{1}{n - p} \sum_{i=1}^{n} \hat{\varepsilon}_i^2 = \frac{1}{n - p} \| y - X\hat{\beta} \|^2 \]
The standard error of \(\hat{\beta}_j\) is then:
\[ \mathrm{SE}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 (X^TX)^{-1}_{jj}} \]
We test the null hypothesis:
\[ H_0: \beta_j = 0 \quad \text{vs.} \quad H_a: \beta_j \ne 0 \]
The test statistic follows a \(t\)-distribution under the null hypothesis:
\[ t = \frac{\hat{\beta}_j}{\mathrm{SE}(\hat{\beta}_j)} \sim t_{n - p} \]
We compare this statistic to critical values of the \(t\)-distribution or use its corresponding p-value to decide whether to reject \(H_0\).
summary(fit)
Call:
lm(formula = score ~ age, data = dat)
Residuals:
Min 1Q Median 3Q Max
-67.109 -11.798 2.971 14.860 55.210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.7827 8.6880 7.802 5.42e-14 ***
age 0.8403 0.3786 2.219 0.027 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.34 on 398 degrees of freedom
Multiple R-squared: 0.01223, Adjusted R-squared: 0.009743
F-statistic: 4.926 on 1 and 398 DF, p-value: 0.02702
- Let \(H=X(X'X)^{-1}X'\) and let \(\hat{Y} = HY\). Then \(H\hat{Y} = ?\)
- Show the derivation for \(cov(\hat{\beta})\).
- What is the unbiased estimator of \(\sigma^2\)?
- The unbiased estimator for \(\sigma^2\) contains a parameter \(p\). What does \(p\) represent?
- Below code is given to simulate data from a “true” model. Run the code to get simulated quantities of covariates \(X\) and outcomes \(Y\).
- What is the “true” intercept in the below simulation?
- What is the “true” covariate coefficient value in the below simulation?
- Fit the linear model. Find the estimates of the \(\beta’s\) and global variance \(\sigma^2\).
- Interpret the findings of the model in terms of null versus alternative hypotheses.
- Create diagnostic plots and determine if model assumptions have been met.
#Set seed for reproducibility
set.seed(1232)
#Set the "true" values of the betas.
= 0.1
beta0 = 0.5
beta1 = -0.2
beta2
#Simulated data from two normals
<- rnorm(n=100, 0, 2)
x1 <- rnorm(n=100, 5, 5)
x2 <- rnorm(n=100, 0, 1)
error
#Generate the simulated y values
= cbind(1, x1, x2)
X = c(beta0, beta1, beta2)
Beta = X %*% Beta + error
Y
###Finish this code to run the proposed model and diagnostics.
Confounding
We found that older mothers had children on average with higher test scores. However, higher maternal education was associated with higher age as shown below.
How do we estimate the age association with outcome accounting for the effect (confounding) of education?
ggplot(dat) +
geom_boxplot(aes(as.factor(edu),age)) +
labs(x = "Mother Education", y= "Age at Delivery")
Confounding
Causes spurious correlation.
Referred to as “omitted variable bias”
We want to control/account for confounding associations!
How do we account for confounding?
Include confounders in the regression equation \(\rightarrow\) MULTIPLE LINEAR REGRESSION: \(y \sim x+z\)
Assumptions of linear regression plus omitted variable assumption:
Residuals are independent AND normal.
Global variance (homoscedasticity).
Linearity of all included covariates.
All variables that are not included in the model have coefficient equal to zero (omitted variable bias = 0).
Adjusting for Confounding:
Consider the above model: \(y_i = \beta_0 + \beta_1 E_{2i} + \beta_2 E_{3i} + \beta_3 E_{4i} + \beta_4 age_i + \epsilon_i\).
Each education group has their own coefficient.
The slope of maternal age is constant across education groups (parallel lines).
\(\beta_4\) is the association between score and age, accounting for different averages in score in different education groups.
= lm(score ~ factor(edu) + age, data = dat)
fit summary(fit)
Call:
lm(formula = score ~ factor(edu) + age, data = dat)
Residuals:
Min 1Q Median 3Q Max
-58.853 -12.112 1.779 14.687 58.298
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 72.2360 8.8700 8.144 5.07e-15 ***
factor(edu)2 9.9365 2.5953 3.829 0.000150 ***
factor(edu)3 8.8416 3.2203 2.746 0.006316 **
factor(edu)4 17.6809 4.7065 3.757 0.000198 ***
age 0.2877 0.3985 0.722 0.470736
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.92 on 395 degrees of freedom
Multiple R-squared: 0.0598, Adjusted R-squared: 0.05028
F-statistic: 6.28 on 4 and 395 DF, p-value: 6.59e-05
- Age effect is no longer statistically significant after accounting for mother’s education level.
- This model assumes an identical effect of age across ALL education groups.
- Is this assumption realistic?
Categorical Variables
First we examine the effects of education on scores.
\(edu\) is coded as 1,2,3,4. We do NOT want to code this as continuous!!
Approach 1: an indicator (dummy) for each group.
\[ X_{1i} = 1_{[edu_i=1]}, X_{2i} = 1_{[edu_2=1]}, X_{3i} = 1_{[edu_i=3]}, X_{4i} = 1_{[edu_i=4]} \]
If we have \(edu_i = [1,1,2,2,3,3,4,4]\), the design matrix is
\[ X\beta = \begin{pmatrix} 1 & 0 & 0 & 0\\ 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\\ 0 & 0 & 0 & 1 \end{pmatrix} \cdot \begin{pmatrix} \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \end{pmatrix} \]
- Note we remove the intercept. If we try to include an intercept, there is linear dependency between columns and there is no unique solution.
= as.numeric(dat$edu ==1)
E1 = as.numeric(dat$edu ==2)
E2 = as.numeric(dat$edu ==3)
E3 = as.numeric(dat$edu ==4)
E4
1:6] E1[
[1] 0 1 0 1 0 1
= lm(dat$score ~ E1 + E2 + E3 + E4 -1) #use -1 to remove the intercept
fit
summary(fit)
Call:
lm(formula = dat$score ~ E1 + E2 + E3 + E4 - 1)
Residuals:
Min 1Q Median 3Q Max
-58.45 -12.70 1.61 15.23 57.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
E1 78.447 2.159 36.33 <2e-16 ***
E2 88.703 1.367 64.88 <2e-16 ***
E3 87.789 2.284 38.44 <2e-16 ***
E4 97.333 3.831 25.41 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.91 on 396 degrees of freedom
Multiple R-squared: 0.9508, Adjusted R-squared: 0.9503
F-statistic: 1913 on 4 and 396 DF, p-value: < 2.2e-16
- What does each element of \(\beta\) vector mean?
- \(\hat{\beta}_1\) is equivalent to taking the average of scores for kids with mother’s \(edu=1\). So each \(\hat{\beta}\) is equivalent to calculating the mean IQ for each education group.
Approach 2: Intercept plus three indicators for groups 2,3, and 4.
\(X_{1i} = 1 , X_{2i} = 1_{[edu_2=1]}, X_{3i} = 1_{[edu_i=3]}, X_{4i} = 1_{[edu_i=4]}\)
= lm(dat$score ~ factor(dat$edu))
fit summary(fit)
Call:
lm(formula = dat$score ~ factor(dat$edu))
Residuals:
Min 1Q Median 3Q Max
-58.45 -12.70 1.61 15.23 57.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78.447 2.159 36.330 < 2e-16 ***
factor(dat$edu)2 10.256 2.556 4.013 7.18e-05 ***
factor(dat$edu)3 9.342 3.143 2.973 0.00313 **
factor(dat$edu)4 18.886 4.398 4.294 2.21e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.91 on 396 degrees of freedom
Multiple R-squared: 0.05856, Adjusted R-squared: 0.05142
F-statistic: 8.21 on 3 and 396 DF, p-value: 2.59e-05
- \(\hat{\beta}\) represents the difference in average score with respect to the \(edu=1\) group., i.e., 10.256 is the average difference between no high school group and high school group.
Variance Inflation Factors
One must always be on the lookout for issues with multicollinearity!
The general issues is that when two variables are highly correlated, it is hard to disentangle their effects.
Mathematically, the standard errors are inflated. Suppose we have a design matrix \(\mathbf{X} = [x_1, ..., x_p]\) and we want to calculate the variance inflation factor (VIF). We regress \(x_1\) against \(x_2, ..., x_p\).
Let \(R_1^2\) be the associated R-squared for \(x_1\). Then \(VIF_1 = \frac{1}{1-R_1^2}\).
It can be shown that \(Var(\hat{\beta}_j) = VIF_j \cdot \frac{\sigma^2}{(n-1)s_{x_j}^2}\).
Different rules of thumb \(VIFs>5\) are cause for concern.
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
#Create some correlated data
$age2 = dat$age^2 + rnorm(n=400, 0, 10)
dat
= lm(dat$score ~ factor(dat$edu) + dat$age + dat$age2)
fit vif(fit)
GVIF Df GVIF^(1/(2*Df))
factor(dat$edu) 1.189686 3 1.029371
dat$age 94.039443 1 9.697394
dat$age2 94.128083 1 9.701963
# Is this bigger than 5. Age is highly collinear as expected.
#Now try it without age2. Do you see any problems?
Interaction Effects: Effect Modification
- We can relax the assumption of identical age effect by including interaction terms.
- Again the reference group is \(edu=1\) or \(E_{1i}\).
\[ \begin{align*} y_i &= \beta_0 +\beta_1 E_{2i} + \beta_2 E_{3i} + \beta_3 E_{4i} + \beta_4 age_i + \\ & \beta_5 E_{2i} age_i + \beta_6 E_{3i} age_i + \beta_7 E_{4i} age_i + \epsilon_i \end{align*} \]
- For each edu group: We account for modified effects by examining whether \(\beta_5, \beta_6\), and \(\beta_7\) are zero.
\[ y_i = \begin{cases} \beta_0 = \beta_4 age_i &\text{ if edu =1}\\ \beta_0 + \beta_1 + (\beta_4 + \beta_5) age_i& \text{ if edu =2}\\ \beta_0 + \beta_2 + (\beta_4 + \beta_6) age_i &\text{ if edu =3}\\ \beta_0 + \beta_3 + (\beta_4 + \beta_7) age_i &\text{ if edu =4} \end{cases} \]
library(car)
= lm(score ~ factor(edu) * age, data = dat)
fit summary(fit)
Call:
lm(formula = score ~ factor(edu) * age, data = dat)
Residuals:
Min 1Q Median 3Q Max
-56.70 -11.80 2.07 14.58 54.34
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 105.2202 17.6127 5.974 5.2e-09 ***
factor(edu)2 -33.0929 21.5732 -1.534 0.1258
factor(edu)3 -53.4970 27.9460 -1.914 0.0563 .
factor(edu)4 36.4537 49.5065 0.736 0.4620
age -1.2402 0.8097 -1.532 0.1264
factor(edu)2:age 1.9704 0.9764 2.018 0.0443 *
factor(edu)3:age 2.7862 1.2293 2.266 0.0240 *
factor(edu)4:age -0.4799 1.9635 -0.244 0.8070
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.81 on 392 degrees of freedom
Multiple R-squared: 0.07705, Adjusted R-squared: 0.06057
F-statistic: 4.675 on 7 and 392 DF, p-value: 4.756e-05
vif(fit)
there are higher-order terms (interactions) in this model
consider setting type = 'predictor'; see ?vif
GVIF Df GVIF^(1/(2*Df))
factor(edu) 9.093097e+05 3 9.842800
age 4.821946e+00 1 2.195893
factor(edu):age 1.039839e+06 3 10.065322
Note how the variance is inflated
This is common with interaction variables since by construction there is dependence between an interaction variable and the main effects.
In general, it means we need larger sample sizes to examine interactions.
It is important to visualize the data and model results.
We see the lines are not parallel and the intercept differs across groups.
library(ggplot2)
ggplot(dat, aes(age, score)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ edu, ncol = 2, scales = "free")
`geom_smooth()` using formula = 'y ~ x'
F-test of interaction
To test whether the interaction was significant, look at whether the inclusion of the interaction significantly “improved” model fit.
- \(H_0\): The interaction between education and age does not improve model fit.
- \(H_A\): The interaction improves model fit.
Suppose we want to choose between these two models, i.e., Full model vs. Reduced model. We use the F-statistic:
The F statistic is a ratio of two chi-square distributions to test whether the full model gives additional information, e.g., does \(\beta_5 = \beta_6 = \beta_7 =0\)?
\[ F = \frac{ \text{RSS}_{\text{reduced}} - \text{RSS}_{\text{full}} }{ p_{\text{full}} - p_{\text{reduced}} } \Bigg/ \frac{ \text{RSS}_{\text{full}} }{ n - p_{\text{full}} },. \] Another parametrization for the F-test is given by:
\[ F = \frac{y'(H_{full} - H_{reduced})y/p-1}{y'(I-H_{full})y/n-p} = \frac{\chi^2_{p-1-k}/p-1-k}{X^2_{n-p}/n-p} \geq F_{\alpha, k, n-k-1} \]
= lm(score ~ factor(edu) * age, data = dat) #Full Model
fit_inter = lm(score~ factor(edu) + age, data = dat) #Reduced Model
fit_nointer anova(fit_nointer, fit_inter)
Analysis of Variance Table
Model 1: score ~ factor(edu) + age
Model 2: score ~ factor(edu) * age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 395 156733
2 392 153857 3 2876.5 2.4429 0.06376 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Overall, there is only limited statistical evidence of an interaction!
Summary
Section | Description |
---|---|
Model (MLR) | \(\mathbf y=\mathbf X\beta+\boldsymbol\epsilon\), with \(E[\boldsymbol\epsilon]=\mathbf 0\), \(\operatorname{Var}(\boldsymbol\epsilon)=\sigma^2\mathbf I\). Each \(\beta_j\) is an adjusted effect. |
Estimation (OLS) | Minimize RSS \(\sum (y_i-\hat y_i)^2\) ⇒ \(\hat\beta=(\mathbf X^\top\mathbf X)^{-1}\mathbf X^\top\mathbf y\). Fitted values \(\hat{\mathbf y}=\mathbf H\mathbf y\), \(\mathbf H=\mathbf X(\mathbf X^\top\mathbf X)^{-1}\mathbf X^\top\). |
Key properties | \(E[\hat\beta]=\beta\); \(\operatorname{Cov}(\hat\beta)=\sigma^2(\mathbf X^\top\mathbf X)^{-1}\); residuals are orthogonal to columns of \(\mathbf X\). |
Normality for inference | For SEs/CIs/p-values assume \(\boldsymbol\epsilon\sim\mathcal N(\mathbf 0,\sigma^2\mathbf I)\). Point estimates don’t require normality. |
Residual variance | \(\hat\sigma^2=\mathrm{SSE}/(n-p)\) with \(\mathrm{SSE}=\sum (y_i-\hat y_i)^2\), \(p\) includes the intercept. \(\hat\sigma\) = residual standard error. |
Mean CI & Prediction PI | 95% CI bands quantify uncertainty in the mean \(\hat y(x)\); 95% PI adds error variance and is wider (for new observations). |
Diagnostics | Residuals vs fitted (linearity/variance), Normal Q–Q (normality), Scale–Location (homoscedasticity), Residuals vs leverage/Cook’s \(D\) (influence). |
Hypothesis tests | \(t\)-tests for individual coefficients (e.g., slope of age). Partial \(F\)-tests for blocks of terms (e.g., interactions). |
Confounding | Adjust by including confounders in \(\mathbf X\) to obtain population-level, adjusted associations. |
Categorical predictors | Use \(k-1\) indicators with an intercept (reference coding) for group mean differences. |
Multicollinearity (VIF) | \(\text{VIF}_j = 1/(1-R_j^2)\); values \(\gtrsim 5\) are a concern (rule of thumb in these notes). |
Interactions | Add terms like \(x:\text{group}\) for group-specific slopes; without interactions slopes are common across groups. |
F-test of interaction | Compare full vs reduced models to test whether interaction terms improve fit. |