Review matrix algebra, see materials in Module 0 on Canvas. For more advanced review see The Matrix Cookbook.
Review notes from Applied Linear Regression (BIOS 509)
Other useful materials:
Concepts
Linear regression model in matrix notation.
Inference for regression coefficient estimates, expected values, and predictions.
Dummy variables.
Effect modification, interactions, and confounding.
Acknowledgements
Lecture notes are build upon materials from Prof. Benjamin Risk and Howard Chang.
Motivating Example
We are interested in the maternal traits associated with children’s cognitive test scores at age 3. Using data from a survey of adult American women and their children from the National Longitudinal Survey of Youth.
Variables:
score: cognitive test score at age 3.
age: maternal age at delivery
edu: maternal education (1) less than high school, (2) high school, (3) some college, (4) college and above.
Rows: 400 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): score, edu, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.00 74.00 90.00 86.93 102.00 144.00
table(dat$edu)
1 2 3 4
85 212 76 27
summary(dat$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 21.00 23.00 22.79 25.00 29.00
par(mfrow =c(1,2))plot(score ~ age, data = dat, xlab ="Age", ylab ="Score")boxplot(score ~ edu, xlab ="Education", data = dat)
Simple Linear Regression
Linear regression is a method that summarizes how the average values of a numerical outcome variable vary over subpopulations defined by linear functions of predictors. Two goals of linear regression:
Prediction of outcome values given fixed predictor values.
Assessment of impact of predictors on the outcome (Estimation).
Let \(y_i\) denote the test score for a given child \(i\). \(age_i\) denotes the corresponding maternal age at delivery for mom \(i\) corresponding to child \(i\), and \(\epsilon_i\) denotes the error term. We will consider the following linear model:
Define the different components \((Y, X, \beta, \epsilon)\).
What is the expected value of \(\mathbf{Y}\)?
What is the variance of \(\mathbf{Y}\)?
What is the covariance of \(cov(y_i, y_j)\)?
Assumptions of the above model
\(y_i\) is a linear function of \(age_i\), i.e, \(E(\mathbf{Y}) = \mathbf{X\beta} = \mathbf{Age\beta}\) is linear.
\(\beta_0=\) the test score for a child born of a mother at age=0 (not meaningful!)
\(\beta_1=\) increase in test score associated with one year increase in maternal age.
\(E(\epsilon_i) = 0\) Expectation of the errors in zero.
Least Squares Solution
Our aim is to find estimates of\(\beta_0\)and\(\beta_1\)that “best” fit the data, but what does “best” mean?
We seek \((\hat{\beta}_0, \hat{\beta}_1)\) that minimizes a loss function.
Our approach is to minimize the sum of squared differences between \(\mathbf{Y}\) and \(\hat{Y} = \mathbf{X\hat{\beta}}\), i.e., this defines the differences between \(\mathbf{Y}\) and the predicted value \(\hat{Y}\) based on a fixed set of \(\beta's\).
Theorem
If \(\mathbf{Y} = \mathbf{X\beta} + \mathbf{\epsilon}\) then the value of \(\hat{\beta}\) that minimizes \((\mathbf{Y - \hat{Y}})^T(\mathbf{Y - \hat{Y}})\) is \(\hat{\beta} = (X'X)^{-1} X'y\).
\(\mathbf{\hat{Y}} = \mathbf{X \hat{\beta}}\) is the fitted (predicted) value.
\(\mathbf{\hat{\epsilon}} - \mathbf{Y} = \mathbf{X\hat{\beta}}\) defines the residual error.
In-class exercise
Let’s show this!
For OLS, we have a nice closed-form solution.
Other loss functions or models often require iterative optimization routines.
Geometric Interpretation of the OLS estimate
The fitted values \(\hat{y}\) are a linear transformation of \(y\). We \(y\) in real space is projected on \(\hat{y}\) is model space.
The HATmatrix is defined as \(\mathbf{H} = \mathbf{X(X^{T} X)^{-1}X^T}\). It is a projection matrix that projects \(y\) to \(\hat{y}\).
The projection matrix is idempotent, i.e., \(HH=H\).
The trace of the projection matrix is the number of covariates plus intercept, i.e, \(tr(H) = p\).
\(\hat{\mathbf{Y}}\) and \(\mathbf{\hat{\epsilon}}\) are independent, i.e., orthogonal (perpendicular in model space).
Properties of the OLS Estimator \(\hat{\beta}\)
If \(E(Y) = X\beta\) then \(\hat{\beta}\) is unbiased for \(\beta\).
\(\hat{\beta}\) is consistent estimation of \(\beta\) under very general assumptions, i.e., \(lim_{n\rightarrow \infty} P(|\hat{\beta} - \beta|>\epsilon) = 0\).
\(\mathbf{\hat{Y}}\) is an unbiased estimator of the mean trend \(\mathbf{X\beta}\).
If \(cov(y) = \sigma^2 I\), the covariance of the estimator \(cov(\mathbf{\hat{\beta}}) = \sigma^2 (X'X)^{-1}\).
In-class exercise
Find the \(cov(\hat{\beta})\).
Gauss-Markov Theorem
\(\mathbf{Y} = \mathbf{X\beta} +\epsilon\)
\(E(\epsilon) = 0\) and \(cov(\epsilon) = \sigma^2 I\)
\(var(\hat{\beta}) = \sigma^2 (X'X)^{-1}\)
\(\hat{\beta}\) is the best (minimum variance) unbiased estimator (BLUE) for \(\mathbf{\beta}\).
\(\hat{\beta} = (X'X)^{-1} X'y\)is the best (smallest variance) linear unbiased estimator of the true\(\beta\)
\(E(\hat{\beta}) = \beta\) and \(cov(\hat{\beta}) = \sigma^2 (X'X)^{-1}\)
The HAT matrix\(X(X'X)^{-1}X'\) is a projection matrix which projects \(y\) to \(\hat{y} = X\hat{\beta}\)
Interpretation
\(\beta_0\)the global value of\(E(Y)\)when\(X=0\)
\(\beta_1\)the global value of the change of\(Y\)for every one unit change of\(X\)
\(\sigma^2\)captures the total variance across all observations
Statistical Linear Regression & Inference
The relation between OLS, Normality, and Maximum Likelihood
So far, we have not discussed the distribution of the errors. For OLS we did not need to make any distributional assumption. For inference, let’s now assume
\(E(\hat{\sigma}^2_{MLE}) \neq \sigma^2\) so it is NOT unbiased.
An unbiased estimator for \(\sigma^2\) is given by \(s^2 = \frac{1}{n-p} (Y-X\beta)'(Y-X\beta)\).
Summary of OLS vs. MLE approaches
OLS
MLE
Distributional assumption
NONE
Normal \(Y\sim N(X\beta, \sigma^2I)\)
Estimator \(\hat{\beta}\)
\((X'X)^{-1}X'y\)
\((X'X)^{-1}X'y\)
\(cov(\hat{\beta})\)
\(\sigma^2 (X'X)^{-1}\)
\(\sigma^2 (X'X)^{-1}\)
Estimator \(\hat{\sigma}^2\)
Cannot get
\(\frac{1}{n} (Y-X\hat{\beta})'(Y-X\hat{\beta})\) is biased
Residual Error
Above we determined that the unbiased estimator of \(\sigma^2\) is \(\frac{1}{n-p} (Y-X\hat{\beta})'(Y-X\hat{\beta})\).
library(ggplot2)#Here p = 2fit =lm(score ~ age, data = dat)summary(fit)
Call:
lm(formula = score ~ age, data = dat)
Residuals:
Min 1Q Median 3Q Max
-67.109 -11.798 2.971 14.860 55.210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.7827 8.6880 7.802 5.42e-14 ***
age 0.8403 0.3786 2.219 0.027 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.34 on 398 degrees of freedom
Multiple R-squared: 0.01223, Adjusted R-squared: 0.009743
F-statistic: 4.926 on 1 and 398 DF, p-value: 0.02702
#Can we confirm this manuallyyhat = fit$fitted.values #Predicted Valuesy = dat$score #Observed ValuesN =length(y)P=2#number columns in design matrixsq_error = (y-yhat)^2#Squared errorssigma_squared_est <-sum(sq_error)/(N-P) #Unbiased est of the variancesigma_est =sqrt(sigma_squared_est) #Unbiased est of the std. errorsigma_est
[1] 20.34027
#Include 95% confidence intervals in your summaryconfidence_intervals <-predict( fit, interval ="confidence", level =0.95)confidence_intervals[1:3,]
est <- fit$coefficients[1] + fit$coefficients[2] * dat$age#Now add the 95% confidence intervals surrounding the fitted valuesplot_data <-data.frame(dat$age, dat$score, confidence_intervals, est)ggplot(plot_data) +geom_point(aes(fit, dat.score)) +geom_line(aes(fit,est), col ="darkgreen", lwd=2) +geom_ribbon(aes(x=fit, ymin =lwr, ymax = upr ), fill ="green", alpha =0.3) +labs(x ="Fitted value", y ="Observed value")
ggplot(plot_data) +geom_point(aes(dat.age, dat.score)) +geom_line(aes(dat.age,est), col ="darkgreen", lwd=2) +geom_ribbon(aes(x=dat.age, ymin =lwr, ymax = upr ), fill ="green", alpha =0.3) +labs(x ="Age", y ="Score Value")
Model Interpretation
fit =lm(score ~ age, data = dat)summary(fit)
Call:
lm(formula = score ~ age, data = dat)
Residuals:
Min 1Q Median 3Q Max
-67.109 -11.798 2.971 14.860 55.210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.7827 8.6880 7.802 5.42e-14 ***
age 0.8403 0.3786 2.219 0.027 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.34 on 398 degrees of freedom
Multiple R-squared: 0.01223, Adjusted R-squared: 0.009743
F-statistic: 4.926 on 1 and 398 DF, p-value: 0.02702
Interpretation: A one year increase in mother’s age at delivery was associated with a X (95%CI = ) increase in the child’s average test score, where test score was measured at age 3. Th association was statistically signifianct at a type I error rate of 0.05.
There was considerable heterogeneity in test scores \((\hat{\sigma} = 20.34)\). Therefore, the regression model does not predict individual test scores well \((R^2 = 0.012)\). 1.2% of variances is explained by mother’s age.
Prediction Interval for a New Observation
Let \(\tilde{y}_i\) be a new observation with covariate value \(age_i\). How can you obtain its uncertainty? We want to capture the uncertainty in our predicted estimate of the expected value \(\hat{y}_i = X\hat{\beta}\) PLUS the uncertainty due to measurement error \(\epsilon_i\).
From this derivation we can see that \(\tilde{\sigma}^2 \geq \hat{\sigma}^2\). So the 95% prediction interval is going to be wider than the 95% confidence interval.
The approximate 95% prediction interval is given by \(\tilde{y}_i \pm 1.96 \tilde{\sigma}\).
X =cbind(1, dat$age)Y = dat$scoreinvXtX =solve(t(X) %*% X)beta = invXtX %*%t(X) %*% Ysigmahat =sum((Y- X%*%beta)^2)/(length(Y) -2)V = sigmahat * invXtXvar_pred =diag(X %*% V %*%t(X)) + sigmahatlwr_pred = fit$fitted.values -1.96*sqrt(var_pred)uppr_pred = fit$fitted.values +1.96*sqrt(var_pred)#Now add the 95% prediction intervals surrounding the fitted valuesplot_data2 <-data.frame(dat$age, dat$score, confidence_intervals, est, lwr_pred, uppr_pred)ggplot(plot_data2) +geom_point(aes(fit, dat.score)) +geom_line(aes(fit,est), col ="darkgreen", lwd=2) +geom_ribbon(aes(x=fit, ymin =lwr, ymax = upr ), fill ="green", alpha =0.3) +geom_ribbon(aes(x=fit, ymin =lwr_pred, ymax = uppr_pred ), fill ="orange", alpha =0.3) +labs(x ="Predicted value", y ="Observed value")
ggplot(plot_data) +geom_point(aes(dat.age, dat.score)) +geom_line(aes(dat.age,est), col ="darkgreen", lwd=2) +geom_ribbon(aes(x=dat.age, ymin =lwr, ymax = upr ), fill ="green", alpha =0.3) +geom_ribbon(aes(x=dat.age, ymin =lwr_pred, ymax = uppr_pred ), fill ="orange", alpha =0.3) +labs(x ="Age", y ="Predicted Value")
Model Diagnostics
Residual vs. Fitted
\(\hat{\epsilon}_i\) should be independent of \(\hat{y}_i\), i.e., no patterns.
Linearity: Red line should be flat, ie banded around 0.
Constant variance (homoscedasticity).
Normal Q-Q Plot
Standardized residual \(\hat{\epsilon}_i/ (\hat{ \sigma} \sqrt{1-h_{ii}})\) should be a standard normal.
We expect the points to follow a straight line, Check non-normality, particularly skewed tails.
Scale-Location
Similar to residual vs fitted, but use standardized residual. Diagnose heteroscedasticity (eg red line increasing).
Residual vs Leverage
Leverage \(h_{ii}\) is how far away \(x_i\) is from other \(x_j\) where \(h_{ii} = \frac{\partial \hat{y}_i}{\partial y_i}\).
Cook’s Distance: measures how much model changes when remove \(ith\) point; >1 is problematic.
par(mfrow=c(2,2))plot(fit)
Hypothesis Testing
(Intercept): \(H_0 : \beta_0 = 0\) when \(age =0\). In other words test scores are equal to zero for a mother at \(age=0\) versus \(H_a: \beta_0 \neq 0\).
Age: \(H_0: \beta_1=0\) In words, the slope of age is equal to zero or there is no linear effect. \(H_a: \beta_1 \neq 0\) if there is a significant linear effect.
summary(fit)
Call:
lm(formula = score ~ age, data = dat)
Residuals:
Min 1Q Median 3Q Max
-67.109 -11.798 2.971 14.860 55.210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.7827 8.6880 7.802 5.42e-14 ***
age 0.8403 0.3786 2.219 0.027 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.34 on 398 degrees of freedom
Multiple R-squared: 0.01223, Adjusted R-squared: 0.009743
F-statistic: 4.926 on 1 and 398 DF, p-value: 0.02702
Tip
In class exercise:
What conclusions can we make about the intercept and the impact of age?
Summary II
What is the model?
In multiple linear regression, we regress a vector of outcomes\(Y\)values on multiple vector of predictor values\(X = (X_1, X_2, ..., X_p)\)with:
\(Y = X\beta + \epsilon\)
\(\mathbf{\beta}\)is a\((p+1) \times 1\) *vector of regression coefficients:
the intercept* \(\beta_0\), and (2) the covariate coefficients\(\beta = (\beta_1, \beta_2, ..., \beta_p)\).
Model Assumptions
\(E(\epsilon)\) is i.i.d normally distributed \(N(0, \sigma^2)\).
\(\hat{\beta} = (X'X)^{-1} X'y\)is the best (smallest variance) linear unbiased estimator of the true\(\beta\) and is equal to \(\beta_{GLS}\) under normality.
\(E(\hat{\beta}) = \beta\) and \(cov(\hat{\beta}) = \sigma^2 (X'X)^{-1}\)
\(\hat{\sigma}^2_{MLE} = \frac{1}{n} (Y-X\hat{\beta})^T (Y-X\hat{\beta})\) is biased
\(s^2 = \frac{1}{n-p} (Y-X\hat{\beta})^T (Y-X\hat{\beta})\) is an unbiased estimator
Interpretation
\(\beta_0\)the global value of\(E(Y)\)when\(X=0\)
\(\beta_1\)the global value of the change of\(Y\)for every one unit change of\(X\)
\(\sigma^2\)captures the total variance across all observations but is a biased estimator
Prediction
The predicted value is given by\(\tilde{y}_i = \hat{\beta}X_i + \hat{e_i}\).
The uncertainty around the predicted value is equal to \(\tilde{y}_i \pm 1.96 \tilde{\sigma}^2\).
\(\tilde{\sigma}^2\)is the sum of the data variance\(\hat{sigma}^2 x_i^T (X^TX)^{-1} x_i\)and the global variance\(\hat{\sigma^2}\), i.e., \(\tilde{\sigma}^2 \geq \hat{\sigma}^2\)
Diagnostics
Linearity: Residuals vs Fitted should be banded around 0 with constant variance, i.e., no patterns.
Normality: Normal QQ Plot should follow the straight line with slightly skewed tails.
Constant variance: Standardized residuals should show flat line
Residual vs Leverage: Shows impact of outlier values, Cook’s distance >1 is problematic.
Hypothesis Testing
\(H_0: \beta=0\)is the null hypothesis of no effect.
\(H_A: \beta \neq 0\)is the alternative hypothesis of a significant effect not equal to zero.
BREAKOUT SESSION 1
Let \(H=X(X'X)^{-1}X'\) and let \(\hat{Y} = HY\). Then \(H\hat{Y} = ?\)
Show the derivation for \(cov(\hat{\beta})\).
What is the unbiased estimator of \(\sigma^2\)?
The unbiased estimator for \(\sigma^2\) contains a parameter \(p\). What does \(p\) represent?
Below code is given to simulate data from a “true” model. Run the code to get simulated quantities of covariates \(X\) and outcomes \(Y\).
What is the “true” intercept in the below simulation?
What is the “true” covariate coefficient value in the below simulation?
Fit the linear model. Find the estimates of the \(\beta’s\) and global variance \(\sigma^2\).
Interpret the findings of the model in terms of null versus alternative hypotheses.
Create diagnostic plots and determine if model assumptions have been met.
fit =lm(dat$score ~ E1 + E2 + E3 + E4 -1) #use -1 to remove the interceptsummary(fit)
Call:
lm(formula = dat$score ~ E1 + E2 + E3 + E4 - 1)
Residuals:
Min 1Q Median 3Q Max
-58.45 -12.70 1.61 15.23 57.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
E1 78.447 2.159 36.33 <2e-16 ***
E2 88.703 1.367 64.88 <2e-16 ***
E3 87.789 2.284 38.44 <2e-16 ***
E4 97.333 3.831 25.41 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.91 on 396 degrees of freedom
Multiple R-squared: 0.9508, Adjusted R-squared: 0.9503
F-statistic: 1913 on 4 and 396 DF, p-value: < 2.2e-16
What does each element of \(\beta\) vector mean?
\(\hat{\beta}_1\) is equivalent to taking the average of scores for kids with mother’s \(edu=1\). So each \(\hat{\beta}\) is equivalent to calculating the mean IQ for each education group.
In class exercise
What is the estimated difference in average score between mothers without high school and mothers with college and above?
Approach 2: Intercept plus three indicators for groups 2,3, and 4.
Call:
lm(formula = dat$score ~ factor(dat$edu))
Residuals:
Min 1Q Median 3Q Max
-58.45 -12.70 1.61 15.23 57.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78.447 2.159 36.330 < 2e-16 ***
factor(dat$edu)2 10.256 2.556 4.013 7.18e-05 ***
factor(dat$edu)3 9.342 3.143 2.973 0.00313 **
factor(dat$edu)4 18.886 4.398 4.294 2.21e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.91 on 396 degrees of freedom
Multiple R-squared: 0.05856, Adjusted R-squared: 0.05142
F-statistic: 8.21 on 3 and 396 DF, p-value: 2.59e-05
\(\hat{\beta}\) represents the difference in average score with respect to the \(edu=1\) group., i.e., 10.256 is the average difference between no high school group and high school group.
The slope of maternal age is constant across education groups (parallel lines).
\(\beta_4\) is the association between score and age, accounting for different averages in score in different education groups.
fit =lm(score ~factor(edu) + age, data = dat)summary(fit)
Call:
lm(formula = score ~ factor(edu) + age, data = dat)
Residuals:
Min 1Q Median 3Q Max
-58.853 -12.112 1.779 14.687 58.298
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 72.2360 8.8700 8.144 5.07e-15 ***
factor(edu)2 9.9365 2.5953 3.829 0.000150 ***
factor(edu)3 8.8416 3.2203 2.746 0.006316 **
factor(edu)4 17.6809 4.7065 3.757 0.000198 ***
age 0.2877 0.3985 0.722 0.470736
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.92 on 395 degrees of freedom
Multiple R-squared: 0.0598, Adjusted R-squared: 0.05028
F-statistic: 6.28 on 4 and 395 DF, p-value: 6.59e-05
Age effect is no longer statistically significant after accounting for mother’s education level.
This model assumes an identical effect of age across ALL education groups.
Is this assumption realistic?
Variance Inflation Factors
One must always be on the lookout for issues with multicollinearity!
The general issues is that when two variables are highly correlated, it is hard to disentangle their effects.
Mathematically, the standard errors are inflated. Suppose we have a design matrix \(\mathbf{X} = [x_1, ..., x_p]\) and we want to calculate the variance inflation factor (VIF). We regress \(x_1\) against \(x_2, ..., x_p\).
Let \(R_1^2\) be the associated R-squared for \(x_1\). Then \(VIF_1 = \frac{1}{1-R_1^2}\).
It can be shown that \(Var(\hat{\beta}_j) = VIF_j \cdot \frac{\sigma^2}{(n-1)s_{x_j}^2}\).
Different rules of thumb \(VIFs>5\) are cause for concern.
To test whether the interaction was significant, look at whether the inclusion of the interaction significantly “improved” model fit.
\(H_0\): The interaction between education and age does not improve model fit.
\(H_A\): The interaction improves model fit.
Suppose we want to choose between these two models, i.e., Full model vs. Reduced model. We use the F-statistic:
F-statistic to perform model comparison
The F statistic is a ratio of two chi-square distributions to test whether the full model gives additional information, e.g., does \(\beta_5 = \beta_6 = \beta_7 =0\)?