## 18.2 Linear model (Regression)

• lm(): R function to estimate a linear model (lm stands for “linear model”“)
• lm(y ~ x): Use outcome variable y and explanatory variable x as input
• lm(y ~ x,data = MyData): If both variables are located in the data frame “MyData”
• lm(y~x1 + x2 + x3,data = MyData): Several independent variables
• lm(y ~ x1 + x2 + x1:x2,data = MyData): Interaction effects
• lm(y ~ x1 * x2,data = MyData): Same as above
• result <- lm(y ~ x): Save the results
• Access results by referring to the object name, e.g. result or summary(result)
• An object of class "lm" contains different components
• coef(result): The coefficients
• residuals(result): The residuals
• IMPORTANT: Option na.action==na.omit omits missings listwise

### 18.2.1 Example: Fertility and socio-economic indicators

# Beispiel: Berufliches Prestige und Bildung bzw. Einkommen

?swiss
summary(swiss$Catholic) summary(swiss$Fertility)
summary(swiss$Education) fit <- lm(Fertility ~ Education + Catholic, data=swiss) # na.action==na.omit # summary() is a generic function. If you input an lm() object it guesses you want a summary of regression statistics summary(fit) # Number of observations: nrow(swiss) = 47 # DF (degrees of freedem): 47 observations - 3 parameters (to estimate) # 95% percent confidence intervall: # 74.23 (point estimate) +- 2.35 (Standard error) * 1.96 (use 2 as a rule of thumb) 74.23 + 2.35*1.96 # t value = corresponding value of 74 in the t-distribution # Pr(>|t|).. p-Value, i.e. the probability to observing t-value or greater # given that beta = 0 # Components of the lm object ls(fit) names(fit) class(fit) # Access coefficents and other parts fit$fitted.values # ^y = y hat = predicted values
fit$residuals # residals = Predicted - Observed coef(fit) # Coefficients beta1, beta2, beta3 fit fit$coefficients # Access subelements of object

# Scatterplot
fit <- lm(Fertility ~ Education,data=swiss)
plot(Fertility ~ Education,data=swiss)
abline(fit,col="red")

install.packages("sp")
install.packages("maptools")
library(maptools)
pointLabel(swiss$Education , swiss$Fertility, rownames(swiss),
cex = 0.7, method = "SANN", allowSmallOverlap = FALSE,
trace = FALSE, doPlot = TRUE)

# 3D SCATTERPLOT
install.packages("car")
library(car)

# Another Spinning 3d Scatterplot
scatter3d(swiss$Catholic,swiss$Fertility,swiss$Education, fit="linear") # adding a quadratic fit for the fun scatter3d(swiss$Catholic,swiss$Fertility,swiss$Education,
fit=c("linear","quadratic")) 

### 18.2.3 Example: Testing assumptions linear regression model (OLS)

## TESTING ASSUMPTIONS OF THE LINEAR REGRESSION MODEL (e.g. Gelman & Hill 2007, 45-47)

# Checking robustness and assumptions
# Good overview: http://www.stat.uni-muenchen.de/~helmut/limo09/limo_09_kap6.pdf)
fit <- lm(Fertility ~ Education + Catholic, data=swiss)

# Linearity
# Cause: Non-linear relationship between X and Y
# Problem: Linear model does not produce good fit
# Diagnose: e.g. partial residual plot for every variable
crPlots(fit) # partial residual plot
# Formula: http://en.wikipedia.org/wiki/Partial_residual_plot
# Residuals + beta_i*X_i (y-axis) vs. X_i (x-axis)
# beta_i is estimated and plotted while controlling for all other variables

ceresPlots(fit)
# Generalization of component+residual (partial residual)

# Independence or errors
# Causes: autocorrelated time series or clustered data
# Problem: wrong t-values
# Diagnose: among others: Durban Watson Test http://de.wikipedia.org/wiki/Durbin-Watson-Test
durbinWatsonTest(fit)

# Constant variance of errors (Homoskedasticity)
# e_i dependent of i
# Causes: errors often increase with x
# Problem: wrong t-values -> problem for hypothesis testing
# Diagnose: plot e_i vs fitted values
summary(fit$fitted.values) # "Solution": calculate robust standard errors (library(sandwich)) # Normality of errors # Cause: wrong distribution assumption / wrong model # Problem: wrong t-values, bad predictions # Diagnose: Plot residuals (are they normally distributed) # qq plot for studentized resid vs. theoretic quantiles qqPlot(fit, main="QQ Plot") # Outliers # Cause: special cases, data errors # Problem: Outliers may biase coefficients # Diagnose: e.g. Cook's D plot - http://en.wikipedia.org/wiki/Cook's_distance # measures effect of deleting observation # identify D values > 4/(n-k-1) cutoff <- 4/((nrow(swiss)-length(fit$coefficients)-2))
plot(fit, which=4, cook.levels=cutoff)

library(car)
avPlots(fit)

### 18.2.4 Exercise: Linear regression in R

1. The relationship between Child Mortality and GDP
1. Load the package car
2. Inspect the data set UN (?UN)
3. Estimate a linear regression of mortality on GDP and keep the estimation results
4. Draw a scatter plot for this relationship
2. Anscombe’s Quartet
1. Load the package car and inspect the data set Quartet
2. Use the data set Quartet to estimate several regressions and compare the coefficients of the following regression models: y1 on x (Model 1), y2 on x (Model 2), y3 on x (Model 3) and y4 on x4 (Model 4)!
3. Generate 4 scatter plots + regression line for the regression models above. What would you conclude comparing the results from the regression with the graphs?
# Tip: par(mfrow=c(2,2)) can be used to plot multiple plots in one window!
library(car)
par(mfrow=c(2,2))
plot(Quartet$x, Quartet$y1) # Scatterplot for the two variables
abline(lm(Quartet$y1 ~ Quartet$x))
plot(Quartet$x, Quartet$y2) # Scatterplot for the two variables
abline(lm(Quartet$y2 ~ Quartet$x))
...