18.2 Linear model (Regression)

  • lm(): R function to estimate a linear model (lm stands for “linear model”“)
  • lm(y ~ x): Use outcome variable y and explanatory variable x as input
  • lm(y ~ x,data = MyData): If both variables are located in the data frame “MyData”
  • lm(y~x1 + x2 + x3,data = MyData): Several independent variables
  • lm(y ~ x1 + x2 + x1:x2,data = MyData): Interaction effects
  • lm(y ~ x1 * x2,data = MyData): Same as above
  • result <- lm(y ~ x): Save the results
  • Access results by referring to the object name, e.g. result or summary(result)
  • An object of class "lm" contains different components
    • coef(result): The coefficients
    • residuals(result): The residuals
  • IMPORTANT: Option na.action==na.omit omits missings listwise


18.2.1 Example: Fertility and socio-economic indicators

# Beispiel: Berufliches Prestige und Bildung bzw. Einkommen

  ?swiss
  summary(swiss$Catholic)
  summary(swiss$Fertility)
  summary(swiss$Education)
  
  fit <- lm(Fertility ~ Education + Catholic, data=swiss) # na.action==na.omit

# summary() is a generic function. If you input an lm() object it guesses you want a summary of regression statistics
  summary(fit)

# Number of observations: nrow(swiss) = 47
# DF (degrees of freedem): 47 observations - 3 parameters (to estimate)
# 95% percent confidence intervall:
# 74.23 (point estimate) +- 2.35 (Standard error) * 1.96 (use 2 as a rule of thumb)
  74.23 + 2.35*1.96
  
# t value = corresponding value of 74 in the t-distribution
# Pr(>|t|).. p-Value, i.e. the probability to observing t-value or greater
  # given that beta = 0

  
# Components of the lm object
  ls(fit)
  names(fit)
  class(fit)

# Access coefficents and other parts
  fit$fitted.values # ^y = y hat = predicted values
  fit$residuals     # residals = Predicted - Observed
  coef(fit)         # Coefficients beta1, beta2, beta3
  fit[1]
  fit[1]$coefficients[1] # Access subelements of object

  
# Scatterplot
  fit <- lm(Fertility ~ Education,data=swiss)
  plot(Fertility ~ Education,data=swiss)
  abline(fit,col="red")

  # Add labels
  install.packages("sp")
  install.packages("maptools")
  library(maptools)
  pointLabel(swiss$Education , swiss$Fertility, rownames(swiss), 
             cex = 0.7, method = "SANN", allowSmallOverlap = FALSE, 
             trace = FALSE, doPlot = TRUE)
  

# 3D SCATTERPLOT
# Load/install package "car"
  install.packages("car")
  library(car)

# Another Spinning 3d Scatterplot
  scatter3d(swiss$Catholic,swiss$Fertility,swiss$Education, 
            fit="linear") 
# adding a quadratic fit for the fun
  scatter3d(swiss$Catholic,swiss$Fertility,swiss$Education, 
            fit=c("linear","quadratic")) 


18.2.2 Checking assumptions linear regression model (OLS)

18.2.3 Example: Testing assumptions linear regression model (OLS)

## TESTING ASSUMPTIONS OF THE LINEAR REGRESSION MODEL (e.g. Gelman & Hill 2007, 45-47)

# Checking robustness and assumptions
  # Good overview: http://www.stat.uni-muenchen.de/~helmut/limo09/limo_09_kap6.pdf)
  fit <- lm(Fertility ~ Education + Catholic, data=swiss)

# Linearity
  # Cause: Non-linear relationship between X and Y
  # Problem: Linear model does not produce good fit
  # Diagnose: e.g. partial residual plot for every variable
  crPlots(fit) # partial residual plot
  # Formula: http://en.wikipedia.org/wiki/Partial_residual_plot
  # Residuals + beta_i*X_i (y-axis) vs. X_i (x-axis)
  # beta_i is estimated and plotted while controlling for all other variables
  
  ceresPlots(fit) 
  # Generalization of component+residual (partial residual) 



# Independence or errors
  # Causes: autocorrelated time series or clustered data
  # Problem: wrong t-values
  # Diagnose: among others: Durban Watson Test http://de.wikipedia.org/wiki/Durbin-Watson-Test
  durbinWatsonTest(fit)



# Constant variance of errors (Homoskedasticity)
  # e_i dependent of i
  # Causes: errors often increase with x
  # Problem: wrong t-values -> problem for hypothesis testing
  # Diagnose: plot e_i vs fitted values
  spreadLevelPlot(fit)
  summary(fit$fitted.values) 
  # "Solution": calculate robust  standard errors (library(sandwich))
 
# Normality of errors
  # Cause: wrong distribution assumption / wrong model
  # Problem: wrong t-values, bad predictions
  # Diagnose: Plot residuals (are they normally distributed)
    # qq plot for studentized resid vs. theoretic quantiles
    qqPlot(fit, main="QQ Plot")

    
# Outliers
  # Cause: special cases, data errors
  # Problem: Outliers may biase coefficients
  # Diagnose: e.g. Cook's D plot - http://en.wikipedia.org/wiki/Cook's_distance
  # measures effect of deleting observation
  # identify D values > 4/(n-k-1)
  cutoff <- 4/((nrow(swiss)-length(fit$coefficients)-2))
  plot(fit, which=4, cook.levels=cutoff)
  
  # Added variable plots
  library(car)
  avPlots(fit)


18.2.4 Exercise: Linear regression in R

  1. The relationship between Child Mortality and GDP
    1. Load the package car
    2. Inspect the data set UN (?UN)
    3. Estimate a linear regression of mortality on GDP and keep the estimation results
    4. Draw a scatter plot for this relationship
  2. Anscombe’s Quartet
    1. Load the package car and inspect the data set Quartet
    2. Use the data set Quartet to estimate several regressions and compare the coefficients of the following regression models: y1 on x (Model 1), y2 on x (Model 2), y3 on x (Model 3) and y4 on x4 (Model 4)!
    3. Generate 4 scatter plots + regression line for the regression models above. What would you conclude comparing the results from the regression with the graphs?
# Tip: par(mfrow=c(2,2)) can be used to plot multiple plots in one window! 
library(car)
par(mfrow=c(2,2))
plot(Quartet$x, Quartet$y1) # Scatterplot for the two variables
abline(lm(Quartet$y1 ~ Quartet$x))  
plot(Quartet$x, Quartet$y2) # Scatterplot for the two variables
abline(lm(Quartet$y2 ~ Quartet$x))  
...

18.2.5 Solution: Linear regression in R