Regressing cross-sectional data

Introduction

marketing.rda contains data on advertising budgets for youtube, facebook and newspapers (thousand USD) + sales (thousand units).

Divide data into training and test sets

training.samples <- marketing$sales %>%
  createDataPartition(p=0.8, list = F)

train.data <- marketing[training.samples, ]
test.data <- marketing[ -training.samples, ]

Plot

## library("ggplot2")
plot(train.data)

## Not working :-(
##autoplot(train.data)

Comments here…

Simple regression (We use LaTeX syntax for composite math equations; https://en.wikipedia.org/wiki/LaTeX):

\[y = a + b \cdot x + \epsilon\]

lm function 1st argument is called ‘formula argument’ and specifies the model. ~ sign separates the dependent variables with explanatory/independent (left) variables (right).

Variable on the left side of the sign is the response variable and on the right are explanatory variables.

+ var adds variable while - var removes variable 1 means intrecept (included by default)

model1 <- lm(sales ~ youtube, data = train.data)

# components of lm model (formally R `object`):
names(model1)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Summary statistics of the model are provided with the summary function in R. With summary statistics, we can assess (among other things) the prediction accuracy of model.

summary(model1)

## 
## Call:
## lm(formula = sales ~ youtube, data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3136 -2.4114 -0.2963  2.3895  8.8189 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 8.765816   0.603438   14.53 <0.0000000000000002 ***
## youtube     0.045602   0.002963   15.39 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.826 on 160 degrees of freedom
## Multiple R-squared:  0.5969, Adjusted R-squared:  0.5944 
## F-statistic: 236.9 on 1 and 160 DF,  p-value: < 0.00000000000000022

Call: model specification
Residuals: distribution of residuals
Coefficients: estimated coefficients, their standard values, t-values and probabilities

Residual standard error: \(\sqrt { RSS/DF }\); Residual sum of squares \(RSS = \sum ( y - \hat y)^2\)

\(DF = n - k\) where \(k\) denotes number of independent variables + intercept (or \(n - k - 1\) if k means independent variables w/o intercept; anyway DF means number of parameters in the model to estimate)

Distribution of residuals in theory should be normal with mean 0

Multiple R-squared: coefficient of determination (współczynnik determinacji in polish); \(R^2\) the proportion of the variation that is explained by the regression line (ESS) to total variation (TSS): ie

\[R^2 = ESS/TSS = \sum (\hat y - \bar y)^ / \sum (y - \bar y)^2\]

BTW \(R^2\) is also squared correlation between \(y\) and \(\hat y\)

BTW2 \(\bar y = \bar{\hat y}\) (property of regression line)

Adjusted \(R^2 = (1- R^2)\times (n-1)/(n- k -1)\)

\(F\)-statistic test the hypothesis \(H_0:\) all coefficients are equal to 0 (except for the intercept); The distribution of \(F(k -1, n - k)\) for small probability \(H_0\) should be rejected

## residuals
residuals(model1)

##            2            3            4            5            6            8 
##  1.279039716  1.452959922  5.143748069 -3.179616840 -0.601900088  0.496557681 
##            9           10           11           12           13           15 
## -3.476427853 -6.979341526 -2.062963084  0.365295432  0.971792399  2.865352362 
##           16           17           19           21           22           24 
##  7.421436823  2.524008918  1.007397626  0.882822730 -6.756901955 -2.658928553 
##           27           28           29           31           32           33 
##  1.414360295 -2.824652305  0.299263234  0.886007518 -0.663969150 -2.564828226 
##           34           35           36           38           39           41 
## -2.420072277 -2.602744698 -9.313603308  4.786424690  0.995651009  0.072908124 
##           42           43           47           48           49           50 
##  2.068328097  0.007701871 -0.954410587  5.946292165 -3.438733966 -0.786740965 
##           51           52           55           56           57           58 
## -6.019341526 -1.419939752  1.098622543  8.789908591 -2.565288795 -0.378999948 
##           59           60           61           62           63           64 
##  8.258712604  1.784184839 -1.973461450  5.975233836 -3.020874423  2.414198839 
##           65           66           67           68           69           70 
##  5.660084046 -1.381657904  0.910430290 -0.308639239  0.923098045  6.130378493 
##           71           72           73           74           75           76 
##  2.298964121  0.105670141  0.327625344 -2.646887956 -0.043565511  0.749376627 
##           77           78           79           80           81           82 
## -1.990680302  1.680140976 -2.701316327 -1.913608441  1.213396692 -7.128235599 
##           83           84           85           86           87           88 
##  0.673591279  3.811175507  5.590962254 -1.098174003  1.458868927  4.376420024 
##           92           93           94           95           97           98 
## -1.570874889  2.601128376  4.144346295 -0.842996215 -5.538952351 -0.283978483 
##           99          100          101          102          103          104 
##  5.861119043  4.475722404 -6.896066677  3.574479286 -6.339018614 -1.408145539 
##          106          107          108          109          110          111 
##  6.727972054 -1.493874423 -3.272716234 -3.122678436  1.018095712 -5.042122674 
##          112          113          114          115          116          117 
##  4.167791932 -1.460532846 -1.155620574  4.474896459  2.244535750 -1.743167004 
##          118          119          120          121          122          123 
## -1.666603308  3.435584746 -1.907429252  2.101916057 -1.394595841 -7.103622440 
##          124          125          126          127          128          129 
##  2.737862861  2.315404625 -0.817604708 -1.272649971 -2.594548245  8.818850262 
##          130          132          133          134          135          136 
## -0.387267797 -8.038183336 -2.385483382  2.726211438  2.174929590  2.511094779 
##          137          138          139          140          141          142 
##  1.233292166  1.216676673  0.401123244  5.956021517  0.297563748  3.674464821 
##          143          147          149          150          151          152 
##  3.287905791 -6.064652305  2.234735003  0.908095246 -4.806379790 -1.467220200 
##          154          155          157          158          159          160 
##  4.660245502 -0.322673303  4.455755535 -4.843223933 -0.646067143 -0.492749365 
##          161          162          164          165          166          167 
## -0.925421320  2.504478820  3.887079846 -0.899275263 -7.318207135 -0.145345725 
##          168          170          171          172          174          175 
## -5.442397988 -6.323380257 -1.421933219 -0.367642505 -3.941059678 -7.136066677 
##          176          178          180          181          182          183 
##  8.481565147 -4.039559911 -2.707837092 -4.735335926 -6.082649505 -1.401211800 
##          184          185          186          187          190          192 
##  6.936035982 -1.534348525  7.136102245 -4.039583709 -1.749123606 -1.017353191 
##          194          195          197          198          199          200 
##  5.626496085  3.802248302 -2.280661171 -3.091671903  6.314925390 -5.386873490

## fitted values
fitted(model1)

##         2         3         4         5         6         8         9        10 
## 11.200960  9.707040 17.056252 18.659617  9.241900 15.343442  9.236428 19.699342 
##        11        12        13        15        16        17        19        21 
## 12.382963 20.514705 10.068208 19.934648 19.458563 12.475991 12.552602 20.717177 
##        22        24        27        28        29        31        32        33 
## 21.756902 21.258929 16.585640 21.904652 22.380737 24.793992 14.943969 14.084828 
##        34        35        36        38        39        41        42        43 
## 23.300072 14.002745 24.673603 12.853575 11.124349 19.847092 18.451672 24.832298 
##        47        48        49        50        51        52        55        56 
## 13.674411 21.893708 21.198734 12.426741 19.699342 14.259940 23.141377 19.650091 
##        57        58        59        60        61        62        63        64 
##  9.165289 16.219000 20.301287 20.295815 11.693461 23.064766 21.860874 14.385801 
##        65        66        67        68        69        70        71        72 
## 15.939916 12.541658 10.489570 16.388639 21.756902 20.629622 19.661036 14.774330 
##        73        74        75        76        77        78        79        80 
## 10.232375 15.846888 20.443566  9.690623 10.270680 15.359859  9.061316 15.113608 
##        81        82        83        84        85        86        87        88 
## 12.946603 21.888236 12.886409 12.508824 20.449038 19.338174 12.941131 14.823580 
##        92        93        94        95        97        98        99       100 
## 10.330875 20.678872 22.495654 14.642996 19.578952 18.883978 24.618881 16.164278 
##       101       102       103       104       106       107       108       109 
## 20.936067 24.985521 24.099019 19.048146 16.312028 10.133874 13.712716  9.482678 
##       110       111       112       113       114       115       116       117 
## 22.741904 21.122123 21.992208 18.380533 20.235621 13.045104 12.875464 16.383167 
##       118       119       120       121       122       123       124       125 
## 12.946603 15.644415  9.827429 16.498084  9.794596 21.023622 15.502137 21.324595 
##       126       127       128       129       130       132       133       134 
## 13.537605  9.192650 13.154548 20.821150 12.027268 23.278183  9.225483 20.793789 
##       135       136       137       138       139       140       141       142 
## 10.785070 11.408905 10.166708 23.743323 11.118877 18.883978 12.782436 19.365535 
##       143       147       149       150       151       152       154       155 
## 20.832094 21.904652 10.845265 11.211905 24.126380 15.387220 18.139754 19.042673 
##       157       158       159       160       161       162       164       165 
## 13.904244 16.963224  9.406067 15.972749 18.205421 13.455521 17.712920 15.179275 
##       166       167       168       170       171       172       174       175 
## 21.598207  9.745346 20.082398 24.323380 11.501933 17.767643 17.981060 20.936067 
##       176       178       180       181       182       183       184       185 
## 23.918435 18.079560 17.827837 17.335336 20.722650 11.841212 24.503964 22.654349 
##       186       187       190       192       194       195       197       198 
## 19.983898 16.399584  9.789124 12.897353 17.893504 16.957752 13.920661 18.451672 
##       199       200 
## 24.285075 21.466873

##
coefficients(model1)["youtube"]

##    youtube 
## 0.04560196

## or
confint(model1)

##                  2.5 %     97.5 %
## (Intercept) 7.57408431 9.95754694
## youtube     0.03975106 0.05145286

b.est <- coefficients(model1)["youtube"]
a.est <- coefficients(model1)["(Intercept)"]

The equation is: sales =
8.7658156 + 0.045602 youtube. So unit increase in advertising on youtube results in 0.045602 increase/change in sales.

Default plot (2nd argument indicates type of plot; 1 – residuals vs fitted):

## which=3
plot (model1, which=1)

More elaborate plots with ggplot2:

library("ggplot2")
library('ggfortify')
# install.packages('ggfortify')
autoplot(model1)

Influential values

plot (model1,which=4)

The main package for specification testing of linear regressions in R is the lmtest package.

Testing for heteroskedasticity in R can be done with the bptest function

By default (using a regression object as an argument) bptest performs the (generalized) Breusch-Pagan test:

library(lmtest)
bptest(model1)

## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 40.252, df = 1, p-value = 0.0000000002232

Since the \(p\)-value is not less than 0.05, we fail to reject the null hypothesis. There is no heteroscedasticity in data.

The Ramsey RESET Test tests functional form by evaluating if higher order terms have any explanatory value:

resettest(model1)

## 
##  RESET test
## 
## data:  model1
## RESET = 1.8041, df1 = 2, df2 = 158, p-value = 0.168

Low \(p\) indicates functional form misspecification.

Testing for autocorrelation: Breusch-Godfrey test. \(H_0\): There is no autocorrelation at any order less than or equal to \(p\).

bgtest(model1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  model1
## LM test = 0.39912, df = 1, p-value = 0.5275

If \(p\)-value is less than 0.05, we can reject the null hypothesis and conclude that the residuals in the regression model are autocorrelated.

Autocorrelation: Durbin Watson test for autocorrelation examines whether residuals are autocorrelated with themselves. The \(H_0\) states that residuals are not autocorrelated. This test could be especially useful when you conduct a multiple (times series) regression.

dwtest(model1)

## 
##  Durbin-Watson test
## 
## data:  model1
## DW = 2.0856, p-value = 0.7081
## alternative hypothesis: true autocorrelation is greater than 0

##  or
#library("car")
#durbinWatsonTest(model1)

If \(p\)-value is less than 0.05, we can reject the null hypothesis and conclude that the residuals in the regression model are autocorrelated.

Multicollinearity. VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction.

## Not appropriate for model with 1 regressor
##vif(model1)

Anyway VIFs exceeding 4 are suspicious while VIFs exceeding 10 are signs of serious multicollinearity requiring correction

Predictions

predictions1 <- predict(model1, test.data )

predictions1

##         1         7        14        18        20        23        25        26 
## 21.357429 11.912351 14.101245 24.164685 16.826418  9.488151 12.175018 23.152322 
##        30        37        40        44        45        46        53        54 
## 12.629214 23.371211 21.242512 20.087870 10.139347 18.347699 20.607733 18.758117 
##        89        90        91        96       105       131       144       145 
## 13.597799 14.774330 16.115027 17.701976 21.800680  8.804121 14.489774 14.030106 
##       146       148       153       156       163       169       173       177 
## 16.443362 22.074292 19.578952  8.990177 19.075507 20.553010  9.838374 22.358848 
##       179       188       189       191       193       196 
## 23.907490 19.223257 24.416408 10.927349  9.707040 10.856209

plot(sales ~ youtube, data = train.data )
 points (predictions1 ~ test.data$youtube, col='blue')
 lines (predictions1 ~ test.data$youtube, col='red')

Multiple regression

Add facebook as 2nd regressor:

model2 <- lm(sales ~ youtube + facebook, data=train.data)

summary(model2)

## 
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3818 -1.1099  0.2868  1.3017  3.2805 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 3.680058   0.374204   9.834 <0.0000000000000002 ***
## youtube     0.045175   0.001459  30.954 <0.0000000000000002 ***
## facebook    0.185245   0.008281  22.371 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.884 on 159 degrees of freedom
## Multiple R-squared:  0.9028, Adjusted R-squared:  0.9016 
## F-statistic: 738.5 on 2 and 159 DF,  p-value: < 0.00000000000000022

## multicollinearity
library("car")

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

vif(model2)

##  youtube facebook 
## 1.000171 1.000171

## correlation matrix

corMatrix <- cor(train.data)
corMatrix

##              youtube   facebook  newspaper     sales
## youtube   1.00000000 0.01308086 0.05246311 0.7725942
## facebook  0.01308086 1.00000000 0.35875164 0.5631501
## newspaper 0.05246311 0.35875164 1.00000000 0.2110232
## sales     0.77259423 0.56315013 0.21102318 1.0000000

##
library(corrplot)

## corrplot 0.84 loaded

corrplot(corMatrix, method = 'number')

corrplot(corMatrix, method = 'color', order = 'alphabet')

Model comparison/selection

The anova function compares nested models. Where we are dealing with regression models, then we apply the F-Test and where we are dealing with logistic regression models, then we apply the Chi-Square Test.

By nested, we mean that the independent variables of the simple model will be a subset of the more complex model. Note that we should fit the models on the same dataset.

We test if we should include facebook or not:

anova(model2, model1, test="F")

## Analysis of Variance Table
## 
## Model 1: sales ~ youtube + facebook
## Model 2: sales ~ youtube
##   Res.Df     RSS Df Sum of Sq      F                Pr(>F)    
## 1    159  564.57                                              
## 2    160 2341.60 -1     -1777 500.47 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We should prefer m.model

model3 <- lm(sales ~ youtube + facebook + newspaper, data=train.data)
summary(model3)

## 
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1078 -1.0636  0.2746  1.3225  3.3259 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  3.822044   0.392042   9.749 <0.0000000000000002 ***
## youtube      0.045264   0.001459  31.017 <0.0000000000000002 ***
## facebook     0.189049   0.008858  21.341 <0.0000000000000002 ***
## newspaper   -0.007113   0.005938  -1.198               0.233    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.882 on 158 degrees of freedom
## Multiple R-squared:  0.9037, Adjusted R-squared:  0.9019 
## F-statistic: 494.2 on 3 and 158 DF,  p-value: < 0.00000000000000022

## already tested
##anova(model2, model1, test="F") 

anova(model3, model2, model1, test="F")

## Analysis of Variance Table
## 
## Model 1: sales ~ youtube + facebook + newspaper
## Model 2: sales ~ youtube + facebook
## Model 3: sales ~ youtube
##   Res.Df     RSS Df Sum of Sq        F              Pr(>F)    
## 1    158  559.49                                              
## 2    159  564.57 -1     -5.08   1.4348              0.2328    
## 3    160 2341.60 -1  -1777.03 501.8375 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple regression (another example)

library("AER")
data("CPS1988")

str(CPS1988)

## 'data.frame':    28155 obs. of  7 variables:
##  $ wage      : num  355 123 370 755 594 ...
##  $ education : int  7 12 9 11 12 16 8 12 12 14 ...
##  $ experience: int  45 1 9 46 36 22 51 34 0 18 ...
##  $ ethnicity : Factor w/ 2 levels "cauc","afam": 1 1 1 1 1 1 1 1 1 1 ...
##  $ smsa      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ region    : Factor w/ 4 levels "northeast","midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ parttime  : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...

summary(CPS1988)

##       wage            education       experience   ethnicity     smsa      
##  Min.   :   50.05   Min.   : 0.00   Min.   :-4.0   cauc:25923   no : 7223  
##  1st Qu.:  308.64   1st Qu.:12.00   1st Qu.: 8.0   afam: 2232   yes:20932  
##  Median :  522.32   Median :12.00   Median :16.0                           
##  Mean   :  603.73   Mean   :13.07   Mean   :18.2                           
##  3rd Qu.:  783.48   3rd Qu.:15.00   3rd Qu.:27.0                           
##  Max.   :18777.20   Max.   :18.00   Max.   :63.0                           
##        region     parttime   
##  northeast:6441   no :25631  
##  midwest  :6863   yes: 2524  
##  south    :8760              
##  west     :6091              
##                              
##

## data.frame:  28155 obs. of  7 variables:
## wage      : num  355 123 370 755 594 ...
## education : int  7 12 9 11 12 16 8 12 12 14 ...
## experience: int  45 1 9 46 36 22 51 34 0 18 ...
## ethnicity : Factor w/ 2 levels "cauc","afam": 1 1 1 1 1 1 1 1 1 1 ...
## smsa      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## region    : Factor w/ 4 levels "northeast","midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
## parttime  : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...

Inspect some correlations

## use dplyr
## https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
corMatrix <- CPS1988 %>% select (wage, education, experience) %>% cor()
## wage $$ / week
## education and experience are measured in years
## ethnicity: factor

corMatrix

##                 wage  education experience
## wage       1.0000000  0.3016440  0.1942204
## education  0.3016440  1.0000000 -0.2867064
## experience 0.1942204 -0.2867064  1.0000000

corrplot(corMatrix, method = 'color', order = 'alphabet')

corrplot(corMatrix, method = 'number')

Categorical variables

Ethnicity is categorical

levels(CPS1988$ethnicity)

## [1] "cauc" "afam"

## Parttime
levels(CPS1988$parttime)

## [1] "no"  "yes"

## this is pretty big
nrow(CPS1988)

## [1] 28155

## later
##plot(CPS1988)
##

The model

model3w <- lm ( log(wage) ~ experience + I(experience^2) + education + ethnicity,
               data = CPS1988 )

NOTE: Inside formula +,*, /,and ^ has special meaning If one has to specify x^2 one has to use I() function as in the example above

summary(model3w)

## 
## Call:
## lm(formula = log(wage) ~ experience + I(experience^2) + education + 
##     ethnicity, data = CPS1988)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9428 -0.3162  0.0580  0.3756  4.3830 
## 
## Coefficients:
##                    Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)      4.32139500  0.01917421  225.38 <0.0000000000000002 ***
## experience       0.07747323  0.00088005   88.03 <0.0000000000000002 ***
## I(experience^2) -0.00131607  0.00001899  -69.31 <0.0000000000000002 ***
## education        0.08567282  0.00127219   67.34 <0.0000000000000002 ***
## ethnicityafam   -0.24336430  0.01291812  -18.84 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5839 on 28150 degrees of freedom
## Multiple R-squared:  0.3347, Adjusted R-squared:  0.3346 
## F-statistic:  3541 on 4 and 28150 DF,  p-value: < 0.00000000000000022

Ethnicityafam means afam is 1 and cauc is 0 so negative coefficient can be interpreted as difference between (log) wage between afam (1) and cauc(0)

In R factors are handled as (set) of dychotomous variables automatically If factor contains more than 2 levels, \(l-1\) dychotomous variables will be created

## Simpler model w/o ethnicity
model3a <- lm ( log(wage) ~ experience + I(experience^2) + education,
               data = CPS1988 )
summary(model3a)

## 
## Call:
## lm(formula = log(wage) ~ experience + I(experience^2) + education, 
##     data = CPS1988)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9333 -0.3176  0.0594  0.3800  4.4178 
## 
## Coefficients:
##                    Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)      4.27808974  0.01915521  223.34 <0.0000000000000002 ***
## experience       0.07752025  0.00088556   87.54 <0.0000000000000002 ***
## I(experience^2) -0.00131597  0.00001911  -68.88 <0.0000000000000002 ***
## education        0.08744110  0.00127667   68.49 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5876 on 28151 degrees of freedom
## Multiple R-squared:  0.3264, Adjusted R-squared:  0.3263 
## F-statistic:  4546 on 3 and 28151 DF,  p-value: < 0.00000000000000022

BTW Setting reference groups for factors: he first level of the factor is treated as the omitted reference group. Use the relevel() function to change.

Note: models are ‘nested’

anova(model3a, model3w)

## Analysis of Variance Table
## 
## Model 1: log(wage) ~ experience + I(experience^2) + education
## Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
##   Res.Df    RSS Df Sum of Sq      F                Pr(>F)    
## 1  28151 9719.6                                              
## 2  28150 9598.6  1    121.02 354.91 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Akaike information criterion (AIC) is a metric that is used to compare the fit of several regression models smaller AIC = model is a better fit

AIC(model3a, model3w)

##         df      AIC
## model3a  5 49965.43
## model3w  6 49614.68

Ehnicity is significant at any reasonable level

Categorical variables with more than two levels

Nine-month academic salary of academic staff in US (libary car)

library("car")
head(Salaries, n=9)

##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000
## 7      Prof          B            30          23 Male 175000
## 8      Prof          B            45          45 Male 147765
## 9      Prof          B            21          20 Male 119250

model.9 <- lm(salary ~ yrs.service + rank + discipline + sex, data=Salaries)
summary(model.9)

## 
## Call:
## lm(formula = salary ~ yrs.service + rank + discipline + sex, 
##     data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64202 -14255  -1533  10571  99163 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   68351.67    4482.20  15.250 < 0.0000000000000002 ***
## yrs.service     -88.78     111.64  -0.795             0.426958    
## rankAssocProf 14560.40    4098.32   3.553             0.000428 ***
## rankProf      49159.64    3834.49  12.820 < 0.0000000000000002 ***
## disciplineB   13473.38    2315.50   5.819         0.0000000124 ***
## sexMale        4771.25    3878.00   1.230             0.219311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22650 on 391 degrees of freedom
## Multiple R-squared:  0.4478, Adjusted R-squared:  0.4407 
## F-statistic: 63.41 on 5 and 391 DF,  p-value: < 0.00000000000000022

anova(model.9)

## Analysis of Variance Table
## 
## Response: salary
##              Df       Sum Sq     Mean Sq  F value                Pr(>F)    
## yrs.service   1  40709289442 40709289442  79.3405 < 0.00000000000000022 ***
## rank          2 103577124570 51788562285 100.9335 < 0.00000000000000022 ***
## discipline    1  17617147945 17617147945  34.3350        0.000000009861 ***
## sex           1    776686259   776686259   1.5137                0.2193    
## Residuals   391 200620394344   513095638                                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interactions

model3b <- lm ( log(wage) ~ experience + I(experience^2) + education*ethnicity,
                data = CPS1988 )
summary(model3b)

## 
## Call:
## lm(formula = log(wage) ~ experience + I(experience^2) + education * 
##     ethnicity, data = CPS1988)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9451 -0.3162  0.0578  0.3761  4.3929 
## 
## Coefficients:
##                            Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)              4.31305918  0.01958965 220.170 <0.0000000000000002 ***
## experience               0.07751973  0.00088028  88.063 <0.0000000000000002 ***
## I(experience^2)         -0.00131787  0.00001901 -69.339 <0.0000000000000002 ***
## education                0.08631178  0.00130887  65.944 <0.0000000000000002 ***
## ethnicityafam           -0.12388689  0.05902575  -2.099              0.0358 *  
## education:ethnicityafam -0.00964814  0.00465096  -2.074              0.0380 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5839 on 28149 degrees of freedom
## Multiple R-squared:  0.3348, Adjusted R-squared:  0.3347 
## F-statistic:  2834 on 5 and 28149 DF,  p-value: < 0.00000000000000022

anova(model3b, model3w)

## Analysis of Variance Table
## 
## Model 1: log(wage) ~ experience + I(experience^2) + education * ethnicity
## Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1  28149 9597.2                              
## 2  28150 9598.6 -1   -1.4672 4.3033 0.03805 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction education*ethnicity is significant at any reasonable level

References

https://www.zeileis.org/teaching/AER/