marketing.rda
contains data on advertising budgets for youtube, facebook and newspapers (thousand USD) + sales (thousand units).
Divide data into training and test sets
training.samples <- marketing$sales %>%
createDataPartition(p=0.8, list = F)
train.data <- marketing[training.samples, ]
test.data <- marketing[ -training.samples, ]
Plot
## library("ggplot2")
plot(train.data)
## Not working :-(
##autoplot(train.data)
Comments here…
Simple regression (We use LaTeX syntax for composite math equations; https://en.wikipedia.org/wiki/LaTeX):
\[y = a + b \cdot x + \epsilon\]
lm
function 1st argument is called ‘formula argument’ and specifies the model. ~
sign separates the dependent variables with explanatory/independent (left) variables (right).
Variable on the left side of the sign is the response variable and on the right are explanatory variables.
+
var adds variable while -
var removes variable 1
means intrecept (included by default)
model1 <- lm(sales ~ youtube, data = train.data)
# components of lm model (formally R `object`):
names(model1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
Summary statistics of the model are provided with the summary function in R. With summary statistics, we can assess (among other things) the prediction accuracy of model.
summary(model1)
##
## Call:
## lm(formula = sales ~ youtube, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3136 -2.4114 -0.2963 2.3895 8.8189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.765816 0.603438 14.53 <0.0000000000000002 ***
## youtube 0.045602 0.002963 15.39 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.826 on 160 degrees of freedom
## Multiple R-squared: 0.5969, Adjusted R-squared: 0.5944
## F-statistic: 236.9 on 1 and 160 DF, p-value: < 0.00000000000000022
Call: model specification
Residuals: distribution of residuals
Coefficients: estimated coefficients, their standard values, t-values and probabilities
Residual standard error: \(\sqrt { RSS/DF }\); Residual sum of squares \(RSS = \sum ( y - \hat y)^2\)
\(DF = n - k\) where \(k\) denotes number of independent variables + intercept (or \(n - k - 1\) if k means independent variables w/o intercept; anyway DF means number of parameters in the model to estimate)
Distribution of residuals in theory should be normal with mean 0
Multiple R-squared: coefficient of determination (wspĂ³Å‚czynnik determinacji in polish); \(R^2\) the proportion of the variation that is explained by the regression line (ESS) to total variation (TSS): ie
\[R^2 = ESS/TSS = \sum (\hat y - \bar y)^ / \sum (y - \bar y)^2\]
BTW \(R^2\) is also squared correlation between \(y\) and \(\hat y\)
BTW2 \(\bar y = \bar{\hat y}\) (property of regression line)
Adjusted \(R^2 = (1- R^2)\times (n-1)/(n- k -1)\)
\(F\)-statistic test the hypothesis \(H_0:\) all coefficients are equal to 0 (except for the intercept); The distribution of \(F(k -1, n - k)\) for small probability \(H_0\) should be rejected
## residuals
residuals(model1)
## 2 3 4 5 6 8
## 1.279039716 1.452959922 5.143748069 -3.179616840 -0.601900088 0.496557681
## 9 10 11 12 13 15
## -3.476427853 -6.979341526 -2.062963084 0.365295432 0.971792399 2.865352362
## 16 17 19 21 22 24
## 7.421436823 2.524008918 1.007397626 0.882822730 -6.756901955 -2.658928553
## 27 28 29 31 32 33
## 1.414360295 -2.824652305 0.299263234 0.886007518 -0.663969150 -2.564828226
## 34 35 36 38 39 41
## -2.420072277 -2.602744698 -9.313603308 4.786424690 0.995651009 0.072908124
## 42 43 47 48 49 50
## 2.068328097 0.007701871 -0.954410587 5.946292165 -3.438733966 -0.786740965
## 51 52 55 56 57 58
## -6.019341526 -1.419939752 1.098622543 8.789908591 -2.565288795 -0.378999948
## 59 60 61 62 63 64
## 8.258712604 1.784184839 -1.973461450 5.975233836 -3.020874423 2.414198839
## 65 66 67 68 69 70
## 5.660084046 -1.381657904 0.910430290 -0.308639239 0.923098045 6.130378493
## 71 72 73 74 75 76
## 2.298964121 0.105670141 0.327625344 -2.646887956 -0.043565511 0.749376627
## 77 78 79 80 81 82
## -1.990680302 1.680140976 -2.701316327 -1.913608441 1.213396692 -7.128235599
## 83 84 85 86 87 88
## 0.673591279 3.811175507 5.590962254 -1.098174003 1.458868927 4.376420024
## 92 93 94 95 97 98
## -1.570874889 2.601128376 4.144346295 -0.842996215 -5.538952351 -0.283978483
## 99 100 101 102 103 104
## 5.861119043 4.475722404 -6.896066677 3.574479286 -6.339018614 -1.408145539
## 106 107 108 109 110 111
## 6.727972054 -1.493874423 -3.272716234 -3.122678436 1.018095712 -5.042122674
## 112 113 114 115 116 117
## 4.167791932 -1.460532846 -1.155620574 4.474896459 2.244535750 -1.743167004
## 118 119 120 121 122 123
## -1.666603308 3.435584746 -1.907429252 2.101916057 -1.394595841 -7.103622440
## 124 125 126 127 128 129
## 2.737862861 2.315404625 -0.817604708 -1.272649971 -2.594548245 8.818850262
## 130 132 133 134 135 136
## -0.387267797 -8.038183336 -2.385483382 2.726211438 2.174929590 2.511094779
## 137 138 139 140 141 142
## 1.233292166 1.216676673 0.401123244 5.956021517 0.297563748 3.674464821
## 143 147 149 150 151 152
## 3.287905791 -6.064652305 2.234735003 0.908095246 -4.806379790 -1.467220200
## 154 155 157 158 159 160
## 4.660245502 -0.322673303 4.455755535 -4.843223933 -0.646067143 -0.492749365
## 161 162 164 165 166 167
## -0.925421320 2.504478820 3.887079846 -0.899275263 -7.318207135 -0.145345725
## 168 170 171 172 174 175
## -5.442397988 -6.323380257 -1.421933219 -0.367642505 -3.941059678 -7.136066677
## 176 178 180 181 182 183
## 8.481565147 -4.039559911 -2.707837092 -4.735335926 -6.082649505 -1.401211800
## 184 185 186 187 190 192
## 6.936035982 -1.534348525 7.136102245 -4.039583709 -1.749123606 -1.017353191
## 194 195 197 198 199 200
## 5.626496085 3.802248302 -2.280661171 -3.091671903 6.314925390 -5.386873490
## fitted values
fitted(model1)
## 2 3 4 5 6 8 9 10
## 11.200960 9.707040 17.056252 18.659617 9.241900 15.343442 9.236428 19.699342
## 11 12 13 15 16 17 19 21
## 12.382963 20.514705 10.068208 19.934648 19.458563 12.475991 12.552602 20.717177
## 22 24 27 28 29 31 32 33
## 21.756902 21.258929 16.585640 21.904652 22.380737 24.793992 14.943969 14.084828
## 34 35 36 38 39 41 42 43
## 23.300072 14.002745 24.673603 12.853575 11.124349 19.847092 18.451672 24.832298
## 47 48 49 50 51 52 55 56
## 13.674411 21.893708 21.198734 12.426741 19.699342 14.259940 23.141377 19.650091
## 57 58 59 60 61 62 63 64
## 9.165289 16.219000 20.301287 20.295815 11.693461 23.064766 21.860874 14.385801
## 65 66 67 68 69 70 71 72
## 15.939916 12.541658 10.489570 16.388639 21.756902 20.629622 19.661036 14.774330
## 73 74 75 76 77 78 79 80
## 10.232375 15.846888 20.443566 9.690623 10.270680 15.359859 9.061316 15.113608
## 81 82 83 84 85 86 87 88
## 12.946603 21.888236 12.886409 12.508824 20.449038 19.338174 12.941131 14.823580
## 92 93 94 95 97 98 99 100
## 10.330875 20.678872 22.495654 14.642996 19.578952 18.883978 24.618881 16.164278
## 101 102 103 104 106 107 108 109
## 20.936067 24.985521 24.099019 19.048146 16.312028 10.133874 13.712716 9.482678
## 110 111 112 113 114 115 116 117
## 22.741904 21.122123 21.992208 18.380533 20.235621 13.045104 12.875464 16.383167
## 118 119 120 121 122 123 124 125
## 12.946603 15.644415 9.827429 16.498084 9.794596 21.023622 15.502137 21.324595
## 126 127 128 129 130 132 133 134
## 13.537605 9.192650 13.154548 20.821150 12.027268 23.278183 9.225483 20.793789
## 135 136 137 138 139 140 141 142
## 10.785070 11.408905 10.166708 23.743323 11.118877 18.883978 12.782436 19.365535
## 143 147 149 150 151 152 154 155
## 20.832094 21.904652 10.845265 11.211905 24.126380 15.387220 18.139754 19.042673
## 157 158 159 160 161 162 164 165
## 13.904244 16.963224 9.406067 15.972749 18.205421 13.455521 17.712920 15.179275
## 166 167 168 170 171 172 174 175
## 21.598207 9.745346 20.082398 24.323380 11.501933 17.767643 17.981060 20.936067
## 176 178 180 181 182 183 184 185
## 23.918435 18.079560 17.827837 17.335336 20.722650 11.841212 24.503964 22.654349
## 186 187 190 192 194 195 197 198
## 19.983898 16.399584 9.789124 12.897353 17.893504 16.957752 13.920661 18.451672
## 199 200
## 24.285075 21.466873
##
coefficients(model1)["youtube"]
## youtube
## 0.04560196
## or
confint(model1)
## 2.5 % 97.5 %
## (Intercept) 7.57408431 9.95754694
## youtube 0.03975106 0.05145286
b.est <- coefficients(model1)["youtube"]
a.est <- coefficients(model1)["(Intercept)"]
The equation is: sales =
8.7658156 + 0.045602 youtube. So unit increase in advertising on youtube results in 0.045602 increase/change in sales.
Default plot (2nd argument indicates type of plot; 1 – residuals vs fitted):
## which=3
plot (model1, which=1)
More elaborate plots with ggplot2
:
library("ggplot2")
library('ggfortify')
# install.packages('ggfortify')
autoplot(model1)
Influential values
plot (model1,which=4)
The main package for specification testing of linear regressions in R is the lmtest
package.
Testing for heteroskedasticity in R can be done with the bptest
function
By default (using a regression object as an argument) bptest
performs the (generalized) Breusch-Pagan test:
library(lmtest)
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 40.252, df = 1, p-value = 0.0000000002232
Since the \(p\)-value is not less than 0.05, we fail to reject the null hypothesis. There is no heteroscedasticity in data.
The Ramsey RESET Test tests functional form by evaluating if higher order terms have any explanatory value:
resettest(model1)
##
## RESET test
##
## data: model1
## RESET = 1.8041, df1 = 2, df2 = 158, p-value = 0.168
Low \(p\) indicates functional form misspecification.
Testing for autocorrelation: Breusch-Godfrey test. \(H_0\): There is no autocorrelation at any order less than or equal to \(p\).
bgtest(model1)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: model1
## LM test = 0.39912, df = 1, p-value = 0.5275
If \(p\)-value is less than 0.05, we can reject the null hypothesis and conclude that the residuals in the regression model are autocorrelated.
Autocorrelation: Durbin Watson test for autocorrelation examines whether residuals are autocorrelated with themselves. The \(H_0\) states that residuals are not autocorrelated. This test could be especially useful when you conduct a multiple (times series) regression.
dwtest(model1)
##
## Durbin-Watson test
##
## data: model1
## DW = 2.0856, p-value = 0.7081
## alternative hypothesis: true autocorrelation is greater than 0
## or
#library("car")
#durbinWatsonTest(model1)
If \(p\)-value is less than 0.05, we can reject the null hypothesis and conclude that the residuals in the regression model are autocorrelated.
Multicollinearity. VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction.
## Not appropriate for model with 1 regressor
##vif(model1)
Anyway VIFs exceeding 4 are suspicious while VIFs exceeding 10 are signs of serious multicollinearity requiring correction
predictions1 <- predict(model1, test.data )
predictions1
## 1 7 14 18 20 23 25 26
## 21.357429 11.912351 14.101245 24.164685 16.826418 9.488151 12.175018 23.152322
## 30 37 40 44 45 46 53 54
## 12.629214 23.371211 21.242512 20.087870 10.139347 18.347699 20.607733 18.758117
## 89 90 91 96 105 131 144 145
## 13.597799 14.774330 16.115027 17.701976 21.800680 8.804121 14.489774 14.030106
## 146 148 153 156 163 169 173 177
## 16.443362 22.074292 19.578952 8.990177 19.075507 20.553010 9.838374 22.358848
## 179 188 189 191 193 196
## 23.907490 19.223257 24.416408 10.927349 9.707040 10.856209
plot(sales ~ youtube, data = train.data )
points (predictions1 ~ test.data$youtube, col='blue')
lines (predictions1 ~ test.data$youtube, col='red')
Add facebook as 2nd regressor:
model2 <- lm(sales ~ youtube + facebook, data=train.data)
summary(model2)
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3818 -1.1099 0.2868 1.3017 3.2805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.680058 0.374204 9.834 <0.0000000000000002 ***
## youtube 0.045175 0.001459 30.954 <0.0000000000000002 ***
## facebook 0.185245 0.008281 22.371 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.884 on 159 degrees of freedom
## Multiple R-squared: 0.9028, Adjusted R-squared: 0.9016
## F-statistic: 738.5 on 2 and 159 DF, p-value: < 0.00000000000000022
## multicollinearity
library("car")
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(model2)
## youtube facebook
## 1.000171 1.000171
## correlation matrix
corMatrix <- cor(train.data)
corMatrix
## youtube facebook newspaper sales
## youtube 1.00000000 0.01308086 0.05246311 0.7725942
## facebook 0.01308086 1.00000000 0.35875164 0.5631501
## newspaper 0.05246311 0.35875164 1.00000000 0.2110232
## sales 0.77259423 0.56315013 0.21102318 1.0000000
##
library(corrplot)
## corrplot 0.84 loaded
corrplot(corMatrix, method = 'number')
corrplot(corMatrix, method = 'color', order = 'alphabet')
The anova function compares nested models. Where we are dealing with regression models, then we apply the F-Test and where we are dealing with logistic regression models, then we apply the Chi-Square Test.
By nested, we mean that the independent variables of the simple model will be a subset of the more complex model. Note that we should fit the models on the same dataset.
We test if we should include facebook or not:
anova(model2, model1, test="F")
## Analysis of Variance Table
##
## Model 1: sales ~ youtube + facebook
## Model 2: sales ~ youtube
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 159 564.57
## 2 160 2341.60 -1 -1777 500.47 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We should prefer m.model
model3 <- lm(sales ~ youtube + facebook + newspaper, data=train.data)
summary(model3)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1078 -1.0636 0.2746 1.3225 3.3259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.822044 0.392042 9.749 <0.0000000000000002 ***
## youtube 0.045264 0.001459 31.017 <0.0000000000000002 ***
## facebook 0.189049 0.008858 21.341 <0.0000000000000002 ***
## newspaper -0.007113 0.005938 -1.198 0.233
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.882 on 158 degrees of freedom
## Multiple R-squared: 0.9037, Adjusted R-squared: 0.9019
## F-statistic: 494.2 on 3 and 158 DF, p-value: < 0.00000000000000022
## already tested
##anova(model2, model1, test="F")
anova(model3, model2, model1, test="F")
## Analysis of Variance Table
##
## Model 1: sales ~ youtube + facebook + newspaper
## Model 2: sales ~ youtube + facebook
## Model 3: sales ~ youtube
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 158 559.49
## 2 159 564.57 -1 -5.08 1.4348 0.2328
## 3 160 2341.60 -1 -1777.03 501.8375 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library("AER")
data("CPS1988")
str(CPS1988)
## 'data.frame': 28155 obs. of 7 variables:
## $ wage : num 355 123 370 755 594 ...
## $ education : int 7 12 9 11 12 16 8 12 12 14 ...
## $ experience: int 45 1 9 46 36 22 51 34 0 18 ...
## $ ethnicity : Factor w/ 2 levels "cauc","afam": 1 1 1 1 1 1 1 1 1 1 ...
## $ smsa : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ region : Factor w/ 4 levels "northeast","midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ parttime : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
summary(CPS1988)
## wage education experience ethnicity smsa
## Min. : 50.05 Min. : 0.00 Min. :-4.0 cauc:25923 no : 7223
## 1st Qu.: 308.64 1st Qu.:12.00 1st Qu.: 8.0 afam: 2232 yes:20932
## Median : 522.32 Median :12.00 Median :16.0
## Mean : 603.73 Mean :13.07 Mean :18.2
## 3rd Qu.: 783.48 3rd Qu.:15.00 3rd Qu.:27.0
## Max. :18777.20 Max. :18.00 Max. :63.0
## region parttime
## northeast:6441 no :25631
## midwest :6863 yes: 2524
## south :8760
## west :6091
##
##
## data.frame: 28155 obs. of 7 variables:
## wage : num 355 123 370 755 594 ...
## education : int 7 12 9 11 12 16 8 12 12 14 ...
## experience: int 45 1 9 46 36 22 51 34 0 18 ...
## ethnicity : Factor w/ 2 levels "cauc","afam": 1 1 1 1 1 1 1 1 1 1 ...
## smsa : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## region : Factor w/ 4 levels "northeast","midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
## parttime : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
Inspect some correlations
## use dplyr
## https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
corMatrix <- CPS1988 %>% select (wage, education, experience) %>% cor()
## wage $$ / week
## education and experience are measured in years
## ethnicity: factor
corMatrix
## wage education experience
## wage 1.0000000 0.3016440 0.1942204
## education 0.3016440 1.0000000 -0.2867064
## experience 0.1942204 -0.2867064 1.0000000
corrplot(corMatrix, method = 'color', order = 'alphabet')
corrplot(corMatrix, method = 'number')
Ethnicity is categorical
levels(CPS1988$ethnicity)
## [1] "cauc" "afam"
## Parttime
levels(CPS1988$parttime)
## [1] "no" "yes"
## this is pretty big
nrow(CPS1988)
## [1] 28155
## later
##plot(CPS1988)
##
The model
model3w <- lm ( log(wage) ~ experience + I(experience^2) + education + ethnicity,
data = CPS1988 )
NOTE: Inside formula +
,*
, /
,and ^
has special meaning If one has to specify x^2
one has to use I()
function as in the example above
summary(model3w)
##
## Call:
## lm(formula = log(wage) ~ experience + I(experience^2) + education +
## ethnicity, data = CPS1988)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9428 -0.3162 0.0580 0.3756 4.3830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.32139500 0.01917421 225.38 <0.0000000000000002 ***
## experience 0.07747323 0.00088005 88.03 <0.0000000000000002 ***
## I(experience^2) -0.00131607 0.00001899 -69.31 <0.0000000000000002 ***
## education 0.08567282 0.00127219 67.34 <0.0000000000000002 ***
## ethnicityafam -0.24336430 0.01291812 -18.84 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5839 on 28150 degrees of freedom
## Multiple R-squared: 0.3347, Adjusted R-squared: 0.3346
## F-statistic: 3541 on 4 and 28150 DF, p-value: < 0.00000000000000022
Ethnicityafam
means afam
is 1
and cauc
is 0
so negative coefficient can be interpreted as difference between (log
) wage between afam (1) and cauc(0)
In R factors are handled as (set) of dychotomous variables automatically If factor contains more than 2 levels, \(l-1\) dychotomous variables will be created
## Simpler model w/o ethnicity
model3a <- lm ( log(wage) ~ experience + I(experience^2) + education,
data = CPS1988 )
summary(model3a)
##
## Call:
## lm(formula = log(wage) ~ experience + I(experience^2) + education,
## data = CPS1988)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9333 -0.3176 0.0594 0.3800 4.4178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.27808974 0.01915521 223.34 <0.0000000000000002 ***
## experience 0.07752025 0.00088556 87.54 <0.0000000000000002 ***
## I(experience^2) -0.00131597 0.00001911 -68.88 <0.0000000000000002 ***
## education 0.08744110 0.00127667 68.49 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5876 on 28151 degrees of freedom
## Multiple R-squared: 0.3264, Adjusted R-squared: 0.3263
## F-statistic: 4546 on 3 and 28151 DF, p-value: < 0.00000000000000022
BTW Setting reference groups for factors: he first level of the factor is treated as the omitted reference group. Use the relevel()
function to change.
Note: models are ‘nested’
anova(model3a, model3w)
## Analysis of Variance Table
##
## Model 1: log(wage) ~ experience + I(experience^2) + education
## Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28151 9719.6
## 2 28150 9598.6 1 121.02 354.91 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Akaike information criterion (AIC) is a metric that is used to compare the fit of several regression models smaller AIC = model is a better fit
AIC(model3a, model3w)
## df AIC
## model3a 5 49965.43
## model3w 6 49614.68
Ehnicity is significant at any reasonable level
Nine-month academic salary of academic staff in US (libary car
)
library("car")
head(Salaries, n=9)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
## 7 Prof B 30 23 Male 175000
## 8 Prof B 45 45 Male 147765
## 9 Prof B 21 20 Male 119250
model.9 <- lm(salary ~ yrs.service + rank + discipline + sex, data=Salaries)
summary(model.9)
##
## Call:
## lm(formula = salary ~ yrs.service + rank + discipline + sex,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64202 -14255 -1533 10571 99163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68351.67 4482.20 15.250 < 0.0000000000000002 ***
## yrs.service -88.78 111.64 -0.795 0.426958
## rankAssocProf 14560.40 4098.32 3.553 0.000428 ***
## rankProf 49159.64 3834.49 12.820 < 0.0000000000000002 ***
## disciplineB 13473.38 2315.50 5.819 0.0000000124 ***
## sexMale 4771.25 3878.00 1.230 0.219311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22650 on 391 degrees of freedom
## Multiple R-squared: 0.4478, Adjusted R-squared: 0.4407
## F-statistic: 63.41 on 5 and 391 DF, p-value: < 0.00000000000000022
anova(model.9)
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## yrs.service 1 40709289442 40709289442 79.3405 < 0.00000000000000022 ***
## rank 2 103577124570 51788562285 100.9335 < 0.00000000000000022 ***
## discipline 1 17617147945 17617147945 34.3350 0.000000009861 ***
## sex 1 776686259 776686259 1.5137 0.2193
## Residuals 391 200620394344 513095638
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3b <- lm ( log(wage) ~ experience + I(experience^2) + education*ethnicity,
data = CPS1988 )
summary(model3b)
##
## Call:
## lm(formula = log(wage) ~ experience + I(experience^2) + education *
## ethnicity, data = CPS1988)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9451 -0.3162 0.0578 0.3761 4.3929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.31305918 0.01958965 220.170 <0.0000000000000002 ***
## experience 0.07751973 0.00088028 88.063 <0.0000000000000002 ***
## I(experience^2) -0.00131787 0.00001901 -69.339 <0.0000000000000002 ***
## education 0.08631178 0.00130887 65.944 <0.0000000000000002 ***
## ethnicityafam -0.12388689 0.05902575 -2.099 0.0358 *
## education:ethnicityafam -0.00964814 0.00465096 -2.074 0.0380 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5839 on 28149 degrees of freedom
## Multiple R-squared: 0.3348, Adjusted R-squared: 0.3347
## F-statistic: 2834 on 5 and 28149 DF, p-value: < 0.00000000000000022
anova(model3b, model3w)
## Analysis of Variance Table
##
## Model 1: log(wage) ~ experience + I(experience^2) + education * ethnicity
## Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28149 9597.2
## 2 28150 9598.6 -1 -1.4672 4.3033 0.03805 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interaction education*ethnicity is significant at any reasonable level