5.12 Which to use when? car::Anova()
, anova()
, gmodels::estimable()
, or predict()
It is easy get car::Anova()
, anova()
, gmodels::estimable()
, and predict()
mixed up. This section summarizes when to use each, followed by some examples.
In general, car::Anova()
and anova()
are used to test whether one or more regression coefficients are equal to 0, whereas gmodels::estimable()
and predict()
are used to estimate a sum of regression coefficients each multiplied by some number.
Use car::Anova(, type = 3)
to test whether all the adjusted regression coefficients associated with a single term in the model (e.g., a single predictor, a single interaction) are simultaneously zero. For binary categorical predictors, continuous predictors, and interactions between them, the car::Anova()
output is redundant with the regression coefficient table output from summary()
since each such term only has one corresponding regression coefficient. However, you must use car::Anova()
to test the significance of a categorical predictor with more than two levels or of an interaction that involves a categorical predictor with more than two levels. car::Anova()
carries out comparisons for each of a set of specific pairs of nested models – each comparison is between the full model and a reduced model with one predictor (or interaction) removed.
Use anova()
to compare two nested models – a full model compared to a reduced model with some terms removed from the full model. The tests possible with car::Anova()
are also possible with anova()
, but the reverse is not true. For example, you must use anova()
to test the significance of a term involved in an interaction by comparing the full model to the reduced model that lacks both that term’s main effect and interaction. When you can use car::Anova()
, however, it is often more convenient than anova()
because it carries out more than one test at the same time, one for each term in the model (comparing the full model to the reduced model that lacks that term).
Use gmodels::estimable()
to estimate and test the significance of a linear combination of coefficients. For example, use gmodels::estimable()
to estimate the effect of one predictor in an interaction at a specific level of the other predictor (since this effect is the sum of a main effect and an interaction term).
Use predict()
to estimate the mean outcome at specified levels of the predictors. Since a prediction is a linear combination of coefficients, you could use gmodels::estimable()
, but for this purpose predict()
is simpler to use since you do not have to remember the order of the terms or which levels are reference levels.
Both prediction (using predict()
) and estimating an effect (using gmodels::estimable()
) involve multiplying regression coefficients by numbers and then adding them up. However, there are a few differences. When using predict()
, you supply values for every predictor in the model and you do not supply a value for the intercept. The intercept is always assumed to be included; otherwise, the result would not be a prediction. When using gmodels::estimable()
; however, you supply numbers by which to multiply the regression coefficients. Typically, a zero is supplied for the intercept but not necessarily. Technically, you could use gmodels::estimable()
to compute any prediction by supplying numbers corresponding to the values of predictors and 1 for the intercept. However, the reverse it not true – there are some effects which are not estimable using a single call to predict()
.
Examples
- Test the significance of smoking status (a factor with three levels) in the model for fasting glucose that includes waist circumference, smoking status, age, gender, race/ethnicity, and income (
fit.ex5.1
).
## Anova Table (Type III tests)
##
## Response: LBDGLUSI
## Sum Sq Df F value Pr(>F)
## (Intercept) 142 1 62.30 0.0000000000000091 ***
## BMXWAIST 146 1 64.02 0.0000000000000040 ***
## smoker 7 2 1.44 0.23638
## RIDAGEYR 131 1 57.42 0.0000000000000926 ***
## RIAGENDR 22 1 9.73 0.00188 **
## race_eth 39 3 5.68 0.00075 ***
## income 1 2 0.23 0.79082
## Residuals 1929 846
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Same test using anova()
# Fit model without smoker
fit0 <- lm(LBDGLUSI ~ BMXWAIST + RIDAGEYR +
RIAGENDR + race_eth + income,
data = nhanesf.complete)
# Compare reduced and full models
anova(fit0, fit.ex5.1)
## Analysis of Variance Table
##
## Model 1: LBDGLUSI ~ BMXWAIST + RIDAGEYR + RIAGENDR + race_eth + income
## Model 2: LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR + RIAGENDR + race_eth +
## income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 848 1936
## 2 846 1929 2 6.59 1.44 0.24
Although anova()
gives the same test in this case, the car::Anova()
code is more succinct and provides tests for all the other terms, as well.
- Test the significance of the interaction (which has two levels) in the model for fasting glucose that includes gender, income, and their interaction (
fit.ex5.3.int
) (Section 5.9.10).
## Anova Table (Type III tests)
##
## Response: LBDGLUSI
## Sum Sq Df F value Pr(>F)
## (Intercept) 2603 1 977.55 <0.0000000000000002 ***
## RIAGENDR 2 1 0.67 0.413
## income 1 2 0.11 0.898
## RIAGENDR:income 17 2 3.17 0.042 *
## Residuals 2266 851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Same test using anova()
# Fit model without the interaction
fit0 <- lm(LBDGLUSI ~ RIAGENDR + income,
data = nhanesf.complete)
# Compare reduced and full models
anova(fit0, fit.ex5.3.int)
## Analysis of Variance Table
##
## Model 1: LBDGLUSI ~ RIAGENDR + income
## Model 2: LBDGLUSI ~ RIAGENDR + income + RIAGENDR:income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 853 2283
## 2 851 2266 2 16.9 3.17 0.042 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Test the significance of income in
fit.ex5.3.int
by comparing the full model to the reduced model that lacks income and the interaction (Section 5.9.11). Is it possible to do this test usingcar::Anova()
?
# Fit model without the main effect or interaction
fit0 <- lm(LBDGLUSI ~ RIAGENDR, data = nhanesf.complete)
# Compare reduced and full models
anova(fit0, fit.ex5.3.int)
## Analysis of Variance Table
##
## Model 1: LBDGLUSI ~ RIAGENDR
## Model 2: LBDGLUSI ~ RIAGENDR + income + RIAGENDR:income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 855 2295
## 2 851 2266 4 29.6 2.78 0.026 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above test is not possible using car::Anova()
since the reduced model has two terms removed, income
and RIAGENDR:income
, that are not simply all the levels of a single factor
variable or all the interaction terms.
- In
fit.ex5.3.int
, estimate the gender effect among those with income $25,000 to <$55,000 (Sections 5.9.7 and 5.9.9.2). Is it possible to use a single call topredict()
to estimate this effect?
## [1] 6
## Estimate
## (Intercept) 6.2328
## RIAGENDRFemale 0.2158
## income$25,000 to <$55,000 0.1007
## income$55,000+ 0.1021
## RIAGENDRFemale:income$25,000 to <$55,000 -0.7122
## RIAGENDRFemale:income$55,000+ -0.7418
# Estimate the effect
gmodels::estimable(fit.ex5.3.int,
c("RIAGENDRFemale" = 1,
"RIAGENDRFemale:income$25,000 to <$55,000" = 1),
conf.int = 0.95)
## Estimate Std. Error t value DF Pr(>|t|) Lower.CI Upper.CI
## (0 1 0 0 1 0) -0.4964 0.2197 -2.259 851 0.02412 -0.9276 -0.06513
You cannot use a single call to predict()
to estimate this effect because there is no way to exclude the intercept. To estimate the gender effect at the specified level of income, you would need two calls to predict()
, as shown below. The first call estimates the mean for females at the middle level of income and the second for males. Subtracting these two provides the correct gender effect (females vs. males). However, the subtraction results in the wrong CI since the difference of CIs is not equal to the CI of the difference.
# Estimated mean outcome for females at middle income level
Y1 <- predict(fit.ex5.3.int,
data.frame(RIAGENDR = "Female",
income = "$25,000 to <$55,000"),
interval = "confidence")
# Estimated mean outcome for males at middle income level
Y2 <- predict(fit.ex5.3.int,
data.frame(RIAGENDR = "Male",
income = "$25,000 to <$55,000"),
interval = "confidence")
# Correct effect estimate but NOT the correct CI
Y1 - Y2
## fit lwr upr
## 1 -0.4964 -0.4839 -0.5088
- Estimate the mean fasting glucose among individuals with a waist circumference of 110 cm who are current smokers, age 50 years, male, non-Hispanic White, and have an income <$25,000.
in
fit.ex5.1
(Section 5.10).
predict(fit.ex5.1,
data.frame(BMXWAIST = 110,
smoker = "Current",
RIDAGEYR = 50,
RIAGENDR = "Male",
race_eth = "Non-Hispanic White",
income = "<$25,000"),
interval = "confidence")
## fit lwr upr
## 1 6.511 6.156 6.866
# Same result using gmodels::estimable()
gmodels::estimable(fit.ex5.1,
c("(Intercept)" = 1,
"BMXWAIST" = 110,
"smokerPast" = 0,
"smokerCurrent" = 1,
"RIDAGEYR" = 50,
"RIAGENDRFemale" = 0,
"race_ethNon-Hispanic White" = 1,
"race_ethNon-Hispanic Black" = 0,
"race_ethNon-Hispanic Other" = 0,
"income$25,000 to <$55,000" = 0,
"income$55,000+" = 0),
conf.int = 0.95)
## Estimate Std. Error t value DF Pr(>|t|) Lower.CI Upper.CI
## (1 110 0 1 50 0 1 0 0 0 0) 6.511 0.1809 36 846 0 6.156 6.866
Although gmodels::estimable()
can provide the correct answer, the code required is more cumbersome. Additionally, the output includes a test of the null hypothesis that the prediction is 0, a test which is not generally of interest.