6.6 Estimating an OR using logistic regression

Unadjusted, categorical predictor

We can also estimate an OR using logistic regression via the glm() function (“generalized linear model”).

Example 6.2 (continued): Use logistic regression to estimate the OR comparing the odds of lifetime marijuana use (mj_lifetime) between males and females (demog_sex).

First, re-level demog_sex to make “Female” the reference level in order to estimate the OR comparing males to females.

nsduh <- nsduh %>% 
  mutate(demog_sex = relevel(demog_sex, ref = "Female"))

Next, use glm() with family = binomial to fit a logistic regression and summary() to view the output.

fit.ex6.2 <- glm(mj_lifetime ~ demog_sex,
                 family = binomial, data = nsduh)
summary(fit.ex6.2)

## 
## Call:
## glm(formula = mj_lifetime ~ demog_sex, family = binomial, data = nsduh)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -0.1350     0.0867   -1.56   0.1195   
## demog_sexMale   0.3678     0.1274    2.89   0.0039 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1386.0  on 999  degrees of freedom
## Residual deviance: 1377.6  on 998  degrees of freedom
## AIC: 1382
## 
## Number of Fisher Scoring iterations: 3

We showed previously that the intercept (-0.1350) is the log-odds of lifetime marijuana use when all the predictors are zero or at their reference level. Use ilogit() to convert a log-odds to a probability.

ilogit(coef(fit.ex6.2)["(Intercept)"])

## (Intercept) 
##      0.4663

Thus, we estimate that the probability of marijuana use among females (the reference level) is $p = e^{-0.1350} / (1 + e^{-0.1350}) = 0.466$ (the same value we computed using the 2 $\times$ 2 table).

The regression coefficient for demog_sexMale (0.3678) represents the log of the OR for lifetime marijuana use comparing males to females. To get the OR itself, exponentiate the coefficient using exp(). Thus, we estimate the OR to be $e^{0.3678}$ = 1.445 (again, the same value we computed using the 2 $\times$ 2 table).

# [-1] drops the first row which corresponds to the intercept)
exp(coef(fit.ex6.2)[-1])

## demog_sexMale 
##         1.445

Use confint() to compute 95% CIs for the regression coefficients, and exponentiate these to get 95% CIs for the ORs (dropping the intercept since $e^{\beta_0}$ is an odds not an OR).

NOTE: confint() uses a profile likelihood method rather than the Wald method used by summary() (see Section 6.18). Thus, the 95% CIs may not correspond exactly to the p-values. That is, you may have p < .05 but a 95% CI that contains 1, or p > .05 but a 95% CI that does not contain 1.

# CIs for regression coefficients
confint(fit.ex6.2)

##                 2.5 %  97.5 %
## (Intercept)   -0.3055 0.03476
## demog_sexMale  0.1186 0.61808

# CI for OR (use [-1,] to drop the first row which corresponds to the intercept)
# and drop=F to retain the rownames since there is only one predictor term
exp(confint(fit.ex6.2))[-1, , drop=F]

##               2.5 % 97.5 %
## demog_sexMale 1.126  1.855

In this example, the categorical predictor had just two levels, so get the p-value testing its association with the outcome from the regression coefficient table (0.0039). In general, use car::Anova() to obtain the p-value for any predictor, continuous or categorical with any number of levels. Unlike with linear regression, there are two different options here: a likelihood ratio test or a Wald test. We will use the Wald test here to be consistent with the p-values in the regression coefficient table (which are equivalent to Wald tests). However, see Section 6.18 for how to obtain likelihood ratio tests throughout.

car::Anova(fit.ex6.2, type = 3, test.statistic = "Wald")

## Analysis of Deviance Table (Type III tests)
## 
## Response: mj_lifetime
##             Df Chisq Pr(>Chisq)   
## (Intercept)  1  2.42     0.1195   
## demog_sex    1  8.34     0.0039 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: Males have significantly greater odds of lifetime marijuana use than females (OR = 1.445; 95% CI = 1.126, 1.855; p = .0039). Males have 44% greater odds of lifetime marijuana use than females.

For this simple case with a binary outcome and a binary predictor, the OR is exactly the same as that obtained via the cross-product from a 2 $\times$ 2 table. An advantage of using logistic regression, however, is that, it provides the ability to adjust for other predictors to obtain an adjusted OR (AOR), as we will see in Section 6.6.3.

Unadjusted, continuous predictor

Next, let’s fit a logistic regression with a continuous predictor and interpret the results.

Example 6.3: Using the 2019 National Survey of Drug Use and Health (NSDUH) teaching dataset (Section A.5), what is the association between lifetime marijuana use (mj_lifetime) and age at first use of alcohol (alc_agefirst)?

Fit the model and interpret the coefficients.

fit.ex6.3 <- glm(mj_lifetime ~ alc_agefirst, family = binomial, data = nsduh)
round(summary(fit.ex6.3)$coef, 4)

##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)    5.3407     0.4747   11.25        0
## alc_agefirst  -0.2835     0.0267  -10.62        0

ilogit(coef(fit.ex6.3)["(Intercept)"])

## (Intercept) 
##      0.9952

The inverse logit of the intercept (5.3407) is the probability of the outcome when all predictors are zero or at their reference level. Thus, the estimated probability of lifetime marijuana use among those who started drinking alcohol at age 0 years is 0.9952.

Examining a summary of the predictor, we see that the range of age of first use of alcohol is 3 to 45 years, and the median value is 17 years.

# Examine the predictor
summary(nsduh$alc_agefirst)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     3.0    15.0    17.0    17.5    19.0    45.0     157

Thus, the probability of marijuana use for those who first drank alcohol at age 0 years is an extrapolation. Had we centered the predictor, the intercept would be more interpretable. For example, had we centered it at the median, 17 years, the intercept would be interpreted as the log-odds at the median age of first alcohol use, with a corresponding estimated prevalence of lifetime marijuana use among 17-year-olds of 62.73%.

nsduh <- nsduh %>% 
  mutate(calc_agefirst = alc_agefirst - 17)
fit.ex6.3.centered <- glm(mj_lifetime ~ calc_agefirst,
                          family = binomial, data = nsduh)
ilogit(coef(fit.ex6.3.centered)["(Intercept)"])

## (Intercept) 
##      0.6273

The regression coefficient for alc_agefirst (-0.2835) is the difference in log-odds associated with a 1-year difference in age at first use of alcohol. Converting this to an OR by exponentiating, we get $e^{-0.2835} = 0.7531$ .

exp(coef(fit.ex6.3)[-1])

## alc_agefirst 
##       0.7531

Thus, starting drinking alcohol 1 year later is associated with 24.7% lower odds of lifetime marijuana use (since 1 – 0.753 = 0.247). Finally, compute a 95% CI for the OR using confint() and exponentiating.

exp(confint(fit.ex6.3)[-1,])

##  2.5 % 97.5 % 
## 0.7135 0.7923

Conclusion: Age at first alcohol use is significantly negatively associated with lifetime marijuana use (OR = 0.753; 95% CI = 0.713, 0.792; p <.001). Individuals who first used alcohol at, say, age 19 years have 24.7% lower odds of having ever used marijuana than those who first used alcohol at age 18 years.

6.6.1 OR associated with other than a 1-unit difference

If, after fitting the model, you want to know the increase or reduction in odds associated with other than a 1-unit difference in a continuous predictor, multiply the regression coefficient by that factor before exponentiating. Similarly, to compute the corresponding confidence interval, multiply the confidence limits for the regression coefficient by that factor prior to exponentiating.

If you want to have the logistic regression model itself estimate an OR corresponding to other than a 1-unit difference then, prior to fitting the model, divide the continuous predictor by the desired factor. In this way a 1-unit difference on the new scale will correspond to the desired difference on the original scale.

Example 6.3 (continued): To estimate the reduction in odds associated with starting drinking alcohol 3 years later, do not multiply 24.7% by 3 (which would be 74.1%). Instead, first compute the OR as $e^{-0.2835 \times 3} = 0.427$ and then compute the percent change in odds. In other words, those who started using alcohol 3 years later have 57.3% lower odds of lifetime marijuana use (not 74.1%).

# OR for a 3-unit difference
exp(3*coef(fit.ex6.3))[-1]

## alc_agefirst 
##       0.4272

# 95% CI for the OR for a 3-unit difference
exp(3*confint(fit.ex6.3))[-1,]

##  2.5 % 97.5 % 
## 0.3632 0.4973

# Same result accomplished by transforming the
# predictor prior to fitting the model
tmpdat <- nsduh %>% 
  mutate(alc_agefirst3 = alc_agefirst/3)

tmpfit <- glm(mj_lifetime ~ alc_agefirst3,
              family = binomial, data = tmpdat)

exp(coef(tmpfit))[-1]

## alc_agefirst3 
##        0.4272

exp(confint(tmpfit))[-1,]

##  2.5 % 97.5 % 
## 0.3632 0.4973

6.6.2 Make sure you know what probability `glm()` is modeling

The outcome $Y$ must be binary (exactly two unique values, not including NA) and have numeric values 0 and 1 or be coded as a factor. If $Y$ is numeric, then glm() will model $p = P(Y = 1)$ . If $Y$ is a factor, then glm() will model $p = P(Y = \textrm{the non-reference level})$ . If the outcome is not coded to these specifications, or if you want to model the probability of the other level of the outcome, then recode it (see Section 4.4.1).

It is essential to know which level of the outcome corresponds to the probability $p$ when using logistic regression. Use is.factor() to check if $Y$ is a factor, and levels() to examine the levels (if it is a factor). The first level listed by levels() is the reference level, so the second is the level whose probability is being modeled. If the outcome is not a factor, use unique() to check the unique values – they should be 0 and 1.

Example 6.3 (continued): For the outcome $Y$ = mj_lifetime, what probability is glm() modeling?

is.factor(nsduh$mj_lifetime)

## [1] TRUE

# If it is a factor, check the levels
levels(nsduh$mj_lifetime)

## [1] "No"  "Yes"

The possible values are No (reference level, since it is listed first) and Yes. Thus, glm() models $p = P(Y = \textrm{Yes})$ .

# If it is not a factor, check
# the unique values.
# Should be 0 and 1
# unique(x)

6.6.3 Adjusted OR

Just as you can move from simple to multiple linear regression by adding more predictors, so you can move from simple to multiple logistic regression. When there are multiple predictors in a logistic regression model, the resulting odds ratios are called adjusted odds ratios (AORs).

Example 6.3 (continued): What is the association between lifetime marijuana use (mj_lifetime) and age at first use of alcohol (alc_agefirst), adjusted for age (demog_age_cat6), sex (demog_sex), and income (demog_income)? Fit the model and compute the AOR, its 95% CI, and the p-value that tests the significance of age at first use of alcohol. While we are not primarily interested in the confounders, it is common to also report their AORs, 95% CIs, and p-values. Since there are some categorical confounders with more than two levels, use car::Anova() to compute multiple df p-values.

fit.ex6.3.adj <- glm(mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex +
                     demog_income, family = binomial, data = nsduh)
# Regression coefficient table
round(summary(fit.ex6.3.adj)$coef, 4)

##                               Estimate Std. Error z value Pr(>|z|)
## (Intercept)                     6.2542     0.5914 10.5759   0.0000
## alc_agefirst                   -0.2754     0.0276 -9.9922   0.0000
## demog_age_cat626-34            -0.2962     0.3286 -0.9012   0.3675
## demog_age_cat635-49            -0.8043     0.2966 -2.7120   0.0067
## demog_age_cat650-64            -0.6899     0.2985 -2.3109   0.0208
## demog_age_cat665+              -1.2748     0.3043 -4.1893   0.0000
## demog_sexMale                  -0.0609     0.1618 -0.3763   0.7067
## demog_income$20,000 - $49,999  -0.5309     0.2664 -1.9927   0.0463
## demog_income$50,000 - $74,999  -0.0793     0.3049 -0.2601   0.7948
## demog_income$75,000 or more    -0.3612     0.2532 -1.4264   0.1538

# AORs and 95% CIs
# Use cbind to view the AORs and CIs side-by-side
# Add [-1,] to drop the row for the intercept
# since exp(intercept) is not an odds ratio
OR.CI <- cbind("AOR" = exp(coef(fit.ex6.3.adj)),
                       exp(confint(fit.ex6.3.adj)))[-1,]
round(OR.CI, 3)

##                                 AOR 2.5 % 97.5 %
## alc_agefirst                  0.759 0.718  0.800
## demog_age_cat626-34           0.744 0.387  1.409
## demog_age_cat635-49           0.447 0.247  0.791
## demog_age_cat650-64           0.502 0.275  0.891
## demog_age_cat665+             0.279 0.152  0.502
## demog_sexMale                 0.941 0.684  1.291
## demog_income$20,000 - $49,999 0.588 0.347  0.987
## demog_income$50,000 - $74,999 0.924 0.507  1.680
## demog_income$75,000 or more   0.697 0.421  1.139

# P-values
car::Anova(fit.ex6.3.adj, type = 3, test.statistic = "Wald")

## Analysis of Deviance Table (Type III tests)
## 
## Response: mj_lifetime
##                Df  Chisq           Pr(>Chisq)    
## (Intercept)     1 111.85 < 0.0000000000000002 ***
## alc_agefirst    1  99.84 < 0.0000000000000002 ***
## demog_age_cat6  4  23.01              0.00013 ***
## demog_sex       1   0.14              0.70669    
## demog_income    3   5.44              0.14197    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of the output:

The AOR for our primary predictor alc_agefirst is 0.759. This AOR corresponds to the odds ratio comparing the odds of lifetime marijuana use between those who differ by one year in age at first use of alcohol but have the same age, sex, and income.
The remaining AORs compare levels of categorical predictors to their reference level, adjusted for the other predictors in the model. For example, comparing individuals with the same age of first alcohol use, sex, and income, 35- to 49-year-olds have 55.3% lower odds of lifetime marijuana use than 18- to 25-year-olds (OR = 0.447; 95% CI = 0.247, 0.791; p = .007). The p-value for this specific comparison of ages comes from the Coefficients table. An overall, 4 df p-value for age, can be read from the Type III Test table (0.00013).
The Type III tests output contains the multiple df Wald tests for categorical predictors with more than two levels. For continuous predictors, or for categorical predictors with exactly two levels, the Type III Wald test p-values are identical to those in the Coefficients table.

Conclusion: After adjusting for age, sex, and income, age at first alcohol use is significantly negatively associated with lifetime marijuana use (AOR = 0.759; 95% CI = 0.718, 0.800; p <.001). Individuals who first used alcohol at a given age have 24.1% lower odds of having ever used marijuana than those who first used alcohol one year earlier.

NOTE: The steps of extracting the coefficients and CIs and exponentiating them are demonstrated here to make it clear where the numbers are coming from. However, the car::S() function (Fox, Weisberg, and Price 2024; Fox and Weisberg 2019) can be used instead to conveniently produce output showing the summary(), as well as the AORs and their 95% CIs (but you still need car::Anova() for the Type III tests).

car::S(fit.ex6.3.adj)
# (output not shown)

References

Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. 3rd ed. Los Angeles: Sage Publications, Inc.

Fox, John, Sanford Weisberg, and Brad Price. 2024. Car: Companion to Applied Regression. https://r-forge.r-project.org/projects/car/.