6.17 Writing it up

This section demonstrates how to write up the methods and results sections corresponding to a logistic regression analysis.

Example 6.3 (continued): Using the 2019 National Survey of Drug Use and Health (NSDUH) teaching dataset (Section A.5), what is the association between lifetime marijuana use (mj_lifetime) and age at first use of alcohol (alc_agefirst), adjusted for age (demog_age_cat6), sex (demog_sex), and income (demog_income)? Write up the Methods and Results for the analyses that answered this research question using models without (fit.ex6.3.adj) and with an interaction (fit.ex6.3.int).

6.17.1 Writing up logistic regression results (no interaction)

Below are a few lines of code to extract the information we need for our write-up.

Outcome, predictors, and dataset name: The call component of the glm object provides the call to glm, including the variable and dataset names.

fit.ex6.3.adj$call
## glm(formula = mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + 
##     demog_income, family = binomial, data = nsduh)

Number of observations: If we used a complete case analysis, then the number of rows in the dataset (nrow(nsduh)) is the number of cases used in the analysis. But counting the number of residuals to get the sample size will work even when you have missing data.

length(fit.ex6.3.adj$residuals)
## [1] 843

Selected descriptive statistics: Extract any descriptive summaries you might want to add to your write-up. For example, the proportion of individuals in each age group. In this example, we did not remove missing data first before fitting the model, so we need to do that here before creating our summary. We want our summary to be based on the same sample as our regression model, and a regression will always carry out a complete case analysis whether you removed the cases with missing values first or not.

tmp <- nsduh %>%
  drop_na(mj_lifetime, alc_agefirst, demog_age_cat6,
         demog_sex, demog_income)
round(100*prop.table(table(tmp$demog_age_cat6)), 1)
## 
## 18-25 26-34 35-49 50-64   65+ 
##  12.1  14.7  27.3  24.1  21.8

Regression coefficients, p-values, AORs, and 95% confidence intervals: The p-values in the regression coefficient table test the association with the outcome for continuous predictors and categorical predictors with exactly two levels. For categorical predictors, each p-value tests the difference in log-odds of the outcome between the level shown in that row and the reference level.

# Regression coefficients, p-values, AORs, 95% CIs for AORs
car::S(fit.ex6.3.adj)
# 95% CIs for the regression coefficients
confint(fit.ex6.3.adj)
# (results not shown)

Type III Wald test p-values: For each categorical predictor with more than two levels, the Type III Wald test provides the multiple degree of freedom p-value for the overall test of that predictor’s association with the outcome.

car::Anova(fit.ex6.3.adj, type = 3, test.statistic = "Wald")
# (results not shown)

Methods: We used data from a teaching dataset including 843 participants in the 2019 National Survey of Drug Use and Health aged 18 years and older (18-25 = 12.1%, 26-34 = 14.7%, 35-49 = 27.3%, 50-64 = 24.1%, 65+ = 21.8%) to assess the association between lifetime marijuana use and age at first use of alcohol, adjusted for age group (ref = 18-25), sex (Female (ref), Male), and income (Less than $20,000 (ref), $20,000 to $49,999, $50,000 to $74,999, $75,000 or more) using binary logistic regression. We assessed the linearity assumption and checked for separation, collinearity, and outliers (no issues were found). We evaluated the robustness of our results to the presence of a few potentially influential observations using a sensitivity analysis (we did not carry this out in this chapter, but this demonstrates where you would report such an analysis). A forest plot is provided to illustrate the AORs and their 95% CIs.

Results: After adjusting for age, sex, and income, age at first alcohol use was significantly negatively associated with lifetime marijuana use (AOR = 0.759; 95% CI = 0.718, 0.800; p <.001). Individuals who first used alcohol at a given age have 24.1% lower odds of having ever used marijuana than those who first used alcohol one year earlier. Current age was also significantly associated with the outcome, with older individuals having lower odds of lifetime marijuana use.

See Table 6.2 for full regression results and Figure 6.3 for a forest plot of the AORs and their 95% CIs.

The code below illustrates the use of gtsummary::tbl_merge() (Sjoberg et al. 2021, 2023) to merge together two regression tables. In this case, we are merging one table that includes the regression coefficients (which are on the logit scale) and another table that includes the AORs (the exponentiated regression coefficients). The intercept, however, is not a log-odds ratio, like the other regression coefficients; it is the log-odds of the outcome when all predictors are 0 or at their reference level. This distinction is added in a footnote to the table. To make the intercept more interpretable, the one continuous variable in this model was centered (age of first alcohol use, centered at age 21 years).

Notice how some features are included in one table but not the other to avoid redundancies (e.g., p-values only in the second table). The labels, however, must be identical in both tables in order for the merge to work correctly. Notice also (in the code comments) the difference between using style_sigfig(), which keeps a specified number of significant digits, and style_number(), which rounds to a specified number of decimal places. For example, the number 1.532 with two significant digits would be 1.5, while rounded to two decimal places would be 1.53. In this particular table, rounding makes everything line up nicely. In other tables, especially when different regression terms are on different scales, specifying the number of significant digits may be preferred.

# Re-fit the model with a centered version of alc_agefirst
tmpdat <- nsduh %>% 
  mutate(calc_agefirst = alc_agefirst - 21)

fit.ex6.3.adj.b <- glm(mj_lifetime ~ calc_agefirst + demog_age_cat6 + demog_sex +
                     demog_income, family = binomial, data = tmpdat)

NOBS <- length(fit.ex6.3.adj.b$residuals)

library(gtsummary)
t1 <- fit.ex6.3.adj.b %>%
  tbl_regression(intercept = T,
                 # Use style_number to round to 3 digits (e.g., 1.27)
                 estimate_fun = function(x) style_number(x, digits = 3),
                 # Use style_sigfig to keep 3 significant digits (e.g., 1.3)
                 # estimate_fun = function(x) style_sigfig(x, digits = 3),
                 label  = list('(Intercept)'  ~ "Intercept",
                               calc_agefirst  ~ "Age First Alc - 21y (y)",
                               demog_age_cat6 ~ "Age (y)",
                               demog_sex      ~ "Sex",
                               demog_income   ~ "Income")) %>%
  modify_column_hide(p.value) %>%
  modify_caption(paste("Logistic regression results for lifetime marijuana use vs.
                   age at first alcohol use (years) (N = ", NOBS, ")", sep=""))

t2 <- fit.ex6.3.adj.b %>%
  tbl_regression(intercept = T,
                 exponentiate = T,  # OR = exp(B)
                 # Use style_number to round to 3 digits (e.g., 1.27)
                 estimate_fun = function(x) style_number(x, digits = 3),
                 # Use style_sigfig to keep 3 significant digits (e.g., 1.3)
                 # estimate_fun = function(x) style_sigfig(x, digits = 3),
                 pvalue_fun   = function(x) style_pvalue(x, digits = 3),
                 label  = list('(Intercept)'  ~ "Intercept",
                               calc_agefirst  ~ "Age First Alc - 21y (y)",
                               demog_age_cat6 ~ "Age (y)",
                               demog_sex      ~ "Sex",
                               demog_income   ~ "Income")) %>%
  # Add test.statistic = "Wald" here to get Wald Type III tests
  add_global_p(keep = T, test.statistic = "Wald")
TABLE <- tbl_merge(
  tbls = list(t1, t2),
  tab_spanner = c("**Adjusted Coefficient**", "**Adjusted Odds Ratio**")
  ) %>% 
  modify_footnote(everything() ~ NA, abbreviation = TRUE) %>% 
  modify_header(update = list(
    estimate_1 = "**log(AOR)**",
    estimate_2 = "**AOR**"
))
TABLE
Table 6.2: Logistic regression results for lifetime marijuana use vs. age at first alcohol use (years) (N = 843)
Characteristic Adjusted Coefficient Adjusted Odds Ratio
log(AOR) 95% CI AOR 95% CI p-value
Intercept 0.470 -0.150, 1.120 1.601 0.860, 3.064 0.145
Age First Alc - 21y (y) -0.275 -0.331, -0.223 0.759 0.718, 0.800 <0.001
Age (y)



<0.001
    18-25
    26-34 -0.296 -0.949, 0.343 0.744 0.387, 1.409 0.367
    35-49 -0.804 -1.400, -0.234 0.447 0.247, 0.791 0.007
    50-64 -0.690 -1.289, -0.116 0.502 0.275, 0.891 0.021
    65+ -1.275 -1.885, -0.689 0.279 0.152, 0.502 <0.001
Sex



0.707
    Female
    Male -0.061 -0.379, 0.256 0.941 0.684, 1.291 0.707
Income



0.142
    Less than $20,000
    $20,000 - $49,999 -0.531 -1.059, -0.013 0.588 0.347, 0.987 0.046
    $50,000 - $74,999 -0.079 -0.679, 0.519 0.924 0.507, 1.680 0.795
    $75,000 or more -0.361 -0.864, 0.130 0.697 0.421, 1.139 0.154

6.17.2 Writing up logistic regression results (with an interaction)

For the model with an interaction (fit.ex6.3.int), more information must be added to the Methods and Results. Specifically, mention the inclusion of an interaction, the test of interaction, the overall test of the primary predictor of interest that tested the main effect and the interaction at the same time (Section 6.9.1), and the separate AORs for the primary predictor of interest at the levels of the other predictor in the interaction (Section 6.9.2).

The following code displays the raw regression coefficients, 95% CIs, and p-values.

# Regression coefficients, p-values, AORs, 95% CIs for AORs
car::S(fit.ex6.3.int)
# 95% CIs for the regression coefficients
confint(fit.ex6.3.int)
# Pvalues
car::Anova(fit.ex6.3.int, type = 3, test.statistic = "Wald")
# (Results not shown)

The following code displays the overall test of the predictor.

# Overall test of alc_agefirst
fit0 <- glm(mj_lifetime ~ demog_age_cat6 + demog_sex + demog_income,
           family = binomial, data = nsduh,
           subset = complete.cases(alc_agefirst))
anova(fit0, fit.ex6.3.int, test = "Chisq")
# (Results not shown)

The following code displays the AORs at each level of the other term in the interaction. We store these as the objects EST.M and EST.F because we will be re-using them below when we create a table illustrating the interaction effect.

# AOR for Males
EST.M <- gmodels::estimable(fit.ex6.3.int,
        c("alc_agefirst"               = 1,
          "alc_agefirst:demog_sexMale" = 1),
        conf.int = 0.95)
# AOR for Females
EST.F <- gmodels::estimable(fit.ex6.3.int,
        c("alc_agefirst"               = 1),
        conf.int = 0.95)
# (Results not shown)

Methods: (Same as for the model with no interaction with the following additions.) An interaction between age at first alcohol use and sex was included in the model to assess if the association with the outcome differed between females and males. The overall significance of age at first alcohol use was assessed by comparing the full model to a reduced model with both its main effect and the interaction removed. Additionally, we estimated the AOR for age at first alcohol use separately for females and for males.

Results: The interaction effect was statistically significant (B = 0.119; 95% CI = 0.011, 0.229; p = .032). Overall, after adjusting for age, sex, and income, age at first alcohol use was significantly associated with lifetime marijuana use (p <.001). The association was significant for both females (OR = 0.713; 95% CI = 0.656, 0.776; p <.001) and males (OR = 0.803; 95% CI = 0.748, 0.863; p <.001). Current age was also significantly associated with the outcome.

See Table 6.3 for full regression results.

NOTE: Be careful interpreting the “AOR” for the interaction term (1.126). It is not an AOR for a specific predictor. Rather, it is a ratio of odds ratios and can be interpreted in two ways. First, it is the ratio of the AOR for a 1-unit difference in age of first alcohol use for males divided by that AOR for females (0.803 / 0.713). Second, it is the ratio of the AOR comparing males to females for those of a given age of first alcohol use divided by that AOR for those who started one year earlier. For example, for age of first use 17 years, this ratio can be derived as exp(0.359 + (17-21) \(\times\) 0.119) / exp(0.359 + (16-21) \(\times\) 0.119) = 10.817 / 9.605.

Table 6.3: Logistic regression results for lifetime marijuana use vs. age at first alcohol use (years) including an interaction with sex (N = 843)
Characteristic Adjusted Coefficient Adjusted Odds Ratio
log(AOR) 95% CI AOR 95% CI p-value
Intercept 0.279 -0.380, 0.958 1.322 0.684, 2.607 0.411
Age First Alc - 21y (y) -0.338 -0.425, -0.259 0.713 0.654, 0.772 <0.001
Age (y)



<0.001
    18-25
    26-34 -0.295 -0.954, 0.350 0.744 0.385, 1.419 0.373
    35-49 -0.816 -1.417, -0.242 0.442 0.243, 0.785 0.006
    50-64 -0.687 -1.291, -0.108 0.503 0.275, 0.897 0.022
    65+ -1.260 -1.875, -0.671 0.284 0.153, 0.511 <0.001
Sex



0.156
    Female
    Male 0.359 -0.137, 0.858 1.432 0.872, 2.359 0.156
Income



0.156
    Less than $20,000
    $20,000 - $49,999 -0.530 -1.062, -0.009 0.588 0.346, 0.991 0.048
    $50,000 - $74,999 -0.093 -0.697, 0.510 0.911 0.498, 1.665 0.762
    $75,000 or more -0.363 -0.869, 0.131 0.696 0.419, 1.140 0.154
Age First Alc - 21y (y) * Sex



0.032
    Age First Alc - 21y (y) * Male 0.119 0.011, 0.229 1.126 1.011, 1.257 0.032

Since we have an interaction, it is helpful to also table the AORs for each term in the interaction at specific levels of the other term. We have already done this for calc_agefirst at each level of demog_sex (0.803 and 0.713 in Results above), but we compute them again here. We now also do this for demog_sex at specific values of calc_agefirst. You can choose any values for calc_agefirst within the range of the data, whatever you feel best illustrates how the AOR varies between meaningfully different predictor values; here we use 15, 18, and 21 years (and subtract 21 from each since we centered this variable at 21 years).

# calc_agefirst at demog_sex = "Male"
EST.M <- gmodels::estimable(fit.ex6.3.int.b,
        c("calc_agefirst"               = 1,
          "calc_agefirst:demog_sexMale" = 1),
        conf.int = 0.95)

# calc_agefirst at demog_sex = "Female"
EST.F <- gmodels::estimable(fit.ex6.3.int.b,
        c("calc_agefirst"               = 1),
        conf.int = 0.95)

# demog_sex effect at calc_agefirst = 15
EST.15 <- gmodels::estimable(fit.ex6.3.int.b,
          c("demog_sexMale"              = 1,
            "calc_agefirst:demog_sexMale" = 15 - 21),
          conf.int = 0.95)

# demog_sex effect at calc_agefirst = 18
EST.18 <- gmodels::estimable(fit.ex6.3.int.b,
          c("demog_sexMale"              = 1,
            "calc_agefirst:demog_sexMale" = 18 - 21),
          conf.int = 0.95)

# demog_sex effect at calc_agefirst = 21
EST.21 <- gmodels::estimable(fit.ex6.3.int.b,
          c("demog_sexMale"              = 1,
            "calc_agefirst:demog_sexMale" = 21 - 21),
          conf.int = 0.95)

# Exponentiate the AORs and 95% CIs but not the p-values
DF <- cbind(
        rbind(exp(EST.F[ c("Estimate", "Lower.CI", "Upper.CI")]),
              exp(EST.M[ c("Estimate", "Lower.CI", "Upper.CI")]),
              exp(EST.15[c("Estimate", "Lower.CI", "Upper.CI")]),
              exp(EST.18[c("Estimate", "Lower.CI", "Upper.CI")]),
              exp(EST.21[c("Estimate", "Lower.CI", "Upper.CI")])),
        # P-values
        rbind(EST.F["Pr(>|X^2|)"],
              EST.M["Pr(>|X^2|)"],
              EST.15["Pr(>|X^2|)"],
              EST.18["Pr(>|X^2|)"],
              EST.21["Pr(>|X^2|)"])
      )

names(DF)    <- c("AOR", "Lower", "Upper", "p-value")

rownames(DF) <- c("Age of first alcohol use effect @ Sex = Female",
                  "Age of first alcohol use effect @ Sex = Male",
                  "Male vs. Female @ Age of first alcohol use = 15y",
                  "Male vs. Female @ Age of first alcohol use = 18y",
                  "Male vs. Female @ Age of first alcohol use = 21y")

round(DF, 3)
##                                                    AOR Lower Upper p-value
## Age of first alcohol use effect @ Sex = Female   0.713 0.656 0.776   0.000
## Age of first alcohol use effect @ Sex = Male     0.803 0.748 0.863   0.000
## Male vs. Female @ Age of first alcohol use = 15y 0.702 0.460 1.072   0.097
## Male vs. Female @ Age of first alcohol use = 18y 1.003 0.723 1.390   0.987
## Male vs. Female @ Age of first alcohol use = 21y 1.432 0.866 2.369   0.156

References

Sjoberg, Daniel D., Joseph Larmarange, Michael Curry, Jessica Lavery, Karissa Whiting, and Emily C. Zabor. 2023. Gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. https://github.com/ddsjoberg/gtsummary.
Sjoberg, Daniel D., Karissa Whiting, Michael Curry, Jessica A. Lavery, and Joseph Larmarange. 2021. “Reproducible Summary Tables with the Gtsummary Package.” The R Journal 13: 570–80. https://doi.org/10.32614/RJ-2021-053.