3.3 Creating a “Table 1”

In most published articles, there is a “Table 1” containing descriptive statistics for the sample. This may include, for example, the mean and standard deviation for continuous variables, the frequency and proportion for categorical variables, and perhaps also the number of missing values.

The brute force method of creating such a table would be to compute each statistic for each variable of interest and then copy and paste the results into a table. Having done this even once you will wish for an easier method! There are many possible solutions, but one that is quick and easy to use is demonstrated here – the gtsummary package (Sjoberg et al. 2021, 2023). Many examples can be found on the gtsummary package GitHub site, in Tutorial:tbl_summary, and in Table Gallery.

Other R packages not covered here that facilitate table creation include flextable (Gohel and Skintzos 2023), tableone (Yoshida and Bartel 2022), and table1 (Rich 2023).

3.3.1 Overall

Example 3.1 (continued): Create a table of summary statistics for all the variables we have been summarizing in this chapter for the entire sample (“overall”).

The code below loads the gtsummary library and uses tbl_summary() with default settings to generate Table 3.1.

The default settings produce a table with the frequency and proportion for categorical variables, the median and interquartile range (IQR) for continuous variables (here, the 25th and 75th percentiles, not their difference), and the number of missing values (if any) (show in rows headed by “Unknown”).

nhanes %>% 
  # Select the variables to be included in the table
  select(sbp, RIDAGEYR, RIAGENDR, income) %>%
  tbl_summary()
Table 3.1: Table 1 using tbl_summary() with default settings
Characteristic N = 1,0001
Systolic BP (mean of 2nd and 3rd) 121 (111, 134)
    Unknown 42
Age in years at screening 47 (32, 61)
Gender
    Male 482 (48%)
    Female 518 (52%)
Annual household income
    < $25,000 156 (18%)
    $25,000 to < $55,000 254 (29%)
    $55,000+ 480 (54%)
    Unknown 110
1 Median (IQR); n (%)

The default is to include missing values, and this can be overridden by setting the missing option to “no”, as in the code below. The resulting table is not shown, but is identical to the table above other than the “Unknown” rows are omitted. In particular, it is still based on the full sample size (1000 observations).

nhanes %>% 
  # Select the variables to be included in the table
  select(sbp, RIDAGEYR, RIAGENDR, income) %>%
  tbl_summary(
    missing = "no"
  )

To create a table for a complete case analysis sample, start with a complete case analysis dataset, as in the code below (which uses complete.dat, the dataset created in Section 3.2.1). Additional options are demonstrated below. See ?gtsummary::tbl_summary for more options.

  • statistic: The default for continuous variables is the median and IQR. The default for categorical variables is the frequency and proportion. Below, this option is used to instead compute the mean and standard deviation for continuous variables (and the default for categorical variables is coded explicitly).
  • digits: tbl_summary() guesses the number of digits to which to round. Use this option to set the number yourself. Since two statistics are specified in the statistic option, two numbers are specified here, one for each statistic.
  • type: tbl_summary() guesses whether a variable is continuous or categorical based on its distribution. Use this option to specify the types yourself, in particular if you need to override the default.
  • label: The default row labels are the variable names or labels (if the dataset has been labeled, for example, using the Hmisc library label() function). Use this option to change the row headers.
  • modify_header: The default column header is “Characteristic”. Use this option to change the column header. Surrounding text by ** results in a bold font.
  • modify_caption: Use this option to add a table caption.
  • bold_labels: Use this option to display the row labels in a bold font.

The results are shown in Table 3.2.

complete.dat %>% 
  select(sbp, RIDAGEYR, RIAGENDR, income) %>%
  tbl_summary(
    statistic = list(all_continuous()  ~ "{mean} ({sd})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(all_continuous()  ~ c(2, 2),
                  all_categorical() ~ c(0, 1)),
    type = list(sbp      ~ "continuous",
                RIDAGEYR ~ "continuous",
                RIAGENDR ~ "categorical",
                income   ~ "categorical"),
    label  = list(sbp      ~ "SBP (mmHg)",
                  RIDAGEYR ~ "Age (years)",
                  RIAGENDR ~ "Gender",
                  income   ~ "Annual Income")
  ) %>%
  modify_header(label = "**Variable**") %>%
  modify_caption("Participant characteristics  (complete case analysis)") %>%
  bold_labels()
Table 3.2: Participant characteristics (complete case analysis)
Variable N = 8551
SBP (mmHg) 123.57 (17.57)
Age (years) 48.11 (17.43)
Gender
    Male 425 (49.7%)
    Female 430 (50.3%)
Annual Income
    < $25,000 148 (17.3%)
    $25,000 to < $55,000 248 (29.0%)
    $55,000+ 459 (53.7%)
1 Mean (SD); n (%)

NOTE: If a variable is a factor with exactly two levels labeled “Yes” and “No”, then tbl_summary() by default will only include the row corresponding to “Yes”. The same applies with variables that have values TRUE/FALSE or 1/0. Use the value option to change the row displayed (see ?gtsummary::tbl_summary for details). Alternatively, set type to “categorical” to display both rows.

3.3.2 By outcome or exposure

In many published research articles, descriptive statistics are presented not only “overall” (over the entire sample) but also by the outcome or exposure. If the “by” variable is continuous then, for the purpose of the descriptive table only, create a categorical version with as many levels as you would like “by” columns, where each level corresponds to a range of values. A common method, demonstrated here, is to use a median split in which a binary variable is created based on whether the value of the continuous variable is below or at least as large as the median value. This results in a “by” variable with two levels and approximately equal sample sizes in each level.

To create a table of descriptive statistics by outcome or exposure, use the by argument. To also include a column with the “overall” summaries, use add_overall(). To stratify by more than one variable, use tbl_strata() (see ?gtsummary::tbl_strata for more information).

Categorical outcome or exposure

Example 3.1 (continued): Create a table of summary statistics overall and by gender.

The following code produces Table 3.3 displaying descriptive statistics by gender.

NOTES:

  • The all_stat_cols() option in modify_header() adds the frequency and proportion of the “by” variable in the column header.
  • The “by” variable (RIAGENDR) was omitted from the type and label options since leaving it in results in an error.
  • The code below illustrates how to assign the table to an object (TABLE1), and then view it by typing the name of the object. This is not actually necessary for this example, but will facilitate exporting the table to an external file (see Section 3.3.3).
TABLE1 <- complete.dat %>% 
  select(sbp, RIDAGEYR, RIAGENDR, income) %>%
  tbl_summary(
    # The "by" variable
    by = RIAGENDR,
    statistic = list(all_continuous()  ~ "{mean} ({sd})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(all_continuous()  ~ c(2, 2),
                  all_categorical() ~ c(0, 1)),
    type = list(sbp      ~ "continuous",
                RIDAGEYR ~ "continuous",
                income   ~ "categorical"),
    label  = list(sbp      ~ "SBP (mmHg)",
                  RIDAGEYR ~ "Age (years)",
                  income   ~ "Annual Income")
  ) %>%
  modify_header(
    label = "**Variable**",
    # The following adds the % to the column total label
    # <br> is the location of a line break
    all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
  ) %>%
  modify_caption("Participant characteristics, by gender") %>%
  bold_labels()  %>%
  # Include an "overall" column
  add_overall(
    last = FALSE,
    # The ** make it bold
    col_label = "**All participants**<br>N = {N}"
  )
TABLE1
Table 3.3: Participant characteristics, by gender
Variable All participants
N = 8551
Male
N = 425 (49.7%)1
Female
N = 430 (50.3%)1
SBP (mmHg) 123.57 (17.57) 124.55 (16.01) 122.60 (18.96)
Age (years) 48.11 (17.43) 47.26 (17.38) 48.94 (17.45)
Annual Income


    < $25,000 148 (17.3%) 65 (15.3%) 83 (19.3%)
    $25,000 to < $55,000 248 (29.0%) 127 (29.9%) 121 (28.1%)
    $55,000+ 459 (53.7%) 233 (54.8%) 226 (52.6%)
1 Mean (SD); n (%)

Median split for a continuous outcome or exposure

Example 3.1 (continued): Create a table of summary statistics overall and by SBP, using a median split to create two SBP groups.

The code below creates a dichotomous version of sbp based on a median split and then uses this new variable as the by variable to produce Table 3.4.

MEDIAN <- median(complete.dat$sbp)
LABEL0 <- paste("SBP <", MEDIAN)
LABEL1 <- paste("SBP >=", MEDIAN)

complete.dat <- complete.dat %>% 
  mutate(sbp_median_split = as.numeric(sbp >= MEDIAN),
         sbp_median_split = factor(sbp_median_split,
                                   levels = 0:1,
                                   labels = c(LABEL0, LABEL1)))
# Checking derivation
tapply(complete.dat$sbp, complete.dat$sbp_median_split, range)
## $`SBP < 121`
## [1]  89 120
## 
## $`SBP >= 121`
## [1] 121 234
# Create table
TABLE1 <- complete.dat %>% 
  # Select the median split variable, not the original variable
  select(sbp_median_split, RIDAGEYR, RIAGENDR, income) %>%
  tbl_summary(
    # Use the median split variable as the "by" variable
    by = sbp_median_split,
    statistic = list(all_continuous()  ~ "{mean} ({sd})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(all_continuous()  ~ c(2, 2),
                  all_categorical() ~ c(0, 1)),
    type = list(RIDAGEYR ~ "continuous",
                RIAGENDR ~ "categorical",
                income   ~ "categorical"),
    label  = list(RIDAGEYR ~ "Age (years)",
                  RIAGENDR ~ "Gender",
                  income   ~ "Annual Income")
  ) %>%
  modify_header(
    label = "**Variable**",
    all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
  ) %>%
  modify_caption("Participant characteristics, by SBP") %>%
  bold_labels() %>% 
  add_overall(last = FALSE,
              col_label = "**All participants**<br>N = {N}")
TABLE1
Table 3.4: Participant characteristics, by SBP
Variable All participants
N = 8551
SBP < 121
N = 412 (48.2%)1
SBP >= 121
N = 443 (51.8%)1
Age (years) 48.11 (17.43) 41.57 (15.84) 54.19 (16.63)
Gender


    Male 425 (49.7%) 180 (43.7%) 245 (55.3%)
    Female 430 (50.3%) 232 (56.3%) 198 (44.7%)
Annual Income


    < $25,000 148 (17.3%) 67 (16.3%) 81 (18.3%)
    $25,000 to < $55,000 248 (29.0%) 120 (29.1%) 128 (28.9%)
    $55,000+ 459 (53.7%) 225 (54.6%) 234 (52.8%)
1 Mean (SD); n (%)

3.3.3 Exporting to an external file

To export a gtsummary table to a Microsoft Word or HTML file, use the following syntax which starts with the tbl_summary object (called TABLE1 above) and then uses the flextable (Gohel and Skintzos 2023) or gt (Iannone et al. 2023) package to do the exporting.

# Make sure these are installed:
# install.packages(c("Rcpp", "gtsummary", "flextable", "gt"))

TABLE1 %>% 
  as_flex_table() %>% 
  flextable::save_as_docx(path = "MyTable1.docx")

TABLE1 %>% 
  as_gt() %>% 
  gt::gtsave(filename = "MyTable1.html")

3.3.4 Adding p-values to Table 1

Often, in a published research article, a Table 1 that displays descriptive statistics by the outcome or an exposure includes p-values that test, for each variable, the null hypothesis that the variable has the same mean (or median or proportion) across all groups in the population. P-values can be easily added to a tbl_summary table using add_p (see ?gtsummary::add_p.tbl_summary for all the options, including the various statistical tests available).

Example 3.1 (continued): Create a table of descriptive statistics, by SBP, including t-tests for continuous variables and chi-square tests for categorical variables.

The code below loads the gtsummary library and uses tbl_summary() with default settings to generate Table 3.5.

complete.dat %>% 
  select(sbp_median_split, RIDAGEYR, RIAGENDR, income) %>%
  tbl_summary(
    by = sbp_median_split,
    statistic = list(all_continuous()  ~ "{mean} ({sd})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(all_continuous()  ~ c(2, 2),
                  all_categorical() ~ c(0, 1))
  ) %>% 
  add_p(
    test = list(all_continuous()  ~ "t.test",
                all_categorical() ~ "chisq.test"),
    pvalue_fun = function(x) style_pvalue(x, digits = 3)
  )
Table 3.5: Participant characteristics, by SBP, including p-values
Characteristic SBP < 121, N = 4121 SBP >= 121, N = 4431 p-value2
Age in years at screening 41.57 (15.84) 54.19 (16.63) <0.001
Gender

<0.001
    Male 180 (43.7%) 245 (55.3%)
    Female 232 (56.3%) 198 (44.7%)
Annual household income

0.728
    < $25,000 67 (16.3%) 81 (18.3%)
    $25,000 to < $55,000 120 (29.1%) 128 (28.9%)
    $55,000+ 225 (54.6%) 234 (52.8%)
1 Mean (SD); n (%)
2 Welch Two Sample t-test; Pearson’s Chi-squared test

3.3.5 Should p-values be added to a Table 1?

It is very easy to add p-values to a Table 1, but are they recommended? In general, no, regardless of whether the data arose from an observational study (Vandenbroucke et al. 2007) or a randomized trial (Moher et al. 2010).

If displaying sample characteristics by the outcome for the purpose of providing crude (unadjusted) tests of association between a set of predictors and the outcome, then including p-values in Table 1 does make some sense. However, typically the subsequent regression analysis will provide adjusted tests of association, and conclusions will be drawn from that adjusted analysis, so unadjusted tests may not be relevant.

A potential reason for displaying p-values in a Table 1 of participant characteristics by the primary exposure of interest is to attempt to demonstrate the extent to which the characteristics differ between exposure groups and may therefore confound the outcome-exposure relationship. But what matters for confounding is not if the groups differ in the population (which is what the p-values are testing) but how much they differ in this sample. P-values, in this context, are not relevant; what matters are the magnitudes of differences in characteristics between the exposure groups, as well as the magnitude of association between characteristics and the outcome (Vandenbroucke et al. 2007).

Yes, p-values are related to the magnitude of differences, but they are very dependent on the sample size, as well. In a small sample, even a meaningfully large difference might not lead to a small p-value, resulting in an incorrect conclusion of “no confounding.” Conversely, in a large sample, even a small, non-meaningful, difference might result in small p-value, resulting in an incorrect conclusion of “confounding.” Yet another error can occur in the case of a small, seemingly non-meaningful, difference that is not statistically significant but which is for a predictor that is very strongly associated with the outcome. The p-value would lead to a conclusion of “no confounding” yet even a small difference between exposure groups in a predictor strongly associated with the outcome can result in meaningful confounding (Dales and Ury 1978; Vandenbroucke et al. 2007). Even worse, if the exposure groups were determined using randomization (e.g., a randomized clinical trial) then we already know the null hypothesis is true so p-values are irrelevant (Moher et al. 2010; Altman 1985; Senn 1994). Under randomization, any differences observed between groups in the sample are entirely due to randomness, not to any underlying difference between the groups.

Thus, using p-values to provide evidence for or against confounding can be misleading or even nonsensical. In a confirmatory analysis (see Section 5.22), potential confounders are identified using subject-matter knowledge based on prior research and included in a regression model regardless of their observed associations.

In summary, including p-values in a Table 1 is easy to do, but may not be relevant and can, at times, be misleading or meaningless.

References

Altman, Douglas G. 1985. “Comparability of Randomised Groups.” Journal of the Royal Statistical Society. Series D (The Statistician) 34 (1): 125–36. https://doi.org/10.2307/2987510.
Dales, L. G., and H. K. Ury. 1978. “An Improper Use of Statistical Significance Testing in Studying Covariables.” International Journal of Epidemiology 7 (4): 373–75. https://doi.org/10.1093/ije/7.4.373.
Gohel, David, and Panagiotis Skintzos. 2023. Flextable: Functions for Tabular Reporting. https://ardata-fr.github.io/flextable-book/.
Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, and JooYoung Seo. 2023. Gt: Easily Create Presentation-Ready Display Tables. https://gt.rstudio.com.
Moher, D., S. Hopewell, K. F Schulz, V. Montori, P. C Gotzsche, P J Devereaux, D. Elbourne, M. Egger, and D. G Altman. 2010. CONSORT 2010 Explanation and Elaboration: Updated Guidelines for Reporting Parallel Group Randomised Trials.” BMJ 340 (March): c869. https://doi.org/10.1136/bmj.c869.
Rich, Benjamin. 2023. Table1: Tables of Descriptive Statistics in HTML. https://github.com/benjaminrich/table1.
Senn, S. 1994. “Testing for Baseline Balance in Clinical Trials.” Statistics in Medicine 13 (17): 1715–26. https://doi.org/10.1002/sim.4780131703.
Sjoberg, Daniel D., Joseph Larmarange, Michael Curry, Jessica Lavery, Karissa Whiting, and Emily C. Zabor. 2023. Gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. https://github.com/ddsjoberg/gtsummary.
Sjoberg, Daniel D., Karissa Whiting, Michael Curry, Jessica A. Lavery, and Joseph Larmarange. 2021. “Reproducible Summary Tables with the Gtsummary Package.” The R Journal 13: 570–80. https://doi.org/10.32614/RJ-2021-053.
Vandenbroucke, Jan P., Erik von Elm, Douglas G. Altman, Peter C. Gøtzsche, Cynthia D. Mulrow, Stuart J. Pocock, Charles Poole, James J. Schlesselman, Matthias Egger, and for the STROBE Initiative. 2007. “Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration.” Epidemiology 18 (6): 805–35. https://doi.org/10.1097/EDE.0b013e3181577511.
Yoshida, Kazuki, and Alexander Bartel. 2022. Tableone: Create Table 1 to Describe Baseline Characteristics with or Without Propensity Score Weights. https://github.com/kaz-yos/tableone.