4.3 SLR model with a continuous predictor

Now you are ready to use R to fit your first regression model! To fit a linear regression, use the lm() function (“lm” stands for “linear model”). The syntax is lm(y ~ x, data = dat) where dat is a data.frame containing variables with the names y and x. If your dataset, outcome, and/or predictor have different names, then use those names instead of dat, y, and/or x, respectively. After fitting the model, use summary() to view the output.

Example 4.1 (continued): Regress fasting glucose on the continuous predictor waist circumference. Typically, the output of lm() is assigned to an object, in this case fit.ex4.1. The name you give the object is arbitrary. Assigning the output to an object allows you to use summary() to later view formatted output and to use $ notation to extract elements of the output.

fit.ex4.1 <- lm(LBDGLUSI ~ BMXWAIST, data = nhanesf)
summary(fit.ex4.1)
## 
## Call:
## lm(formula = LBDGLUSI ~ BMXWAIST, data = nhanesf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.012 -0.685 -0.280  0.168 13.184 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   3.3041     0.2950   11.20 <0.0000000000000002 ***
## BMXWAIST      0.0278     0.0029    9.59 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.52 on 963 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.0871, Adjusted R-squared:  0.0862 
## F-statistic: 91.9 on 1 and 963 DF,  p-value: <0.0000000000000002

Estimated regression coefficients

The Coefficients: section of the output contains estimates of and hypothesis tests for the intercept \((\beta_0)\), labeled (Intercept), and the coefficient for waist circumference (the slope, \(\beta_1\)), labeled BMXWAIST. The Estimate column displays the estimated regression coefficients. Thus, the best fit regression line has an intercept of 3.304 and a slope of 0.0278. If you want to extract just the Coefficients table, use summary(fit.ex4.1)$coef. Additionally, you can extract any element of the table using bracket [] notation or row and column headers, as shown below.

summary(fit.ex4.1)$coef
##             Estimate Std. Error t value                         Pr(>|t|)
## (Intercept)  3.30409   0.295020  11.200 0.000000000000000000000000001873
## BMXWAIST     0.02776   0.002895   9.588 0.000000000000000000007392885480
# Extract the slope using [] notation
summary(fit.ex4.1)$coef[2,1]
## [1] 0.02776
# Extract the slope using row and column headers
summary(fit.ex4.1)$coef["BMXWAIST", "Estimate"]
## [1] 0.02776
# Extract the p-value for the slope
summary(fit.ex4.1)$coef["BMXWAIST", "Pr(>|t|)"]
## [1] 0.000000000000000000007393

The Coefficients table also displays the standard error of each Estimate, its t value (= Estimate/Std. Error), and its p-value (labeled Pr(>|t|)). The t value tells you how many standard deviations away from 0 the estimate falls. The p-value for the intercept is usually not of interest, but the p-value for the predictor tests the null hypothesis that the outcome has no association with the predictor or, equivalently, that the slope is zero. The null hypothesis is that the best fit line is a horizontal line, indicating that the expected mean outcome is the same at all values of the predictor (that is, there is no association between the outcome and the predictor). Using the expected value notation, if fasting glucose and waist circumference are unrelated, then \(E(\textrm{FG}|\textrm{WC}) = E(\textrm{FG})\), that is, the average fasting glucose does not vary with waist circumference.

How is the p-value interpreted?

The p-value is computed under two assumptions: (1) Equation (4.1) is the model that is generating the data and (2) the slope (\(\beta_1\)) is zero. Under these conditions, the p-value is the probability of observing a sample that results in a slope with a magnitude at least as large as what was observed. In other words, the p-value answer the question, “If there really is no association between the outcome and predictor, how likely is it that we would observe an association at least this strong?”

In this example, p < .001 which means that if the true model really is a horizontal line there is very little chance we would observe such a steep slope (such a strong association). In epidemiology, the p-value is often said to be used to attempt to “rule out chance” as an explanation for an observed association. The p-value, however, is not the probability that random chance produced the observed association (a common misconception). More specifically, it is not the probability that the null hypothesis of no association is true. Rather, the p-value is the probability of observing a slope so large assuming the null is true (assuming chance, or assuming no association). You cannot compute the probability of something you are assuming to be true, but you can assume it is true and then compute the probability of observing an extreme result.

The remainder of the summary indicates the following:

  • Residual standard (SE) error: The Residual standard error is 1.523. This is an estimate of \(\sigma\), the standard deviation of the model error term \(\epsilon\). To extract this estimate from the output, use summary(fit.ex4.1)$sigma.
  • Residual SE degrees of freedom (df): The residual SE df is 963 and corresponds to the sample size minus the number of regression parameters. In this example, there are 2 regression parameters (the intercept and the slope). In general, the number of regression parameters is the number of rows in the Coefficients section. Therefore, the sample size with no missing data is 963 + 2 = 965. To extract the residual SE df from the output, use fit.ex4.1$df.residual.
  • Number of observations excluded: A note that 35 observations that were deleted due to “missingness” – they had a missing value for at least one of the variables in the regression model. To extract the sample size used in the analysis (the number of observations in the dataset with no missing values on any analysis variable), use length(fit.ex4.1$residuals).
  • Multiple R2: The Multiple R-squared value (.08714), which is a measure of goodness of fit ranging from 0 (no association) to 1 (perfect linear association), and is the square of the Pearson correlation between the outcome and predictor. It is interpreted as the proportion of variation in the outcome explained by the model. To extract R2 from the output, use summary(fit.ex4.1)$r.squared.
  • Adjusted R2: The Adjusted R-squared (.0862) is an alternative that penalizes R2 based on how many predictors are in the model. To extract adjusted R2 from the output, use summary(fit.ex4.1)$adj.r.squared.
  • Global F test: The F-statistic: section provides a global F test of the null hypothesis that all the coefficients other than the intercept are 0, and this section can typically be ignored. For SLR with a continuous predictor, this provides no new information – the F statistic is the square of the slope’s t value, and the F statistic p-value is identical to the slope’s p-value. When there is more than one predictor, as there will be in Chapter 5, the global F test is usually not of interest. Typically, we are interested in tests of individual predictors, not a comparison of a model with all the predictors to a model with none. To extract the F statistic and corresponding df values, use summary(fit.ex4.1)$fstatistic.

Use confint() to get confidence intervals (CI) for the intercept and slope. It is always a good idea to report a CI along with any estimate.

confint(fit.ex4.1)
##               2.5 %  97.5 %
## (Intercept) 2.72514 3.88305
## BMXWAIST    0.02208 0.03344

4.3.1 Writing it up

Below is an example of how to write up these results for a report or publication.

Example 4.1 (continued): Linear regression was used to test the association between fasting glucose (mmol/L) and waist circumference (cm) using data from 965 adults from NHANES 2017-2018. 8.71% of the variation in fasting glucose was explained by waist circumference (R2 = .0871). There was a significant positive association between fasting glucose and waist circumference (B = 0.0278; 95% CI = 0.0221, 0.0334; p < .001). On average, for every 1-cm difference in waist circumference, adults differ in mean fasting glucose by 0.0278 mmol/L. Table 4.1 summarizes the regression results.

In Chapter 3 tbl_summary() from the gtsummary library (Sjoberg et al. 2021, 2023) was used to produce a table of descriptive statistics. The tbl_regression() function easily creates nice regression tables (see Tutorial:tbl_regression and Table Gallery for more information).

library(gtsummary)
fit.ex4.1 %>%
  tbl_regression(intercept = T,
                 estimate_fun = function(x) style_sigfig(x, digits = 4),
                 pvalue_fun   = function(x) style_pvalue(x, digits = 3)) %>%
  modify_caption("Regression of fasting glucose (mmol/L) on waist circumference")
Table 4.1: Regression of fasting glucose (mmol/L) on waist circumference
Characteristic Beta 95% CI1 p-value
(Intercept) 3.304 2.725, 3.883 <0.001
Waist Circumference (cm) 0.0278 0.0221, 0.0334 <0.001
1 CI = Confidence Interval

To save the table to a Microsoft Word document, use the following code.

fit.ex4.1 %>%
  tbl_regression(intercept = T,
                 estimate_fun = function(x) style_sigfig(x, digits = 4),
                 pvalue_fun   = function(x) style_pvalue(x, digits = 3)) %>%
  as_flex_table() %>%
  flextable::save_as_docx(path = "MyTable2.docx")

NOTES:

  • This is cross-sectional data, so the write-up was carefully worded to not make any kind of causal claim. We did not write that a “change” or “increase” in waist circumference “causes” or “leads to” greater fasting glucose. We specifically use the words “differ” or “difference” because in cross-sectional data we can only draw conclusions about differences in the mean outcome between individuals with different values of the predictor, not what happens as an individual changes over time.
  • As described in Appendix A.1, the NHANES data we are using are a subsample of the full data and intended only for use in teaching regression methods. No conclusions should be drawn from these analyses about the outcomes we are analyzing. Also, the NHANES complex survey design was not accounted for in these analyses; we will discuss how to account for complex survey designs in Chapter 8.
  • Even if we had used the full NHANES data and appropriately accounted for the complex survey design, we would still generally not want to draw any substantive conclusions from a simple linear regression as this is observational data. This is an unadjusted analysis and may be biased due to confounding. We will use multiple linear regression later to carry out adjusted analyses in Chapter 5. This is one of the main features of regression that makes it useful in epidemiological research – the ability to control for confounding by adjusting for potential confounders in a multiple regression model.

4.3.2 Centering a continuous predictor

“Centering” refers to replacing a continuous predictor with its value after subtracting off some value, typically the mean or median (although you can center at any number). The primary purpose of centering is to make the intercept interpretable as the mean outcome when the predictor is at its centering value.

To center a continuous predictor \(X\) at its mean, compute \(cX = X - \bar{X}\) where \(\bar{X}\) is the mean of \(X\) and enter \(cX\) into the model rather than \(X\). To center at a different value, replace \(\bar{X}\) with that value (e.g., the median).

Example 4.1 (continued): Refit the regression of fasting glucose on waist circumference after centering waist circumference at its mean.

MEAN <- mean(nhanesf$BMXWAIST, na.rm = T)
nhanesf$cBMXWAIST <- nhanesf$BMXWAIST - MEAN
fit.ex4.1.centered <- lm(LBDGLUSI ~ cBMXWAIST, data = nhanesf)
MEAN
## [1] 100.5

The uncentered output was:

round(summary(fit.ex4.1)$coef, 6)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.30409   0.295020  11.200        0
## BMXWAIST     0.02776   0.002895   9.588        0

and the new, centered output is:

round(summary(fit.ex4.1.centered)$coef, 6)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  6.09342   0.049041 124.251        0
## cBMXWAIST    0.02776   0.002895   9.588        0

The only difference between these two outputs is the intercept. Without centering, the intercept was the mean fasting glucose at waist circumference = 0 cm, which was not meaningful since no such humans exist (or could exist!). After centering, the intercept is the mean fasting glucose at waist circumference = 100 cm. Centering waist circumference made the intercept interpretable but had no effect on the regression coefficient for waist circumference.

References

Sjoberg, Daniel D., Joseph Larmarange, Michael Curry, Jessica Lavery, Karissa Whiting, and Emily C. Zabor. 2023. Gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. https://github.com/ddsjoberg/gtsummary.
Sjoberg, Daniel D., Karissa Whiting, Michael Curry, Jessica A. Lavery, and Joseph Larmarange. 2021. “Reproducible Summary Tables with the Gtsummary Package.” The R Journal 13: 570–80. https://doi.org/10.32614/RJ-2021-053.