Chapter 5 Simple linear regression

5.1 Introduction

In general, regression is used to estimate the association between an outcome \((Y)\) and one or more predictors \((X_1,X_2,…,X_p)\). Linear regression is used when \(Y\) is continuous and approximately normally distributed. The term simple linear regression (SLR) is typically used for the setting where there is only one predictor (X) and that predictor is continuous. In this text, I will use the term SLR to refer to any linear regression with just one predictor, whether that predictor is continuous or categorical, and also for the situation where the predictor is modeled as a curve (e.g., quadratic polynomial). In Chapter 6 we will extend SLR to the case of multiple predictors.

This chapter is not meant to be an exhaustive treatment of SLR. Rather, it is meant to give a just enough of an introduction to the concepts of linear regression to set the stage for multiple linear regression, at which point we will go into more detail on aspects such as getting predictions from the model and checking assumptions.

If you are only interested in a measure of association between the outcome and a continuous predictor, then estimating their correlation is sufficient. SLR will allow you to not only estimate their correlation, but also the best fitting line describing the linear relationship between the variables. In the case of a categorical predictor, SLR will estimate \(E(Y|X=x)\), the mean outcome at each level of the predictor. The notation \(E(Y)\) refers to “expected value” or the mean of a random variable \(Y\); the notation \(E(Y|X=x)\) denotes the mean of \(Y\) given \(X\) is equal to the value \(x\).

Example 1: What is the linear relationship between total cholesterol and age among adult participants in NHANES 2017-2018? Cholesterol is the outcome and age is a continuous predictor.

load("Data/NHANES-2017-2018-sub20.Rdata")
library(tidyverse)
nhanes %>% 
  ggplot(aes(y = LBXTC, x = RIDAGEYR)) +
  geom_point(col = "darkgray") +
  geom_smooth(method = "lm", se = F, col = "black", lwd = 2) +
  labs(x = "Age (years)", y = "Total Cholesterol (mg/dL)")

When the predictor is categorical, rather than fit a straight line, SLR estimates the mean \(Y\) at each level of \(X\).

NOTE: If you were doing a complete case analysis, then make sure to use the dataset with all cases with missing data removed.

Example 2: What is the relationship between systolic blood pressure (SBP) and smoking (never, past, current) among adult participants in NHANES 2017-2018? SBP is the outcome and smoking status is a categorical predictor.

# Compute mean Y by X
MEANS <- nhanes %>% 
  group_by(smoker) %>%
  # Add na.rm=T in case there are missing values
  summarize(mean = mean(BPXSY1, na.rm = T))
# Plot
nhanes %>% 
  ggplot(aes(y = BPXSY1, x = smoker)) +
  geom_point(col = "darkgray") +
  geom_point(data = MEANS, aes(y = mean, x = smoker), col = "black", cex = 5) +
  geom_line( data = MEANS, aes(y = mean, x = smoker, group = 1)) +
  labs(x = "Smoking Status", y = "Systolic BP (mm Hg)") +
  theme(legend.position = "none")

SLR in this case estimates the mean \(Y\) at each level of \(X\), as shown by the large dots above. The line connecting them is not estimated by SLR; it is just added here to demonstrate that with a categorical predictor the means do not necessarily fall on a straight line.

5.2 Notation and interpretation

The data are \(n\) pairs of observed values of the outcome \(Y\) and predictor \(X\). The data for the \(i^{th}\) case are denoted \((y_i,x_i)\). If \(X\) is continuous, the linear regression model is written \(Y = \beta_0 + \beta_1 X + \epsilon\) where \(\beta_0\) and \(\beta_1\) are the slope and intercept, respectively, of the best fit line, and \(\epsilon\) is the error term, or residual, which is assumed to have a normal distribution with the same variance at all values of \(X\). We denote the assumption about the error term using the notation \(\epsilon \sim N(0, \sigma^2)\). This model assumes that the relationship between \(Y\) and \(X\) is linear and the error term captures random measurement error and model misspecification (e.g., if there really should be additional predictors in the model; if \(X\) should be modeled using a curve).

For a continuous predictor, the intercept (\(\beta_0\)) is interpreted as the mean (average) outcome (\(Y\)) when \(X=0\) and the slope \((\beta_1)\) is interpreted as the difference in mean outcome associated with a one-unit difference in \(X\).

Example 1 (continued): For the regression of total cholesterol on age, \(\beta_0\) corresponds to the mean total cholesterol among those with age = 0. In this case, that is not a meaningful quantity, but that does not mean there is anything wrong with the regression fit. Later, we will discuss centering a predictor so the intercept is more meaningful. \(\beta_1\) corresponds to the slope of the best fit line.

When the predictor is categorical, the notation is very different. Consider a categorical predictor \(X\) with three levels 1, 2, and 3. The linear regression model is written

\[Y = \beta_0 + \beta_1 I(X=2) + \beta_2 I(X=3) + \epsilon\] where the indicator function \(I\)(logical condition) equals 1 when the condition is true and 0 when it is false. Thus,

\[\begin{array}{rcl} E(Y│X=1) & = & \beta_0 \\ E(Y│X=2) & = & \beta_0 + \beta_1 \\ E(Y│X=3) & = & \beta_0 + \beta_2 \end{array}\]

Thus, for a categorical predictor, the individual \(\beta\) parameters do not correspond to the intercept and slope of a line. Rather, \(\beta_0\) corresponds to the mean outcome at the reference level and the other \(\beta\)s correspond to differences in the mean outcome between other levels and the reference level, as shown below:

\[\begin{array}{rcl} \beta_0 & = & E(Y│X=1) \\ \beta_1 & = & E(Y│X=2) - E(Y│X=1) \\ \beta_2 & = & E(Y│X=3) - E(Y│X=1) \end{array}\]

This difference in interpretation of regression coefficients between the continuous and categorical predictor settings is very important and will come up over and over again throughout this text.

Example 2 (continued): Using “Never” as the reference level, the regression of SBP on smoking status can be written as

\[Y = \beta_0 + \beta_1 I(X=Past) + \beta_2 I(X=Current) + \epsilon\]

\(\beta_0\) corresponds to the mean SBP among never smokers, \(\beta_1\) corresponds to the difference in mean SBP between past and never smokers, and \(\beta_2\) corresponds to the difference in mean SBP between current and never smokers.

The indicator functions \(I()\) are also referred to as dummy variables. If a variable is coded as a factor in R then the creation of dummy variables is done for you automatically when you run a regression. For a categorical variable with \(k\) levels, \(k-1\) dummy variables (each a 0/1 variable) are created, one for each level other than the reference level, and automatically included in the regression model. For example, for smoking status, two dummy variables are created, one that is 1 when smoking status is Past and 0 otherwise, and one that is 1 when smoking status is Current and 0 otherwise.

5.3 Visualizing the relationship

In Chapter 3 we looked at graphical summaries of each variable individually. Before fitting a linear regression, always visualize the relationship between \(Y\) and \(X\). For now, we are just going to use the plot to get an idea of what the relationship looks like. Later, when we discuss the assumptions of the linear regression model in more detail, you will be able to look at this plot and diagnose if there are any issues with the data that need to be dealt with.

Earlier in this Chapter, we actually produced these same figures using tidyverse functions. Here, we will use base R functions. Which you use is a matter of preference.

# Continuous predictor
plot(nhanes$RIDAGEYR, nhanes$LBXTC, col = "darkgray",
     xlab = "Age (years)", ylab = "Total Cholesterol (mg/dL)",
     las = 1,  # Rotate the axis labels so they are all horizontal
     pch = 20, # Change the plotting symbol to solid circles 
     font.lab = 2, font.axis = 2, # Bold font
     main = "Total Cholesterol vs. Age (NHANES 2017-2018)")
# The lm() function fits a linear regression (discussed more later)
abline(lm(LBXTC ~ RIDAGEYR, data = nhanes), lwd = 2)

# Categorical predictor
# NOTE: If you use plot here, you will get side-by-side boxplots
# To get the plot we want, use plot.default
plot.default(nhanes$smoker, nhanes$BPXSY1, col = "darkgray",
             xlab = "Smoking Status", ylab = "Systolic BP (mm Hg)",
             las = 1, pch = 20,
             font.lab = 2, font.axis = 2,
             xaxt = "n", # Suppresses the x-axis
             main = "Total Cholesterol vs. Smoking Status (NHANES 2017-2018)")
# Add custom x-axis
axis(1, at = 1:3, labels = levels(nhanes$smoker), font = 2)
# Compute means
MEANS <- tapply(nhanes$BPXSY1, nhanes$smoker, mean, na.rm = T)
# Add means to the plot
points(1:3, MEANS, pch = 20, cex = 3)
lines( 1:3, MEANS)

5.4 Fitting the SLR model

Finally, we are ready to fit our first regression model! To fit a linear regression in R, use the lm() function (lm stands for “linear model”). The syntax is lm(y ~ x, data = dat) where dat is a data.frame (in base R) or tibble (in tidyverse) containing variables with the names y and x. Then we can use summary() to view the output.

5.4.1 Continuous predictor

Fitting SLR models for continuous and categorical predictors is very similar, but the interpretations are different, as discussed previously. Below we fit the regression models corresponding to the plots we have been looking at above: the regression of SBP on age, and the regression of SBP on smoking status. Let’s start with SBP on age (Example 1):

fit.ex1 <- lm(LBXTC ~ RIDAGEYR, data = nhanes)
summary(fit.ex1)
## 
## Call:
## lm(formula = LBXTC ~ RIDAGEYR, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.93  -27.27   -2.75   25.33  200.32 
## 
## Coefficients:
##             Estimate Std. Error t value
## (Intercept) 174.6149     3.6953   47.25
## RIDAGEYR      0.2590     0.0687    3.77
##                         Pr(>|t|)    
## (Intercept) < 0.0000000000000002 ***
## RIDAGEYR                 0.00017 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.2 on 1024 degrees of freedom
##   (145 observations deleted due to missingness)
## Multiple R-squared:  0.0137, Adjusted R-squared:  0.0127 
## F-statistic: 14.2 on 1 and 1024 DF,  p-value: 0.000174

Looking at the Coefficients section of the output, there is a row for the intercept \(\beta_0\) labeled (Intercept), and a row for the coefficient for the predictor, or the slope \(\beta_1\), labeled RIDAGEYR. The Estimate column displays the estimated regression coefficients. Thus, the best fit regression line has an intercept of 174.6 and a slope of 0.26.

The Coefficients table also displays the standard error of the estimated value, its t value, and its p-value (labeled Pr(>|t|)). The t value is the estimate divided by the standard error and tells you how many standard deviations away from 0 the estimate falls.

NOTE: - The p-value for the intercept is usually not of interest. - The p-value for the predictor tests the null hypothesis that the outcome has not association with the predictor or, that the slope is zero. The null hypothesis would correspond to the best fit line being a horizontal line, indicating that the expected mean outcome is the same at all values of the predictor (no association).

How do we interpret the p-value? If this is the true model that is generating the data, and if the slope really is zero, then the p-value is the probability of observing a sample with a slope with a magnitude at least as large as what was observed. In this example, p = 0.0001741 which means that if the true model really is a horizontal line, then there is almost no way that we would observe such a steep slope (such a strong association). In epidemiology, the p-value is what you use to attempt to “rule out chance” as an explanation for an observed association. The p-value is not the probability that chance produced the data, however. It is not the probability that the null hypothesis is true - it is the probability of seeing a slope so large assuming the null is true (assuming chance, or assuming no association).

The remainder of the summary indicates the following:

  • The Residual standard error, which is an estimate of \(\sigma\), the standard deviation of the model error term \(\epsilon\).
  • The residual standard error degrees of freedom, which correspond to the sample size minus the number of regression parameters = 1024. Since this is SLR, there are 2 regression parameters (the intercept and the slope, or the number of rows in the Coefficients section). Therefore, the sample size with no missing data is \(1024+ 2 =1026\).
  • A note that 145 observations that were omitted due to the fact that they had a missing value for at least one of the variables in the regression model. As mentioned previously, R automatically carries out a complete case analysis when there are missing values.
  • The Multiple R-squared value, which is a measure of goodness of fit ranging from 0 (no association) to 1 (perfect linear association) and is the square of the Pearson correlation between the outcome and predictor. It is interpreted as the proportion of variation in the outcome explained by the model.
  • Adjusted R-squared is an alternative that penalizes \(R^2\) based on how many predictors are in the model.
  • For SLR, the F-statistic section provides no new information as the statistic is the square of the t value for the slope, and the p-value is the same as the p-value for the slope. When there is more than one predictor, however, as there will be in Chapter 11, the F statistic tests the null hypothesis that all the coefficients other than the intercept are all 0. This test is typically not of interest.

We can use confint() to get confidence intervals (CI) for the intercept and slope. It is always a good idea to report a CI along with any estimate.

confint(fit.ex1)
##                2.5 %   97.5 %
## (Intercept) 167.3636 181.8661
## RIDAGEYR      0.1241   0.3939

Writing it Up

Below is an example of how to write up these results for a report or publication.

Example 1 (continued): We used linear regression to test the association between total cholesterol (mg/dL) and age (years) using data from 1026 adults from NHANES 2017-2018. 1.4% of the variation in total cholesterol was explained by age (\(R^2\) = 0.014), and there was a significant positive association between total cholesterol and age (B = 0.26; 95% CI = 0.12, 0.39; p < .001). On average, for every 1-year difference in age, adults differ in mean total cholesterol by 0.26 mg/dL. Table 5.1. below summarizes the regression results.

Table 5.1: Linear regression results for total cholesterol (mg/dL) vs. age (N = 1026)
Predictor B 95% CI p
Intercept 174.6 167.4, 181.9 < .001
Age (years) 0.26 0.12, 0.39 < .001

NOTE: This is cross-sectional data, so the results are carefully worded to not make any kind of causal claim. We did not write that a “change” or “increase” in age “causes” or “leads to” greater total cholesterol. We specifically used the word “difference” because in cross-sectional data we can only draw conclusions about differences in the mean outcome between individuals with different values of the predictor, not what happens as an individual changes over time. These are not necessarily the same; in this example, if there were a secular trend in total cholesterol (differences between generations), then mean total cholesterol might differ between individuals of different ages even if it does not change with age within an individual.

WARNING: Survey weights were not incorporated for this analysis. The results presented here may be biased and will tend to have p-values that are too small as they assume simple random sampling rather than accounting for the complex survey design of NHANES. We will discuss how to account for the survey design in Chapter 10.

5.4.2 Categorical predictor

Above we showed that when you have a categorical predictor linear regression is essentially comparing the mean outcome between levels of that predictor. This may ring a bell from your introductory statistics class where you compared means between independent samples using a two-sample t-test (when there were two groups) or one-way analysis of variance (ANOVA) when there were three or more groups. These tests are equivalent to a linear regression with a single categorical predictor. The advantage of using linear regression, however, is that it is easy to adjust for additional variables by adding more predictors to the model (as we will do in Chapter 6).

Let’s continue with Example 2 and regress SBP on smoking status, a categorical variable with three levels (Never, Past, and Current) (as seen by using the levels function below). This variable is coded as a factor variable. The syntax is the same as for a continuous predictor:

levels(nhanes$smoker)
## [1] "Never"   "Past"    "Current"
fit.ex2 <- lm(BPXSY1 ~ smoker, data = nhanes)
summary(fit.ex2)
## 
## Call:
## lm(formula = BPXSY1 ~ smoker, data = nhanes)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.69 -14.00  -3.33  10.00  85.31 
## 
## Coefficients:
##               Estimate Std. Error t value
## (Intercept)    125.330      0.832  150.71
## smokerPast       5.356      1.518    3.53
## smokerCurrent    0.670      1.680    0.40
##                           Pr(>|t|)    
## (Intercept)   < 0.0000000000000002 ***
## smokerPast                 0.00044 ***
## smokerCurrent              0.69005    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.7 on 986 degrees of freedom
##   (182 observations deleted due to missingness)
## Multiple R-squared:  0.0128, Adjusted R-squared:  0.0108 
## F-statistic:  6.4 on 2 and 986 DF,  p-value: 0.00174
confint(fit.ex2)
##                 2.5 %  97.5 %
## (Intercept)   123.698 126.962
## smokerPast      2.378   8.334
## smokerCurrent  -2.627   3.967

NOTE: The above syntax will not work if smoker is not coded as a factor in R! If it were coded as a numeric variable with levels, say, 1, 2, and 3, then lm() would treat it incorrectly as a continuous variable. To confirm this variable is a factor, you can use is.factor(nhanes$smoker). If that returns FALSE then convert the variable to a factor first. See Chapter [INSERT CROSS-REFERENCE] for more information on factor variables in R.

The primary way in which this differs from the output for a continuous variable is that the Coefficients table now has, in addition to the (Intercept) row, \(K-1\) rows (\(K\) = # of levels of the categorical variable), each one corresponding to the difference in the estimated mean outcome between a level and the reference level. By default, lm() sets the reference level to be the first level of the categorical variable.

This output states that:

  • The estimated mean SBP among never smokers is 125.3 (for a categorical predictor the intercept is the mean at the reference level).
  • Past smokers have significantly greater mean SBP than never smokers (B = 5.36; 95% CI = 2.38, 8.33; p < .001), but current and never smokers do not have significantly different mean SBP (B = 0.67; 95% CI = -2.63, 3.97; p = .690) (the coefficients corresponding to the levels of the categorical predictor represent the difference in mean outcome between that level and the reference level).

Re-Leveling

What if you want to compare the mean outcome between levels of smoking status where neither is the reference level? You can re-level the variable to change the reference level. For example, to compare current and past smokers:

nhanes.releveled <- nhanes %>% 
  mutate(smoker = relevel(smoker, ref = "Past"))
# Check before (rows) vs. after (columns)
table(nhanes$smoker, nhanes.releveled$smoker, exclude = NULL)
##          
##           Past Never Current
##   Never      0   680       0
##   Past     282     0       0
##   Current    0     0     209
fit.ex2.releveled <- lm(BPXSY1 ~ smoker, data = nhanes.releveled)
summary(fit.ex2.releveled)
## 
## Call:
## lm(formula = BPXSY1 ~ smoker, data = nhanes.releveled)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.69 -14.00  -3.33  10.00  85.31 
## 
## Coefficients:
##               Estimate Std. Error t value
## (Intercept)     130.69       1.27  102.94
## smokerNever      -5.36       1.52   -3.53
## smokerCurrent    -4.69       1.93   -2.42
##                           Pr(>|t|)    
## (Intercept)   < 0.0000000000000002 ***
## smokerNever                0.00044 ***
## smokerCurrent              0.01561 *  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.7 on 986 degrees of freedom
##   (182 observations deleted due to missingness)
## Multiple R-squared:  0.0128, Adjusted R-squared:  0.0108 
## F-statistic:  6.4 on 2 and 986 DF,  p-value: 0.00174
confint(fit.ex2.releveled)
##                 2.5 %   97.5 %
## (Intercept)   128.195 133.1772
## smokerNever    -8.334  -2.3780
## smokerCurrent  -8.482  -0.8894

Re-leveling does not change the overall fit of the model, as indicated by the residual standard error, \(R^2\), and F-statistic. However, it does change the interpretation of the coefficients. The (Intercept) is now the mean SBP among past smokers (the new reference level) and the other coefficients correspond to differences in mean outcome between each level and past smokers. Thus, we conclude that current smokers have significantly lower SBP than past smokers (B = -4.69; 95% CI = -8.48, -0.89; p = .016).

Recall that the formula for this model can be written with indicator functions as:

\[Y = \beta_0 + \beta_1 I(Smoker=Past) + \beta_2 I(Smoker=Current) + \epsilon\]

As mentioned before, when a categorical variable coded as a factor is included in a model, lm() automatically creates indicator functions corresponding to all but one level of the categorical variable, also known as dummy variables. To see which levels it created indicator functions for, you can look in the output, or you can use the contrasts() function. Here, we will examine how fit.ex2 and fit.ex2.releveled differ in their indicator functions by using contrasts() to examine the variables that were used in each of those fits:

contrasts(nhanes$smoker)
##         Past Current
## Never      0       0
## Past       1       0
## Current    0       1

In the original variable, Never was the reference level for smoker. Including nhanes$smoker in the model is the same as including two indicator functions called Past and Current: \(I(Smoker=Past)\) which is 1 for past smokers and 0 otherwise (the first column in the output above) and \(I(Smoker=Current)\) which is 1 for current smokers and 0 otherwise (the second column in the output above).

When using the re-leveled factor nhanes.releveled$smoker instead, the indicator functions change accordingly. Since now “Past” is the reference level, we have indicator functions for Never and Current:

contrasts(nhanes.releveled$smoker)
##         Never Current
## Past        0       0
## Never       1       0
## Current     0       1

Multiple DF Test for a Categorical Predictor

The p-values for the individual smoker parameters give us tests of specific pairwise comparisons between levels. But what if you want a single overall test of significance for “smoking status?” This would require us to test the null hypothesis that both \(\beta_1 = 0\) and \(\beta_2 = 0\) simultaneously. Since there are two tests, this would be a “2 df” (two degree of freedom) test, and is equivalent to testing the null hypothesis that the mean outcome is the same at all levels of the predictor (if all the \(beta\)s other than the intercept are zero, then there is no difference in the mean outcome between each pair of levels of the categorical variable). We can obtain the p-value corresponding to this multiple degree of freedom (df) test by calling the Anova() function in the car library (Fox and Weisberg 2019). The type = 3 specifies that we want a Type III test which is a test of a parameter after adjusting for all other terms in the model (there are no other terms in SLR, but there will be when we get to multiple linear regression).

# We can access the Anova function without loading the entire car library via the `::` syntax
car::Anova(fit.ex2, type = 3)
## Anova Table (Type III tests)
## 
## Response: BPXSY1
##              Sum Sq  Df F value              Pr(>F)
## (Intercept) 8859061   1 22714.3 <0.0000000000000002
## smoker         4989   2     6.4              0.0017
## Residuals    384561 986                            
##                
## (Intercept) ***
## smoker      ** 
## Residuals      
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Type III test states that the 2 degree of freedom test of smoker is statistically significant (p = .002). In other words, we reject the null hypothesis that mean SBP is the same at for never, past, and current smokers. If reporting this test formally, we would write “F[2, 986] = 6.40, p = .002” where the 2 and 986 are the variable and residual degrees of freedom, respectively.

In the case of a single categorical predictor, the Type III test is equivalent to the F-test shown in the lm() output above. This will not be the case in multiple linear regression. car::Anova() is used for tests of individual predictors adjusted for all others, whereas the global F-test in the lm() output is used to test all predictors simultaneously comparing to the null model with just an intercept.

Writing it Up

Below is an example of how to write up these results for a report or publication.

Notice how this differs from when the predictor was continuous - here we report the overall association plus some or all of the pairwise differences (highlighting what is most interesting). We need to include an adjustment for multiple testing here, as well, which accounts for the fact that we are carrying out three pairwise comparisons (Past vs. Never, Current vs. Never, Current vs. Past). For now, we will use the simple Bonferroni correction, in which we multiply each p-value by the number of pairwise comparisons before comparing to the significance level \(\alpha = .05\).

NOTE: Multiple comparisons between levels of a categorical variable are referred to as post hoc pairwise comparisons and require an adjustment for multiple testing. Otherwise, your Type I error is greater than 5%. The simplest method of adjusting p-values is the Bonferroni correction in which you multiply each p-value by the number of pairwise comparisons before comparing it to the \(\alpha\) level (equivalently, divide \(alpha\) by the number of tests before comparing to the raw p-values). Multiple testing will be discussed in more detail in Section 6.12.

Example 2 (continued): We used linear regression to test the association between systolic blood pressure (SBP, mmHg) and smoking status (Never, Past, Current) using data from 989 adults from NHANES 2017-2018. 1.3% of the variation in SBP was explained by smoking status (\(R^2\) = 0.013), and there was a significant association between SBP and smoking status (F[2, 986] = 6.40, p = .002). After adjusting for multiple comparisons using a Bonferroni correction, past smokers had significantly greater mean SBP than never smokers (B = 5.36; 95% CI = 2.38, 8.33; p < .001) and significantly lower mean SBP than current smokers (B = -4.69; 95% CI = -8.48, -0.89; p = .016). Table 5.2. below summarizes the regression results.

Table 5.2: Linear regression results for systolic blood pressure (mmHg) vs. smoking status (N = 989)
Predictor Level B 95% CI p
Intercept 125.3 123.7, 127.0 < .001
Smoking status (ref = Never) .002
Past 5.36 2.38, 8.33 < .001
Current 0.67 -2.63, 3.97 .690

The p-value in the “Smoking status (ref = Never)” row corresponds to the 2 df test of smoking status we obtained using car::Anova().

NOTE: We probably should not draw any substantive conclusions regarding the association between smoking and SBP from this analysis; a simple linear regression is an unadjusted analysis and may be biased due to confounding. We will use multiple linear regression later to carry out adjusted analyses. This is one of the main features of regression that makes it useful in epidemiological research - the ability to control for confounding by adjusting for potential confounders in a multiple regression model.

Special Case: Binary Predictor

If a categorical predictor has only two levels, it will only have one indicator function and there is no need for a Type III test - we can simply read the p-value from the Coefficients table. The comparison of mean outcome between the two levels is the same as the overall test of a binary predictor. For example, suppose we regressed SBP on gender (for which NHANES only reports responses of male and female).

fit.binary <- lm(BPXSY1 ~ RIAGENDR, data = nhanes)
summary(fit.binary)$coef
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)     127.874     0.8924 143.286  0.00000
## RIAGENDRFemale   -2.218     1.2615  -1.758  0.07902
car::Anova(fit.binary, type = 3)
## Anova Table (Type III tests)
## 
## Response: BPXSY1
##              Sum Sq  Df  F value              Pr(>F)
## (Intercept) 8077832   1 20530.84 <0.0000000000000002
## RIAGENDR       1216   1     3.09               0.079
## Residuals    388334 987                             
##                
## (Intercept) ***
## RIAGENDR    .  
## Residuals      
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can see that the p-value in the regression coefficient table is the same as the p-value in the Anova Type III table.

5.5 Fitting curves using polynomials

If the relationship between the outome and a continuous predictor is non-linear, a curve may fit better than a straight line. A common way to fit a curve is to use a polynomial function, like a quadratic or cubic. A regression curve does not have a single slope - the rate of change in \(Y\) is not constant and depends on \(X\).

NOTE: It is best, when fitting polynomial models, to first center the predictor using the transformation \(cX = X - \bar{X}\) where \(\bar{X}\) is the mean of \(X\) and then enter \(cX\) into the model rather than \(X\). This makes the terms less correlated, leading to more stable parameter and standard error estimates that are less likely to cause numerical problems. It has the added benefit of making the intercept interpretable as the mean outcome at the mean value of the predictor.

The linear, quadratic, and cubic models are, respectively:

\[\begin{array}{lrcl} \text{Linear:} & Y & = & \beta_0 + \beta_1 (X-\bar{X}) + \epsilon \\ \text{Quadratic:} & Y & = & \beta_0 + \beta_1 (X-\bar{X}) + \beta_2 (X-\bar{X})^2 + \epsilon \\ \text{Cubic:} & Y & = & \beta_0 + \beta_1 (X-\bar{X}) + \beta_2 (X-\bar{X})^2 + \beta_3 (X-\bar{X})^3 + \epsilon \end{array}\]

Example 3: We will use nation-level data to regress \(Y\) = Crude death rate/1000 people [death] and \(X\) = Secondary school enrollment [school2].

load("Data/nations.rData")
# Create centered version of the predictor
nations$cschool2 <- nations$school2 - mean(nations$school2, na.rm = T)
# Fit linear, quadratic, and cubic models
fit.poly.1 <- lm(death ~ cschool2, data = nations)
fit.poly.2 <- lm(death ~ cschool2 + I(cschool2^2), data = nations)
fit.poly.3 <- lm(death ~ cschool2 + I(cschool2^2) + I(cschool2^3), data = nations)
# NOTE: You must to use I(X^2) rather than X^2 in the model,
# but this is not actually an indicator function.
# Here, I stands for the identity function.

We can test to see if the quadratic curve fits better than linear by testing the null hypothesis \(\beta_2=0\) in the quadratic fit. You can test to see if the cubic curve fits better than quadratic by testing the null hypothesis that \(\beta_3=0\) in the cubic fit. Instead of looking at all the output, we can just extract the Coefficients table from the output:

round(summary(fit.poly.1)$coef, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  11.5481     0.3573  32.316        0
## cschool2     -0.1142     0.0121  -9.407        0
round(summary(fit.poly.2)$coef, 4)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)     9.0876     0.4391  20.695        0
## cschool2       -0.1388     0.0104 -13.412        0
## I(cschool2^2)   0.0028     0.0004   7.436        0
round(summary(fit.poly.3)$coef, 4)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)     8.9875     0.4443  20.227   0.0000
## cschool2       -0.1122     0.0229  -4.892   0.0000
## I(cschool2^2)   0.0031     0.0004   7.163   0.0000
## I(cschool2^3)   0.0000     0.0000  -1.301   0.1964

A rounded p-value of 0 in this case is p < 0.0001 since we rounded to 4 digits. Based on the statistical tests, the quadratic term in the quadratic model is statistically significant (p < .001) but the cubic term in the cubic model is not (p = .196). Thus, we would conclude that the best fitting polynomial is a quadratic.

NOTE: With a large sample size, even a small deviation from linearity will have a small p-value. In that case, a visual comparison is more helpful than a statistical hypothesis test. Whatever the sample size, it is typically a good idea to judge the comparison of these polynomials more on the visual difference or similarity, not on p-values.

To plot the fits, the approach to take is to plot the data, and then overlay lines or curves based on predictions from the model over a range of predictor values. We will talk more about prediction in the next section. For now, just know that the predict() function will do what we need.

# Plot the raw X variable, not the centered one!
plot(nations$school2, nations$death, col = "darkgray",
     xlab = "Secondary School Enrollment",
     ylab = "Crude death rate/1000",
     las = 1, pch = 20, font.lab = 2, font.axis = 2)
# Get predicted values over a range for plotting
# Add na.rm=T in case there are missing values
X <- seq(min(nations$school2, na.rm = T), max(nations$school2, na.rm = T), by = 1)
# Center X just like the predictor
cX <- X - mean(nations$school2, na.rm = T)
PRED1 <- predict(fit.poly.1, data.frame(cschool2 = cX))
PRED2 <- predict(fit.poly.2, data.frame(cschool2 = cX))
PRED3 <- predict(fit.poly.3, data.frame(cschool2 = cX))
# Add lines to the plot
lines(X, PRED1, lwd = 2, lty = 1)
lines(X, PRED2, lwd = 2, lty = 2)
lines(X, PRED3, lwd = 2, lty = 3)
# Add a legend
par(font = 2)
legend(60, 25, c("Linear", "Quadratic", "Cubic"),
       lwd = c(2,2,2), lty = c(1,2,3), seg.len = 4,
       bty = "n", title = "Fit")

par(font = 1)

In this example, the scatterplot confirms our conclusions from the statistical tests - the quadratic curve fits the data much better than a straight line and the cubic is not an improvement over the quadratic. So what is the relationship between nations’ crude death rate and secondary school enrollment? Based on this data, it appears that the crude death rate decreases with secondary school enrollment, but only for countries with low enrollment. Among, countries with higher enrollment, there is less of a relationship (the curve levels off at the higher values of \(X\)).

What if we want to get a single overall test for the predictor after fitting it as a quadratic polynomial? When it was just linear, this was easy - we read the p-value from the regression output. But when you have a polynomial, the variable is modeled using more than one term. As when we tested a categorical variable with more than two levels, we need an overall multiple degree of freedom test of all the terms at the same time. In the above example, we would want to compare the quadratic model to the model with just an intercept. The anova function can accomplish this by comparing the fit with just an interecept to the quadratic fit.

# Intercept-only model:
fit.poly.0 <- lm(death ~ 1, data = nations)
# Compare Intercept-only to Quadratic
# Use try() because we want to see the error without the program crashing
try(anova(fit.poly.0, fit.poly.2))
## Error in anova.lmlist(object, ...) : 
##   models were not all fitted to the same size of dataset

What went wrong? fit.poly.2 included the outcome death as well as the predictor cschool2 whereas fit.poly.0 only has death. There were missing values in cschool2 that resulted in the available cases being different for the two fits. How do we solve this? lm() has a subset option which we can use to remove cases with missing values for the variables that were used in fit.poly.2. (Had we removed missing data before doing any of these analyses, this would not have been an issue.)

# Intercept-only model:
fit.poly.0 <- lm(death ~ 1, data = nations, subset = !is.na(death) & !is.na(cschool2))
# Compare Intercept-only to Quadratic
anova(fit.poly.0, fit.poly.2)
## Analysis of Variance Table
## 
## Model 1: death ~ 1
## Model 2: death ~ cschool2 + I(cschool2^2)
##   Res.Df  RSS Df Sum of Sq    F              Pr(>F)
## 1    103 2530                                      
## 2    101  875  2      1654 95.5 <0.0000000000000002
##      
## 1    
## 2 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “Df” column has a “2.” This is a 2 degree of freedom test because we are testing 2 terms simultaneously – the linear term cschool2 and the quadratic term I(cschool2^2). According to the anova table, we can reject the null hypothesis that both these terms have zero coefficients and conclude that there is a significant association between secondary school enrollment and death rate) (p < .001).

5.6 Predictions from the model

To get a prediction of the population mean outcome (\(Y\)) at a specific predictor value (\(X\)), you could manually enter the predictor value into the model and compute the prediction, as in the following three examples.

Example 1 (continued): The predicted mean total cholesterol for those age 65 years is 174.6 + + 0.26 \(\times\) 65 = 191.5 mg/dL. This is not exact because we rounded each coefficient.

Example 2 (continued): The predicted mean SBP for Current smokers is 125.3 + + 5.36 \(\times\) 0 + 0.67 \(\times\) 1 = 126.0 mg/dL. This is not exact because we rounded each coefficient.

Example 3 (continued): The predicted mean crude death rate per 1000 in countries with a secondary school enrollment of 60% is 9.09 + -0.14 \(\times\) 60 + 0.0028 \(\times\) 602 = 10.77 per 1000.

Or, you could let the software do the computation for you. We saw above how to use predict() to help us with plotting fitted curves. We can use it in general to obtain a prediction from the model at any given predictor value. Let’s repeat the three examples above, this time using predict(), along with interval = "confidence" which will compute a 95% confidence interval for the prediction.

# Example 1
# Reminder of what the model looked like:
# fit.ex1 <- lm(LBXTC ~ RIDAGEYR, data = nhanes)
predict(fit.ex1, newdata = data.frame(RIDAGEYR = 65), interval = "confidence")
##     fit   lwr   upr
## 1 191.5 188.3 194.7
# Example 2
# Reminder of what the model looked like:
# fit.ex2 <- lm(BPXSY1 ~ smoker, data = nhanes)
predict(fit.ex2, newdata = data.frame(smoker = "Current"), interval = "confidence")
##   fit   lwr   upr
## 1 126 123.1 128.9
# Example 3
# Reminder of what the model looked like:
# fit.poly.2 <- lm(death ~ cschool2 + I(cschool2^2), data = nations)
predict(fit.poly.2, newdata = data.frame(cschool2 = 60), interval = "confidence")
##     fit   lwr  upr
## 1 10.98 8.859 13.1

In some cases, our manual calcuation was spot on, in others not so much. If you want an exact answer, use predict.

5.7 Confidence intervals, confidence bands, prediction intervals

Whenever you estimate a parameter, you also want to report a confidence interval for it. A 95% confidence interval (CI) for a population parameter is a random interval that has a 95% probability of containing the true parameter. For example, if (5.1%, 7.3%) is a 95% CI for the incidence of smoking-related morbidity in a particular population in 2019, that means that there is a 95% chance that this interval contains the true prevalence.

We are going to look at three different kinds of CIs that you can create in simple linear regression: (1) CI for regression coefficients, (2) CI for the mean outcome, and (3) CI for the outcome for an individual observation.

  • The regression coefficients are the intercept and slope (for a continuous predictor) or the intercept and differences in mean outcome between levels and the reference level (for a categorical predictor).
  • The mean outcome is the value of the outcome on the regression line, which we computed in the previous section on predictions. A point on the regression line is the mean outcome at a given value of the predictor, and is also called the fitted value.
  • The outcome for an individual observation is also the value of the outcome on the regression line, but its CI will be wider than the CI for the mean outcome because individual observations are more variable than the mean.

5.7.1 CI for regression coefficients

In a previous section, we computed CIs for each of the individual regression coefficients using confint(fit). For example, with a continuous predictor:

# Continuous predictor
confint(fit.ex1)
##                2.5 %   97.5 %
## (Intercept) 167.3636 181.8661
## RIDAGEYR      0.1241   0.3939

Based on this output, we would say that a 95% CI for the population slope (\(\beta_1\)) of the regression of total cholesterol (mg/dL) on age (years) is 0.12, 0.39. Combining this with the regression coefficient and p-value from the earlier regression output, we would report this as follows: Age is significantly associated with total cholesterol (B = 0.26; 95% CI = 0.12, 0.39; p < .001).

For a categorical predictor,

# Categorical predictor
confint(fit.ex2)
##                 2.5 %  97.5 %
## (Intercept)   123.698 126.962
## smokerPast      2.378   8.334
## smokerCurrent  -2.627   3.967

Recall that the intercept (\(\beta_0\)) is the mean for the reference level, and the other parameters are differences in mean between those levels and the reference level. Thus, a 95% CI for the population mean SBP (mm Hg) among never smokers is 123.7, 127.0. To get a CI for the mean at another level, use that level as the reference level before fitting the model. A 95% CI for the population difference in mean SBP between past and never smokers is 2.38, 8.33. We would report this as follows: Mean SBP is significantly greater among past smokers than among never smokers (B = 5.36; 95% CI = 2.38, 8.33; p < .001). To get a CI for a pairwise comparison between levels where neither is the reference level, change the reference level to one of the levels you want to compare before fitting the model.

5.7.2 CI for mean outcome

We have actually already done this earlier - we used predict() along with interval = confidence to get a 95% CI for a prediction, and the “prediction” is the same as the “mean outcome” (at a given value of the predictor). By way of reminder,

predict(fit.ex1, newdata = data.frame(RIDAGEYR = 65), interval = "confidence")
##     fit   lwr   upr
## 1 191.5 188.3 194.7

CIs for the mean over the entire range of \(X\) values are referred to as a confidence band since they visually form a band around the regression line. To visualize the confidence band for a continuous predictor, compute the CIs over the range of possible \(X\) values.

X <- seq(min(nhanes$RIDAGEYR, na.rm=T), max(nhanes$RIDAGEYR, na.rm=T), by = 1)
mypred.cont <- predict(fit.ex1, newdata = data.frame(RIDAGEYR=X), interval="confidence")

Let’s look at the first few lines of mypred.cont so we can see the columns names:

head(cbind(X, mypred.cont))
##    X   fit   lwr   upr
## 1 18 179.3 174.2 184.3
## 2 19 179.5 174.6 184.5
## 3 20 179.8 175.0 184.6
## 4 21 180.1 175.4 184.8
## 5 22 180.3 175.7 184.9
## 6 23 180.6 176.1 185.0

The plot of fit vs. X will be the regression line (solid line), and the plots of lwr and upr vs. X will be the confidence bands (dashed lines). With a reasonably large sample size confidence bands are quite narrow, so below we will plot both on the original scale and zoomed in so you can see the bands more clearly.

par(mfrow=c(1,2))
plot(nhanes$RIDAGEYR, nhanes$LBXTC, col="gray",
     xlab = "Age (years)", ylab = "Total Cholesterol (mg/dL)",
     las = 1, pch = 20, font.lab = 2, font.axis = 2,
     main = "Confidence Band\nContinuous Predictor")
lines(X, mypred.cont[, "fit"], col = "red",  lty = 1, lwd = 2)
lines(X, mypred.cont[, "upr"], col = "blue", lty = 5, lwd = 2)
lines(X, mypred.cont[, "lwr"], col = "blue", lty = 5, lwd = 2)

plot(nhanes$RIDAGEYR, nhanes$LBXTC, ,   col="gray", ylim = c(170, 200),
     xlab = "Age (years)", ylab = "Total Cholesterol (mg/dL)",
     las = 1, pch = 20, font.lab = 2, font.axis = 2,
     main = "(zoomed in)")
lines(X, mypred.cont[, "fit"], col = "red",  lty = 1, lwd = 2)
lines(X, mypred.cont[, "upr"], col = "blue", lty = 5, lwd = 2)
lines(X, mypred.cont[, "lwr"], col = "blue", lty = 5, lwd = 2)

par(mfrow=c(1,1))

In general, the confidence bands will be narrower in the middle of the predictor range and wider at the edges because we are more confident in the location of the middle of the line than we are about the location of the ends of the line. This can be shown mathematically, but just conceptually, consider that a slightly different sample of points would result in a slightly shifted and rotated line and that any rotation is going to have a greater impact on the ends of the line than the middle.

If the predictor is categorical, then rather than plotting over a continuous range of \(X\) values, we plot over the set of unique levels of \(X\).

X <- levels(nhanes$smoker)
mypred.cat <- predict(fit.ex2, newdata = data.frame(smoker=X), interval="confidence")
data.frame(X, mypred.cat)
##         X   fit   lwr   upr
## 1   Never 125.3 123.7 127.0
## 2    Past 130.7 128.2 133.2
## 3 Current 126.0 123.1 128.9
par(mfrow=c(1,2))
plot.default(nhanes$smoker, nhanes$BPXSY1,  col="gray",
             xlab = "Smoking Status", ylab = "Systolic BP (mm Hg)",
             las = 1, pch = 20,
             font.lab = 2, font.axis = 2, xaxt = "n",
             main = "Confidence Band\nCategorical Predictor")
axis(1, at = 1:3, labels = levels(nhanes$smoker), font = 2)
points(1:3, mypred.cat[, "fit"], col = "red", pch = 20, cex = 1.5)
lines( 1:3, mypred.cat[, "fit"], col = "red")
lines( 1:3, mypred.cat[, "upr"], col = "blue", lty = 5, lwd = 2)
lines( 1:3, mypred.cat[, "lwr"], col = "blue", lty = 5, lwd = 2)

plot.default(nhanes$smoker, nhanes$BPXSY1,  col="gray", ylim = c(120, 140),
             xlab = "Smoking Status", ylab = "Systolic BP (mm Hg)",
             las = 1, pch = 20,
             font.lab = 2, font.axis = 2, xaxt = "n",
             main = "(zoomed in)")
axis(1, at = 1:3, labels = levels(nhanes$smoker), font = 2)
points(1:3, mypred.cat[, "fit"], col = "red", pch = 20, cex = 1.5)
lines( 1:3, mypred.cat[, "fit"], col = "red")
lines( 1:3, mypred.cat[, "upr"], col = "blue", lty = 5, lwd = 2)
lines( 1:3, mypred.cat[, "lwr"], col = "blue", lty = 5, lwd = 2)

par(mfrow=c(1,1))

Unlike for a continuous predictor, the width of the confidence band is not necessarily greater on the “ends” when you have a categorical predictor. The width of the band now is more related to the sample size at each level of \(X\).

5.7.3 CI for individual observations

The third type of CI of interest is a CI for the outcome for an individual observation at a specific X value. This is not the same as the previous section where we computed a CI for the mean. The CI for an individual’s outcome will be wider than the CI for the mean - one observation is more variable than the mean of many observations; individuals are more variable than the average of individuals. The CI for an individual’s outcome is sometimes called a prediction interval. This can be confusing… the word “prediction” is used for both the predicted mean outcome and the prediction for an individual, and they are always the same value. But the term “prediction interval” is used only for the CI for an individual.

A 95% prediction interval for the outcome (\(Y\)) for an individual with a specific predictor value \(X=x\) is interpreted as follows: Among all individuals in the population with \(X=x\), we expect 95% of their \(Y\) values to be within the 95% prediction interval.

In predict(), add interval = "prediction".

predict(fit.ex1, newdata = data.frame(RIDAGEYR = 65), interval = "prediction")  
##     fit   lwr   upr
## 1 191.5 110.6 272.3
predict(fit.ex2, newdata = data.frame(smoker = "Current"), interval = "prediction") 
##   fit   lwr   upr
## 1 126 87.14 164.9

Example 1 (continued): The average total cholesterol for individuals in the population who are age 65 years is 191.5 mg/dL, and we expect 95% of such individuals to have total cholesterol between 110.6 and 272.3 mg/dL.

Example 2 (continued): The average SBP for individuals who are current smokers is 126.0 mmHg, and we expect 95% of current smokers to have a SBP between 87.1 and 164.9 mmHg.

The CIs for individual observations over a range of \(X\) values form a prediction band. To visualize the prediction band for a continuous predictor, plot the prediction interval lower and upper bounds over the range of predictor values. The code to accomplish this is exactly the same as in the previous section, but with interval="prediction" instead of interval="prediction" in the call to predict().

There is no need to zoom in this time because the prediction band is quite wide. Since we are less confident in the outcome for an individual observation than for the mean of individual observations, a prediction band will always be wider than a confidence band.

NOTE: As the sample size increases, the prediction band will become more accurate but will always be wide since it captures 95% of the observations. The confidence band, however, will get closer and closer to the regression line as the sample size increases.

Similarly, for a categorical predictor:

In summary, you can obtain three kinds of CIs from a regression – for the population regression coefficients, for the population mean outcome, and for the outcome for an individual observation. For the mean and individual observations, the predictions are the same, but the CI for an individual observation is wider than the CI for the mean.

5.8 Assumptions

The validity of regression coefficient estimates, confidence intervals, and significance tests based on a regression model depend on a number of assumptions: independent observations, linearity, normality of residuals, and constant variance. Additionally, individual cases might be outliers (points that fall very far from the regression line) or influential (cases that have a large impact on the magnitude of the intercept and slope). These issues will all be covered in the following chapter on multiple linear regression.

5.9 Exercises

References

Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. Third edition. Los Angeles: SAGE Publications, Inc.