Chapter 7 Linear Regression

7.1 Objectives

After completing this chapter, the readers are expected to

  • understand the concept of simple and multiple linear regression
  • perform simple linear regression
  • perform multiple linear regression
  • perform model fit assessment of linear regression models
  • present and interpret the results of linear regression analyses

7.2 Introduction

Linear regression is one of the most common statistical analyses in medical and health sciences. Linear regression models the linear (i.e. straight line) relationship between:

  • outcome: numerical variable (e.g. blood pressure, BMI, cholesterol level).
  • predictors/independent variables: numerical variables and categorical variables (e.g. gender, race, education level).

In simple words, we might be interested in knowing the relationship between the cholesterol level and its associated factors, for example gender, age, BMI and lifestyle. This can be explored by a Linear regression analysis.

7.3 Linear regression models

Linear regression is a type of Generalized linear models (GLMs), which also includes other outcome types, for example categorical and count. In subsequent chapters, we will cover these outcome types in form of logistic regression and Poisson regression. Basically, the relationship between the outcome and predictors in a linear regression is structured as follows,

\[\begin{aligned} numerical\ outcome = &\ numerical\ predictors \\ & + categorical\ predictors \end{aligned}\]

More appropriate forms of this relationship will explained later under simple and multiple linear regressions sections.

7.4 Prepare R Environment for Analysis

7.4.1 Libraries

For this chapter, we will be using the following packages:

  • foreign: for reading SPSS and STATA datasets
  • tidyverse: a general and powerful package for data transformation
  • psych: for descriptive statistics
  • gtsummary: for coming up with nice tables for results and plotting the graphs
  • ggplot2, ggpubr, GGally: for plotting the graphs
  • rsq: for getting \(R^2\) value from a GLM model
  • broom: for tidying up the results
  • car: for vif() function

These are loaded as follows using the function library(),

library(foreign)
library(tidyverse)
library(psych)
library(gtsummary)
library(ggplot2)
library(ggpubr)
library(GGally)
library(rsq)
library(broom)
library(car)

7.4.2 Dataset

We will use the coronary.dta dataset in STATA format. The dataset contains the total cholesterol level, their individual characteristics and intervention groups in a hypothetical clinical trial. The dataset contains 200 observations for nine variables:

  1. id: Subjects’ ID.
  2. cad: Coronary artery disease status (categorical) {no cad, cad}.
  3. sbp : Systolic blood pressure in mmHg (numerical).
  4. dbp : Diastolic blood pressure in mmHg (numerical).
  5. chol: Total cholesterol level in mmol/L (numerical).
  6. age: Age in years (numerical).
  7. bmi: Body mass index (numerical).
  8. race: Race of the subjects (categorical) {malay, chinese, indian}.
  9. gender: Gender of the subjects (categorical) {woman, man}.

The dataset is loaded as follows,

coronary = read.dta("data/coronary.dta")

We then look at the basic structure of the dataset,

str(coronary)
## 'data.frame':    200 obs. of  9 variables:
##  $ id    : num  1 14 56 61 62 64 69 108 112 134 ...
##  $ cad   : Factor w/ 2 levels "no cad","cad": 1 1 1 1 1 1 2 1 1 1 ...
##  $ sbp   : num  106 130 136 138 115 124 110 112 138 104 ...
##  $ dbp   : num  68 78 84 100 85 72 80 70 85 70 ...
##  $ chol  : num  6.57 6.33 5.97 7.04 6.66 ...
##  $ age   : num  60 34 36 45 53 43 44 50 43 48 ...
##  $ bmi   : num  38.9 37.8 40.5 37.6 40.3 ...
##  $ race  : Factor w/ 3 levels "malay","chinese",..: 3 1 1 1 3 1 1 2 2 2 ...
##  $ gender: Factor w/ 2 levels "woman","man": 1 1 1 1 2 2 2 1 1 2 ...
##  - attr(*, "datalabel")= chr "Written by R.              "
##  - attr(*, "time.stamp")= chr ""
##  - attr(*, "formats")= chr [1:9] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int [1:9] 100 108 100 100 100 100 100 108 108
##  - attr(*, "val.labels")= chr [1:9] "" "cad" "" "" ...
##  - attr(*, "var.labels")= chr [1:9] "id" "cad" "sbp" "dbp" ...
##  - attr(*, "version")= int 7
##  - attr(*, "label.table")=List of 3
##   ..$ cad   : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "no cad" "cad"
##   ..$ race  : Named int [1:3] 1 2 3
##   .. ..- attr(*, "names")= chr [1:3] "malay" "chinese" "indian"
##   ..$ gender: Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "woman" "man"

7.5 Simple Linear Regression

7.5.1 About Simple Linear Regression

Simple linear regression (SLR) models linear (straight line) relationship between:

  • outcome: numerical variable.
  • ONE predictor: numerical/categorical variable.

Note: When the predictor is a categorical variable, this is typically analyzed by one-way ANOVA. However, SLR can also handle a categorical variable in the GLM framework.

We may formally represent SLR in form of an equation as follows,

\[numerical\ outcome = intercept + coefficient \times predictor\] or in a shorter form using mathematical notations,

\[\hat y = b_0 + b_1x_1\] where \(\hat y\) (pronounced y hat) is the predicted value of the outcome y.

7.5.2 Data exploration

Let say, for the SLR we are interested in knowing whether diastolic blood pressure (predictor) is associated with the cholesterol level (outcome). We explore the variables by obtaining the descriptive statistics and plotting the data distribution.

We obtain the descriptive statistics of the variables,

coronary %>% select(chol, dbp) %>% describe()
##      vars   n  mean    sd median trimmed   mad min    max range skew kurtosis
## chol    1 200  6.20  1.18   6.19    6.18  1.18   4   9.35  5.35 0.18    -0.31
## dbp     2 200 82.31 12.90  80.00   81.68 14.83  56 120.00 64.00 0.42    -0.33
##        se
## chol 0.08
## dbp  0.91

and the histograms and box-and-whiskers plots,

hist_chol = ggplot(coronary, aes(chol)) + 
  geom_histogram(color = "blue", fill = "white")
hist_dbp = ggplot(coronary, aes(dbp)) + 
  geom_histogram(color = "red", fill = "white")
bplot_chol = ggplot(coronary, aes(chol)) + 
  geom_boxplot(color = "blue", )
bplot_dbp = ggplot(coronary, aes(dbp)) + 
  geom_boxplot(color = "red")
ggarrange(hist_chol, bplot_chol, hist_dbp, bplot_dbp)
Histograms and box-and-whiskers plots for `chol` and `dbp`.

FIGURE 7.1: Histograms and box-and-whiskers plots for chol and dbp.

7.5.3 Univariable analysis

For the analysis, we fit the SLR model, which consists of only one predictor (univariable). Here, chol is specified as the outcome, and dbp as the predictor. In glm, the formula is specified as outcome ~ predictor. Here, we specify chol ~ dbp as the formula in glm.

We fit and view the summary information of the model as,

slr_chol = glm(chol ~ dbp, data = coronary)
summary(slr_chol)
## 
## Call:
## glm(formula = chol ~ dbp, data = coronary)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9967  -0.8304  -0.1292   0.7734   2.8470  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.995134   0.492092   6.087 5.88e-09 ***
## dbp         0.038919   0.005907   6.589 3.92e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.154763)
## 
##     Null deviance: 278.77  on 199  degrees of freedom
## Residual deviance: 228.64  on 198  degrees of freedom
## AIC: 600.34
## 
## Number of Fisher Scoring iterations: 2

We can tidy up the glm output and obtain the 95% confidence interval (CI) using tidy() from the broom package,

tidy(slr_chol, conf.int = TRUE)
## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   3.00     0.492        6.09 5.88e- 9   2.03      3.96  
## 2 dbp           0.0389   0.00591      6.59 3.92e-10   0.0273    0.0505

From the output above, we pay attention at these results:

  • coefficients, \(b\) – column estimate.
  • 95% CI – columns conf.low and conf.high.
  • P-value – column p.value.

7.5.4 Model fit assessment

It is important to assess to what extend the SLR model reflects the data. First, we can assess this by \(R^2\), which is the percentage of the variance for the outcome that is explained by the predictor. In simpler words, to what extend the variation in the values of the outcome is caused/explained by the predictor. This ranges from 0% (the predictor does not explain the outcome at all) to 100% (the predictor explains the outcome perfectly). Here, we obtain the \(R^2\) values,

rsq(slr_chol)
## [1] 0.1798257

Next, we can assess the model fit by a scatter plot,

plot_slr = ggplot(coronary, aes(x = dbp, y = chol)) + 
  geom_point() + geom_smooth(method = lm)
plot_slr
Scatter plot of `chol` (outcome) vs `dbp` (predictor).

FIGURE 7.2: Scatter plot of chol (outcome) vs dbp (predictor).

This plot allows the assessment of normality, linearity and equal variance assumptions. We expect an elliptical/oval shape (normality) and equal scatter of dots above and below the prediction line (equal variance). These aspects indicate a linear relationship between chol and dbp (linearity).

7.5.5 Presentation and interpretation

To present the result, we can use tbl_regression() to come up with a nice table. We use slr_chol of the glm output with tbl_regression() in the gtsummary package.

tbl_regression(slr_chol) 
Characteristic Beta 95% CI1 p-value
dbp 0.04 0.03, 0.05 <0.001
1 CI = Confidence Interval

It is also very informative to present the model equation,

\[chol = 3.0 + 0.04\times dbp\] where we obtain the intercept value from summary(slr_chol).

Based on the \(R^2\) (which was 0.18), table and model equation, we may interpret the results as follows:

  • 1mmHg increase in DBP causes 0.04mmol/L increase in cholesterol level.
  • DBP explains 18% of the variance in cholesterol level.

7.6 Multiple Linear Regression

7.6.1 About Multiple Linear Regression

Multiple linear regression (MLR) models linear relationship between:

  • outcome: numerical variable.
  • MORE than one predictors: numerical and categorical variables.

We may formally represent MLR in form of an equation,

\[\begin{aligned} numerical\ outcome = &\ intercept \\ & + coefficients \times numerical\ predictors \\ & + coefficients \times categorical\ predictors \end{aligned}\] or in a shorter form,

\[\hat y = b_0 + b_1x_1 + b_2x_2 + ... + b_px_p\] where we have p predictors.

Whenever the predictor is a categorical variable with more than two levels, we use dummy variable(s). There is no issue with binary categorical variable. For the variable with more than two levels, the number of dummy variables (i.e. once turned into several binary variables) equals number of levels minus one. For example, whenever we have four levels, we will obtain three dummy (binary) variables. As we will see later, glm will automatically do this for factor variable and provide separate estimates for each dummy variable.

7.6.2 Data exploration

Now, for the MLR we are no longer restricted to one predictor. Let’s say, we are interested in knowing the relationship between blood pressure (SBP and DBP), age, BMI, race and render as the predictors and the cholesterol level (outcome). As before, we explore the variables by the descriptive statistics,

# numerical
coronary %>% select(-id, -cad, -race, -gender) %>% describe()
##      vars   n   mean    sd median trimmed   mad   min    max range  skew
## sbp     1 200 130.18 19.81 126.00  128.93 17.79 88.00 187.00 99.00  0.53
## dbp     2 200  82.31 12.90  80.00   81.68 14.83 56.00 120.00 64.00  0.42
## chol    3 200   6.20  1.18   6.19    6.18  1.18  4.00   9.35  5.35  0.18
## age     4 200  47.33  7.34  47.00   47.27  8.15 32.00  62.00 30.00  0.05
## bmi     5 200  37.45  2.68  37.80   37.65  2.37 28.99  45.03 16.03 -0.55
##      kurtosis   se
## sbp     -0.37 1.40
## dbp     -0.33 0.91
## chol    -0.31 0.08
## age     -0.78 0.52
## bmi      0.42 0.19
# categorical
coronary %>% dplyr::select(race, gender) %>% tbl_summary() 
Characteristic N = 2001
race
    malay 73 (36%)
    chinese 64 (32%)
    indian 63 (32%)
gender
    woman 100 (50%)
    man 100 (50%)
1 n (%)

and the pairs plot, where we focus on the distribution of the data by histograms and box-and-whiskers plots. The pairs plot also includes information on the bivariate correlation statistics between the numerical variables.

coronary %>% select(-id, -cad) %>% ggpairs()
Pairs plot for all variables.

FIGURE 7.3: Pairs plot for all variables.

7.6.3 Univariable analysis

For the univariable analysis in the context of MLR, we aim to select variables that are worthwhile to be included in the multivariable model.

In the context of exploratory research, we want to choose only variables with P-values < 0.25 to be included in MLR. To obtain the P-values, you may perform separate SLRs for each of the predictors (on your own). However, obtaining P-value for each predictor is easy by add1() function. Here, we use likelihood ratio test (LRT) using test = "LRT" option to obtain the P-values. We start with an intercept only model slr_chol0 using chol ~ 1 formula specification in the glm followed by add1(). add1() will test each predictor one by one.

slr_chol0 = glm(chol ~ 1, data = coronary)  # intercept only model
add1(slr_chol0, scope = ~ sbp + dbp + age + bmi + race + gender, 
     test = "LRT")
## Single term additions
## 
## Model:
## chol ~ 1
##        Df Deviance    AIC scaled dev.  Pr(>Chi)    
## <none>      278.77 637.99                          
## sbp     1   235.36 606.14      33.855 5.938e-09 ***
## dbp     1   228.64 600.34      39.648 3.042e-10 ***
## age     1   243.68 613.08      26.911 2.130e-07 ***
## bmi     1   272.17 635.20       4.792   0.02859 *  
## race    2   241.68 613.43      28.561 6.280e-07 ***
## gender  1   277.45 639.04       0.952   0.32933    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output, all variables are important with P < .25 except gender. These variables, excluding gender, are candidates in this variable selection step.

However, please keep in mind that in the context of confirmatory research, the variables that we want to include are not merely based on P-values alone. It is important to consider expert judgement as well.

7.6.4 Multivariable analysis

Multivariable analysis involves more than one predictors. In the univariable variable selection, we decided on several potential predictors. For MLR, we (judiciously) included these variables in an MLR model. In the present dataset, we have the following considerations:

  • including both SBP and DBP is redundant, because both represent the blood pressure. These variables are also highly correlated. This is indicated by the correlation value, r = 0.828 and scatter plot for the SBP-DBP pair in the pairs plot in the data exploration step.
  • gender was not sigficant, thus we may exclude the variable.
  • let say, as advised by experts in the field, we should exclude age in the modelling.

Now, given these considerations, we perform MLR with the selected variables,

mlr_chol = glm(chol ~ dbp + bmi + race, data = coronary)
summary(mlr_chol)
## 
## Call:
## glm(formula = chol ~ dbp + bmi + race, data = coronary)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.18698  -0.73076  -0.01935   0.63476   2.91524  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.870859   1.245373   3.911 0.000127 ***
## dbp          0.029500   0.006203   4.756 3.83e-06 ***
## bmi         -0.038530   0.028099  -1.371 0.171871    
## racechinese  0.356642   0.181757   1.962 0.051164 .  
## raceindian   0.724716   0.190625   3.802 0.000192 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.083909)
## 
##     Null deviance: 278.77  on 199  degrees of freedom
## Residual deviance: 211.36  on 195  degrees of freedom
## AIC: 590.63
## 
## Number of Fisher Scoring iterations: 2

From the output above, for each variable, we focus these results:

  • coefficients, \(b\)s – column estimate.
  • 95% CIs – columns conf.low and conf.high.
  • P-values – column p.value.

Note that for a categorical variable with more than two categories, the estimates are obtained for each dummy variable. In our case, race consists of Malay, Chinese and Indian. From the output, the dummy variables are racechinese representing Chinese vs Malay and raceindian representing Indian vs Malay dummy variables, where Malay is set as the baseline comparison group.

We also notice that some variables are not significant at significance level of 0.05, namely bmi and racechinese. As for racechinese dummy variable, because this forms part of the race variable, we accept the variable because it is marginally insignificant (0.0512 vs 0.05) and the other dummy variable raceindian is significant.

Stepwise automatic variable selection

We noted that not all variables included in the model are significant. In our case, we may remove bmi because it is not statistically significant. But in exploratory research where we have hundreds of variables, it is impossible to select variables by eye-ball judgement. So, in this case, how to perform the variable selection? To explore the significance of the variables, we may perform stepwise automatic selection. It is important to know stepwise selection is meant for exploratory research. For confirmatory analysis, it is important to rely on expert opinion for the variable selection. We may perform forward, backward or both forward and backward selection combined.

Forward selection starts with an intercept only or empty model without variable. It proceeds by adding one variable after another. In R, Akaike information criterion (AIC) is used as the comparative goodness-of-fit measure and model quality. In the stepwise selection, it seeks to find the model with the lowest AIC iteratively and the steps are shown in the output. More information about AIC can be referred to Hu (2007) and Wikipedia (2022).

# forward
mlr_chol_stepforward = step(slr_chol0, scope = ~ dbp + bmi + race, 
                            direction = "forward")
## Start:  AIC=637.99
## chol ~ 1
## 
##        Df Deviance    AIC
## + dbp   1   228.64 600.34
## + race  2   241.68 613.43
## + bmi   1   272.17 635.20
## <none>      278.77 637.99
## 
## Step:  AIC=600.34
## chol ~ dbp
## 
##        Df Deviance    AIC
## + race  2   213.40 590.55
## <none>      228.64 600.34
## + bmi   1   227.04 600.94
## 
## Step:  AIC=590.55
## chol ~ dbp + race
## 
##        Df Deviance    AIC
## <none>      213.40 590.55
## + bmi   1   211.36 590.63

Backward selection starts with a model containing all variables. Then, it proceeds by removing one variable after another, of which it aims to find the model with the lowest AIC.

# backward
mlr_chol_stepback = step(mlr_chol, direction = "backward")
## Start:  AIC=590.63
## chol ~ dbp + bmi + race
## 
##        Df Deviance    AIC
## - bmi   1   213.40 590.55
## <none>      211.36 590.63
## - race  2   227.04 600.94
## - dbp   1   235.88 610.58
## 
## Step:  AIC=590.55
## chol ~ dbp + race
## 
##        Df Deviance    AIC
## <none>      213.40 590.55
## - race  2   228.64 600.34
## - dbp   1   241.68 613.43

Bidirectional selection, as implemented in R, starts as with the model with all variables. Then, it proceeds with removing or adding variables, which combines both forward and backward selection methods. It stops once it finds the model with the lowest AIC.

# both
mlr_chol_stepboth = step(mlr_chol, direction = "both")
## Start:  AIC=590.63
## chol ~ dbp + bmi + race
## 
##        Df Deviance    AIC
## - bmi   1   213.40 590.55
## <none>      211.36 590.63
## - race  2   227.04 600.94
## - dbp   1   235.88 610.58
## 
## Step:  AIC=590.55
## chol ~ dbp + race
## 
##        Df Deviance    AIC
## <none>      213.40 590.55
## + bmi   1   211.36 590.63
## - race  2   228.64 600.34
## - dbp   1   241.68 613.43

Preliminary model

Let say, after considering the P-value, stepwise selection (in exploratory research) and expert opinion, we decided that our preliminary model is,

chol ~ dbp + race

and we fit the model again to view basic information of the model,

mlr_chol_sel = glm(chol ~ dbp + race, data = coronary)
summary(mlr_chol_sel)
## 
## Call:
## glm(formula = chol ~ dbp + race, data = coronary)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1378  -0.7068  -0.0289   0.5997   2.7778  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.298028   0.486213   6.783 1.36e-10 ***
## dbp         0.031108   0.006104   5.096 8.14e-07 ***
## racechinese 0.359964   0.182149   1.976 0.049534 *  
## raceindian  0.713690   0.190883   3.739 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.088777)
## 
##     Null deviance: 278.77  on 199  degrees of freedom
## Residual deviance: 213.40  on 196  degrees of freedom
## AIC: 590.55
## 
## Number of Fisher Scoring iterations: 2
rsq(mlr_chol_sel)
## [1] 0.2345037

7.6.5 Interaction

Interaction is the combination of predictors that requires interpretation of their regression coefficients separately based on the levels of the predictor. For example, we need separate interpretation of the coefficient for dbp depending on race group: Malay, Chinese or Indian. This makes interpreting our analysis complicated as we can no longer interpret each coefficient on its own. So, most of the time, we pray not to have interaction in our regression model. We fit the model with a two-way interaction term,

summary(glm(chol ~ dbp * race, data = coronary))
## 
## Call:
## glm(formula = chol ~ dbp * race, data = coronary)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.10485  -0.77524  -0.02423   0.58059   2.74380  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.11114    0.92803   2.275 0.024008 *  
## dbp              0.04650    0.01193   3.897 0.000134 ***
## racechinese      1.95576    1.28477   1.522 0.129572    
## raceindian       2.41530    1.25766   1.920 0.056266 .  
## dbp:racechinese -0.02033    0.01596  -1.273 0.204376    
## dbp:raceindian  -0.02126    0.01529  -1.391 0.165905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.087348)
## 
##     Null deviance: 278.77  on 199  degrees of freedom
## Residual deviance: 210.95  on 194  degrees of freedom
## AIC: 592.23
## 
## Number of Fisher Scoring iterations: 2

From the output, there is no evidence that suggests the presence of interaction because the included interaction term was insignificant. In R, it is easy to fit interaction by *, e.g. dbp * race will automatically includes all variables involved. This is equal to specifiying glm(chol ~ dbp + race + dbp:race, data = coronary), where we can use : to include the interaction.

7.6.6 Model fit assessment

For MLR, we assess the model fit by \(R^2\) and histogram and scatter plots of residuals. Residuals, in simple term, are the discrepancies between the observed values (dots) and the predicted values (by the fit MLR model). So, the lesser the discrepancies, the better is the model fit.

Percentage of variance explained, \(R^2\)

First, we obtain the \(R^2\) values. In comparison to the \(R^2\) obtained for the SLR, we include adj = TRUE here to obtain an adjusted \(R^2\). The adjusted \(R^2\) here is the \(R^2\) with penalty for the number of predictors p. This discourages including too many variables, which might be unnecessary.

rsq(mlr_chol_sel, adj = TRUE)
## [1] 0.2227869

Histogram and box-and-whiskers plot

Second, we plot a histogram and a box-and-whiskers plot to assess the normality of raw/unstandardized residuals of the MLR model. We expect normally distributed residuals to indicate a good fit of the MLR model. Here, we have a normally distributed residuals.

rraw_chol = resid(mlr_chol_sel)
hist(rraw_chol)
Histogram of raw residuals.

FIGURE 7.4: Histogram of raw residuals.

boxplot(rraw_chol)
Box-and-whiskers plot of raw residuals.

FIGURE 7.5: Box-and-whiskers plot of raw residuals.

Scatter plots

Third, we plot a standardized residuals (Y-axis) vs standardized predicted values (X-axis). Similar to the one for SLR, this plot allows assessment of normality, linearity and equal variance assumptions. The dots should form elliptical/oval shape (normality) and scattered roughly equal above and below the zero line (equal variance). Both these indicate linearity. Our plot below shows that the assumptions are met.

rstd_chol = rstandard(mlr_chol_sel)  # standardized residuals
pstd_chol = scale(predict(mlr_chol_sel))  # standardized predicted values
plot(rstd_chol ~ pstd_chol, xlab = "Std predicted", ylab = "Std residuals")
abline(0, 0)  # normal, linear, equal variance
Scatter plot of standardized residuals vs standardized predicted values

FIGURE 7.6: Scatter plot of standardized residuals vs standardized predicted values

In addition to the standardized residuals vs standardized predicted values plot, for numerical predictors, we assess the linear relationship between the raw residuals and the observed values of the numerical predictors. We plot the raw residuals vs numerical predictor below. The plot interpreted in similar way to the standardized residuals vs standardized predicted values plot. The plot shows good linearity between the residuals and the numerical predictor.

plot(rraw_chol ~ coronary$dbp, xlab = "DBP", ylab = "Raw Residuals")
abline(0, 0)
Scatter plot of raw residuals vs DBP (numerical predictor).

FIGURE 7.7: Scatter plot of raw residuals vs DBP (numerical predictor).

7.6.7 Presentation and interpretation

After passing all the assumption checks, we may now decide on our final model. We may rename the preliminary model mlr_chol_sel to mlr_chol_final for easier reference.

mlr_chol_final = mlr_chol_sel

Similar to SLR, we use tbl_regression() to come up with a nice table to present the results.

tbl_regression(mlr_chol_final) 
Characteristic Beta 95% CI1 p-value
dbp 0.03 0.02, 0.04 <0.001
race
    malay
    chinese 0.36 0.00, 0.72 0.050
    indian 0.71 0.34, 1.1 <0.001
1 CI = Confidence Interval

It will be useful to be able to save the output in the spreadsheet format for later use. We can use tidy() function in this case and export it to a .csv file,

tib_mlr = tidy(mlr_chol_final, conf.int = TRUE)
write.csv(tib_mlr, "mlr_final.csv")

Then, we present the model equation. Cholesterol level in mmol/L can be predicted by its predictors as given by,

\[chol = 3.30 + 0.03\times dbp + 0.36\times race\ (chinese) + 0.71\times race\ (indian)\] Based on the adjusted \(R^2\), table and model equation, we may interpret the results as follows:

  • 1mmHg increase in DBP causes 0.03mmol/L increase in cholesterol, while controlling for the effect of race.
  • Likewise, 10mmHg increase in DBP causes 0.03 x 10 = 0.3mmol/L increase in cholesterol, while controlling for the effect of race.
  • Being Chinese causes 0.36mmol/L increase in cholesterol in comparison to Malay, while controlling for the effect of DBP.
  • Being Indian causes 0.71mmol/L increase in cholesterol in comparison to Malay, while controlling for the effect of DBP.
  • DBP and race explains 22.3% variance in cholesterol.

For each of this interpretation, please keep in mind to also consider the 95% CI of each of coefficient. For example, being Indian causes 0.71mmol/L increase in cholesterol in comparison to Malay, where this may range from 0.34mmol/L to 1.1mmol/L based on the 95% CI.

7.7 Prediction

In some situations, it is useful to use the SLR/MLR model for prediction. For example, we may want to predict the cholesterol level of a patient given some clinical characteristics. We can use the final model above for prediction. For starter, let us view the predicted values for our sample,

coronary$pred_chol = predict(mlr_chol_final)
head(coronary)
##   id    cad sbp dbp   chol age  bmi   race gender pred_chol
## 1  1 no cad 106  68 6.5725  60 38.9 indian  woman  6.127070
## 2 14 no cad 130  78 6.3250  34 37.8  malay  woman  5.724461
## 3 56 no cad 136  84 5.9675  36 40.5  malay  woman  5.911109
## 4 61 no cad 138 100 7.0400  45 37.6  malay  woman  6.408839
## 5 62 no cad 115  85 6.6550  53 40.3 indian    man  6.655908
## 6 64 no cad 124  72 5.9675  43 37.6  malay    man  5.537812

Compare the predicted values with the observed cholesterol level. Recall that we already checked this for the model fit assessment before.

It is more useful to predict for newly observed data. Let us try predicting the cholesterol level for an Indian patient with DBP = 90mmHg,

predict(mlr_chol_final, list(dbp = 90, race = "indian"))
##        1 
## 6.811448

Now, we also do so the data with many more patients,

new_data = data.frame(dbp = c(90, 90, 90), 
                      race = c("malay", "chinese", "indian"))
predict(mlr_chol_final, new_data)
##        1        2        3 
## 6.097758 6.457722 6.811448
new_data$pred_chol = predict(mlr_chol_final, new_data)
new_data
##   dbp    race pred_chol
## 1  90   malay  6.097758
## 2  90 chinese  6.457722
## 3  90  indian  6.811448

7.8 Summary

In this chapter, we went through the basics about SLR and MLR. We performed the analysis for each and learned how to assess the model fit for the regression models. We learned how to nicely present and interpret the results. In addition, we also learned how to utilize the model for prediction. Excellent texts on regression includes Biostatistics: Basic Concepts and Methodology for the Health Sciences (Daniel and Cross 2014) and the more advanced books such as Applied Linear Statistical Models (Neter 2013) and Epidemiology: Study Design and Data Analysis (Woodward 2013).

References

Daniel, W. W., and C. L. Cross. 2014. Biostatistics: Basic Concepts and Methodology for the Health Sciences. Wiley Series in Probability and Statistics Series. Wiley. https://books.google.com.my/books?id=zrCkmwEACAAJ.
Hu, Shuhua. 2007. “Akaike Information Criterion.” Center for Research in Scientific Computation 93.
Neter, John Neter Michael Kutner William Wasserman Christopher Nachtsheim John. 2013. Applied Linear Statistical Models 5ed (Pb 2013). Paperback. MC GRAW HILL INDIA. https://lead.to/amazon/com/?op=bt&la=en&cu=usd&key=1259064743.
Wikipedia. 2022. “Akaike Information Criterion, Wikipedia. The Free Encyclopedia.” https://en.wikipedia.org/wiki/Akaike_information_criterion.
Woodward, M. 2013. Epidemiology: Study Design and Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. CRC Press. https://books.google.com.my/books?id=VJDSBQAAQBAJ.