4  Fitting linear models

Table of content for chapter 04

Chapter section list

4.1 The linear model

Bullet List 4.1

: Assumption of the linear model

The assumptions of the linear models are:

  • Continuous variable: A numeric variable that is at least nominally continuous. (Qualitative response variables and count response variables require other regression models. See ?sec-chap06.)
  • Independence: The observations of the variables for one case are independent of the observations for all other cases. (If cases are dependent, linear mixed-effects models may be more appropriate. See ?sec-chap07.)
  • Linearity: The dependence of the response on the predictors is through the conditional expected value or mean function. (The {car} package contains many functions that can help you decide whether the assumption of linearity is reasonable in any given problem. See ?sec-chap08.)
  • Constant conditicional variance: The conditional variance of the response given the regressors (or, equivalently, the predictors) is constant. (See ?sec-chap05-1?sec-chap05-3 and ?sec-chap08-5 for diagnostics tools.)

Changing assumptions changes the model. For example, it is common to add a normality assumption, producing the normal linear model. Another common extension to the linear model is to modify the constant variance assumption producing the weighted linear model.

4.2 Linear least-square regression

4.2.1 Simple linear regression

The Davis data set in the {carData} package contains the measured and selfreported heights and weights of 200 men and women engaged in regular exercise. A few of the data values are missing, and consequently there are only 183 complete cases for the variables that are used in the analysis reported below.

The variables weight (measured weight) and repwt (reported weight) are in kilograms, and height (measured height) and repht (reported height) are in centimeters. We focus here on the regression of weight on repwt.

Example 4.1 : Simple linear regression

R Code 4.1 : A quick look at the Davis data

Listing / Output 4.1: Skim the Davis data set
Code
Davis <- carData::Davis

skimr::skim(Davis)
Data summary
Name Davis
Number of rows 200
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 F: 112, M: 88

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weight 0 1.00 65.80 15.10 39 55.0 63.0 74.00 166 ▇▆▁▁▁
height 0 1.00 170.02 12.01 57 164.0 169.5 177.25 197 ▁▁▁▇▇
repwt 17 0.92 65.62 13.78 41 55.0 63.0 73.50 124 ▆▇▃▁▁
repht 17 0.92 168.50 9.47 148 160.5 168.0 175.00 200 ▃▇▇▃▁

Instead of using the base::summary() function I have applied the skim() function from the {skimr} package. See Package Profile A.19.

R Code 4.2 : Simple linear regression of Davis data

Listing / Output 4.2: Simple linear regression model of the Davis data
Code
glue::glue("Standard call of the lm function")
(davis_lm0 <- lm(weight ~ repwt, data = Davis))

save_data_file("chap04", davis_lm0, "davis_lm0.rds")

glue::glue("")
glue::glue("################################################################")
glue::glue("Suppressing the intercept to force the regression through the origin")
lm(weight ~ -1 + repwt, data = Davis)
#> Standard call of the lm function
#> 
#> Call:
#> lm(formula = weight ~ repwt, data = Davis)
#> 
#> Coefficients:
#> (Intercept)        repwt  
#>      5.3363       0.9278  
#> 
#> 
#> ################################################################
#> Suppressing the intercept to force the regression through the origin
#> 
#> Call:
#> lm(formula = weight ~ -1 + repwt, data = Davis)
#> 
#> Coefficients:
#> repwt  
#> 1.006

A model formula consists of three parts:

    1. The left-hand side of the model formula specifies the response variable; it is usually a variable name (weight, in the example) but may be an expression that evaluates to the response (e.g., sqrt(weight), log(income), or income/hours.worked).
    1. The tilde is a separator.
    1. The right-hand side is the most complex part of the formula. It is a special expression, including the names of the predictors, that R evaluates to produce the regressors for the model. **The arithmetic operators, +, -, *, /, and ^, have special meaning on the right-hand side of a model formula, but they retain their ordinary meaning on the left-hand side of the formula.** R will use any unadorned numeric predictor on the right-hand side of the model formula as a regressor, as is desired here for simple regression. The intercept \(\beta_{0}\) is included in the model without being specified directly.

We can suppress the intercept to force the regression through the origin using weight ~ -1 + repwt or weight ~ repwt - 1. More generally, a minus sign removes a term, here the intercept, from the predictor. Using \(0\) (zero) also suppresses the intercept. That \(0\) is an equivalent of \(-1\) can be seen in the repetition of the formula which uses \(0\) although I have used in the formula \(-1\).

The stats::lm() function returns a linear-model object, which was assigned to the R variable davis_lm0. Printing this object just results in a relatively sparse output: Printing a linear-model object simply shows the command that produced the model along with the estimated regression coefficients.

Caution 4.1: Difference in coefficient without intercept

There is a small difference in the coefficient with and without intercept. Where does this come from and what does it mean?

Is this the result “to force the regression through the origin”?

R Code 4.3 : Summary of the linear model davis_lm0

Listing / Output 4.3: Using stats::summary() for the davis_lm0 object
Code
summary(davis_lm0)
#> 
#> Call:
#> lm(formula = weight ~ repwt, data = Davis)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#>  -7.048  -1.868  -0.728   0.601 108.705 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   5.3363     3.0369   1.757   0.0806
#> repwt         0.9278     0.0453  20.484   <2e-16
#> 
#> Residual standard error: 8.419 on 181 degrees of freedom
#>   (17 observations deleted due to missingness)
#> Multiple R-squared:  0.6986, Adjusted R-squared:  0.697 
#> F-statistic: 419.6 on 1 and 181 DF,  p-value: < 2.2e-16

For an explication of the different printed parts see Listing / Output 4.4.

R Code 4.4 : Default usage of the more sophisticated summary with car::S()

Listing / Output 4.4: Default summary output of car::S()
Code
car::S(davis_lm0)
#> Call: lm(formula = weight ~ repwt, data = Davis)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   5.3363     3.0369   1.757   0.0806
#> repwt         0.9278     0.0453  20.484   <2e-16
#> 
#> Residual standard deviation: 8.419 on 181 degrees of freedom
#>   (17 observations deleted due to missingness)
#> Multiple R-squared: 0.6986
#> F-statistic: 419.6 on 1 and 181 DF,  p-value: < 2.2e-16 
#>     AIC     BIC 
#> 1303.06 1312.69

As a subset of the stats:summary() printout it does not report the residual distribution but adds with AIC and BIC two diagnostics figures for compare and choose between different models.

  • Call: A brief header is printed first, repeating the command that created the regression model.
  • Coefficients: The part of the output marked Coefficients provides basic information about the estimated regression coefficients.
    • The row names (=first column) lists the names of the regressors fit by lm().
      • Intercept: The intercept, if present, is named (Intercept).
      • Estimate: The column labeled Estimate provides the least-squares estimates of the regression coefficients.
      • Std. Error: The column marked Std. Error displays the standard errors of the estimated coefficients. The default standard errors are computed as \(\sigma\) times a function of the regressors, as presented in any regression text, but the car::S() function also allows you to use other methods for computing standard errors.
      • t-value: The column marked t value is the ratio of each estimate to its standard error and is a Wald test statistic for the null hypothesis that the corresponding population regression coefficient is equal to zero. If assumptions hold and the errors are normally distributed or the sample size is large enough, then these t-statistics computed with the default standard error estimates are distributed under the null hypothesis as t random variables with degrees of freedom (df ) equal to the residual degrees of freedom under the model.
      • Pr (>|t|): The column marked Pr (>|t|) is the two-sided p-value for the null hypothesis assuming that \(\beta_{0} = 0\) versus \(\beta_{0} \neq 0\) the t-distribution is appropriate. (Example: The hypothesis \(\beta_{0} = 0\) versus \(\beta_{0} != 0\), with the value of \(\beta_{1}\) unspecified, has a p-value of about .08, providing weak evidence against the null hypothesis that \(\beta_{0} = 0\), if the assumptions of the model hold.)
    • Below the coefficient table is additional summary information, including the residual standard deviation. (Example: This typical error in prediction is so large that, if correct, it is unlikely that the predictions of actual weight from reported weight would be of any practical value.)
    • The residual df are n − 2 for simple regression, here 183 − 2 = 181, and we’re alerted to the fact that 17 cases are removed because of missing data.
    • The Multiple R-squared, \(R^2 \approx 0.70\), is the square of the correlation between the response and the fitted values and is interpretable as the proportion of variation of the response variable around its mean accounted for by the regression.
    • F-statistic: The reported F-statistic provides a likelihood-ratio test of the general null hypothesis that all the population regression coefficients, except for the intercept, are equal to zero, versus the alternative that at least one of the \(\beta_{j}\) is nonzero. If the errors are normal or n is large enough, then this test statistic has an F-distribution with the degrees of freedom shown. Because simple regression has only one parameter beyond the intercept, the F-test is equivalent to the t-test that \(\beta_{1} = 0\), with \(t_{2} = F\). (In other models, such as GLMs or normal nonlinear models, Wald tests and the generalization of the likelihood-ratio F-test may test the same hypotheses, but they need not provide the same inferences.)
    • The AIC and BIC values at the bottom of the output are, respectively, the Akaike Information Criterion and the Bayesian Information Criterion, statistics that are sometimes used for model selection.
Note 4.1

With the exception of the AIC and BIC values I cannot see any advantage in the default version of car::S() against base::summary(). I am curious what it means that car::S() “adds more functionality”.

One obviously is, that car::S() allows different methods to compute standard errors.

R Code 4.5 : Short summary of the lm object with car::brief()

Listing / Output 4.5: Brief summary of the lm object with car::brief()
Code
car::brief(davis_lm0)
#>            (Intercept)  repwt
#> Estimate          5.34 0.9278
#> Std. Error        3.04 0.0453
#> 
#>  Residual SD = 8.42 on 181 df, R-squared = 0.699

car::brief() for a lm object prints only the first two columns of the coefficients: Estimate and Std. Error.

R Code 4.6 : Confidence intervals for the estimates of davis_lm0

Listing / Output 4.6: Confidence intervals for the estimates of davis_lm0 with stats::confint()
Code
glue::glue("Confidence intervals with `stats::confint()`")
stats::confint(davis_lm0)
glue::glue(" ")
glue::glue("#############################################")
glue::glue("Cofidence intervals with `car::Confint()`")
car::Confint(davis_lm0)
#> Confidence intervals with `stats::confint()`
#>                  2.5 %    97.5 %
#> (Intercept) -0.6560394 11.328560
#> repwt        0.8384665  1.017219
#>  
#> #############################################
#> Cofidence intervals with `car::Confint()`
#>              Estimate      2.5 %    97.5 %
#> (Intercept) 5.3362605 -0.6560394 11.328560
#> repwt       0.9278428  0.8384665  1.017219

Unlike stats::confint(), car::Confint() prints the coefficient estimates along with the confidence limits.

The separate confidence intervals do not address the hypothesis that simultaneously \(\beta_{0} = 0\) and \(\beta_{1} = 1\) versus the alternative that either or both of the intercept and slope differ from these values.

We shoud have started with visualization of the data set! Always start with Exploratory Data Analysis (EDA)!

Example 4.2 : Exploratory Data Analysis of Davis data set

R Code 4.7 : Scatterplot of davis_lm0

Listing / Output 4.7: Scatterplot of measured weight (weight) by reported weight (repwt) for Davis’s data. The solid line is the least-squares linear regression line, and the broken line is the line of unbiased reporting y = x.
Code
Davis <- carData::Davis

graphics::plot(weight ~ repwt, data = Davis)
graphics::abline(0, 1, lty = "dashed", lwd = 2)
graphics::abline(davis_lm0, lwd = 2)
graphics::legend(
  "bottomright",
  c("Unbiased Reporting", "Least Squares"),
  lty = c("dashed", "solid"),
  lwd = 2,
  inset = 0.02
)
base::with(Davis, car::showLabels(repwt, weight, n = 3, method = "mahal"))

#> 12 21 54 
#> 12 21 52

  • The graphics::plot() function draws the basic scatterplot, to which we use graphics::abline() to add the line of unbiased reporting (y = x, with intercept zero and slope 1, the broken line) and the least-squares line (the solid line).
  • The graphics::legend() command adds a legend at the lower right of the plot, inset from the corner by 2% of the plot’s size.
  • The car::showLabels() function identifies the two points with the largest Mahalanobis distances from the center of the data.

The resulting graph reveals an extreme outlier, Case 12. Case 21 has also unusually large values of both measured and reported weight but is in line with the rest of the data.

It seems bizarre that an individual who weighs more than 160 kg would report her weight as less than 60 kg, but there is a simple explanation: On data entry, Subject 12’s height in centimeters and weight in kilograms were inadvertently exchanged.

R Code 4.8 : Update linar model after deleting data point 12

Listing / Output 4.8: Linear model of updated Davis data (davis_lm1)
Code
glue::glue("#####################################################")
glue::glue("Updated linear model `davis_lm1`")
glue::glue("#####################################################")
(davis_lm1 <- update(davis_lm0, subset = -12))

save_data_file("chap04", davis_lm1, "davis_lm1.rds")

glue::glue("#####################################################")
glue::glue("Display summary with `car::S()`")
glue::glue("#####################################################")
glue::glue(" ")
car::S(davis_lm1)
#> #####################################################
#> Updated linear model `davis_lm1`
#> #####################################################
#> 
#> Call:
#> lm(formula = weight ~ repwt, data = Davis, subset = -12)
#> 
#> Coefficients:
#> (Intercept)        repwt  
#>      2.7338       0.9584  
#> 
#> #####################################################
#> Display summary with `car::S()`
#> #####################################################
#>  
#> Call: lm(formula = weight ~ repwt, data = Davis, subset = -12)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  2.73380    0.81479   3.355 0.000967
#> repwt        0.95837    0.01214  78.926  < 2e-16
#> 
#> Residual standard deviation: 2.254 on 180 degrees of freedom
#>   (17 observations deleted due to missingness)
#> Multiple R-squared: 0.9719
#> F-statistic:  6229 on 1 and 180 DF,  p-value: < 2.2e-16 
#>    AIC    BIC 
#> 816.27 825.88

R Code 4.9 : Compare coefficients and confidence intervals

Listing / Output 4.9: Using car::compareCoefs() to compare the estimated coefficients and their standard errors from the two fitted regressions, with and without Case 12, and car::Confint() to recompute the confidence intervals for the coefficients
Code
glue::glue("############################################################")
glue::glue("Compare estimated coefficients of both models")
glue::glue("############################################################")
glue::glue(" ")
car::compareCoefs(davis_lm0, davis_lm1)

glue::glue(" ")
glue::glue("############################################################")
glue::glue("Recomputing the confidence intervals for the coefficients")
glue::glue("############################################################")
glue::glue(" ")
car::Confint(davis_lm1)
#> ############################################################
#> Compare estimated coefficients of both models
#> ############################################################
#>  
#> Calls:
#> 1: lm(formula = weight ~ repwt, data = Davis)
#> 2: lm(formula = weight ~ repwt, data = Davis, subset = -12)
#> 
#>             Model 1 Model 2
#> (Intercept)   5.336   2.734
#> SE            3.037   0.815
#>                            
#> repwt        0.9278  0.9584
#> SE           0.0453  0.0121
#>                            
#>  
#> ############################################################
#> Recomputing the confidence intervals for the coefficients
#> ############################################################
#>  
#>              Estimate     2.5 %    97.5 %
#> (Intercept) 2.7338020 1.1260247 4.3415792
#> repwt       0.9583743 0.9344139 0.9823347

Both the intercept and the slope change, but not dramatically, when the outlier is removed. Case 12 is at a relatively low-leverage point, because its value for repwt is near the center of the distribution of the regressor.

In contrast, there is a major change in the residual standard deviation, reduced from an unacceptable 8.4 kg to a possibly acceptable 2.2 kg.

As a consequence, the coefficient standard errors are also much smaller when the outlier is removed, and the value of \(R^2\) is greatly increased. Because the coefficient standard errors are now smaller, even though the estimated intercept \(b_{0}\) is still close to zero and the estimated slope \(b_{1}\) is close to 1, \(b_{0} = 0\) and \(b_{1} = 1\) are both outside of the recomputed 95% confidence intervals for the coefficients1

In the following experiment I am trying to reproduce book’s Figure 4.1. I have to solve two challenges:

  1. Calculating the Mahalanobis distances and returning the farest distances.
  2. Applying these values and printing out the row names (ID numbers in this case) near the coordinates of the points the belong to.

Experiment 4.1 : Interactive plot and labelled Mahalanobis distances

R Code 4.10 : Simple linear regression of interactive Davis data produced with {ggiraph}

Listing / Output 4.10: Interactive linear model of weight regresssed on repwt with Davis data set. Interactivity created with {ggiraph}.
Code
Davis <- carData::Davis

gg_davis <- Davis |>
  tidyr::drop_na(repwt) |>
  tibble::rowid_to_column(var = "ID") |>
  ggplot2::ggplot(
        ggplot2::aes(
          x = repwt,
          y = weight
        )
  ) +
  ggiraph::geom_point_interactive(
    ggplot2::aes(
      x = repwt,
      y = weight,
      tooltip = ID,
      data_id = ID
    ),
    shape = 1,
    size = 2
  ) +
  ggplot2::geom_smooth(
    ggplot2::aes(
      linetype = "solid"
    ),
    formula = y ~ x,
    method = lm,
    se = FALSE,
    color = "black"
  ) +
    ggplot2::geom_smooth(
      ggplot2::aes(
        linetype = "dashed",
      ),
      formula = y ~ x,
      method = loess,
      se = FALSE,
      color = "black"
  ) +
  ggplot2::scale_linetype_discrete(
    name = "",
    label = c(
      "Least Squares",
      "Unbiased Reporting   "
    )
  ) +
  ggplot2::guides(
    linetype = ggplot2::guide_legend(position = "inside")
  ) +
  ggplot2::theme(
    legend.title = ggplot2::element_blank(),
    legend.position.inside = c(.8, .15),
    legend.box.background = ggplot2::element_rect(
                              color = "black", 
                              linewidth = 1
                              )
                 )

ggiraph::girafe(ggobj = gg_davis)

This was my first try to replicate the book’s Figure 4.1. There are two important qualification to make:

1. Producing an interactive graphics with {ggiraph}

Instead of printing out the two points with the largest Mahalanobis distance I have applied functions from the {ggiraph} package to produce an interactive plot. When you hover the cursor over data points you will get a tooltip with the row number of the point under your cursor. After you learned that the outlier with weight over 160kg is from row 12 you can delete this data point in the next step.

I used {ggiraph} for two reasons:

(a) {ggigraph} is easy to use

  • {ggiraph} (see Package Profile A.7) uses the {ggplot2} structure and names of commands. There are only three changes for making a {ggplot2} graph interactive:

    It wasn’t therefore necessary to invest much time for learning a new package as it would have been the case for learning {plotly}, another package for creating interactive web graphics (see Package Profile A.16).

(b) At first I didn’t know how to calculate Mahalanobis distances

I knew I could use the {ggrepel} package (see Package Profile A.9) to label the points but I didn’t know how to calculate the two points with the largest Mahalanobis distance. Although I knew that there is in base R the stats::mahalanobis() function, but at first I didn’t succeed to provide the necessary parameter for the arguments. Some days later I found the solution with Listing / Output 4.11.

2. New detailed specification for the {ggplot2} legend

This is the first time, that I have moved the legend inside the graoh and framed with a box. I used the https://tidyverse.or/blog article ggplot2 3.5.0: legends and an article from https://r-graph-gallery.com Building a nice legend with R and ggplot2. Helpful were also hints from a StackOverflow post.

R Code 4.11 : Simple linear regression of weight on repwt with labelled points of the biggest Mahalanobis distances using {ggrepel}

Listing / Output 4.11: Replication of book’s Figure 4.1 with {ggplot2} and {ggrepel}
Code
set.seed(420)
df_temp <- carData::Davis |> 
  dplyr::select(weight, repwt) |> 
  tidyr::drop_na() 

df <- df_temp |>
  dplyr::mutate(mahal =
        {
          stats::mahalanobis(
                df_temp,
                base::colMeans(df_temp),
                stats::cov(df_temp)
                )
        }
  ) |>
  dplyr::mutate(p =
      pchisq(
        q = mahal,
        df = 1,
        lower.tail = FALSE
        )
      ) |>
  tibble::rownames_to_column(var = "ID") |>
  dplyr::mutate(ID =
        dplyr::case_when(p < .001 ~ ID,
                         p >= .001 ~ "")
  )
  
df |>
  ggplot2::ggplot(
        ggplot2::aes(
                x = repwt,
                y = weight
        )
  ) +
  ggplot2::geom_point(
    shape = 1,
    size = 2
  ) +
  ggrepel::geom_text_repel(
    ggplot2::aes(label = ID)
  ) +
  ggplot2::stat_smooth(
    ggplot2::aes(linetype = "solid"),
    formula = y ~ x,
    method = lm,
    se = FALSE,
    color = "black"
  ) +
    ggplot2::geom_smooth(
      ggplot2::aes(
          linetype = "dashed"
        ),
      formula = y ~ x,
      method = loess,
      se = FALSE,
      color = "black"
  ) +
  ggplot2::scale_linetype_discrete(
    name = "",
    label = c(
      "Least Squares",
      "Unbiased Reporting   "
    )
  ) +
  ggplot2::guides(
    linetype = ggplot2::guide_legend(position = "inside")
  ) +
  ggplot2::theme(
    legend.title = ggplot2::element_blank(),
    legend.position.inside = c(.8, .15),
    legend.box.background = ggplot2::element_rect(
                              color = "black",
                              linewidth = 1
                              )
                 )


This is my best attempt to replicate book’s Figure 4.1. Instead of using functions of tthe he {car} package I used packages from the {tidyverse} approach: {gplot2} and {ggrepel} (see Package Profile A.9).

Resource 4.1 : Mahalanobis Distance

I didn’t know that stats::mahalanobis() requires as parameters the dataframe with the regression variables. Two almost identical blog articles showed me how to apply the base R function:

  • Finnstats: How to Calculate Mahalanobis Distance in R (I believe that this article was the original of the two.)
  • Statology: How to Calculate Mahalanobis Distance in R

The Mahalanobis distance is a distance measure of multivariate distributions and was developed 1936 by Prasanta Chandra Mahalanobis. In contrast to the Euclidean distance which works fine as long as the dimensions are equally weighted and are independent of each other, the Mahalanobis distance takes into account different axis variances and correlation.

Very instructive for me was the (German) article Mahalanobis-Distanz. It features an experiment where a graphic shows that the distances from different points on a circle with the center in the data cloud are not equal because of different standard deviations on different axes. Mahalanobis distance takes into account the different standard deviations along the axes of the n-dimensional space as well as the correlation (variances and covariances) between the axes.

During the work with {ggrepel} I encountered another difficulty receiving several {ggplot2} warnings. I could finally solve this problem after asking about help on StackOverflow.

4.3 Multiple linear regression

Multiple regression extends simple regression to more than one predictor.


  1. The estimated regression coefficient replaces the Greek letter with the corresponding Roman letter, as in \(b_{1}\) for \(\beta_{1}\).↩︎