8  Regression model building and variable selection

Learning objectives

By the end of this week you should be able to:

  1. Understand how model diagnostics are used to compare regression models

  2. Build regression models suitable for prediction

  3. Build regression models suitable for isolating the effect of a single predictor

  4. Build regression models suitable for understanding multiple predictors

Learning activities

This week’s learning activities include:

Learning Activity Learning objectives
Reading 1
Independent exercise 1
Lecture 1 2
Lecture 2 3
Lecture 2 4

Model building

The previous weeks provide the tools to carry out a multiple linear regression in Stata or R and interpret the results. However, the process of choosing exactly which regression model best answers your research question is not always clear. This process involves choosing: which covariates to include, what functional form they should take, and what (if any) transformations may be necessary. Before we illustrate these steps in three different contexts, it is helpful to first introduce several measures of regression model performance to help inform our decisions.

F-tests and Likelihood ratio tests for nested models

The first method of comparing regression models is one we have already been using all course: P-values that test the specific inclusion or exclusion of a variable or groups of variables (e.g. a group of dummy variables associated with a categorical covariate) from the model. This is achieved in linear regression with an F-test which produces an equivalent p-value to the t-test P-value shown in the common regression output for continuous and binary variables. For linear regression, this is also equivalent to a likelihood ratio test P-value. Although this method of comparison is the most intuitive, it is limited in that it can only compare nested models - models that differ by the inclusion of one or more variables. So it is not useful for comparing models that differ in other ways. e.g. comparing models with different methods of adjusting for non-linearity (categorisation, cubic-splines, or log-transformation of the covariate). For these comparisons, different model comparison measures need to be employed.

\(R^2\)

We are already familiar with coefficient of determination R2 from weeks 1’s reading. Recall that this is the proportion of the total variability of the outcome that can be explained by the covariates in the model. Or alternatively 1 minus the proportion of variability remaining unexplained.

\[ R^2 = \frac{\text{Model sum of squares}}{\text{Total sum of squares}} = 1 - \frac{\text{Residual sum of squares}}{\text{Total sum of squares}}\]

R2 therefore provides a natural and intuitive measure of regression performance with higher R2 values indicating better model performance (as more variability of the outcome can be explained). However, issues arise when R2 is used to compare models as it will always favour more complex models regardless of whether that increased complexity is justified. This will lead to over paramaterised, or overfitted models. Therefore R2 is only a useful comparitive measures for models of equal complexity e.g. models with the same number of parameters.

Adjusted \(R^2\)

The adjusted R2 attempts to compensate for the overfitting issues associated with the unadjusted R2 by penalising the R2 calculation by the number of parameters of the model. There are several ways of doing this, and the exact method isn’t too important, so the common formula below is shown purely so you can compare the calculation to the regular R2.

\[ \text{adjusted } R^2 = 1 - \frac{\text{Residual sum of squares}/(n-p)}{\text{Total sum of squares}/(n-1)}\] where \(n\) is the number of observations and \(p\) is the number of parameters of the model (equal to the number of regression coefficients). So the adjusted R2 will only increase if the residual error in the more complex model reduces enough to compensate for the penalty of the extra parameter (increasing \(p\)).

AIC

The adjusted R2 is just one of many ways to penalise unnecesary complexity to avoid over fitting. One popular method of adjustment would be the Akaike Information Criterion, commonly known as the AIC. Here, instead of quantifying the model fit through least squares, the maximum likelihood is used to compare models - again with a penalty proportional to the number of paramaters. Using a likelihood based approach has the advantage that it can be applied in models not fit through ordinary least squares (such as logistic regression taught in the second half of this course). AIC is calculated as

\[ \text{AIC} = 2p - 2\log(\mathcal{L}) \]

where \(p\) is the number of paramaters and \(\mathcal{L}\) is the maximum likelihood value of the model fit. As we are subtracting the maximum likelihood, lower AIC values indicate better models (i.e. higher likelihoods are better). The AIC can be either negative or positive, and so it is important to remember that a “lower” AIC could mean either a smaller positive AIC value, or a “more negative” negative AIC value.

BIC

The Bayesian Information Criterion (BIC) is very similar to the AIC, however instead of penalising by \(2p\), it penalises by \(2p\log(n)\).

\[ \text{BIC} = 2p\log(n) - 2\log(\mathcal{L})\]

This change in penalty between BIC and AIC is important for two reasons. Firstly, the BIC penalty is a stricter penalty than AIC. Secondly, the BIC penalties become progressively stricter as the sample size increases. Both of these generally lead to BIC favouring more simple models than AIC.

Independent exercise

Use the tools above to investigate the ideal number of knots for the week 7 investigation between HDL and BMI.

Lecture 1 - Prediction (more done in week 10)

In this video, we will look at how the tools above can be used to help build regression models suitable for prediction, where the goal is to minimise the predictive error.

Download video here

Lecture 2 - Isolating the effect of a single predictor

In this video, we will look at how the tools above can be used to help build regression models suitable for measuring the effect of an exposure on an outcome, where the goal is to measure this effect without bias due to confounding.

Download video here

Lecture 3 - Understanding multiple predictors

In this video, we will look at how the tools above can be used in exploratory research, where the goal is to identify which covariates are associated with an outcome. It is common in this type of research for potential confounders or predictors of interest to be less well established.

Download video here

Summary

This weeks key concepts are:

  1. There are several measures available to help statistically compare regression models - including P-values from t-tests and f-tests, R2, adjusted R2, AIC and BIC.

  2. How these tools will be applied will be different depending on the context of your research question

  3. This course focuses on prediction models, models to understand the effect of a single exposure, and understanding multiple predictors

  4. All types of models should also consider contextualised field specific issues and norms.

  5. You should not use automatic covariate selection algorithms for use in this course. Rather build models where you are comfortable justifying the inclusion or exclusion of each covariate.