5.29 Summary of multiple linear regression

There was a lot of information in this chapter. This section attempts to put all the steps in the order in which you would carry them out in a real-life analysis, although in fact you may have to jump back and forth between steps as, invariably, things will not go as planned.

Think about your research question

What is your research question? Based on that question, what is your outcome and what is your predictor or predictors of interest?
For a given predictor of interest (see Section 5.9):
- Are there possible confounders or mediators? Adjust for confounders, but not mediators.
- Are there any moderators? If so, include corresponding interaction terms.
For confirmatory analyses (see Section 5.24):
- Do not base decisions about which predictors or interactions to include, or in what form to include them, on statistical significance or effect size. Base decisions on subject-matter knowledge, interpretability, meeting the regression assumptions, and avoiding overfitting (see Section 5.27).
- Determine if you need to adjust for multiple tests of outcomes or multiple comparisons between levels of a categorical predictor. To maximize power for the tests or comparisons of most interest, designate some as primary and others as secondary. Adjust for multiple testing over the primary (confirmatory) tests (see Section 5.25) and consider the secondary tests to be exploratory.

Assess and handle missing data

Summarize the data using summary() to assess the extent of missing data. Some variables may be unusable due to having a large number of cases with missing values. Some variables may need to be modified as, in many studies, missing values actually have meaning (e.g., a survey question was automatically skipped based on the answer to a previous question, in which case you may be able to logically infer the value).
Multiple imputation of missing data is covered in Chapter 9.
If not using multiple imputation, at this stage it is best to create a dataset with no missing values (complete-case analysis, see Section 5.3). This will ensure your “Table 1” of descriptive statistics will be based on the same sample size as your regression analysis. Also, if you decide to compare models with different sets of predictors, it is important that the number of observations with non-missing values be the same for each model. If you add or remove variables in any of the following steps, you may need to return and repeat this step as the number of observations with missing data may have changed or, if using multiple imputation the imputation model will need to be adjusted and re-fit.

Examine the analysis variables and adjust as needed

Create a histogram and numerical summary of each continuous variable (see Section 5.4).
- Any anomalous values? Check to see if they are data entry errors or should be set to missing based on being impossible values.
- If the outcome is highly skewed, consider a transformation.
- If a continuous predictor is highly skewed, consider a transformation to reduce the leverage of extreme values.
- If in doubt, wait until you have looked at regression diagnostics before deciding on the need for a transformation of either the outcome or any continuous predictors.
Create a frequency table for each categorical predictor (see Section 5.4).
- Any anomalous values? Check to see if they are data entry errors or should be set to missing based on being impossible values.
- Decide if you need to collapse any levels so there is sufficient sample size in each.
- Make sure each categorical predictor is coded as a factor (see Section 4.4.1).
Check for collinearity (see Section 5.21).
- Compute VIFs or aGSIFs.
- Remove and/or combine predictors to reduce redundancy.
Visualize the unadjusted relationships (see Section 5.5).
Create a “Table 1” of descriptive statistics (see Section 3.3).

Fit the model

Fit the model, carry out Type III tests, and compute CIs for the regression coefficients (see Section 5.6).
If the model included an interaction (see Section 5.10):
- Visualize the interaction (see Section 5.10.5).
- Test the significance of the interaction term (see Section 5.10.10).
- Estimate and test the significance of each predictor in the interaction at levels of the other (see Sections 5.10.7, 5.10.9.2, and 5.10.9.3).
- Test the overall significance of each predictor in the interaction (see Section 5.10.11).
Multiple testing adjustments, if necessary (for multiple outcomes and/or post-hoc comparisons between levels of a categorical predictor) (see Section 5.25).

Diagnostics

Check assumptions (independence, linearity, normality, constant variance) (see Sections 5.15 to 5.18)
Check for outliers and influential observations (see Sections 5.22 and 5.23)
Carry out a sensitivity analysis, if necessary (see Section 5.26).

Adjustments

Adjust the model, if necessary, to resolve assumption violations or other issues.
After re-fitting a model, re-do all diagnostic checks.
If any adjustment results in a predictor being added or removed, go back to the missing data step. The new complete case dataset might have a different sample size or, if using multiple imputation, the imputation model will be different.

Write up the methods and results (see Section 5.28)