Introduction to Regression Methods for Public Health Using R
Hardcopy
Preface
About the Author
Acknowledgments
1
Introduction
1.1
R and RStudio
1.2
Datasets
1.3
Functions
1.4
Software information and conventions
2
Overview of Regression Methods
2.1
Introduction
2.2
Why use regression?
2.3
Taxonomy of regression methods
2.4
Exercises
3
Data Summarization
3.1
Examining the data
3.1.1
Detailed description of all variables in a dataset
3.2
Missing data options
3.2.1
Complete case analysis
3.2.2
Multiple imputation
3.3
Creating a “Table 1”
3.3.1
Overall
3.3.2
By outcome or exposure
3.3.3
Exporting to an external file
3.3.4
Adding p-values to Table 1
3.3.5
Should p-values be added to a Table 1?
3.4
Exercises
4
Simple Linear Regression
4.1
Introduction
4.2
Notation and interpretation
4.3
SLR model with a continuous predictor
4.3.1
Estimated regression coefficients
4.3.2
Other outputs of summary()
4.3.3
Writing it up
4.3.4
Centering a continuous predictor
4.4
SLR model with a categorical predictor
4.4.1
Recoding as a factor
4.4.2
What happens if you fit the model without coding a categorical variable as a factor?
4.4.3
Re-leveling
4.4.4
Multiple DF test for a categorical predictor
4.4.5
Special case: binary predictor
4.4.6
Writing it up
4.5
Interpreting p-values
4.6
Predictions from the model
4.7
Confidence intervals and prediction intervals
4.7.1
CIs for regression coefficients
4.7.2
CI for the mean outcome
4.7.3
PI for an individual observation
4.8
Fitting curves using polynomials
4.9
Assumptions
4.10
Exercises
5
Multiple Linear Regression
5.1
Introduction
5.2
Notation and interpretation
5.3
Complete case analysis dataset
5.4
Examine the data
5.4.1
Outcome
5.4.2
Continuous predictors
5.4.3
Categorical predictors
5.4.4
Overall description of the data
5.4.5
Collapsing sparse levels
5.4.6
Visualizing the unadjusted relationships
5.5
Fitting the MLR model
5.5.1
Unadjusted
5.5.2
Adjusted
5.6
Residuals
5.6.1
Computing residuals
5.7
Visualizing the adjusted relationships
5.8
Types of predictor variables
5.8.1
Confounder
5.8.2
Mediator
5.8.3
Moderator
5.9
Interactions
5.9.1
Understanding a two-way interaction using stratification
5.9.2
Including an interaction in a regression model
5.9.3
Regression equation with no interaction
5.9.4
Regression equation with an interaction
5.9.4.1
Examining the equation for each gender
5.9.5
Visualizing an interaction
5.9.6
Testing the difference between the slopes
5.9.7
Estimating and testing the significance of the slope at each level of a moderator
5.9.8
A two-way interaction has two interpretations
5.9.9
Types of two-way interactions
5.9.9.1
Continuous
\(\times\)
categorical
5.9.9.2
Categorical
\(\times\)
categorical
5.9.9.3
Continuous
\(\times\)
continuous
5.9.10
Test of interaction
5.9.11
Overall test of a predictor involved in an interaction
5.9.12
When to include an interaction
5.10
Predictions
5.11
Confidence intervals and prediction intervals
5.12
Which to use when?
car::Anova()
,
anova()
,
gmodels::estimable()
, or
predict()
5.13
Overview of regression diagnostics
5.14
Checking the independence assumption
5.14.1
Impact of dependence
5.14.2
Diagnosis of dependence
5.14.3
Potential solutions for dependence
5.15
Checking the normality assumption
5.15.1
Impact of non-normality
5.15.2
Diagnosis of non-normality
5.15.3
Potential solutions for non-normality
5.16
Checking the linearity assumption
5.16.1
Impact of non-linearity
5.16.2
Diagnosis of non-linearity
5.16.2.1
Checking linearity in a model with an interaction
5.16.3
Potential solutions for non-linearity
5.16.4
Examples of non-linearities that are resolved by a transformation of the predictor
5.16.5
Changing the amount of smoothing in a CR plot
5.17
Checking the constant variance assumption
5.17.1
Impact of non-constant variance
5.17.2
Diagnosis of non-constant variance
5.17.3
Potential solutions for non-constant variance
5.18
Box-Cox outcome transformation
5.19
Interpreting a model with a transformed outcome
5.20
Collinearity
5.20.1
Diagnosis of collinearity
5.20.1.1
VIFs when all predictors are continuous
5.20.1.2
Generalized VIFs when at least one predictor is categorical
5.20.1.3
VIFs when there is an interaction or polynomial terms
5.20.1.4
VIF summary
5.20.2
Impact of collinearity
5.20.3
Potential solutions for collinearity
5.20.3.1
Remove predictors to reduce collinearity
5.20.3.2
Combine predictors to reduce collinearity
5.21
Outliers
5.21.1
Impact of outliers
5.21.2
Diagnosis of outliers
5.21.3
Potential solutions for outliers
5.22
Influential observations
5.22.1
Impact of influential observations
5.22.2
Diagnosis of influential observations
5.22.3
Potential solutions for influential observations
5.23
Confirmatory vs. exploratory analysis
5.24
Multiple testing
5.24.1
Primary vs. secondary tests
5.25
Sensitivity analysis
5.25.1
Example: Sensitivity to collapsing a categorical predictor
5.25.2
Example: Sensitivity to outliers and influential observations
5.26
Generalization / extrapolation / interpolation / overfitting
5.27
Writing it up
5.28
Summary of multiple linear regression
5.29
Exercises
6
Binary Logistic Regression
6.1
Introduction
6.2
Interpretation of the logistic regression coefficients
6.3
Why not use linear regression for a binary outcome?
6.4
Odds and odds ratios
6.5
Estimating an OR using a 2
\(\times\)
2 table
6.6
Estimating an OR using logistic regression
6.6.1
OR associated with other than a 1-unit difference
6.6.2
Make sure you know what probability
glm()
is modeling
6.6.3
Adjusted OR
6.7
Visualizing ORs
6.8
Prediction
6.9
Interactions
6.9.1
Overall test of a predictor involved in an interaction
6.9.2
Estimate the OR at each level of the other predictor
6.9.3
Visualize an interaction
6.10
Separation
6.10.1
Quasi-complete separation
6.10.2
Complete separation
6.10.3
Diagnosing separation
6.10.4
Resolving separation
6.10.4.1
Filter
6.10.4.2
Collapse
6.10.4.3
Remove
6.10.4.4
Summary
6.11
Collinearity
6.12
Assumptions
6.13
Outliers
6.14
Influential observations
6.15
Generalization / overfitting
6.16
Goodness-of-fit
6.16.1
Hosmer-Lemeshow test
6.16.2
Calibration plot
6.16.3
Relationship between the HL test p-value and the calibration plot
6.17
Writing it up
6.17.1
Writing up logistic regression results (no interaction)
6.17.2
Writing up logistic regression results (with an interaction)
6.18
Likelihood ratio test vs. Wald test
6.19
Summary of binary logistic regression
6.20
Conditional logistic regression for matched case-control data
6.21
Log-binomial regression to estimate a risk ratio or prevalence ratio
6.22
Ordinal logistic regression
6.22.1
Ordinal model
6.22.2
Transforming a continuous outcome into an ordinal outcome
6.22.3
Make sure you know what probabilities
polr()
is modeling
6.22.4
Separation
6.22.5
Fitting the model
6.22.6
Interpreting the coefficients
6.22.7
What does an OR > 1 mean?
6.22.8
Adjusted model
6.22.9
Prediction
6.22.10
Checking the proportional odds assumption
6.23
Exercises
7
Survival Analysis
7.1
Introduction
7.2
Censoring
7.2.1
Non-informative censoring assumption
7.3
Survival function
7.4
Hazard function
7.5
Survival analysis dataset structure
7.6
Kaplan-Meier estimate of the survival function
7.6.1
Plotting the survival function
7.6.2
Computing and plotting the hazard function
7.6.3
Estimating the event probability within a time interval
7.6.4
Median survival time
7.6.5
Comparing groups
7.7
Cox regression
7.8
Fitting the Cox regression model
7.8.1
Unadjusted
7.8.2
Adjusted
7.9
Visualizing hazard ratios
7.10
Prediction
7.11
Plotting the estimated survival function
7.12
Interactions
7.12.1
Overall test of a predictor involved in an interaction
7.12.2
Estimating the HR at each level of the other variable
7.12.3
Visualizing an interaction
7.13
Separation
7.14
Time-varying predictors
7.15
Collinearity
7.16
Proportional hazards assumption
7.16.1
Checking the proportional hazards assumption
7.16.2
Adding a time interaction for a continuous predictor
7.16.3
Adding a time interaction for a categorical predictor
7.16.4
Stratifying by a categorical variable
7.17
Independence assumption
7.18
Linearity assumption
7.19
Outliers
7.20
Influential observations
7.21
Generalization / overfitting
7.22
Likelihood ratio test vs. Wald test
7.23
Writing it up
7.23.1
Writing up Cox regression results (assuming PH)
7.23.2
Writing up Cox regression results (relaxing PH)
7.24
Summary of survival analysis
7.25
Exercises
8
Analyzing Complex Survey Data
8.1
Introduction
8.1.1
NHANES survey design
8.1.2
NSDUH survey design
8.2
Specifying the survey design
8.2.1
Design degrees of freedom
8.3
Weighted descriptive statistics
8.3.1
Overall
8.3.2
By exposure or outcome
8.4
Weighted linear regression
8.4.1
Fitting the model
8.4.2
Prediction
8.4.3
Interactions
8.4.4
Visualize the weighted unadjusted relationships
8.5
Domain (subgroup) analysis
8.6
Weighted binary logistic regression
8.6.1
Fitting the model
8.6.2
Prediction
8.6.3
Interactions
8.7
Weighted survival analysis
8.7.1
Weighted Kaplan-Meier estimate of the survival function
8.7.2
Weighted log-rank test for comparing groups
8.7.3
Weighted Cox regression
8.7.3.1
Fitting the model
8.7.3.2
Exclude cases with zero weights
8.7.3.3
Interactions
8.8
Likelihood ratio test vs. Wald test
8.9
Summary of special cases
8.10
Exercises
9
Multiple Imputation of Missing Data
9.1
Introduction
9.2
Missing data mechanism
9.3
The imputation model
9.4
Fitting the imputation model
9.4.1
What variables to include
9.4.2
Transform-then-impute vs. impute-then-transform
9.4.3
Pre-processing
9.4.4
Visualize the missing data pattern
9.4.5
Compare those with and without missing data
9.4.6
Number of imputations
9.4.7
mice()
9.4.8
Examine the imputed values
9.4.9
For descriptive statistics only: back-transformation and derived variables
9.5
Descriptive statistics after MI
9.6
Linear regression after MI
9.6.1
Multiple degree of freedom tests
9.6.2
Predictions
9.6.3
Polynomial predictor transformations
9.6.4
Interactions
9.6.4.1
Interaction via stratification
9.6.4.2
Interaction via transform-then-impute
9.6.4.3
Estimate the effect of one variable at levels of the other
9.7
Logistic regression after MI
9.7.1
Multiple degree of freedom tests
9.7.2
Predictions
9.8
Cox regression after MI
9.8.1
Multiple degree of freedom tests
9.8.2
Predictions
9.9
Number of imputations revisited
9.10
Regression diagnostics after MI
9.10.1
Example: Examining a diagnostic plot across imputations
9.10.2
Example: Pooling a diagnostic test over imputations
9.11
Exercises
Appendix
A
Datasets
A.1
NHANES (2017-2018)
A.2
United Nations Human Development Data (2020)
A.3
U.S. Natality (2018)
A.4
COVID-19 county-level data
A.5
NSDUH (2019)
A.6
BioLINCC teaching datasets
A.7
Opioid
B
Package versions
References
Published with bookdown
Introduction to Regression Methods for Public Health Using R
Chapter 2
Overview of Regression Methods
In this chapter, you will learn:
A general definition of regression;
Reasons to use regression; and
How to distinguish between commonly used regression methods.