Chapter 3 Overview of Regression Methods

In an introductory statistics class, many statistical methods are covered for testing and assessing the magnitude of an association between two variables. “Regression” is the general term used in statistics for methods that assess the association between two variables adjusted for other variables. There does not actually have to be more than just two variables, and many basic statistical methods are just special cases of regression.

Unlike, for example, correlation, in regression you must designate one variable to be the outcome, also known as the dependent variable. The other variables are predictors, or independent variables (in some contexts, predictors might be called covariates or confounders).

A simple linear regression equation takes the form:

\[Y = \beta_0 + \beta_1 X + \epsilon\] This equation says that pairs of points \((X,Y)\) are scattered randomly around a line. For example:

3.1 Why use regression?

More complicated regression settings have different equations but this simple equation illustrates the many purposes for which a regression can be used:

  • Testing a theory: A theory that implies a certain functional relationship between Y and X can be tested by comparing the hypothesized model to simpler or more complex models and seeing which model fits best.

  • Prediction: After fitting the model to observed data, specified values of \(X\) will be used to predict yet to be observed values of \(Y\).

  • Testing for association: Is there a significant association between \(Y\) and \(X\)? In this case, the question is answered by testing the null hypothesis \(H_0:\beta_1=0\). Under the null hypothesis, the outcome does not depend on the predictor.

  • Estimating a rate of change: How does \(Y\) change as \(X\) changes? In this case, that is answered by estimating the magnitude of \(\beta_1\), the regression slope.

A multiple linear regression equation takes the form:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \ldots + \beta_p X_p + \epsilon\] Now there are many predictors. The above purposes can still be accomplished. In addition, the following purpose is often of interest:

  • Controlling for confounding: Is there an association between \(Y\) and \(X_1\) after adjusting for \(X_2, \ldots, X_p\)?

The nature of the outcome variable (for example, is it categorical or continuous) determines which regression method is appropriate and, as mentioned previously, regression methods are extensions of many familiar two-sample statistical methods, as shown in the following table.

\(Y\) (outcome) Regression method Special case \(X\) (predictor) variable(s) for the special case
Continuous Simple linear regression Correlation One continuous variable
Two-sample t-test One binary variable (categorical with 2 levels)
One-Way ANOVA One categorical variable with 3 or more levels
Multiple linear regression Two-Way ANOVA Two categorical variables
ANCOVA One categorical, one continuous
Binary Binary logistic regression Chi-square test One categorical variable
Ordinal Ordinal logistic regression Spearman’s rho One ordinal variable
Kendall’s tau-b One ordinal variable
Nominal Multinomial logistic regression Chi-square test One categorical variable
Count Poisson regression
Event time Cox proportional hazards regression (survival analysis) Mantel-Haenszel test One binary variable
Log-rank test One categorical variable

The above methods all assume that each observation is independent of the others, for example if you have cross-sectional data. If you have repeated measures on the same individuals or observations that are clustered in some way (e.g., sample households then interview everyone in a household) then the independence assumption does not hold and other methods are needed. The following methods extend methods mentioned above to the case of dependent data.

  • Linear regression \(\rightarrow\) Generalized least squares, Linear mixed models

  • Logistic regression \(\rightarrow\) Generalized linear mixed models, Generalized estimating equations (GEE)

  • Poisson regression \(\rightarrow\) Generalized linear mixed models

A few other terms to be familiar with:

  • Simple or univariate regression refers to the case where there is just one predictor.

  • Multiple or multivariable regression refers to the case where there is more than one predictor. Multivariate regression technically refers to the situation where you have multiple outcome variables, although the term is often used to refer to the case of multiple predictors.

Here are a few examples of research questions for which you would use each of these methods:

  • Simple linear regression: Is there an association between triglycerides and body mass index?

  • Multiple linear regression: Is there an association between triglycerides and body mass index, adjusted for age, sex, and race/ethnicity?

  • Binary logistic regression: What is the magnitude of association between obesity status BMI \(\ge\) 30 kg/m2 and hours of weekly physical activity?

  • Ordinal logistic regression: What is the magnitude of association between overweight/obesity status (1 = “normal,” BMI \(<\) 25; 2 = “overweight,” 25 \(\le\) BMI \(<\) 30; 3 = “obese,” BMI \(\ge\) 30 kg/m2) and hours of weekly physical activity?

  • Multinomial logistic regression: Is there an association between beverage choice (soda, tea, coffee, water) following an educational seminar on the topic of cardiovascular risk?

  • Poisson regression: Is there a difference in mean length of stay (number of days from hospital admission to discharge) between those with different insurance types (private, Medicare, Medicaid, uninsured)?

  • Cox proportional hazards regression: Is there a difference in time to recurrence between cancer patients given a new therapy and those given the standard of care?

In subsequent chapters, you will learn about simple linear regression, multiple linear regression, binary logistic regression, Cox proportional hazards regression (survival analysis), and linear mixed models (longitudinal data analysis).

3.2 Basic statistical concepts

(Not sure if this fits here or elsewhere)

Put the 2x2 table explaining the relationship between N and p for GOF tests. Create a 2x2 table for Effect size and p for statistical vs. practical significance. Combine them in some way?

3.3 To Do

  • Components of a regression model (outcome, predictors, error or distribution)
  • Add some basic information?
    • Cross-sectional models and causality
    • Statistical vs. practical significance
    • Other things?