2.3 Taxonomy of regression methods
The nature of the outcome variable determines which regression method is appropriate. Below is a list of outcome types, along with at least one example of each.
- A continuous variable has a range of possible values and can take on any value in an interval (e.g., waist circumference, age, body mass index). In practice, continuous variables take on a finite set of values due to measurement instrument limitations and/or rounding, but the underlying variable can in theory take on any numeric value.
- A categorical variable has a small number of possible values (e.g., smoking status, race/ethnicity, presence/absence of a condition).
- A binary variable is a categorical variable with exactly two levels (e.g., presence/absence of disease).
- An ordinal variable is a categorical variable in which the levels have a natural ordering (e.g., depression categorized as none, mild, moderate, and severe).
- A nominal variable is a categorical variable in which the levels do not have a natural ordering (e.g., marital status with levels married, living with partner, separated, divorced, never married, and widowed).
- A count variable takes on non-negative integer values (0, 1, 2, …) and describes a number of items or events (e.g., days of hospitalization in a year).
- An event time variable describes the time from a time origin to an event (e.g., time from surgery to cancer remission).
Table 2.1 identifies the appropriate regression method for each outcome type, common statistical methods that are special cases of each regression method, and the predictor type associated with each special case. For example, “correlation” is a special case of simple linear regression with a continuous predictor.
Outcome | Regression Method | Special Case | Type of Predictor Variable(s) for Special Case |
---|---|---|---|
Continuous | Simple linear | Correlation | Continuous |
Two-sample t-test | Binary | ||
One-way ANOVA | Categorical with three or more levels | ||
Multiple linear | Two-way ANOVA | Two categorical | |
ANCOVA | One categorical, one continuous | ||
Binary | Binary logistic | Chi-square test | Categorical |
Ordinal | Ordinal logistic | Spearman’s rho | Ordinal |
Kendall’s tau-b | Ordinal | ||
Nominal | Multinomial logistic | Chi-square test | Categorical |
Count | Poisson | ||
Event time | Cox (survival analysis) | Mantel-Haenszel | Binary |
Log-rank test | Categorical |
The above methods all assume each observation is independent of the others, as is the case, for example, with cross-sectional data that are a simple random sample from a population. If you have repeated measures on the same individuals (e.g., a cohort study with follow-up over time) or clustered observations (e.g., in a study of hospital patients from a group of hospitals, patients within a hospital are more similar to each other than are patients from different hospitals), then the independence assumption does not hold and other methods are needed. The following methods extend certain methods to handle dependent data.
Generalized least squares and linear mixed models extend linear regression to handle dependent data.
Generalized estimating equations (GEE) and generalized linear mixed models extend logistic and Poisson regression to handle dependent data.
A few other terms to be familiar with:
Simple or univariate regression models have only one predictor. This is also referred to as bivariate regression (referring to the fact that there are two variables involved – one outcome and one predictor).
Multiple or multivariable regression models have more than one predictor. The term multivariate regression is often used for this scenario, as well; however, it technically refers to the case of multiple outcomes.
The following are examples of research questions for which you could use each of the regression methods listed in Table 2.1.
Simple linear regression: Is there an association between (the outcome) triglycerides and (the predictor) body mass index?
Multiple linear regression: Is there an association between triglycerides and body mass index, adjusted for age, sex, and race/ethnicity?
Binary logistic regression: What is the magnitude of association between obesity status (BMI < 30 kg/m2 vs. \(\ge\) 30 kg/m2) and hours of weekly physical activity?
Ordinal logistic regression: What is the magnitude of association between overweight/obesity category (1 = “normal”, BMI \(<\) 25; 2 = “overweight”, 25 \(\le\) BMI \(<\) 30; 3 = “obese”, BMI \(\ge\) 30 kg/m2) and hours of weekly physical activity?
Multinomial logistic regression: Is there an association between the most-consumed non-alcoholic beverage (regular soda, diet soda, tea, coffee, energy drink, or water) in the month after an educational seminar on the topic of cardiovascular risk and the content of the seminar (standard content vs. standard content + experimental content to be evaluated for effectiveness)?
Poisson regression: Is there a difference in mean length of stay (number of days from hospital admission to discharge) between those with different insurance types (private, Medicare, Medicaid, uninsured)? Is the number of cases of COVID-19 per 100,000 associated with the proportion of individuals in a county who report wearing a mask most of the time? Note: Another form of regression, negative binomial regression, may be more appropriate depending on the magnitude of spread in the data.
Cox proportional hazards regression: Is there a difference in time from cancer remission to recurrence between patients given a new therapy and those given the standard of care?
Of these methods, this text covers simple (Chapter 4) and multiple (Chapter 5) linear regression, binary and ordinal logistic regression (Chapter 6), and Cox proportional hazards regression (survival analysis) (Chapter 7). However, before getting into the details of each regression method, the following chapter describes the necessary preliminary step of examining and summarizing the analysis variables (Chapter 3).