6.1 Introduction

Binary logistic regression is similar in principle to linear regression – it models an outcome as a function of one or more predictors. The main distinction, however, is the form of the outcome. In linear regression, the outcome is a continuous variable, taking on values over a range of possible values. In binary logistic regression, the outcome is binary: a categorical variable with exactly two possible values. Hereafter, unless a distinction is necessary, this method is simply referred to as “logistic regression”.

Given a binary outcome $Y$ (e.g., “disease”) that takes on values 0 (“no”) and 1 (“yes”) and $K$ predictors, the logistic regression model

$\begin{equation} \ln{\left( \frac{p}{1-p} \right)} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_K X_K \tag{6.1} \end{equation}$

models $p$ , the probability that $Y = 1$ (e.g., “has the disease”), as a function of the predictors. Unlike linear regression, there is no normally distributed error term in Equation (6.1). Instead, the model assumes that, for a group of $n$ individuals with the same predictors, the number of individuals with $Y = 1$ has a binomial distribution with sample size $n$ and probability $p$ . In practice, the coded values of $Y$ can be any two values, not just 0 and 1.

Logistic regression could be used to model, for example, the relationship between the binary outcome coronary heart disease status (yes or no) and age, adjusted for confounding due to other factors. As with linear regression, the model can be used to test the null hypothesis of no association, estimate the magnitude of the association and its 95% confidence interval, and predict the outcome at a given age. For logistic regression, the raw prediction from the model is on the log-odds scale but can be transformed to produce a predicted probability. All of these steps will be demonstrated in the following sections.