4.2 Notation and interpretation

While we are going to try in this text to not get into too much mathematical detail, we do need some notation to help keep everything straight. If you have never seen this sort of notation before, do not worry if it is a lot to digest at first. Over time, the notation will become more familiar. The initial descriptions that follow are dense, but we will unpack the details as we go.

The data are $n$ independent pairs of observed values of the outcome $Y$ and predictor $X$ . The data for the $i^{th}$ case (or individual, or observation) are denoted $(y_i,x_i)$ – each case has an associated outcome value and predictor value. Equation (4.1) describes the simple linear regression model for $Y$ with a continuous predictor $X$ .

$\begin{equation} Y = \beta_0 + \beta_1 X + \epsilon \tag{4.1} \end{equation}$

where $\beta_0$ and $\beta_1$ are the intercept and slope, respectively, of the best fit line, and $\epsilon$ is the error term, which is assumed to have a normal distribution with the same variance at all values of $X$ . We denote the assumption about the error term using the notation $\epsilon \sim N(0, \sigma^2)$ . The model described by Equation (4.1) assumes that the relationship between $Y$ and $X$ is linear and the error term captures random variation between individuals and random measurement error.

For a continuous predictor, the intercept ( $\beta_0$ ) is $E(Y|X=0)$ , the mean (average) outcome when the predictor is zero; and the slope $(\beta_1)$ is the difference in mean outcome associated with a one-unit difference in $X$ .

For a categorical predictor, however, the notation is very different. Equation (4.2) describes the simple linear regression model for $Y$ with a categorical predictor $X$ with $L$ levels.

$\begin{equation} Y = \beta_0 + \beta_1 I(X=2) + \beta_2 I(X=3) + ... + \beta_{L-1} I(X=L) + \epsilon \tag{4.2} \end{equation}$

where the indicator function $I(\textrm{logical condition})$ equals 1 when the logical condition is true and 0 when it is false. For example, $I(X=2)$ = 1 when $X = 2$ and 0 when $X \ne 2$ . One of the levels $(X=1)$ is left out of Equation (4.2) and is referred to as the reference level.

Suppose $X$ has $L$ = 3 levels. If we plug each of 1, 2, and 3 in for $X$ in Equation (4.2) we get the following three equations:

$\begin{array}{rcl} E(Y│X=1) & = & \beta_0 \\ E(Y│X=2) & = & \beta_0 + \beta_1 \\ E(Y│X=3) & = & \beta_0 + \beta_2 \end{array}$

Solving these equations for the $\beta$ s results in:

$\begin{array}{rcl} \beta_0 & = & E(Y│X=1) \\ \beta_1 & = & E(Y│X=2) - E(Y│X=1) \\ \beta_2 & = & E(Y│X=3) - E(Y│X=1) \end{array}$

So, for a categorical predictor, the individual $\beta$ terms do not correspond to the intercept and slope of a line. Instead, $\beta_0$ corresponds to the mean outcome at the reference level, and the other $\beta$ s correspond to differences in the mean outcome between other levels and the reference level.

The interpretations of regression coefficients, and the distinctions in their interpretations between continuous and categorical predictors, are very important to keep in mind when fitting and interpreting regression models. In summary:

$\beta_0$ is the mean outcome when the predictor is 0 (if the predictor is continuous) or at its reference level (if the predictor is categorical).
Each other $\beta$ is the difference in mean outcome between two groups of individuals. If the predictor $X$ is continuous, $\beta_1$ is the difference in mean outcome between groups that differ by one-unit in $X$ . If the predictor $X$ is categorical, each $\beta$ other than the intercept is the difference in mean outcome between those at one of the non-reference levels of $X$ and those at the reference level.

The following examples help to make these concepts more clear.

Example 4.1 (continued): For the regression of fasting glucose on waist circumference (a continuous predictor), the regression equation is written as follows.

$\textrm{FG} = \beta_0 + \beta_1 \textrm{WC} + \epsilon$

$\beta_0$ is the intercept of the best fit line and corresponds to the mean fasting glucose among those with a waist circumference of 0 cm. Clearly, this is not a meaningful quantity! However, that does not mean there is anything wrong with the regression fit. Later, we will discuss centering a predictor so the value of the intercept is more meaningful (see Section 4.3.4).

$\beta_1$ is the slope of the best fit line and corresponds to the difference in mean fasting glucose between those who differ by 1 cm in waist circumference. In Figure 4.2, for every 1 cm you move to the right on the horizontal axis, the line “slopes” upwards by an amount equal to our best estimate of $\beta_1$ . The nature of the equation, specifically that fasting glucose is assumed to be linearly related to waist circumference, implies that this difference between groups is the same regardless of the level of waist circumference (the slope is the same at all points on the line). The two groups could have waist circumferences of 90 and 91 or 120 and 121; in either case the pair of groups are assumed to differ in mean fasting glucose by the same amount (the slope). Relaxing this assumption can be accomplished by fitting a curve instead of a straight line (see Section 4.8).

Example 4.2 (continued): Using “Never” as the reference level, the regression of fasting glucose on smoking status (a categorical predictor) is written:

$\textrm{FG} = \beta_0 + \beta_1 I(X=\textrm{Past}) + \beta_2 I(X=\textrm{Current}) + \epsilon$

$\beta_0$ corresponds to the mean fasting glucose among never smokers (the reference level) (see Figure 4.3), $\beta_1$ corresponds to the difference in mean fasting glucose between past and never smokers, and $\beta_2$ corresponds to the difference in mean fasting glucose between current and never smokers. Once again, for emphasis, the non-intercept $\beta$ s are not mean outcomes at non-reference levels; they are differences in the mean outcome between non-reference levels and the reference level.

The indicator functions $I()$ are also referred to as dummy variables. If a variable is coded as a factor, then the creation of dummy variables is done for you automatically when you run a regression. For a categorical variable with $L$ levels, $L-1$ dummy variables (each a 0/1 variable) are created, one for each level other than the reference level, and automatically included in the regression model. For example, for smoking status, two dummy variables are created for you by R, one that is 1 when smoking status is Past and 0 otherwise, and another that is 1 when smoking status is Current and 0 otherwise.