Chapter 9 Regression

9.1 Standard deviation line on a scatter plot

Before introducing linear regression, it is helpful to draw the standard deviation (SD) line on a scatter plot ( keep in mind that the SD line is not the regression line, which will be introduced later).

The SD line passes through the point of means (\(\bar{x}, \bar{y}\)) and its slope is:

\[\text{slope of SD line} = \pm \frac{s_y}{s_x} \tag{9.1} \]

The sign depends on the direction of association between the variables:

positive slope – if the Pearson correlation coefficient is positive (higher values of (x) tend to correspond to higher values of (y),
negative slope – if the Pearson correlation coefficient is negative.

Note that the SD line represents the direction along which linearly correlated points on the scatter plot tend to cluster.

Figure 9.1: Scatter plot showing the heights of 1,078 fathers and their adult sons in England, circa 1900 with the SD line (in orange).

Figure 9.2: Scatter plot of Z-standardized heights of fathers and sons. Note the SD line is a y=x line when x and y are Z-scores.

9.2 Conditional averages on a scatter plot

Linear regression can be viewed as a simplified way to describe how the conditional averages of one variable (\(Y\)) change as another variable (or set of variables) (\(X\)) varies. In Figure 9.3, the conditional averages of sons’ heights across intervals of fathers’ heights are illustrated. These averages provide an intuitive starting point for understanding the concept of regression, as they show the expected value of \(Y\) for different ranges of \(X\). If the conditional averages form an approximately straight line, then a linear regression model might provide a reasonable approximation of the relationship between the variables.

Heights of fathers and sons plotted as individual points, with group averages for sons based on fathers’ height categories highlighted.

Figure 9.3: Heights of fathers and sons plotted as individual points, with group averages for sons based on fathers’ height categories highlighted.

9.3 Simple linear regression

A simple linear regression is a statistical method used to describe the relationship between two variables:

one dependent variable (\(Y\)) — the outcome we want to predict or explain. Other names include “explained variable”, “response variable”, “predicted variable”, “outcome variable”, “output variable”, or “target.”
one independent variable (\(X\)) — the “predictor”, “regressor”, “explanatory”, or “input” variable.

9.3.1 Regression line on a scatter plot

A regression line represents the predicted values of \(Y\) for given values of \(X\) under the assumption of a linear relationship. It is drawn through the data points in such a way that it best fits the overall trend.

Figure 9.4: Heights of fathers and sons, group averages and the regression line.

9.3.2 Intercept and slope

The fitted regression line is expressed as:

\[\widehat{y_i} = \widehat{\beta}_0 + \widehat{\beta}_1 x_i, \tag{9.2}\]

where:

\(x_i\) is the value of the independent variable \(X\),
\(\widehat{y_i}\) is the predicted average of \(Y\) given \(X = x_i\),
\(\widehat{\beta}_0\) is the intercept of the regression line,
\(\widehat{\beta}_1\) is the slope of the regression line.

The intercept is the predicted average value of \(y\) when \(x_i=0\). In other words, it is the point where the fitted regression line crosses the \(Y\)-axis.

The slope indicates how the predicted average of \(Y\) changes for each one-unit difference in \(X\). A positive slope means \(Y\) tends to increase as \(X\) increases; a negative slope means \(Y\) tends to decrease.

Many equivalent formulas for the slope of the fitted regression line exist, but the most intuitive expression for the slope links it to the correlation and standard deviations of \(X\) and \(Y\):

\[\widehat{\beta}_1 = r_{xy} \frac{s_y}{s_x} \tag{9.3}\]

For every one standard deviation increase in \(X\), the predicted value of average \(Y\) changes by \(r_{xy}\) times the standard deviation of \(Y\). The formula and its practical implication are illustrated in Figure 9.5.

The slope of the regression line reflects the correlation between X and Y. On average, a one standard deviation increase in X corresponds to an increase of r times the standard deviation of Y. The blue line represents the regression line, illustrating the predicted relationship between X and Y. The orange line shows the SD line. The blue points mark (1) the point of averages (mean of X and Y) and (2) the predicted average value of Y when X is one standard deviation above its mean.

Figure 9.5: The slope of the regression line reflects the correlation between X and Y. On average, a one standard deviation increase in X corresponds to an increase of r times the standard deviation of Y. The blue line represents the regression line, illustrating the predicted relationship between X and Y. The orange line shows the SD line. The blue points mark (1) the point of averages (mean of X and Y) and (2) the predicted average value of Y when X is one standard deviation above its mean.

The fitted intercept is computed as:

\[\widehat{\beta}_0 = \bar{y} - \widehat{\beta}_1 \bar{x}. \tag{9.4}\]

This ensures the regression line passes through the point of averages \((\bar{x}, \bar{y})\), just like the SD line (see 9.5).

9.3.3 Residuals and least squares

Residuals are the differences between observed values and predicted values:

\[\widehat{e}_i = y_i - \widehat{y}_i \tag{9.5}\]

They measure how far each data point is from the regression line. Analyzing residuals helps assess the goodness of fit and detect patterns that may violate linearity assumptions of the regression model.

The regression line is chosen to minimize the sum of squared residuals:

\[ \text{Minimize}\sum_i(y_i - \widehat{y}_i) \tag{9.6}\]

This criterion of minimum least squares ensures the best possible linear fit to the data.

9.4 R-squared

The coefficient of determination (\(R^2\), R-squared) measures the proportion of variance in \(Y\) explained by \(X\):

\[R^2=1-\frac{\text{SS}_{res}}{\text{SS}_{tot}}, \tag{9.7}\]

where:

\(\text{SS}_{res}\) is the sum of squared residuals (“residual sum of squares”):

\[\text{SS}_{res} = \sum_i{e_i^2} \tag{9.8}\]

and \(\text{SS}_{tot}\) is the “total some of squares” (sum of squared deviations from the mean):

\[\text{SS}_{tot} = \sum_i{(y_i-\bar{y})^2} \tag{9.9}\]

For the simple linear regression R-squared is equivalent to the squared Pearson correlation coefficient:

\[R^2 = r_{xy}^2 \tag{9.10}\]

9.5 Purpose of linear regression

Linear regression and similar statistical models are used to:

describe and better understand association between X and Y,
predict the dependent variable based on the value(s) of explanatory variable(s) \(X\),
understand possible causal relationships between the \(X\) and \(Y\) variable, also in order to assess possibility of intervention.

While regression itself does not prove causality (correlation does not imply causation, 8.3), it can suggest potential causal links if combined with domain knowledge, experimental design (e.g., randomized controlled trials), and control for confounders. In observational studies (1.1), regression can help identify associations that may be worth exploring further to understand possible causal relationships.

9.6 Regression to the mean

The term regression originates from Francis Galton’s 19th-century observation that the heights of sons tended to “regress” toward the population average rather than perfectly matching their fathers’ heights. Very tall fathers usually had tall sons, but not as tall as themselves; similarly, very short fathers had sons who were short, but closer to the mean. Galton initially believed this was a special phenomenon, later termed regression to the mean. Subsequent research revealed that this is not unique but a mathematical consequence of imperfect correlation.

Figure 9.6 illustrates this concept using Galton’s data on fathers’ and sons’ heights. The black points represent group averages, the blue line is the regression line, and the orange line shows the SD line (perfect correlation). Notice how the regression line is flatter than the SD line, reflecting the tendency toward the mean.

Figure 9.6: Heights of fathers and sons, group averages, the regression line (blue) and the SD line (orange).

Regression to the mean occurs in many real-world contexts beyond Galton’s height example:

Test–retest scores: Students who score extremely high or low on a test often score closer to the average on a repeated test. This happens because part of the extreme score is due to random factors (“luck”), which are unlikely to repeat.
Assortative mating with imperfect correlation: Very intelligent women tend to marry men who are intelligent, but typically less intelligent than themselves. This reflects the fact that intelligence between partners is correlated, but the correlation coefficient is not 1.
Medical measurements: Patients with extremely high blood pressure at one visit often show lower readings at the next visit, even without treatment. This is partly due to measurement variability and temporary conditions, not the underlying patients’ improvement.

9.7 Regression vs correlation

As you can see, simple linear regression is closely related to Pearson’s correlation:

Slope and correlation. The slope of the regression line depends directly on the correlation between \(X\) and \(Y\) – see equation (9.3). In fact, if both variables are standardized to z-scores, the slope equals the correlation coefficient.
The sign of the regression slope is the same as the sign of the correlation coefficient.
Coefficient of determination. The \(R^2\) value in simple linear regression is equal to the squared Pearson correlation coefficient (see (9.10)).

Correlation and regression are connected concepts, yet they differ in key ways:

Correlation is symmetric — swapping \(X\) and \(Y\) does not change the correlation. Regression, by contrast, is directional – it models one variable as dependent on the other: swapping \(X\) and \(Y\) changes the slope and intercept.
Correlation is unitless, and always lies between -1 and 1. In the case of regression units matter: the slope depends on the units of \(X\) and \(Y\).
Finally, correlation simply measures the strength and direction of a linear relationship, whereas regression goes beyond description: it provides an equation that can be used for prediction and explanation.

9.8 Variable transformation

9.9 Multiple linear regression

9.9.1 Binary variables in linear regression

9.9.2 Matrix notation

9.10 Links

Scatter plot and regression line — web application: https://istats.shinyapps.io/ExploreLinReg/

Least squares rules – visualization: https://college.cengage.com/nextbook/statistics/utts_13540/student/html/simulation3_1.html

Impact of outliers: https://college.cengage.com/nextbook/statistics/utts_13540/student/html/simulation3_5.html

Multiple regression – visualization: https://istats.shinyapps.io/MultivariateRelationship/

Regression to the mean – an introduction with examples: https://fs.blog/regression-to-the-mean/

9.11 Questions

9.11.1 Discussion questions

Question 9.1 In Pearson’s study, the sons of the 182-cm fathers only averaged 179 cm in height. True or false: if you take the 179-cm sons, their fathers will average about 182 cm in height. Explain briefly.

Question 9.2 For the men in Diagnoza Społeczna, the ones who were 190 cm tall averaged 92 kg in weight. True or false, and explain: the ones who weighed 92 kg must have averaged 190 cm in height.

Question 9.3 (Freedman, Pisani, and Purves 2007) In a study of the stability of IQ scores, a large group of individuals is tested once at age 18 and again at age 35. The following results are obtained:

Age 18: average score = 100, SD = 15

Age 35: average score = 100, SD = 15, r = 0.80

Estimate the average score at age 35 for all the individuals who scored 115 at age 18.
Predict the score at age 35 for an individual who scored 115 at age 18.
What is the difference between estimation of the average for the existing group and prediction?

9.11.2 Test questions

Question 9.4 In a certain group of students, exam scores average 60 points and have a standard deviation of 12, similar to the laboratory test scores. The correlation between the laboratory test scores and the exam scores is approximately 0.50. Using linear regression, estimate the mean exam score for students whose laboratory test scores were:

60 points:
72 points:
36 points:
54 points:
90 points:

Question 9.5 Following results have been obtained in the study of 1,000 married couples. Mean height of husband: 178 cm; standard deviation of husband’s height: 7 cm, mean height of wife: 167 cm, standard deviation of wife’s height: 6 cm. Correlation between wife’s height and husband’s height: ≈ 0.25.

Estimate the (average) height of wife if husband’s height is:

178 cm:
171 cm:
192 cm:
174.5 cm:
188.5 cm:

Literature

Freedman, David, Robert Pisani, and Roger Purves. 2007. Statistics, 4th Edition. New York: W. W. Norton & Company.