Chapter 8 Regression Wisdom
8.1 Dangers of Extrapolation
Indiscriminate use of regression can yield predictions that are ridiculous.
The graph and regression line below show the relationship between the mean height (in cm) and age (in months) of young children in Kalama, Egypt.
## # A tibble: 12 × 2
## age height
## <int> <dbl>
## 1 18 76.1
## 2 19 77
## 3 20 78.1
## 4 21 78.2
## 5 22 78.8
## 6 23 79.7
## 7 24 79.9
## 8 25 81.1
## 9 26 81.2
## 10 27 81.8
## 11 28 82.8
## 12 29 83.5
##
## Call:
## lm(formula = height ~ age, data = Kalama)
##
## Coefficients:
## (Intercept) age
## 64.928 0.635
Let’s predict the height of a 50 year old man.
\[X=50 \times 12 = 600\]
\[\hat{Y}=64.928 + 0.635 \times 600 \approx 446\]
The man is predicted to be about 446 cm, or 4.46 m tall. Divide cm by 2.54, the predicted height is about 176 inches tall, or about 14 feet 8 inches tall.
This is obviously a terrible prediction, even though \(r^2=.989\) for the model . Think about how humans tend to grow and what the linear regression model is assuming about how we grow.
8.2 Lurking Variable
Look at the following scatterplot. What do you think is true about the magnitude and the direction of the relationship between \(X\) and \(Y\)?
The plot looked like there was a moderate negative relationship.
Correlation: \(r=\) -0.561
Regression Equation: \(\hat{y}=\) 156.472 \(+\) -0.807 \(X\)
But what is \(X\) and \(Y\)? It’s impossible to determine if there is a cause-and-effect relationship between the variables unless you are provided some context.
It turns out that \(X\) is height and \(Y\) is hair length, both measured in centimeters, for a sample of \(n=28\) people, both male and female. What do you think is the actual reason for the correlation between \(X\) and \(Y\)?
Considering women and men separately, the correlations are much weaker.
Correlation (Women Only): \(r=\) 0.225
Correlation (Men Only): \(r=\) -0.045
Summary statistics
## sex min Q1 median Q3 max mean sd n missing
## 1 F 148 154.75 158.0 163.25 178 159.3125 7.516371 16 0
## 2 M 164 170.00 172.5 178.50 190 174.8333 7.481290 12 0
## sex min Q1 median Q3 max mean sd n missing
## 1 F 22 27.25 32.0 38.5 48 33.312500 8.178987 16 0
## 2 M 0 2.75 4.5 7.0 35 8.083333 9.940352 12 0
The multiple linear regression equation predicting hair length based on both height and sex (0=female, 1=male) is:
\[\hat{Hair Length} = 14.736 + 0.117 Height - 27.039 Sex\]
The equation has \(r^2=0.679\) and only the \(Sex\) predictor is statistically significant.
8.3 Residual Plots
Calculate the correlation coefficient \(r\) and the equation of the least squares regression model \(\hat{Y}=b_0 + b_1 X\) for each of the four data sets. Notice that the \(X\) variable for the first 3 data sets are the same.
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
We will examine the “Anscombe” quartet of four data sets with the same correlation and regression equation, but four very different trends, originally presented by Frank Anscombe in 1973. For each data set, I have constructed a residual plot, plotting \(\hat{Y}\) versus \(e\) (predicted value vs residual) for each data set. This plot will ideally be “random”, without any sort of trend or correlation.
Sometimes the residual plot will plot \(X\) vs \(e\) (the actual value of explanatory variable vs residual) instead. This is commonly done if the explanatory variable is time and we wish to determine if there is a temporal trend in the data.
Here’s a more recent entertaining example, called the Datasaurus
example that Alberto Cairo came up with. It’s similar to the Anscombe
quartet in that data sets with virtually the same correlation and regression equation have wildly different patterns.
## # A tibble: 13 × 8
## dataset mean_x mean_y std_dev_x std_dev_y corr_x_y slope y.intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 47.8 16.8 26.9 -0.0641 -0.103 53.4
## 2 bullseye 54.3 47.8 16.8 26.9 -0.0686 -0.110 53.8
## 3 circle 54.3 47.8 16.8 26.9 -0.0683 -0.110 53.8
## 4 dino 54.3 47.8 16.8 26.9 -0.0645 -0.104 53.5
## 5 dots 54.3 47.8 16.8 26.9 -0.0603 -0.0969 53.1
## 6 h_lines 54.3 47.8 16.8 26.9 -0.0617 -0.0992 53.2
## 7 high_lines 54.3 47.8 16.8 26.9 -0.0685 -0.110 53.8
## 8 slant_down 54.3 47.8 16.8 26.9 -0.0690 -0.111 53.8
## 9 slant_up 54.3 47.8 16.8 26.9 -0.0686 -0.110 53.8
## 10 star 54.3 47.8 16.8 26.9 -0.0630 -0.101 53.3
## 11 v_lines 54.3 47.8 16.8 26.9 -0.0694 -0.112 53.9
## 12 wide_lines 54.3 47.8 16.8 26.9 -0.0666 -0.107 53.6
## 13 x_shape 54.3 47.8 16.8 26.9 -0.0656 -0.105 53.6
8.4 Quadratic Regression
In this problem, we are trying to model the age (in years) of a tree based on the diameter (in inches) of the trunk. This will be useful for predicting the age of a tree without having to cut the tree down and count the growth rings. I’d suggest entering diameter (\(X\)) in List L1 and the age (\(Y\)) in list L2.
## diameter age
## [1,] 1.8 4
## [2,] 1.8 5
## [3,] 2.2 8
## [4,] 4.4 8
## [5,] 6.6 8
## [6,] 4.4 10
## [7,] 7.7 10
## [8,] 10.8 12
## [9,] 7.7 13
## [10,] 5.5 14
## [11,] 9.9 16
## [12,] 10.1 18
## [13,] 12.1 20
## [14,] 12.8 22
## [15,] 10.3 23
## [16,] 14.3 25
## [17,] 13.2 28
## [18,] 9.9 29
## [19,] 13.2 30
## [20,] 15.4 30
## [21,] 17.6 33
## [22,] 14.3 34
## [23,] 15.4 35
## [24,] 11.0 38
## [25,] 15.4 38
## [26,] 16.5 40
## [27,] 16.5 42
Summary Statistics:
Means: \(\bar{x}=\) 10.4; \(\bar{y}=\) 21.963; Standard Deviations: \(s_x=\) 4.793; \(s_y=\) 11.902; Correlation: \(r=\) 0.888; \(r^2=\) 0.789
Regression Equation: \(\hat{y}=\) -0.974 \(+\) 2.206 \(X\)
While the correlation coefficient is close to 1, careful examination of the plot shows the trend is somewhat nonlinear. In the following graph, I have superimposed a curve that estimates the true relationship between tree diameter and age.
A plot called a residual plot can help us detect non-linearity, and other violations of regression assumptions. In order to construct it, we must compute the predicted values \(\hat{y}\) and the residuals \(e\) for each data point. We can do this on the TI calculator. Let List L3
hold your predicted values: L3=-0.974+2.206*L1
. Let List L4
hold your residuals: L4=L2-L3
.
I did this with my software as well. Then construct a scatterplot with the predicted values on the \(x\)-axis and the residuals on the \(y\)-axis.
Ideally, the points in the residual plot are uncorrelated and fall in a random pattern. In this case, there is a curved pattern in the data. Notice the most of the largest & oldest trees have their age underestimated (i.e. the residual \(e\) is positive). Our model is biased and will predict larger, older trees to be younger than they really are.
Maybe the cross-sectional area of the tree trunk, rather than the diameter, might be a better predictor of age. We could re-express the diameter in this case by squaring it, using both \(X\) and \(X^2\), the squared diameter, to predict age.
Then we fit a quadratic regression equation of the form \[\hat{y}=b_0 + b_1 X + b_2 X^2\] to the data. The QuadReg
option on the TI calculator will fit this model. I’ll fit this model and graph the equation. It’s a better fit! One might also use this for the second data set from the Anscombe data.
Regression Equation: \(\hat{y}=\) 2.72 \(+\) 1.161 \(X +\) 0.055 \(X^2\)
\(r^2=0.799\)
8.5 Log Transformation
Another very common re-expression is to take the logarithm of one or both regression variables. For example, if the response variable \(Y\) is increasing at an exponential, rather than a linear, rate, taking the log of \(Y\) will “straighten” out the scatterplot.
Here’s a simplified example, with \(X\)=hours and \(Y\)=number of bacteria in an experiment.
## hours bacteria
## 1 1 20
## 2 2 40
## 3 3 75
## 4 4 150
## 5 5 297
## 6 6 750
## (Intercept) hours
## 0.972 0.308
I doubt it would take a residual plot for you to be convinced that fitting a linear model would be a poor choice in this situation.
We can “straighten” an exponential growth trend by taking the logarithm of the \(Y\) variable. We can use either the base-10 log or the natural log. I’ll use the base-10 log in this example and do with my calculator.
The equation has \(r^2=0.996\) is: \[\log(\hat{Y})=0.972+0.308X\]
If I want to predict the number of bacteria in the petri dish after \(X=2.5\) hours, I use the equation.
\(\log(\hat{Y})=0.972+0.308(2.5)=\) 1.742
This answer is in log-bacteria, not number of bacteria. Since the log is less than two, the actual number of bacteria will be less than \(10^2=100\). We “undo” the re-expression by taking the antilog.
\(\hat{Y}=10^{1.742}=\) 55.233 bacteria.