## 3.5 Model diagnostics

As we saw in Section 2.3, checking the assumptions of the multiple linear model through the visualization of the data becomes tricky even when $$p=2.$$ To solve this issue, a series of diagnostic tools have been designed in order to evaluate graphically and systematically the validity of the assumptions of the linear model.

We will illustrate them in the wine dataset, which is available in the wine.RData workspace:

load("wine.RData")
mod <- lm(Price ~ Age + AGST + HarvestRain + WinterRain, data = wine)
summary(mod)

Before going into specifics, recall the following general comment on performing model diagnostics.

When one assumption fails, it is likely that this failure will affect to other assumptions. For example, if linearity fails, then most likely homoscedasticity and normality will fail also. Therefore, identifying the root cause of the assumptions failure is key in order to try to find a patch.

### 3.5.1 Linearity

Linearity between the response $$Y$$ and the predictors $$X_1,\ldots,X_p$$ is the building block of the linear model. If this assumption fails, i.e., if there is a nonlinear trend linking $$Y$$ and at least one of the predictors $$X_1,\ldots,X_p$$ in a significant way, then the linear fit will yield flawed conclusions, with varying degrees of severity, about the explanation of $$Y.$$ Therefore, linearity is a crucial assumption.

How to check it

The so-called residuals vs. fitted values plot is the scatterplot of $$\{(\hat Y_i,\hat\varepsilon_i)\}_{i=1}^n$$ and is a very useful tool for detecting linearity departures using a single68 graphical device. Figure 3.15 illustrates how the residuals vs. fitted values plots behave for situations in which the linearity is known to be respected or violated.

Under linearity, we expect that there is no significant trend in the residuals $$\hat\varepsilon_i$$ with respect to $$\hat Y_i.$$ This can be assessed by checking if a flexible fit of the mean of the residuals is constant. Such fit is given by the red line produced by:

plot(mod, 1)

If nonlinearities are observed, it is worth to plot the regression terms of the model. These are the $$p$$ scatterplots of $$\{(X_{ij},Y_i)\}_{i=1}^n,$$ for $$j=1,\ldots,p,$$ that are accompanied by the regression lines $$y=\hat\beta_0+\hat\beta_jx_j,$$ where $$(\hat{\beta}_0,\hat{\beta}_1,\ldots,\hat{\beta}_p)'$$ come from the multiple linear fit, not from individual simple linear regressions. The regression terms help identifying which predictors have nonlinear effects on $$Y.$$

par(mfrow = c(2, 2)) # We have 4 predictors
termplot(mod, partial.resid = TRUE)

What to do if it fails

Using an adequate nonlinear transformation for the problematic predictors or adding interaction terms, as we saw in Section 3.4, might be helpful in linearizing the relation between $$Y$$ and $$X_1,\ldots,X_p.$$ Alternatively, one can consider a nonlinear transformation $$f$$ for the response $$Y,$$ at the expenses of:

1. Modeling $$W:=f(Y)$$ rather than $$Y,$$ thus slightly changing the setting of the original modeling problem.
2. Alternatively, untransforming the linear prediction $$\hat{W}$$ as $$\hat{Y}_f:=f^{-1}(\hat{W})$$ and performing biased predictions69 $$\mbox{}$$70 for $$Y.$$

Transformations of $$Y$$ are explored in Section 3.5.7. Chapter 5 is devoted to replacing the untransformed approach with a more principled one71.

Let’s see the transformation of predictors in the example that motivated Section 3.4.

par(mfrow = c(1, 2))
plot(lm(y ~ x, data = nonLinear), 1) # Nonlinear
plot(lm(y ~ I(x^2), data = nonLinear), 1) # Linear

### 3.5.2 Normality

The assumed normality of the errors $$\varepsilon_1,\ldots,\varepsilon_n$$ allows us to make exact inference in the linear model, in the sense that the distribution of $$\hat{\boldsymbol{\beta}}$$ given in (2.11) is exact for any $$n,$$ and not asymptotic with $$n\to\infty.$$ If normality does not hold, then the inference we did (CIs for $$\beta_j,$$ hypothesis testing, CIs for prediction) is to be somehow suspected. Why just somehow? Roughly speaking, the reason is that the central limit theorem will make $$\hat{\boldsymbol{\beta}}$$ asymptotically normal72, even if the errors are not. However, the speed of this asymptotic convergence greatly depends on how non-normal is the distribution of the errors. Hence the next rule of thumb:

Non-severe73 departures from normality yield valid (asymptotic) inference for relatively large sample sizes $$n.$$

Therefore, the failure of normality is typically less problematic than other assumptions.

How to check it

The QQ-plot (theoretical quantile vs. empirical quantile plot) allows to check if the standardized residuals follow a $$\mathcal{N}(0,1).$$ What it does is to compare the theoretical quantiles of a $$\mathcal{N}(0,1)$$ with the quantiles of the sample of standardized residuals.

plot(mod, 2)

Under normality, we expect the points to align with the diagonal line, which represents the expected position of the points if they were sampled from a $$\mathcal{N}(0,1).$$ It is usual to have larger departures from the diagonal in the extremes74 than in the center, even under normality, although these departures are more evident if the data is non-normal.

There are formal tests to check the null hypothesis of normality in our residuals. Two popular ones are the Shapiro–Wilk test, implemented by the shapiro.test function, and the Lilliefors test75, implemented by the nortest::lillie.test function:

# Shapiro-Wilk test of normality
shapiro.test(mod$residuals) ## ## Shapiro-Wilk normality test ## ## data: mod$residuals
## W = 0.95903, p-value = 0.3512
# We do not reject normality
# shapiro.test allows up to 5000 observations -- if dealing with more data
# points, randomization of the input is a possibility

# Lilliefors test -- the Kolmogorov-Smirnov adaptation for testing normality
nortest::lillie.test(mod$residuals) ## ## Lilliefors (Kolmogorov-Smirnov) normality test ## ## data: mod$residuals
## D = 0.13739, p-value = 0.2125
# We do not reject normality

What to do if it fails

Patching non-normality is not easy and most of the time requires the consideration of other models, like the ones seen in Chapter 5. If $$Y$$ is non-negative, one possibility is to transform $$Y$$ by means of the Box–Cox transformation:

\begin{align} y^{(\lambda)}:=\begin{cases} \frac{y^\lambda-1}{\lambda},& \lambda\neq0,\\ \log(y),& \lambda=0. \end{cases}\tag{3.6} \end{align}

This transformation alleviates the skewness76 of the data, therefore making it more symmetric and hence normal-like. The optimal data-dependent $$\hat\lambda$$ that makes the data most normal-like can be found through maximum likelihood77 on the transformed sample $$\{Y^{(\lambda)}_{i}\}_{i=1}^n.$$ If $$Y$$ is not non-negative, (3.6) cannot be applied. A possible patch is to shift the data by a positive constant $$m=-\min(Y_1,\ldots,Y_n)+\delta,$$ $$\delta>0,$$ such that transformation (3.6) becomes

\begin{align*} y^{(\lambda,m)}:=\begin{cases} \frac{(y+m)^\lambda-1}{\lambda},& \lambda\neq0,\\ \log(y+m),& \lambda=0. \end{cases} \end{align*}

A neat alternative to this shifting is to rely on a transformation that is already designed for real $$Y,$$ such as the Yeo–Johnson transformation:

\begin{align} y^{(\lambda)}:=\begin{cases} \frac{(y+1)^\lambda-1}{\lambda},& \lambda\neq0,y\geq 0,\\ \log(y+1),& \lambda=0,y\geq 0,\\ -\frac{(-y+1)^{2-\lambda}-1}{2-\lambda},& \lambda\neq2,y<0,\\ -\log(-y+1),& \lambda=2,y<0. \end{cases}\tag{3.7} \end{align}

The beauty of the Yeo–Johnson transformation is that it extends neatly the Box–Cox transformation, which appears as a particular case when $$Y$$ is non-negative and after performing a shift of $$1$$ units (see Figure 3.21). As with the Box–Cox transformation, the optimal $$\hat\lambda$$ is estimated through maximum likelihood on the transformed sample $$\{Y^{(\lambda)}_{i}\}_{i=1}^n.$$

N <- 200
y <- seq(-4, 4, length.out = N)
lambda <- c(0, 0.5, 1, 2, -0.5, -1, -2)
l <- length(lambda)
psi <- sapply(lambda, function(la) car::yjPower(U = y, lambda = la))
matplot(y, psi, type = "l", ylim = c(-4, 4), lwd = 2, lty = 1:l,
ylab = latex2exp::TeX("$y^{(\\lambda)}$"), col = 1:l, las = 1,
main = "Yeo-Johnson transformation")
abline(v = 0, h = 0)
abline(v = -1, h = 0, lty = 2)
legend("bottomright", lty = 1:l, lwd = 2, col = 1:l,
legend = latex2exp::TeX(paste0("$\\lambda = ", lambda, "$")))

Let’s see how to implement both transformations using the car::powerTransform, car::bcPower, and car::yjPower functions.

# Test data

# Predictors
n <- 200
set.seed(121938)
X1 <- rexp(n, rate = 1 / 5) # Non-negative
X2 <- rchisq(n, df = 5) - 3 # Real

# Response of a linear model
epsilon <- rchisq(n, df = 10) - 10 # Centered error, but not normal
Y <- 10 - 0.5 * X1 + X2 + epsilon

# Transformation of non-normal data to achieve normal-like data (no model)

# Optimal lambda for Box-Cox
BC <- car::powerTransform(lm(X1 ~ 1), family = "bcPower") # Maximum-likelihood fit
# Note we use a regression model with no predictors
(lambdaBC <- BC$lambda) # The optimal lambda ## Y1 ## 0.2412419 # lambda < 1, so positive skewness is corrected # Box-Cox transformation X1Transf <- car::bcPower(U = X1, lambda = lambdaBC) # Comparison par(mfrow = c(1, 2)) hist(X1, freq = FALSE, breaks = 10, ylim = c(0, 0.3)) hist(X1Transf, freq = FALSE, breaks = 10, ylim = c(0, 0.3))  # Optimal lambda for Yeo-Johnson YJ <- car::powerTransform(lm(X2 ~ 1), family = "yjPower") (lambdaYJ <- YJ$lambda)
##        Y1
## 0.5850791

# Yeo-Johnson transformation
X2Transf <- car::yjPower(U = X2, lambda = lambdaYJ)

# Comparison
par(mfrow = c(1, 2))
hist(X2, freq = FALSE, breaks = 10, ylim = c(0, 0.3))
hist(X2Transf, freq = FALSE, breaks = 10, ylim = c(0, 0.3))


# Transformation of non-normal response in a linear model

# Optimal lambda for Yeo-Johnson
YJ <- car::powerTransform(lm(Y ~ X1 + X2), family = "yjPower")
(lambdaYJ <- YJ$lambda) ## Y1 ## 0.9160924 # Yeo-Johnson transformation YTransf <- car::yjPower(U = Y, lambda = lambdaYJ) # Comparison for the residuals par(mfrow = c(1, 2)) plot(lm(Y ~ X1 + X2), 2) plot(lm(YTransf ~ X1 + X2), 2) # Slightly better Be aware that using the previous transformations implies modeling the transformed response rather than $$Y.$$ Normality is also possible to patch if it is a consequence of the failure of linearity or homoscedasticity, which translates the problem into fixing those assumptions. ### 3.5.3 Homoscedasticity The constant-variance assumption of the errors is also key for obtaining the inferential results we saw. For example, if the assumption does not hold, then the CIs for prediction will not respect the confidence for which they were built. How to check it Heteroskedasticity can be detected by looking into irregular vertical dispersion patterns in the residuals vs. fitted values plot. However, it is simpler to use the scale-location plot, where the standardized residuals are transformed by a square root of its absolute value, and inspect the deviations in the positive axis. plot(mod, 3) Under homoscedasticity, we expect the red line, a flexible fit of the mean of the transformed residuals, to be almost constant about $$1$$. If there are clear non-constant patterns, then there is evidence of heteroskedasticity. There are formal tests to check the null hypothesis of homoscedasticity in our residuals. For example, the Breusch–Pagan test implemented in car::ncvTest: # Breusch-Pagan test car::ncvTest(mod) ## Non-constant Variance Score Test ## Variance formula: ~ fitted.values ## Chisquare = 0.3254683, Df = 1, p = 0.56834 # We do not reject homoscedasticity The Breusch–Pagan test checks homoscedasticity against a non-constant variance in the residuals that is linearly increasing with respect to the predictors. This means that the test can be fooled by a nonlinear pattern in the variance of the residuals that results in a flat linear fit (e.g., a quadratic pattern). It is advised then to check the scale-location plot if performing the Breusch–Pagan test, in order to identify evident non-constant variances hidden behind tricky nonlinearities. The next code illustrates the previous warning with two examples. # Heteroskedastic models set.seed(123456) x <- rnorm(100) y1 <- 1 + 2 * x + rnorm(100, sd = x^2) y2 <- 1 + 2 * x + rnorm(100, sd = 1 + x * (x > 0)) modHet1 <- lm(y1 ~ x) modHet2 <- lm(y2 ~ x) # Heteroskedasticity not detected car::ncvTest(modHet1) ## Non-constant Variance Score Test ## Variance formula: ~ fitted.values ## Chisquare = 2.938652e-05, Df = 1, p = 0.99567 plot(modHet1, 3) # Heteroskedasticity correctly detected car::ncvTest(modHet2) ## Non-constant Variance Score Test ## Variance formula: ~ fitted.values ## Chisquare = 41.03562, Df = 1, p = 1.4948e-10 plot(modHet2, 3) What to do if it fails Using a nonlinear transformation for the response $$Y$$ may help to control the variance. Typical choices are $$\log(Y)$$ and $$\sqrt{Y},$$ which reduce the scale of the larger responses and leads to a reduction of heteroskedasticity. Keep in mind that these transformations, as the Box–Cox transformations, are designed for non-negative $$Y.$$ The Yeo–Johnson transformation can be used instead if $$Y$$ is real. Let’s see some examples. # Artificial data with heteroskedasticity set.seed(12345) X <- rchisq(500, df = 3) e <- rnorm(500, sd = sqrt(0.1 + 2 * X)) Y <- 1 + X + e # Original plot(lm(Y ~ X), 3) # Very heteroskedastic # Transformed plot(lm(I(log(abs(Y))) ~ X), 3) # Much less hereroskedastic, but at the price # of losing the signs in Y... # Shifted and transformed delta <- 1 # This is tunable m <- -min(Y) + delta plot(lm(I(log(Y + m)) ~ X), 3) # No signs loss # Transformed by Yeo-Johnson # Optimal lambda for Yeo-Johnson YJ <- car::powerTransform(lm(Y ~ X), family = "yjPower") (lambdaYJ <- YJ$lambda)
##        Y1
## 0.6932053

# Yeo-Johnson transformation
YTransf <- car::yjPower(U = Y, lambda = lambdaYJ)
plot(lm(YTransf ~ X), 3) # Slightly less hereroskedastic

### 3.5.4 Independence

Independence is also a key assumption: it guarantees that the amount of information that we have on the relationship between $$Y$$ and $$X_1,\ldots,X_p,$$ from $$n$$ observations, is maximal. If there is dependence, then information is repeated, and as a consequence the variability of the estimates will be larger. For example, $$95\%$$ CIs built under the assumption of independence will be shorter than the adequate, meaning that they will not contain with a $$95\%$$ confidence the unknown parameter, but with a lower confidence (say $$80\%$$).

An extreme case is the following: suppose we have two samples of sizes $$n$$ and $$2n,$$ where the $$2n$$-sample contains the elements of the $$n$$-sample twice. The information in both samples is the same, and so are the estimates for the coefficients $$\boldsymbol{\beta}.$$ Yet, in the $$2n$$-sample the length of the confidence intervals is $$C(2n)^{-1/2},$$ whereas in the $$n$$-sample they have length $$Cn^{-1/2}.$$ A reduction by a factor of $$\sqrt{2}$$ in the confidence interval has happened, but we have the same information! This will give us a wrong sense of confidence in our model, and the root of the evil is that the information that is actually in the sample is smaller due to dependence.

How to check it

The set of possible dependence structures on the residuals is immense, and there is no straightforward way of checking all of them. Usually what it is examined is the presence of autocorrelation, which appears when there is some kind of serial dependence in the measurement of observations. The serial plot of the residuals allows to detect time trends in them:

plot(mod$residuals, type = "o") Under uncorrelation, we expect the series to show no tracking of the residuals. That is, that the closer observations do not take similar values, but rather change without any kind of distinguishable pattern. Tracking is associated to positive autocorrelation, but negative autocorrelation, manifested as alternating small-large or positive-negative residuals, is also possible. The lagged plots of $$(\hat\varepsilon_{i-\ell},\hat\varepsilon_i),$$ $$i=\ell+1,\ldots,n,$$ obtained through lag.plot, allow us to detect both kinds of autocorrelations for a given lag $$\ell.$$ Under independence, we expect no trend in such plot. Here is an example: lag.plot(mod$residuals, lags = 1, do.lines = FALSE)

# No serious serial trend, but some negative autocorrelation is appreciated
cor(mod$residuals[-1], mod$residuals[-length(modresiduals)]) ## [1] -0.4267776 There are also formal tests for testing for the absence of autocorrelation, such as the Durbin–Watson test implemented in car::durbinWatsonTest: # Durbin-Watson test car::durbinWatsonTest(mod) ## lag Autocorrelation D-W Statistic p-value ## 1 -0.4160168 2.787261 0.054 ## Alternative hypothesis: rho != 0 # Does not reject at alpha = 0.05 What to do if it fails Little can be done if there is dependence in the data, once this has been collected. If the dependence is temporal, we must rely on the family of statistical models meant to deal with serial dependence: time series. Other kinds of dependence such as spatial dependence, spatio-temporal dependence, geometrically-driven dependencies, censorship, truncation, etc. need to be analyzed with a different set of tools to the ones covered in these notes. However, there is a simple trick worth mentioning to deal with serial dependence. If the observations of the response $$Y,$$ say $$Y_1,Y_2,\ldots,Y_n,$$ present serial dependence, a differentiation of the sample that yields $$Y_1-Y_2,Y_2-Y_3,\ldots,Y_{n-1}-Y_n$$ may lead to independent observations. These are called the innovations of the series of $$Y.$$ Load the dataset assumptions3D.RData and compute the regressions y.3 ~ x1.3 + x2.3, y.4 ~ x1.4 + x2.4, y.5 ~ x1.5 + x2.5, and y.8 ~ x1.8 + x2.8. Use the presented diagnostic tools to test the assumptions of the linear model and look out for possible problems. ### 3.5.5 Multicollinearity A common problem that arises in multiple linear regression is multicollinearity. This is the situation when two or more predictors are highly linearly related between them. Multicollinearitiy has important effects on the fit of the model: • It reduces the precision of the estimates. As a consequence, the signs of fitted coefficients may be even reversed and valuable predictors may appear as non-significant. In limit cases, numerical instabilities may appear since $$\mathbf{X}'\mathbf{X}$$ will be almost singular. • It is difficult to determine how each of the highly-related predictors affects the response, since one masks the other. Intuitively, multicollinearity can be visualized as a card (fitting plane) that is hold on its opposite corners and that spins on its diagonal, where the data is concentrated. Then, very different planes will fit the data almost equally well, which results in a large variability of the optimal plane. An approach to detect multicollinearity is to inspect the correlation matrix between the predictors. # Numerically round(cor(wine), 2) ## Year Price WinterRain AGST HarvestRain Age FrancePop ## Year 1.00 -0.46 0.05 -0.29 -0.06 -1.00 0.99 ## Price -0.46 1.00 0.13 0.67 -0.51 0.46 -0.48 ## WinterRain 0.05 0.13 1.00 -0.32 -0.27 -0.05 0.03 ## AGST -0.29 0.67 -0.32 1.00 -0.03 0.29 -0.30 ## HarvestRain -0.06 -0.51 -0.27 -0.03 1.00 0.06 -0.03 ## Age -1.00 0.46 -0.05 0.29 0.06 1.00 -0.99 ## FrancePop 0.99 -0.48 0.03 -0.30 -0.03 -0.99 1.00 # Graphically corrplot::corrplot(cor(wine), addCoef.col = "grey") Here we can see what we already knew from Section 2.1: Age and Year are perfectly linearly related and Age and FrancePop are highly linearly related. Then one approach will be to directly remove one of the highly-correlated predictors. However, it is not enough to inspect pairwise correlations in order to get rid of multicollinearity. Indeed, it is possible to build counterexamples that show non-suspicious pairwise correlations but problematic complex linear relations that remain hidden. For the sake of illustration, here is one: # Create predictors with multicollinearity: x4 depends on the rest set.seed(45678) x1 <- rnorm(100) x2 <- 0.5 * x1 + rnorm(100) x3 <- 0.5 * x2 + rnorm(100) x4 <- -x1 + x2 + rnorm(100, sd = 0.25) # Response y <- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4 + rnorm(100) data <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y) # Correlations -- none seems suspicious corrplot::corrplot(cor(data), addCoef.col = "grey") A better approach to detect multicollinearity is to compute the Variance Inflation Factor (VIF) of each fitted coefficient $$\hat\beta_j.$$ This is a measure of how linearly dependent is $$X_j$$ on the rest of the predictors78 and it is defined as \begin{align*} \text{VIF}(\hat\beta_j)=\frac{1}{1-R^2_{X_j|X_{-j}}}, \end{align*} where $$R^2_{X_j|X_{-j}}$$ represents the $$R^2$$ from the regression of $$X_j$$ onto the remaining predictors $$X_1,\ldots,X_{j-1},X_{j+1}\ldots,X_p.$$ Clearly, $$\text{VIF}(\hat\beta_j)\geq 1.$$ The next simple rule of thumb gives direct insight into which predictors are multicollinear: • VIF close to $$1$$: absence of multicollinearity. • VIF larger than $$5$$ or $$10$$: problematic amount of multicollinearity. Advised to remove the predictor with largest VIF. VIF is computed with the car::vif function, which takes as an argument a linear model. Let’s see how it works in the previous example with hidden multicollinearity. # Abnormal variance inflation factors: largest for x4, we remove it modMultiCo <- lm(y ~ x1 + x2 + x3 + x4) car::vif(modMultiCo) ## x1 x2 x3 x4 ## 26.361444 29.726498 1.416156 33.293983 # Without x4 modClean <- lm(y ~ x1 + x2 + x3) # Comparison car::compareCoefs(modMultiCo, modClean) ## Calls: ## 1: lm(formula = y ~ x1 + x2 + x3 + x4) ## 2: lm(formula = y ~ x1 + x2 + x3) ## ## Model 1 Model 2 ## (Intercept) 1.062 1.058 ## SE 0.103 0.103 ## ## x1 0.922 1.450 ## SE 0.551 0.116 ## ## x2 1.640 1.119 ## SE 0.546 0.124 ## ## x3 -3.165 -3.145 ## SE 0.109 0.107 ## ## x4 -0.529 ## SE 0.541 ## confint(modMultiCo) ## 2.5 % 97.5 % ## (Intercept) 0.8568419 1.2674705 ## x1 -0.1719777 2.0167093 ## x2 0.5556394 2.7240952 ## x3 -3.3806727 -2.9496676 ## x4 -1.6030032 0.5446479 confint(modClean) ## 2.5 % 97.5 % ## (Intercept) 0.8526681 1.262753 ## x1 1.2188737 1.680188 ## x2 0.8739264 1.364981 ## x3 -3.3564513 -2.933473 # Summaries summary(modMultiCo) ## ## Call: ## lm(formula = y ~ x1 + x2 + x3 + x4) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.9762 -0.6663 0.1195 0.6217 2.5568 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0622 0.1034 10.270 < 2e-16 *** ## x1 0.9224 0.5512 1.673 0.09756 . ## x2 1.6399 0.5461 3.003 0.00342 ** ## x3 -3.1652 0.1086 -29.158 < 2e-16 *** ## x4 -0.5292 0.5409 -0.978 0.33040 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.028 on 95 degrees of freedom ## Multiple R-squared: 0.9144, Adjusted R-squared: 0.9108 ## F-statistic: 253.7 on 4 and 95 DF, p-value: < 2.2e-16 summary(modClean) ## ## Call: ## lm(formula = y ~ x1 + x2 + x3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.91297 -0.66622 0.07889 0.65819 2.62737 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0577 0.1033 10.24 < 2e-16 *** ## x1 1.4495 0.1162 12.47 < 2e-16 *** ## x2 1.1195 0.1237 9.05 1.63e-14 *** ## x3 -3.1450 0.1065 -29.52 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.028 on 96 degrees of freedom ## Multiple R-squared: 0.9135, Adjusted R-squared: 0.9108 ## F-statistic: 338 on 3 and 96 DF, p-value: < 2.2e-16 # Variance inflation factors are normal car::vif(modClean) ## x1 x2 x3 ## 1.171942 1.525501 1.364878 Note that multicollinearity is another instance of the model correctness vs. usefulness. A model with multicollinearity might be perfectly valid in the sense of respecting the assumptions of the model. As we saw in Section 2.3, it does not matter whether the predictors are related or not, at least for the verification of the assumptions. But the model will be useless if the multicollinearity is high, since it can inflate the variability of the estimation without any kind of bound. In the extreme case in which the multicollinearity is perfect, then the model will not be identifiable, despite being correct. The next simulation study illustrates how the non-indentifiability of $$\boldsymbol\beta$$ worsens with increasing multicollinearity, and therefore so does the error on estimating $$\boldsymbol\beta.$$ We consider the linear model \begin{align*} Y=X_1+X_2+X_3+\varepsilon,\quad \varepsilon\sim\mathcal{N}(0, 1), \end{align*} where \begin{align*} X_1,X_2\sim\mathcal{N}(0,1),\quad X_3=b(X_1+X_2)+\delta,\quad \delta\sim \mathcal{N}(0, 0.25) \end{align*} and $$b\geq0$$ controls the degree of multicollinearity of $$X_3$$ with $$X_1$$ and $$X_2$$ (if $$b=0,$$ $$X_3$$ is uncorrelated with $$X_1$$ and $$X_2$$). We aim to study how $$\mathbb{E}[\|\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}\|]$$ behaves when $$n$$ grows, for different values of $$b.$$ To do so, we consider $$n=2^\ell,$$ $$\ell=3,\ldots,12,$$ $$b=0,0.5,1,5,10,$$ and approximate the expected error by Monte Carlo using $$M=500$$ replicates. The results of the simulation study are shown in Figure 3.30, which illustrates that multicollinearity notably decreases the accuracy of $$\hat{\boldsymbol{\beta}},$$ at all sample sizes $$n.$$ Replicate Figure 3.30 by implementing the simulation study behind it. ### 3.5.6 Outliers and high-leverage points Outliers and high-leverage points are particular observations that have an important impact in the final linear model, either on the estimates or on the properties of the model. They are defined as follows: • Outliers are the observations with a response $$Y_i$$ far away from the regression plane. They typically do not affect the estimate of the plane, unless one of the predictors is also extreme (see next point). But they inflate $$\hat\sigma$$ and, as a consequence, they draw down the $$R^2$$ of the model and expand the CIs. • High-leverage points are observations with an extreme predictor $$X_{ij}$$ located far away from the rest of points. These observations are highly influential and may drive the linear model fit. The reason is the squared distance in the RSS: an individual extreme point may represent a large portion of the RSS. Both outliers and high-leverage points can be identified with the residuals vs. leverage plot: plot(mod, 5) The rules of thumb for declaring outliers and high-leverage points are: • If the standardized residual of an observation is larger than $$3$$ in absolute value, then it may be an outlier. • If the leverage statistic $$h_i$$ is greatly exceeding $$(p+1)/n\;$$79 (see below), then the $$i$$-th observation may be suspected of having a high leverage. Let’s see an artificial example. # Create data set.seed(12345) x <- rnorm(100) e <- rnorm(100, sd = 0.5) y <- 1 + 2 * x + e # Leverage expected value 2 / 101 # (p + 1) / n ## [1] 0.01980198 # Base model m0 <- lm(y ~ x) plot(x, y) abline(coef = m0coefficients, col = 2)

plot(m0, 5)

summary(m0)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.10174 -0.30139 -0.00557  0.30949  1.30485
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.01103    0.05176   19.53   <2e-16 ***
## x            2.04727    0.04557   44.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5054 on 98 degrees of freedom
## Multiple R-squared:  0.9537, Adjusted R-squared:  0.9532
## F-statistic:  2018 on 1 and 98 DF,  p-value: < 2.2e-16

# Make an outlier
x[101] <- 0; y[101] <- 30
m1 <- lm(y ~ x)
plot(x, y)
abline(coef = m1$coefficients, col = 2) plot(m1, 5) summary(m1) ## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.3676 -0.5730 -0.2955 0.0941 28.6881 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.3119 0.2997 4.377 2.98e-05 *** ## x 1.9901 0.2652 7.505 2.71e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.942 on 99 degrees of freedom ## Multiple R-squared: 0.3627, Adjusted R-squared: 0.3562 ## F-statistic: 56.33 on 1 and 99 DF, p-value: 2.708e-11 # Make a high-leverage point x[101] <- 10; y[101] <- 5 m2 <- lm(y ~ x) plot(x, y) abline(coef = m2$coefficients, col = 2)

plot(m2, 5)

summary(m2)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -9.2423 -0.6126  0.0373  0.7864  2.1652
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.09830    0.13676   8.031 2.06e-12 ***
## x            1.31440    0.09082  14.473  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.339 on 99 degrees of freedom
## Multiple R-squared:  0.6791, Adjusted R-squared:  0.6758
## F-statistic: 209.5 on 1 and 99 DF,  p-value: < 2.2e-16

The leverage statistic associated to the $$i$$-th datum corresponds to the $$i$$-th diagonal entry of the hat matrix $$\mathbf{H}$$:

\begin{align*} h_{i}:=\mathbf{H}_{ii}=(\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}')_{ii}. \end{align*}

It can be seen that $$\frac{1}{n}\leq h_i\leq 1$$ and that the mean $$\bar{h}=\frac{1}{n}\sum_{i=1}^nh_i=\frac{p+1}{n}.$$ This can be clearly seen in the case of simple linear regression, where the leverage statistic has the explicit form

\begin{align*} h_{i}=\frac{1}{n}+\frac{(X_i-\bar{X})^2}{\sum_{j=1}^n(X_j-\bar{X})^2}. \end{align*}

Interestingly, this expression shows that the leverage statistic is directly dependent on the distance to the center of the predictor. A precise threshold for determining when $$h_i$$ is greatly exceeding measure its expected value $$\bar{h}$$ can be given if the predictors are assumed to be jointly normal. In this case, $$nh_i-1\sim\chi^2_p$$ and hence the $$i$$-th point is declared as a potential high-leverage point if $$h_i>\frac{\chi^2_{p;\alpha}+1}{n},$$ where $$\chi^2_{p;\alpha}$$ is the $$\alpha$$-upper quantile of the $$\chi^2_p$$ distribution ($$\alpha$$ can be taken as $$0.05$$ or $$0.01$$).

The functions influence, hat, and rstandard allow performing a finer inspection of the leverage statistics.

# Access leverage statistics
head(influence(model = m2, do.coef = FALSE)\$hat)
##          1          2          3          4          5          6
## 0.01017449 0.01052333 0.01083766 0.01281244 0.01022209 0.03137304

# Another option
h <- hat(x = x)

# 1% most influential points
n <- length(x)
p <- 1
hist(h, breaks = 20)
abline(v = (qchisq(0.99, df = p) + 1) / n, col = 2)


# Standardized residuals
rs <- rstandard(m2)
plot(m2, 2) # QQ-plot
points(qnorm(ppoints(n = n)), sort(rs), col = 2, pch = '+') # Manually computed

### 3.5.7 Case study application

Moore’s law is an empirical law that states that the power of a computer doubles approximately every two years. Translated into a mathematical formula, Moore’s law is

\begin{align*} \text{transistors}\approx 2^{\text{years}/2}. \end{align*}

Applying logarithms to both sides gives

\begin{align*} \log(\text{transistors})\approx \frac{\log(2)}{2}\text{years}. \end{align*}

We can write the above formula more generally as

\begin{align*} \log(\text{transistors})=\beta_0+\beta_1 \text{years}+\varepsilon, \end{align*}

where $$\varepsilon$$ is a random error. This is a linear model!

The dataset cpus.txt (source, retrieved in September 2016) contains the transistor counts for the CPUs appeared in the time range 1971–2015. For this data, do the following:

1. Import conveniently the data and name it as cpus.
2. Show a scatterplot of Transistor.count vs. Date.of.introduction with a linear regression.
3. Are the assumptions verified in Transistor.count ~ Date.of.introduction? Which ones are more “problematic”?
4. Create a new variable, named Log.Transistor.count, containing the logarithm of Transistor.count.
5. Show a scatterplot of Log.Transistor.count vs. Date.of.introduction with a linear regression.
6. Are the assumptions verified in Log.Transistor.count ~ Date.of.introduction? Which ones are which are more “problematic”?
7. Regress Log.Transistor.count ~ Date.of.introduction.
8. Summarize the fit. What are the estimates $$\hat\beta_0$$ and $$\hat\beta_1$$? Is $$\hat\beta_1$$ close to $$\frac{\log(2)}{2}$$?
9. Compute the CI for $$\beta_1$$ at $$\alpha=0.05.$$ Is $$\frac{\log(2)}{2}$$ inside it? What happens at levels $$\alpha=0.10,0.01$$?
10. We want to forecast the average log-number of transistors for the CPUs to be released in 2024. Compute the adequate prediction and CI.
11. A new CPU design is expected for 2025. What is the range of log-number of transistors expected for it, at a $$95\%$$ level of confidence?
12. Compute the ANOVA table for Log.Transistor.count ~ Date.of.introduction. Is $$\beta_1$$ significant?

The dataset gpus.txt (source, retrieved in September 2016) contains the transistor counts for the GPUs appeared in the period 1997–2016. Repeat the previous analysis for this dataset.

Analyze the bias induced by transforming the response, fitting a linear model, and then untransforming. For that, consider the relation \begin{align} Y=\exp(\beta_0+\beta_1X+\varepsilon),\tag{3.8} \end{align} where $$X\sim\mathcal{N}(0,0.5)$$ and $$\varepsilon\sim\mathcal{N}(0,1)$$. Set $$\beta_0=-1$$ and $$\beta_1=1$$. Then, follow these steps:

1. Compute the exact conditional distribution $$Y|X=x.$$ Hint: use the lognormal distribution.
2. Based on Step 1, compute the true regression function $$m(x)=\mathbb{E}[Y|X=x]$$ exactly and plot it.
3. Simulate a sample $$\{(X_i,Y_i)\}_{i=1}^n$$ from (3.8), with $$n=200.$$
4. Define $$W:=\log(Y)$$ and fit the linear model $$W=\beta_0+$$ $$\beta_1X+\varepsilon$$ from the previous sample. Are $$(\beta_0,\beta_1)$$ properly estimated?
5. Estimate $$\mathbb{E}[W|X=0]$$ and $$\mathbb{E}[W|X=1]$$ by $$\hat w_0=\hat\beta_0$$ and $$\hat w_1=\hat\beta_0+\hat\beta_1,$$ respectively.
6. Compute $$\hat y_0=\exp(\hat w_0)$$ and $$\hat y_1=\exp(\hat w_1),$$ the predictions with the transformed model of $$\mathbb{E}[Y|X=0]$$ and $$\mathbb{E}[Y|X=1].$$ Compare the estimates with $$m(0)$$ and $$m(1).$$
7. Plot the fitted curve $$\hat y = \exp(\hat\beta_0+\hat\beta_1x)$$ and compare with $$m$$ in Step 2.
8. If $$n$$ is increased in Step 3, does the estimation in Step 6 improve? Summarize your takeaways from the exercise.

### References

Box, G. E. P., and D. R. Cox. 1964. “An Analysis of Transformations.” Journal of the Royal Statistical Society, Series B (Methodological) 26 (2): 211–52. https://doi.org/10.1111/j.2517-6161.1964.tb00553.x.
Moore, G. E. 1965. “Cramming More Components onto Integrated Circuits.” Electronics 38 (8): 114–17. https://doi.org/10.1109/JPROC.1998.658762.
Peña, D. 2002. Regresión y Diseño de Experimentos. Madrid: Alianza Editorial. https://www.alianzaeditorial.es/libro/manuales/regresion-y-diseno-de-experimentos-daniel-pena-9788420693897/.
Yeo, I.-K., and R. A. Johnson. 2000. “A New Family of Power Transformations to Improve Normality or Symmetry.” Biometrika 87 (4): 954–59. https://doi.org/10.1093/biomet/87.4.954.

1. If $$p=1,$$ then it is possible to inspect the scatterplot of the $$\{(X_{i1},Y_i)\}_{i=1}^n$$ in order to determine whether linearity is plausible. But the usefulness of this graphical check quickly decays with the dimension $$p,$$ as $$p$$ scatterplots need to be investigated. That is precisely the key point for relying in the residuals vs. fitted values plot.↩︎

2. Assume that $$f(Y)=\beta_0+\beta_1X+\varepsilon,$$ with $$f$$ nonlinear and invertible. Clearly, $$W=f(Y)$$ satisfies $$W=\beta_0+\beta_1X+\varepsilon,$$ and we can estimate $$\mathbb{E}[W|X=x]$$ with $$\hat{w}=\hat{\beta}_0+\hat{\beta}_1x$$ unbiasedly, i.e., $$\mathbb{E}[\hat{w}|X_1,\ldots,X_n]=\mathbb{E}[W|X=x].$$ Define $$\hat{y}_f:=f^{-1}(\hat{w}).$$ Because of the delta method (see next footnote), $$\mathbb{E}[\hat{y}_f|X_1,\ldots,X_n]$$ is asymptotically equal to $$f^{-1}(\mathbb{E}[W|X=x]).$$ However, $$f^{-1}(\mathbb{E}[W|X=x])\neq (\mathbb{E}[f^{-1}(W)|X=x])=\mathbb{E}[Y|X=x],$$ so $$\hat{y}_f$$ is a biased estimator of $$\mathbb{E}[Y|X=x].$$ Indeed, if $$f$$ is strictly convex (e.g., $$f=\log$$), then $$f^{-1}$$ is strictly concave, and Jensen’s inequality guarantees that $$f^{-1}(\mathbb{E}[W|X=x])\geq \mathbb{E}[f^{-1}(W)|X=x],$$ that is, $$\hat{y}_f$$ underestimates $$\mathbb{E}[Y|X=x].$$↩︎

3. Delta method. Let $$\hat{\theta}$$ be an estimator of $$\theta$$ such that $$\sqrt{n}(\hat{\theta}-\theta)$$ is asymptotically a $$\mathcal{N}(0,\sigma^2)$$ when $$n\to\infty.$$ If $$g$$ is a real-valued function such that $$g'(\theta)\neq0$$ exists, then $$\sqrt{n}(g(\hat{\theta})-g(\theta))$$ is asymptotically a $$\mathcal{N}(0,\sigma^2(g'(\theta))^2).$$↩︎

4. Given that removing the bias introduced by considering $$\hat{Y}_f=f^{-1}(\hat{W})$$ is challenging.↩︎

5. Recall that $$\hat\beta_j=\mathbf{e}_j'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}=:\sum_{i=1}^nw_{ij}Y_i,$$ where $$\mathbf{e}_j$$ is the canonical vector of $$\mathbb{R}^{p+1}$$ with $$1$$ in the $$j$$-th position and $$0$$ elsewhere. Therefore, $$\hat\beta_j$$ is a weighted sum of the random variables $$Y_1,\ldots,Y_n$$ (recall that we assume that $$\mathbf{X}$$ is given and therefore is deterministic). Even if $$Y_1,\ldots,Y_n$$ are not normal, the central limit theorem entails that $$\sqrt{n}(\hat\beta_j-\beta_j)$$ is asymptotically normally distributed when $$n\to\infty,$$ provided that linearity holds.↩︎

6. Distributions that are not heavy tailed, not heavily multimodal, and not heavily skewed.↩︎

7. For $$X\sim F,$$ the $$p$$-th quantile $$x_p=F^{-1}(p)$$ of $$X$$ is estimated through the sample quantile $$\hat{x}_p:=F_n^{(-1)}(p),$$ where $$F_n(x)=\frac{1}{n}\sum_{i=1}^n1_{\{X_i\leq x\}}$$ is the empirical cdf of $$X_1,\ldots,X_n$$ and $$F_n^{(-1)}$$ is the generalized inverse function of $$F_n.$$ If $$X\sim f$$ is continuous, then $$\sqrt{n}(\hat{x}_p-x_p)$$ is asymptotically a $$\mathcal{N}\left(0,\frac{p(1-p)}{f(x_p)^2}\right).$$ Therefore, the variance of $$\hat x_p$$ grows if $$p\to0$$ or $$p\to1,$$ hence more variability is expected on the extremes of the QQ-plot (see Figure 3.20).↩︎

8. This is the Kolmogorov–Smirnov test shown in Section A.1 but adapted to testing the normality of the data with unknown mean and variance. More precisely, the test tests the composite null hypothesis $$H_0:F=\Phi(\cdot;\mu,\sigma^2)$$ with $$\mu$$ and $$\sigma^2$$ unknown. Note that this is different from the simple null hypothesis $$H_0:F=F_0$$ of the Kolmogorov–Smirnov test in which $$F_0$$ is completely specified. Further tests of normality can be derived by adapting other tests for the simple null hypothesis $$H_0:F=F_0,$$ such as the Cramér–von Mises and the Anderson–Darling tests, and these are implemented in the functions nortest::cvm.test and nortest::ad.test.↩︎

9. Precisely, if $$\lambda<1,$$ positive skewness (or skewness to the right) is palliated (large values of $$Y$$ shrink, small values grow), whereas if $$\lambda>1$$ negative skewness (or skewness to the left) is corrected (large values of $$Y$$ grow, small values shrink).↩︎

10. Maximum likelihood for a $$\mathcal{N}(\mu,\sigma^2)$$ and potentially using the linear model structure if we are performing the transformation to achieve normality in errors of the linear model. Recall that optimally transforming $$Y$$ such that $$Y$$ is normal-like or $$Y|(X_1,\ldots,X_p)$$ is normal-like (the assumption in the linear model) are very different goals!↩︎

11. And therefore more informative on the linear dependence of the predictors than the correlations of $$X_j$$ with each of the remaining predictors.↩︎

12. This is the expected value for the leverage statistic $$h_i$$ if the linear model holds. However, the distribution of $$h_i$$ depends on the joint distribution of the predictors $$(X_1,\ldots,X_p).$$↩︎