## 2.7 Model fit

### 2.7.1 The $$R^2$$

The coefficient of determination $$R^2$$ is closely related with the ANOVA decomposition. It is defined as

\begin{align} R^2:=\frac{\text{SSR}}{\text{SST}}=\frac{\text{SSR}}{\text{SSR}+\text{SSE}}=\frac{\text{SSR}}{\text{SSR}+(n-p-1)\hat\sigma^2}. \tag{2.21} \end{align}

The $$R^2$$ measures the proportion of variation of the response variable $$Y$$ that is explained by the predictors $$X_1,\ldots,X_p$$ through the regression. Intuitively, $$R^2$$ measures the tightness of the data cloud around the regression plane. Check in Figure 2.17 how the value of $$\sigma^2$$45 affects the $$R^2.$$

The sample correlation coefficient is intimately related with the $$R^2.$$ For example, if $$p=1,$$ then it can be seen (exercise below) that $$R^2=r^2_{xy}.$$ More importantly, $$R^2=r^2_{y\hat y}$$ for any $$p,$$ that is, the square of the sample correlation coefficient between $$Y_1,\ldots,Y_n$$ and $$\hat Y_1,\ldots,\hat Y_n$$ is $$R^2,$$ a fact that is not immediately evident. Let’s check this fact when $$p=1$$ by relying on $$R^2=r^2_{xy}.$$ First, by the form of $$\hat\beta_0$$ given in (2.3),

\begin{align} \hat Y_i&=\hat\beta_0+\hat\beta_1X_i\nonumber\\ &=(\bar Y-\hat\beta_1\bar X)+\hat\beta_1X_i\nonumber\\ &=\bar Y+\hat\beta_1(X_i-\bar X). \tag{2.22} \end{align}

Then, replace (2.22) in

\begin{align*} r^2_{y\hat y}&=\frac{s_{y\hat y}^2}{s_y^2s_{\hat y}^2}\\ &=\frac{\left(\sum_{i=1}^n \big(Y_i-\bar Y \big)\big(\hat Y_i-\bar Y \big)\right)^2}{\sum_{i=1}^n \big(Y_i-\bar Y \big)^2\sum_{i=1}^n \big(\hat Y_i-\bar Y \big)^2}\\ &=\frac{\left(\sum_{i=1}^n \big(Y_i-\bar Y \big)\big(\bar Y+\hat\beta_1(X_i-\bar X)-\bar Y \big)\right)^2}{\sum_{i=1}^n \big(Y_i-\bar Y \big)^2\sum_{i=1}^n \big(\bar Y+\hat\beta_1(X_i-\bar X)-\bar Y \big)^2}\\ &=r^2_{xy}. \end{align*}

As a consequence, $$r^2_{y\hat y}=r^2_{xy}=R^2$$ when $$p=1.$$

Show that $$R^2=r^2_{xy}$$ when $$p=1.$$ Hint: start from the definition of $$R^2$$ and use (2.3) to arrive to $$r^2_{xy}.$$

Trusting the $$R^2$$ blindly can lead to catastrophic conclusions. Here are a couple of counterexamples of a linear regression performed in a data that clearly does not satisfy the assumptions discussed in Section 2.3 but, despite that, the linear models have large $$R^2$$’s. These counterexamples emphasize that inference, built on the validity of the model assumptions46, will be problematic if these assumptions are violated, no matter what is the value of $$R^2.$$ For example, recall how biased the predictions and its associated CIs will be in $$x=0.35$$ and $$x=0.65.$$

# Simple linear model

# Create data that:
# 1) does not follow a linear model
# 2) the error is heteroskedastic
x <- seq(0.15, 1, l = 100)
set.seed(123456)
eps <- rnorm(n = 100, sd = 0.25 * x^2)
y <- 1 - 2 * x * (1 + 0.25 * sin(4 * pi * x)) + eps

# Great R^2!?
reg <- lm(y ~ x)
summary(reg)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.53525 -0.18020  0.02811  0.16882  0.46896
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.87190    0.05860   14.88   <2e-16 ***
## x           -1.69268    0.09359  -18.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.232 on 98 degrees of freedom
## Multiple R-squared:  0.7695, Adjusted R-squared:  0.7671
## F-statistic: 327.1 on 1 and 98 DF,  p-value: < 2.2e-16

# scatterplot is a quick alternative to
# plot(x, y)
# abline(coef = regcoef, col = 3) # But prediction is obviously problematic car::scatterplot(y ~ x, col = 1, regLine = list(col = 2), smooth = FALSE) # Multiple linear model # Create data that: # 1) does not follow a linear model # 2) the error is heteroskedastic x1 <- seq(0.15, 1, l = 100) set.seed(123456) x2 <- runif(100, -3, 3) eps <- rnorm(n = 100, sd = 0.25 * x1^2) y <- 1 - 3 * x1 * (1 + 0.25 * sin(4 * pi * x1)) + 0.25 * cos(x2) + eps # Great R^2!? reg <- lm(y ~ x1 + x2) summary(reg) ## ## Call: ## lm(formula = y ~ x1 + x2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.78737 -0.20946 0.01031 0.19652 1.05351 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.788812 0.096418 8.181 1.1e-12 *** ## x1 -2.540073 0.154876 -16.401 < 2e-16 *** ## x2 0.002283 0.020954 0.109 0.913 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3754 on 97 degrees of freedom ## Multiple R-squared: 0.744, Adjusted R-squared: 0.7388 ## F-statistic: 141 on 2 and 97 DF, p-value: < 2.2e-16 We can visualize the fit of the latter multiple linear model, since we are in $$p=2.$$ # But prediction is obviously problematic car::scatter3d(y ~ x1 + x2, fit = "linear") rgl::rglwidget() The previous counterexamples illustrate that a large $$R^2$$ means nothing in terms of inference if the assumptions of the model do not hold. Remember that: • $$R^2$$ does not measure the correctness of a linear model but its usefulness, assuming the model is correct. • $$R^2$$ is the proportion of variance of $$Y$$ explained by $$X_1,\ldots,X_p$$ but, of course, only when the linear model is correct. We finalize by pointing out a nice connection between the $$R^2,$$ the ANOVA decomposition, and the least squares estimator $$\hat{\boldsymbol{\beta}}.$$ The ANOVA decomposition gives another view on the least-squares estimates: $$\hat{\boldsymbol{\beta}}$$ are the estimated coefficients that maximize the $$R^2$$ (among all the possible estimates we could think about). To see this, recall that \begin{align*} R^2=\frac{\text{SSR}}{\text{SST}}=\frac{\text{SST} - \text{SSE}}{\text{SST}}=\frac{\text{SST} - \text{RSS}(\hat{\boldsymbol{\beta}})}{\text{SST}}. \end{align*} Then, if $$\text{RSS}(\hat{\boldsymbol{\beta}})=\min_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}\text{RSS}(\boldsymbol{\beta}),$$ then $$R^2$$ is maximal for $$\hat{\boldsymbol{\beta}}.$$ ### 2.7.2 The $$R^2\!{}_{\text{Adj}}$$ As we saw, these are equivalent forms for $$R^2$$: \begin{align} R^2=&\,\frac{\text{SSR}}{\text{SST}}=\frac{\text{SST}-\text{SSE}}{\text{SST}}=1-\frac{\text{SSE}}{\text{SST}}\nonumber\\ =&\,1-\frac{\hat\sigma^2}{\text{SST}}\times(n-p-1).\tag{2.23} \end{align} The SSE on the numerator always decreases as more predictors are added to the model, even if these are not significant. As a consequence, the $$R^2$$ always increases with $$p$$. Why is this so? Intuitively, because the complexity – hence the flexibility – of the model increases when we use more predictors to explain $$Y.$$ Mathematically, because when $$p$$ approaches $$n-1$$ the second term in (2.23) is reduced and, as a consequence, $$R^2$$ grows. The adjusted $$R^2$$, $$R^2_\text{Adj},$$ is an important quantity specifically designed to overcome this $$R^2$$’s flaw, ubiquitous in multiple linear regression. The purpose is to have a better tool for comparing models without systematically favoring more complex models47. This alternative coefficient is defined as \begin{align} R^2_{\text{Adj}}:=&\,1-\frac{\text{SSE}/(n-p-1)}{\text{SST}/(n-1)}=1-\frac{\text{SSE}}{\text{SST}}\times\frac{n-1}{n-p-1}\nonumber\\ =&\,1-\frac{\hat\sigma^2}{\text{SST}}\times (n-1).\tag{2.24} \end{align} The $$R^2_{\text{Adj}}$$ is independent of $$p,$$ at least explicitly. If $$p=1$$ then $$R^2_{\text{Adj}}$$ is almost $$R^2$$ (practically identical if $$n$$ is large). Both (2.23) and (2.24) are quite similar except for the last factor, which in the latter does not depend on $$p.$$ Therefore, (2.24) will only increase if $$\hat\sigma^2$$ is reduced with $$p;$$ in other words, if the new variables contribute in the reduction of variability around the regression plane. The different behavior between $$R^2$$ and $$R^2_\text{Adj}$$ can be visualized with the following simulation study. Suppose that we generate a random dataset $$\{(X_{i1},X_{i2},Y_i)\}_{i=1}^n,$$ with $$n=200$$ observations of two predictors $$X_1$$ and $$X_2$$ that are distributed as a $$\mathcal{N}(0,1),$$ and a response $$Y$$ generated by the linear model \begin{align} Y_i=\beta_0+\beta_1X_{i1}+\beta_2X_{i2}+\varepsilon_i,\tag{2.25} \end{align} where $$\varepsilon_i\sim\mathcal{N}(0,9).$$ To this data, we add $$196$$ “garbage” predictors $$X_j\sim\mathcal{N}(0,1)$$ that are completely independent from $$Y.$$ Therefore, we end up with $$p=198$$ predictors where only the first two ones are relevant for explaining $$Y.$$ We compute now the $$R^2(j)$$ and $$R^2_\text{Adj}(j)$$ for the models \begin{align} Y=\beta_0+\beta_1X_{1}+\cdots+\beta_jX_{j}+\varepsilon,\tag{2.26} \end{align} with $$j=1,\ldots,p,$$ and we plot them as the curves $$(j,R^2(j))$$ and $$(j,R_\text{Adj}^2(j)).$$ Since $$R^2$$ and $$R^2_\text{Adj}$$ are random variables, we repeat this procedure $$M=500$$ times to have an idea of the variability behind $$R^2$$ and $$R^2_\text{Adj}.$$ Figure 2.19 contains the results of this experiment. As it can be seen, the $$R^2$$ increases linearly with the number of predictors considered, although only the first two ones were actually relevant. Thus, if we did not know about the random mechanism (2.25) that generated $$Y,$$ we would be tempted to believe that the more adequate models would be the ones with a larger number of predictors. On the contrary, $$R^2_\text{Adj}$$ only increases in the first two variables and then exhibits a mild decaying trend, indicating that, on average, the best choice for the number of predictors is actually close to $$p=2.$$ However, note that $$R^2_\text{Adj}$$ has a huge variability when $$p$$ approaches $$n-2,$$ a consequence of the explosive variance of $$\hat\sigma^2$$ in that degenerate case48. The experiment helps to visualize that $$R^2_\text{Adj}$$ is more adequate than the $$R^2$$ for evaluating the fit of a multiple linear regression49. An example of a simulated dataset considered in the experiment of Figure 2.19 is: # Generate data p <- 198 n <- 200 set.seed(3456732) beta <- c(0.5, -0.5, rep(0, p - 2)) X <- matrix(rnorm(n * p), nrow = n, ncol = p) Y <- drop(X %*% beta + rnorm(n, sd = 3)) data <- data.frame(y = Y, x = X) # Regression on the two meaningful predictors summary(lm(y ~ x.1 + x.2, data = data)) # Adding 20 "garbage" variables # R^2 increases and adjusted R^2 decreases summary(lm(y ~ X[, 1:22], data = data)) Implement the simulation study behind Figure 2.19 in order to replicate it. The $$R^2_\text{Adj}$$ no longer measures the proportion of variation of $$Y$$ explained by the regression, but the result of correcting this proportion by the number of predictors employed. As a consequence of this, $$R^2_\text{Adj}\leq1$$ and $$R^2_\text{Adj}$$ can be negative. The next code illustrates a situation where we have two predictors completely independent from the response. The fitted model has a negative $$R^2_\text{Adj}.$$ # Three independent variables set.seed(234599) x1 <- rnorm(100) x2 <- rnorm(100) y <- 1 + rnorm(100) # Negative adjusted R^2 summary(lm(y ~ x1 + x2)) ## ## Call: ## lm(formula = y ~ x1 + x2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5081 -0.5021 -0.0191 0.5286 2.4750 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.97024 0.10399 9.330 3.75e-15 *** ## x1 0.09003 0.10300 0.874 0.384 ## x2 -0.05253 0.11090 -0.474 0.637 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.034 on 97 degrees of freedom ## Multiple R-squared: 0.009797, Adjusted R-squared: -0.01062 ## F-statistic: 0.4799 on 2 and 97 DF, p-value: 0.6203 For the previous example, construct more predictors (x3, x4, …) that are independent from y. Then check that, when the predictors are added to the model, the $$R^2_\text{Adj}$$ decreases and the $$R^2$$ increases. Beware of the $$R^2$$ and $$R^2_\text{Adj}$$ for model fits with no intercept! If the linear model is fitted with no intercept, the summary function silently returns a ‘Multiple R-squared’ and an ‘Adjusted R-squared’ that do not correspond (see this question in the R FAQ) with the seen expressions \begin{align*} R^2=1-\frac{\text{SSE}}{\sum_{i=1}^n(Y_i-\bar{Y})^2},\quad R^2_{\text{Adj}}=1-\frac{\text{SSE}}{\text{SST}}\times\frac{n-1}{n-p-1}. \end{align*} In the case with no intercept, summary rather returns \begin{align*} R^2_0 :=1-\frac{\text{SSE}}{\sum_{i=1}^nY_i^2},\quad R^2_{0,\text{Adj}} :=1-\frac{\text{SSE}}{\sum_{i=1}^nY_i^2}\times\frac{n-1}{n-p-1}. \end{align*} The reason is perhaps shocking: if the model is fitted without intercept and neither the response nor the predictors are centered, then the ANOVA decomposition does not hold, in the sense that \begin{align*} \mathrm{SST} \neq \mathrm{SSR} + \mathrm{SSE}. \end{align*} The fact that the ANOVA decomposition does not hold for no-intercept models has serious consequences on the theory we built. In particular, the $$R^2$$ can be negative and $$R^2\neq r^2_{y\hat y},$$ both deriving from the fact that $$\sum_{i=1}^n\hat\varepsilon_i\neq0.$$ Therefore, the $$R^2$$ cannot be regarded as the “proportion of variance explained” if the model fit is performed without intercept. The $$R^2_0$$ and $$R^2_{0,\text{Adj}}$$ versions are considered because they are the ones that arise from the “no-intercept ANOVA decomposition” \begin{align*} \mathrm{SST}_0 = \mathrm{SSR}_0 + \mathrm{SSE},\quad \mathrm{SST}_0:=\sum_{i=1}^nY_i^2,\quad \mathrm{SSR}_0:=\sum_{i=1}^n\hat{Y}_i^2. \end{align*} and therefore the $$R^2_0$$ is guaranteed to be a quantity in $$[0,1].$$ It would indeed be the proportion of variance explained if the predictors and the response were centered (i.e., if $$\bar Y=0$$ and $$\bar X_j=0,$$ $$j=1,\ldots,p$$). The next chunk of code illustrates these concepts for the iris dataset. # Model with intercept mod1 <- lm(Sepal.Length ~ Petal.Width, data = iris) mod1 ## ## Call: ## lm(formula = Sepal.Length ~ Petal.Width, data = iris) ## ## Coefficients: ## (Intercept) Petal.Width ## 4.7776 0.8886 # Model without intercept mod0 <- lm(Sepal.Length ~ 0 + Petal.Width, data = iris) mod0 ## ## Call: ## lm(formula = Sepal.Length ~ 0 + Petal.Width, data = iris) ## ## Coefficients: ## Petal.Width ## 3.731 # Recall the different way of obtaining the estimators X1 <- cbind(1, irisPetal.Width)
X0 <- cbind(iris$Petal.Width) # No column of ones! Y <- iris$Sepal.Length
betaHat1 <- solve(crossprod(X1)) %*% t(X1) %*% Y
betaHat0 <- solve(crossprod(X0)) %*% t(X0) %*% Y
betaHat1
##           [,1]
## [1,] 4.7776294
## [2,] 0.8885803
betaHat0
##          [,1]
## [1,] 3.731485

# Summaries: higher R^2 for the model with no intercept!?
summary(mod1)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Width, data = iris)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.38822 -0.29358 -0.04393  0.26429  1.34521
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  4.77763    0.07293   65.51   <2e-16 ***
## Petal.Width  0.88858    0.05137   17.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.478 on 148 degrees of freedom
## Multiple R-squared:  0.669,  Adjusted R-squared:  0.6668
## F-statistic: 299.2 on 1 and 148 DF,  p-value: < 2.2e-16
summary(mod0)
##
## Call:
## lm(formula = Sepal.Length ~ 0 + Petal.Width, data = iris)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.1556 -0.3917  1.0625  3.8537  5.0537
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## Petal.Width    3.732      0.150   24.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.609 on 149 degrees of freedom
## Multiple R-squared:  0.8058, Adjusted R-squared:  0.8045
## F-statistic: 618.4 on 1 and 149 DF,  p-value: < 2.2e-16

# Wait a moment... let's see the actual fit
plot(Sepal.Length ~ Petal.Width, data = iris)
abline(mod1, col = 2) # Obviously, much better
abline(mod0, col = 3)


# Compute the R^2 manually for mod1
SSE1 <- sum(mod1$residuals^2) SST1 <- sum((mod1$model$Sepal.Length - mean(mod1$model$Sepal.Length))^2) 1 - SSE1 / SST1 ## [1] 0.6690277 # Compute the R^2 manually for mod0 SSE0 <- sum(mod0$residuals^2)
SST0 <- sum((mod0$model$Sepal.Length - mean(mod0$model$Sepal.Length))^2)
1 - SSE0 / SST0
## [1] -8.926872
# It is negative!

# Recall that the mean of the residuals is not zero!
mean(mod0$residuals) ## [1] 1.368038 # What summary really returns if there is no intercept n <- nrow(iris) p <- 1 R0 <- 1 - sum(mod0$residuals^2) / sum(mod0$model$Sepal.Length^2)
R0Adj <- 1 - sum(mod0$residuals^2) / sum(mod0$model$Sepal.Length^2) * (n - 1) / (n - p - 1) R0 ## [1] 0.8058497 R0Adj ## [1] 0.8045379 # What if we centered the data previously? irisCen <- data.frame(scale(iris[, -5], center = TRUE, scale = FALSE)) modCen1 <- lm(Sepal.Length ~ Petal.Width, data = irisCen) modCen0 <- lm(Sepal.Length ~ 0 + Petal.Width, data = irisCen) # No problem, "correct" R^2 summary(modCen1) ## ## Call: ## lm(formula = Sepal.Length ~ Petal.Width, data = irisCen) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.38822 -0.29358 -0.04393 0.26429 1.34521 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -9.050e-16 3.903e-02 0.0 1 ## Petal.Width 8.886e-01 5.137e-02 17.3 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.478 on 148 degrees of freedom ## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668 ## F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16 summary(modCen0) ## ## Call: ## lm(formula = Sepal.Length ~ 0 + Petal.Width, data = irisCen) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.38822 -0.29358 -0.04393 0.26429 1.34521 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## Petal.Width 0.8886 0.0512 17.36 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4764 on 149 degrees of freedom ## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668 ## F-statistic: 301.2 on 1 and 149 DF, p-value: < 2.2e-16 # But only if we center predictor and response... summary(lm(iris$Sepal.Length ~ 0 + irisCen$Petal.Width)) ## ## Call: ## lm(formula = iris$Sepal.Length ~ 0 + irisCen$Petal.Width) ## ## Residuals: ## Min 1Q Median 3Q Max ## 4.455 5.550 5.799 6.108 7.189 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## irisCen$Petal.Width   0.8886     0.6322   1.406    0.162
##
## Residual standard error: 5.882 on 149 degrees of freedom
## Multiple R-squared:  0.01308,    Adjusted R-squared:  0.006461
## F-statistic: 1.975 on 1 and 149 DF,  p-value: 0.1619

Let’s play the evil practitioner and try to modify the $$R_0^2$$ returned by summary when the intercept is excluded. If we were really evil, we could use this knowledge to fool someone that is not aware of the difference between $$R_0^2$$ and $$R^2$$ into believing that any given model is incredibly good or bad in terms of the $$R^2$$!

1. For the previous example, display the $$R_0^2$$ as a function of a shift in mean of the predictor Petal.Width. What can you conclude? Hint: you may use I() for performing the shifting inside the model equation.
2. What shift on Petal.Width would be necessary to achieve an $$R_0^2\approx0.95$$? Is this shift unique?
3. Do the same but for a shift in the mean of the response Sepal.Length. What shift would be necessary to achieve an $$R_0^2\approx0.50$$? Is there a single shift?
4. Consider the multiple linear model medv ~ nox + rm for the Boston dataset. We want to tweak the $$R_0^2$$ to set it to any number in $$[0,1].$$ Can we achieve this by only shifting rm or medv (only one of them)?
5. Explore systematically the $$R_0^2$$ for the shifting combinations of rm and medv and obtain a combination that delivers an $$R_0^2\approx 0.30.$$ Hint: use filled.contour() for visualizing the $$R_0^2$$ surface.

### 2.7.3 Case study application

Coming back to the case study, we have studied so far three models:

# Fit models
modWine1 <- lm(Price ~ ., data = wine)
modWine2 <- lm(Price ~ . - FrancePop, data = wine)
modWine3 <- lm(Price ~ Age + WinterRain, data = wine)

# Summaries
summary(modWine1)
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.46541 -0.24133  0.00413  0.18974  0.52495
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.343e+00  7.697e+00  -0.304  0.76384
## WinterRain   1.153e-03  4.991e-04   2.311  0.03109 *
## AGST         6.144e-01  9.799e-02   6.270 3.22e-06 ***
## HarvestRain -3.837e-03  8.366e-04  -4.587  0.00016 ***
## Age          1.377e-02  5.821e-02   0.237  0.81531
## FrancePop   -2.213e-05  1.268e-04  -0.175  0.86313
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.293 on 21 degrees of freedom
## Multiple R-squared:  0.8278, Adjusted R-squared:  0.7868
## F-statistic: 20.19 on 5 and 21 DF,  p-value: 2.232e-07
summary(modWine2)
##
## Call:
## lm(formula = Price ~ . - FrancePop, data = wine)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.46024 -0.23862  0.01347  0.18601  0.53443
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.6515703  1.6880876  -2.163  0.04167 *
## WinterRain   0.0011667  0.0004820   2.420  0.02421 *
## AGST         0.6163916  0.0951747   6.476 1.63e-06 ***
## HarvestRain -0.0038606  0.0008075  -4.781 8.97e-05 ***
## Age          0.0238480  0.0071667   3.328  0.00305 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared:  0.8275, Adjusted R-squared:  0.7962
## F-statistic: 26.39 on 4 and 22 DF,  p-value: 4.057e-08
summary(modWine3)
##
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.88964 -0.51421 -0.00066  0.43103  1.06897
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9830427  0.5993667   9.982 5.09e-10 ***
## Age         0.0360559  0.0137377   2.625   0.0149 *
## WinterRain  0.0007813  0.0008780   0.890   0.3824
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared:  0.2371, Adjusted R-squared:  0.1736
## F-statistic:  3.73 on 2 and 24 DF,  p-value: 0.03884

The model modWine2 has the largest $$R^2_\text{Adj}$$ among the three studied. It is a model that explains the $$82.75\%$$ of the variability in a non-redundant way and with all its coefficients significant. Therefore, we have a formula for effectively explaining and predicting the quality of a vintage (answers Q1). Prediction and, importantly, quantification of the uncertainty in the prediction, can be done through the predict function.

Furthermore, the interpretation of modWine2 agrees with well-known facts in viticulture that make perfect sense. Precisely (answers Q2):

• Higher temperatures are associated with better quality (higher priced) wine.
• Rain before the growing season is good for the wine quality, but during harvest is bad.
• The quality of the wine improves with the age.

Although these conclusions could be regarded as common folklore, keep in mind that this model

• allows to quantify the effect of each variable on the wine quality and
• provides a precise way of predicting the quality of future vintages.

1. Which is not the $$\hat\sigma^2$$ in (2.21), but $$\hat\sigma^2$$ is obviously dependent on $$\sigma^2.$$↩︎

2. Which do not hold!↩︎

3. An informal way of regarding the difference between $$R^2$$ and $$R^2_\text{Adj}$$ is by thinking of $$R^2$$ as a measure of fit that “is not aware of the dangers of overfitting”. In this interpretation, $$R^2_\text{Adj}$$ is the overfitting-aware version of $$R^2.$$↩︎

4. This indicates that $$R^2_\text{Adj}$$ can act as a model selection device up to a certain point, as its effectiveness becomes too variable, too erratic, when the number of predictors is very high in comparison with $$n.$$↩︎

5. Coincidentally, the experiment also serves as a reminder about the randomness of $$R^2$$ and $$R^2_{\text{Adj}},$$ a fact that is sometimes overlooked.↩︎