2.4 Inference for model parameters

The assumptions introduced in the previous section allow us to specify what is the distribution of the random vector β^. The distribution is derived conditionally on the predictors’ sample X1,,Xn. In other words, we assume that the randomness of Y=Xβ+ε comes only from the error terms and not from the predictors.25 To denote this, we employ lowercase for the predictors’ sample x1,,xn.

2.4.1 Distributions of the fitted coefficients

The distribution of β^ is:

(2.11)β^Np+1(β,σ2(XX)1).

This result can be obtained from the form of β^ given in (2.7), the sample version of the model assumptions given in (2.10), and the linear transformation property of a normal given in (1.4). Equation (2.11) implies that the marginal distribution of β^j is

(2.12)β^jN(βj,SE(β^j)2),

where SE(β^j) is the standard error, SE(β^j)2:=σ2vj, and

vj is the j-th element of the diagonal of (XX)1.

Recall that an equivalent form for (2.12) is (why?)

β^jβjSE(β^j)N(0,1).

The interpretation of (2.12) is simpler in the case with p=1, where

(2.13)β^0N(β0,SE(β^0)2),β^1N(β1,SE(β^1)2),

with

(2.14)SE(β^0)2=σ2n[1+X¯2sx2],SE(β^1)2=σ2nsx2.

Some insights on (2.13) and (2.14), illustrated interactively in Figure 2.13, are the following:

  • Bias. Both estimates are unbiased. That means that their expectations are the true coefficients for any sample size n.

  • Variance. The variances SE(β^0)2 and SE(β^1)2 have interesting interpretations in terms of their components:

    • Sample size n. As the sample size grows, the precision of the estimators increases, since both variances decrease.

    • Error variance σ2. The more disperse the error is, the less precise the estimates are, since more vertical variability is present.

    • Predictor variance sx2. If the predictor is spread out (large sx2), then it is easier to fit a regression line: we have information about the data trend over a long interval. If sx2 is small, then all the data is concentrated on a narrow vertical band, so we have a much more limited view of the trend.

    • Mean X¯. It has influence only on the precision of β^0. The larger X¯ is, the less precise β^0 is.

Figure 2.13: Illustration of the randomness of the fitted coefficients (β^0,β^1) and the influence of n, σ2, and sx2. The predictors’ sample x1,,xn is fixed and new responses Y1,,Yn are generated each time from a linear model Y=β0+β1X+ε. Application available here.

The insights about (2.11) are more convoluted. The following broad remarks, extensions of what happened when p=1, apply:

  • Bias. All the estimates are unbiased for any sample size n.

  • Variance. It depends on:

    • Sample size n. Hidden inside XX. As n grows, the precision of the estimators increases.
    • Error variance σ2. The larger σ2 is, the less precise β^ is.
    • Predictor sparsity (XX)1. The more “disperse”26 the predictors are, the more precise β^ is.

The problem with the result in (2.11) is that σ2 is unknown in practice. Therefore, we need to estimate σ2 in order to use a result similar to (2.11). We do so by computing a rescaled sample variance of the residuals ε^1,,ε^n:

(2.15)σ^2:=1np1i=1nε^i2.

Note the np1 in the denominator. The factor np1 represents the degrees of freedom: the number of data points minus the number of already27 fitted parameters (p slopes plus 1 intercept) with the data. For the interpretation of σ^2, it is key to realize that the mean of the residuals ε^1,,ε^n is zero, this is ε^¯=0. Therefore, σ^2 is indeed a rescaled sample variance of the residuals which estimates the variance of ε.28 It can be seen that σ^2 is unbiased as an estimator of σ2.

If we use the estimate σ^2 instead of σ2, we get more useful29 distributions than (2.12):

(2.16)β^jβjSE^(β^j)tnp1,SE^(β^j)2:=σ^2vj,

where tnp1 represents the Student’s t distribution with np1 degrees of freedom.

The LHS of (2.16) is the t-statistic for βj, j=0,,p. We will employ them for building confidence intervals and hypothesis tests in what follows.

2.4.2 Confidence intervals for the coefficients

Thanks to (2.16), we can have the 100(1α)% Confidence Intervals (CI) for the coefficient βj, j=0,,p:

(2.17)(β^j±SE^(β^j)tnp1;α/2)

where tnp1;α/2 is the α/2-upper quantile of the tnp1. Usually, α=0.10,0.05,0.01 are considered.

Figure 2.14: Illustration of the randomness of the CI for β0 at 100(1α)% confidence. The plot shows 100 random CIs for β0, computed from 100 random datasets generated by the same simple linear model, with intercept β0. The illustration for β1 is completely analogous. Application available here.

This random CI contains the unknown coefficient βj “with a probability of 1α. The previous quoted statement has to be understood as follows. Suppose you have 100 samples generated according to a linear model. If you compute the CI for a coefficient, then in approximately 100(1α) of the samples the true coefficient would be actually inside the random CI. Note also that the CI is symmetric around β^j. This is illustrated in Figure 2.14.

2.4.3 Testing on the coefficients

The distributions in (2.16) allow also to conduct a formal hypothesis test on the coefficients βj, j=0,,p. For example the test for significance30 is especially important, that is, the test of the hypotheses

H0:βj=0

for j=0,,p. The test of H0:βj=0 with 1jp is especially interesting, since it allows us to answer whether the variable Xj has a significant linear effect on Y. The statistic used for testing for significance is the t-statistic

β^j0SE^(β^j),

which is distributed as a tnp1 under the (veracity of) the null hypothesis.31

The null hypothesis H0 is tested against the alternative hypothesis, H1. If H0 is rejected, it is rejected in favor of H1. The alternative hypothesis can be two-sided (we will focus mostly on these alternatives), such as

H0:βj=0vs.H1:βj0

or one-sided, such as

H0:βj=0vs.H1:βj<(>)0.

The test based on the t-statistic is referred to as the t-test. It rejects H0:βj=0 (against H1:βj0) at significance level α for large absolute values of the t-statistic, precisely for those above the α/2-upper quantile of the tnp1 distribution. That is, it rejects H0 at level α if |β^j|SE^(β^j)>tnp1;α/2.32 For the one-sided tests, it rejects H0 against H1:βj<0 or H1:βj>0 if β^jSE^(β^j)<tnp1;α or β^jSE^(β^j)>tnp1;α, respectively.

Remember the following insights about hypothesis testing.

The analogy of conducting an hypothesis test and a trial can be seen in Appendix A.1.

In an hypothesis test, the p-value measures the degree of veracity of H0 according to the data. The rule of thumb is the following:

Is the p-value lower than α?

  • Yes reject H0.
  • No do not reject H0.

The connection of a t-test for H0:βj=0 and the CI for βj, both at level α, is the following:

Is 0 inside the CI for βj?

  • Yes do not reject H0.
  • No reject H0.

The one-sided test H0:βj=0 vs. H1:βj<0 (respectively, H1:βj>0) can be done by means of the CI for βj. If H0 is rejected, they allow us to conclude that β^j is significantly negative (positive) and that for the considered regression model, Xj has a significant negative (positive) effect on Y. The rule of thumb is the following:

Is the CI for βj below (above) 0 at level α?

  • Yes reject H0 at level α. Conclude Xj has a significant negative (positive) effect on Y at level α.
  • No the criterion is not conclusive.

2.4.4 Case study application

Let’s analyze the multiple linear model we have considered for the wine dataset, now that we know how to make inference on the model parameters. The relevant information is obtained with the summary of the model:

# Fit
modWine1 <- lm(Price ~ ., data = wine)

# Summary
sumModWine1 <- summary(modWine1)
sumModWine1
## 
## Call:
## lm(formula = Price ~ ., data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46541 -0.24133  0.00413  0.18974  0.52495 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.343e+00  7.697e+00  -0.304  0.76384    
## WinterRain   1.153e-03  4.991e-04   2.311  0.03109 *  
## AGST         6.144e-01  9.799e-02   6.270 3.22e-06 ***
## HarvestRain -3.837e-03  8.366e-04  -4.587  0.00016 ***
## Age          1.377e-02  5.821e-02   0.237  0.81531    
## FrancePop   -2.213e-05  1.268e-04  -0.175  0.86313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.293 on 21 degrees of freedom
## Multiple R-squared:  0.8278, Adjusted R-squared:  0.7868 
## F-statistic: 20.19 on 5 and 21 DF,  p-value: 2.232e-07

# Contains the estimation of sigma ("Residual standard error")
sumModWine1$sigma
## [1] 0.2930287

# Which is the same as
sqrt(sum(modWine1$residuals^2) / modWine1$df.residual)
## [1] 0.2930287

The Coefficients block of the summary output contains the next elements regarding the significance of each coefficient βj, this is, the test H0:βj=0 vs. H1:βj0:

  • Estimate: least squares estimate β^j.
  • Std. Error: estimated standard error SE^(β^j).
  • t value: t-statistic β^jSE^(β^j).
  • Pr(>|t|): p-value of the t-test.
  • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1: codes indicating the size of the p-value. The more asterisks, the more evidence supporting that H0 does not hold.33

Note that a high proportion of predictors are not significant in modWine1: FrancePop and Age are not significant (and the intercept is not significant also). This is an indication of an excess of predictors adding little information to the response. One explanation is the almost perfect correlation between FrancePop and Age shown before: one of them is not adding any extra information to explain Price. This complicates the model unnecessarily and, more importantly, it has the undesirable effect of making the coefficient estimates less precise. We opt to remove the predictor FrancePop from the model since it is exogenous to the wine context.34 A data-driven justification of the removal of this variable is that it is the least significant in modWine1.

Then, the model without FrancePop35 is:

modWine2 <- lm(Price ~ . - FrancePop, data = wine)
summary(modWine2)
## 
## Call:
## lm(formula = Price ~ . - FrancePop, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46024 -0.23862  0.01347  0.18601  0.53443 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.6515703  1.6880876  -2.163  0.04167 *  
## WinterRain   0.0011667  0.0004820   2.420  0.02421 *  
## AGST         0.6163916  0.0951747   6.476 1.63e-06 ***
## HarvestRain -0.0038606  0.0008075  -4.781 8.97e-05 ***
## Age          0.0238480  0.0071667   3.328  0.00305 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared:  0.8275, Adjusted R-squared:  0.7962 
## F-statistic: 26.39 on 4 and 22 DF,  p-value: 4.057e-08

All the coefficients are significant at level α=0.05. Therefore, there is no clear redundant information. In addition, the R2 is very similar to the full model, but the 'Adjusted R-squared', a weighting of the R2 to account for the number of predictors used by the model, is slightly larger. As we will see in Section 2.7.2, this means that, compared to the number of predictors used, modWine2 explains more variability of Price than modWine1.

A handy way of comparing the coefficients of both models is car::compareCoefs:

car::compareCoefs(modWine1, modWine2)
## Calls:
## 1: lm(formula = Price ~ ., data = wine)
## 2: lm(formula = Price ~ . - FrancePop, data = wine)
## 
##               Model 1   Model 2
## (Intercept)     -2.34     -3.65
## SE               7.70      1.69
##                                
## WinterRain   0.001153  0.001167
## SE           0.000499  0.000482
##                                
## AGST           0.6144    0.6164
## SE             0.0980    0.0952
##                                
## HarvestRain -0.003837 -0.003861
## SE           0.000837  0.000808
##                                
## Age           0.01377   0.02385
## SE            0.05821   0.00717
##                                
## FrancePop   -2.21e-05          
## SE           1.27e-04          
## 

Note how the coefficients for modWine2 have smaller errors than modWine1.

The individual CIs for the unknown βj’s can be obtained by applying the confint function to an lm object. Let’s compute the CIs for the model coefficients of modWine1, modWine2, and a new model modWine3:

# Fit a new model
modWine3 <- lm(Price ~ Age + WinterRain, data = wine)
summary(modWine3)
## 
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88964 -0.51421 -0.00066  0.43103  1.06897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.9830427  0.5993667   9.982 5.09e-10 ***
## Age         0.0360559  0.0137377   2.625   0.0149 *  
## WinterRain  0.0007813  0.0008780   0.890   0.3824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared:  0.2371, Adjusted R-squared:  0.1736 
## F-statistic:  3.73 on 2 and 24 DF,  p-value: 0.03884

# Confidence intervals at 95%
# CI: (lwr, upr)
confint(modWine3)
##                    2.5 %      97.5 %
## (Intercept)  4.746010626 7.220074676
## Age          0.007702664 0.064409106
## WinterRain  -0.001030725 0.002593278

# Confidence intervals at other levels
confint(modWine3, level = 0.90)
##                       5 %        95 %
## (Intercept)  4.9575969417 7.008488360
## Age          0.0125522989 0.059559471
## WinterRain  -0.0007207941 0.002283347
confint(modWine3, level = 0.99)
##                    0.5 %      99.5 %
## (Intercept)  4.306650310 7.659434991
## Age         -0.002367633 0.074479403
## WinterRain  -0.001674299 0.003236852

# Compare with previous models
confint(modWine1)
##                     2.5 %        97.5 %
## (Intercept) -1.834844e+01 13.6632391095
## WinterRain   1.153872e-04  0.0021910509
## AGST         4.106337e-01  0.8182146540
## HarvestRain -5.577203e-03 -0.0020974232
## Age         -1.072931e-01  0.1348317795
## FrancePop   -2.858849e-04  0.0002416171
confint(modWine2)
##                     2.5 %       97.5 %
## (Intercept) -7.1524497573 -0.150690903
## WinterRain   0.0001670449  0.002166393
## AGST         0.4190113907  0.813771726
## HarvestRain -0.0055353098 -0.002185890
## Age          0.0089852800  0.038710748
confint(modWine3)
##                    2.5 %      97.5 %
## (Intercept)  4.746010626 7.220074676
## Age          0.007702664 0.064409106
## WinterRain  -0.001030725 0.002593278

In modWine3, the 95% CI for β0 is (4.7460,7.2201), for β1 is (0.0077,0.0644), and for β2 is (0.0010,0.0026). Therefore, we can say with a 95% confidence that the coefficient of WinterRain is non-significant (0 is inside the CI). But, inspecting the CI of β2 in modWine2 we can see that it is significant for the model! How is this possible? The answer is that the presence of extra predictors affects the coefficient estimate, as we saw in Figure 2.7. Therefore, the precise statement to make is:

In model Price ~ Age + WinterRain, with α=0.05, the coefficient of WinterRain is non-significant.

Note that this does not mean that the coefficient will be always non-significant: in Price ~ Age + AGST + HarvestRain + WinterRain it is.

Compute and interpret the CIs for the coefficients, at levels α=0.10,0.05,0.01, for the following regressions:

  1. Price ~ WinterRain + HarvestRain + AGST (wine).
  2. AGST ~ Year + FrancePop (wine).

For the assumptions dataset, do the following:

  1. Regression y7 ~ x7. Check that:
    • The intercept is not significant for the regression at any reasonable level α.
    • The slope is significant for any α107.
  2. Regression y6 ~ x6. Assume the linear model assumptions are verified.
    • Check that β^0 is significantly different from zero at any level α.
    • For which α=0.10,0.05,0.01 is β^1 significantly different from zero?

In certain applications, it is useful to center the predictors X1,,Xp prior to fit the model, in such a way that the slope coefficients (β1,,βp) measure the effects of deviations of the predictors from their means. Theoretically, this amounts to considering the linear model

Y=β0+β1(X1E[X1])++βp(XpE[Xp])+ε.

In the sample case, we proceed by replacing Xij by XijX¯j, which can be easily done by the scale function (see below). If, in addition, the response is also centered, then β0=0 and β^0=0. This centering of the data has no influence on the significance of the predictors (but has influence on the significance of β^0), as it is just a linear transformation of them.

# By default, scale centers (subtracts the mean) and scales (divides by the
# standard deviation) the columns of a matrix
wineCen <- data.frame(scale(wine, center = TRUE, scale = FALSE))

# Regression with centered response and predictors
modWine3Cen <- lm(Price ~ Age + WinterRain, data = wineCen)

# Summary
summary(modWine3Cen)
## 
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wineCen)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88964 -0.51421 -0.00066  0.43103  1.06897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 5.964e-16  1.110e-01   0.000   1.0000  
## Age         3.606e-02  1.374e-02   2.625   0.0149 *
## WinterRain  7.813e-04  8.780e-04   0.890   0.3824  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared:  0.2371, Adjusted R-squared:  0.1736 
## F-statistic:  3.73 on 2 and 24 DF,  p-value: 0.03884

  1. This is for theoretical and modeling convenience. With this assumption, we just model the randomness of Y given the predictors. If the randomness of Y and the randomness of X1,,Xn was to be modeled, we will require from a significantly more complex model.↩︎

  2. Undestood as small |(XX)1|.↩︎

  3. Prior to undertake the estimation of σ we have used the sample to estimate β^. The situation is thus analogous to the discussion between the sample variance sx2=1ni=1n(XiX¯)2 and the sample quasi-variance s^x2=1n1i=1n(XiX¯)2 that are computed from a sample X1,,Xn. When estimating Var[X], both estimate previously E[X] through X¯. The fact that s^x2 accounts for that prior estimation through the degrees of freedom n1 makes that estimator unbiased for Var[X] (sx2 is not).↩︎

  4. Recall that the sample variance of ε^1,,ε^n is 1ni=1n(ε^iε^¯)2.↩︎

  5. In the sense of practically realistic.↩︎

  6. Shortcut for significantly different from zero.↩︎

  7. This is denoted as β^j0SE^(β^j)H0tnp1.↩︎

  8. In R, tnp1;α/2 can be computed as qt(p = 1 - alpha / 2, df = n - p - 1) or qt(p = alpha / 2, df = n - p - 1, lower.tail = FALSE).↩︎

  9. For example, '**' indicates that the p-value lies within 0.001 and 0.01.↩︎

  10. This is a context-guided decision, not data-driven.↩︎

  11. Notice the use of - for excluding a particular predictor.↩︎