3.4 Inference for model parameters

The assumptions introduced in the previous section allow to specify what is the distribution of the random vector \(\hat{\boldsymbol{\beta}}\). The distribution is derived conditionally on the sample predictors \(\mathbf{X}_1,\ldots,\mathbf{X}_n\). In other words, we assume that the randomness of \(\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\) comes only from the error terms and not from the predictors. To denote this, we employ lowercase for the sample predictors \(\mathbf{x}_1,\ldots,\mathbf{x}_n\).

3.4.1 Distributions of the fitted coefficients

The distribution of \(\hat{\boldsymbol{\beta}}\) is: \[\begin{align} \hat{\boldsymbol{\beta}}\sim\mathcal{N}_{k+1}\left(\boldsymbol{\beta},\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\right) \tag{3.6} \end{align}\]

where \(\mathcal{N}_{m}\) is the \(m\)-dimensional normal, this is, the extension of the usual normal distribution to deal with \(m\) random variables25. The interpretation of (3.6) is not so easy as in the simple linear case. Here are some broad remarks:

  • Bias. The estimates are unbiased.
  • Variance. Depending on:

    • Sample size \(n\). Hidden inside \(\mathbf{X}^T\mathbf{X}\). As \(n\) grows, the precision of the estimators increases.
    • Error variance \(\sigma^2\). The larger \(\sigma^2\) is, the less precise \(\hat{\boldsymbol{\beta}}\) is.
    • Predictor sparsity \((\mathbf{X}^T\mathbf{X})^{-1}\). The more sparse the predictor is (small \(|(\mathbf{X}^T\mathbf{X})^{-1}|\)), the more precise \(\hat{\boldsymbol{\beta}}\) is.
The problem with (3.6) is that \(\sigma^2\) is unknown in practice, so we need to estimate \(\sigma^2\) from the data. We do so by computing a rescaled sample variance of the residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\): \[\begin{align*} \hat\sigma^2=\frac{\sum_{i=1}^n\hat\varepsilon_i^2}{n-k-1}. \end{align*}\]

Note the \(n-k-1\) in the denominator. Now \(n-k-1\) are the degrees of freedom, the number of data points minus the number of already fitted parameters (\(k\) slopes and \(1\) intercept). As in simple linear regression, the mean of the residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\) is zero.

If we use the estimate \(\hat\sigma^2\) instead of \(\sigma^2\), we get more useful distributions, this time for the individual \(\beta_j\)’s: \[\begin{align} \frac{\hat\beta_j-\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim t_{n-k-1},\quad\hat{\mathrm{SE}}(\hat\beta_j)^2=\hat\sigma^2v_j^2\tag{3.7} \end{align}\]

where \(t_{n-k-1}\) represents the Student’s \(t\) distribution with \(n-k-1\) degrees of freedom and \[ v_j\text{ is the }j\text{-th element of the diagonal of }(\mathbf{X}^T\mathbf{X})^{-1}. \] The LHS of (3.7) is the \(t\)-statistic for \(\beta_j\), \(j=0,\ldots,k\). They are employed for building confidence intervals and hypothesis tests.

3.4.2 Confidence intervals for the coefficients

Thanks to (3.7), we can have the \(100(1-\alpha)\%\) CI for the coefficient \(\beta_j\), \(j=0,\ldots,k\): \[\begin{align} \left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)t_{n-k-1;\alpha/2}\right)\tag{3.8} \end{align}\]

where \(t_{n-k-1;\alpha/2}\) is the \(\alpha/2\)-upper quantile of the \(t_{n-k-1}\). Note that with \(k=1\) we have same CI as in (2.8).

Let’s see how we can compute the CIs. We return to the wine dataset, so in case you do not have it loaded, you can download it here as an .RData file. We analyse the CI for the coefficients of Price ~ Age + WinterRain.

# Fit model
mod <- lm(Price ~ Age + WinterRain, data = wine)

# Confidence intervals at 95%
##                    2.5 %      97.5 %
## (Intercept)  4.746010626 7.220074676
## Age          0.007702664 0.064409106
## WinterRain  -0.001030725 0.002593278

# Confidence intervals at other levels
confint(mod, level = 0.90)
##                       5 %        95 %
## (Intercept)  4.9575969417 7.008488360
## Age          0.0125522989 0.059559471
## WinterRain  -0.0007207941 0.002283347
confint(mod, level = 0.99)
##                    0.5 %      99.5 %
## (Intercept)  4.306650310 7.659434991
## Age         -0.002367633 0.074479403
## WinterRain  -0.001674299 0.003236852

In this example, the 95% confidence interval for \(\beta_0\) is \((4.7460, 7.2201)\), for \(\beta_1\) is \((0.0077, 0.0644)\) and for \(\beta_2\) is \((-0.0010, 0.0026)\). Therefore, we can say with a 95% confidence that the coefficient of WinterRain is non significant. But in Section 3.1.1 we saw that it was significant in the model Price ~ Age + AGST + HarvestRain + WinterRain! How is this possible? The answer is that the presence of extra predictors affects the coefficient estimate, as we saw in Figure 3.6. Therefore, the precise statement to make is: in the model Price ~ Age + WinterRain, with \(\alpha=0.05\), the coefficient of WinterRain is non significant. Note that this does not mean that it will be always non significant: in Price ~ Age + AGST + HarvestRain + WinterRain it is.

Compute and interpret the CIs for the coefficients, at levels α = 0.10, 0.05, 0.01, for the following regressions:

  • medv ~ . - lstat - chas - zn - crim (Boston)
  • nox ~ chas + zn + indus + lstat + dis + rad (Boston)
  • Price ~ WinterRain + HarvestRain + AGST (wine)
  • AGST ~ Year + FrancePop (wine)

3.4.3 Testing on the coefficients

The distributions in (3.7) also allow to conduct a formal hypothesis test on the coefficients \(\beta_j\), \(j=0,\ldots,k\). For example the test for significance is specially important: \[\begin{align*} H_0:\beta_j=0 \end{align*}\] for \(j=0,\ldots,k\). The test of \(H_0:\beta_j=0\) with \(1\leq j\leq k\) is specially interesting, since it allows to answer whether the variable \(X_j\) has a significant linear effect on \(Y\). The statistic used for testing for significance is the \(t\)-statistic \[\begin{align*} \frac{\hat\beta_j-0}{\hat{\mathrm{SE}}(\hat\beta_j)}, \end{align*}\]

which is distributed as a \(t_{n-k-1}\) under the (veracity of) the null hypothesis. \(H_0\) is tested against the bilateral alternative hypothesis \(H_1:\beta_j\neq 0\).

Remember two important insights regarding hypothesis testing.

In an hypothesis test, the p-value measures the degree of veracity of H0 according to the data. The rule of thumb is the following:

Is the p-value lower than α?

  • Yes reject H0.
  • No do not reject H0.

The connection of a t-test for H0 : βj = 0 and the CI for βj, both at level α, is the following.

Is 0 inside the CI for βj?

  • Yes do not reject H0.
  • No reject H0.

The tests for significance are built-in in the summary function, as we saw in Section 3. For mod, the regression of Price ~ Age + WinterRain, we have:

## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88964 -0.51421 -0.00066  0.43103  1.06897 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.9830427  0.5993667   9.982 5.09e-10 ***
## Age         0.0360559  0.0137377   2.625   0.0149 *  
## WinterRain  0.0007813  0.0008780   0.890   0.3824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared:  0.2371, Adjusted R-squared:  0.1736 
## F-statistic:  3.73 on 2 and 24 DF,  p-value: 0.03884

The unilateral test H0 : βj ≥ 0 (respectively, H0 : βj ≤ 0) vs H1 : βj < 0 (H1 : βj > 0) can be done by means of the CI for βj. If H0 is rejected, they allow to conclude that \(\hat\beta_j\) is significantly negative (positive) and that for the considered regression model, Xj has a significant negative (positive) effect on Y. We have been doing them using the following rule of thumb:

Is the CI for βj below (above) 0 at level α?

  • Yes reject H0 at level α. Conclude Xj has a significant negative (positive) effect on Y at level α.
  • No the criterion is not conclusive.

  1. With \(m=1\), the density of a \(\mathcal{N}_{m}\) corresponds to a bell-shaped curve With \(m=2\), the density is a surface similar to a bell.