3.4 Inference for model parameters
The assumptions introduced in the previous section allow to specify what is the distribution of the random vector \(\hat{\boldsymbol{\beta}}\). The distribution is derived conditionally on the sample predictors \(\mathbf{X}_1,\ldots,\mathbf{X}_n\). In other words, we assume that the randomness of \(\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol\varepsilon\) comes only from the error terms and not from the predictors. To denote this, we employ lowercase for the sample predictors \(\mathbf{x}_1,\ldots,\mathbf{x}_n\).
3.4.1 Distributions of the fitted coefficients
The distribution of \(\hat{\boldsymbol{\beta}}\) is: \[\begin{align} \hat{\boldsymbol{\beta}}\sim\mathcal{N}_{k+1}\left(\boldsymbol\beta,\sigma^2(\mathbf{X}^T\mathbf{X})^{1}\right) \tag{3.6} \end{align}\]where \(\mathcal{N}_{m}\) is the \(m\)dimensional Normal, this is, the extension of the usual Normal distribution to deal with \(m\) random variables^{25}. The interpretation of (3.6) is not so easy as in the simple linear case. Here are some broad remarks:
 Bias. The estimates are unbiased.
Variance. Depending on:
 Sample size \(n\). Hidden inside \(\mathbf{X}^T\mathbf{X}\). As \(n\) grows, the precision of the estimators increases.
 Error variance \(\sigma^2\). The larger \(\sigma^2\) is, the less precise \(\hat{\boldsymbol{\beta}}\) is.
 Predictor sparsity \((\mathbf{X}^T\mathbf{X})^{1}\). The more sparse the predictor is (small \((\mathbf{X}^T\mathbf{X})^{1}\)), the more precise \(\hat{\boldsymbol{\beta}}\) is.
Note the \(nk1\) in the denominator. Now \(nk1\) are the degrees of freedom, the number of data points minus the number of already fitted parameters (\(k\) slopes and \(1\) intercept). As in simple linear regression, the mean of the fitted residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\) is zero.
If we use the estimate \(\hat\sigma^2\) instead of \(\sigma^2\), we get more useful distributions, this time for the individual \(\beta_j\)’s: \[\begin{align} \frac{\hat\beta_j\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim t_{nk1},\quad\hat{\mathrm{SE}}(\hat\beta_j)^2=\hat\sigma^2v_j^2\tag{3.7} \end{align}\]where \(t_{nk1}\) represents the Student’s \(t\) distribution with \(nk1\) degrees of freedom and \[ v_j\text{ is the $j$th element of the diagonal of }(\mathbf{X}^T\mathbf{X})^{1}. \] The LHS of (3.7) is the \(t\)statistic for \(\beta_j\), \(j=0,\ldots,k\). They are employed for building confidence intervals and hypothesis tests.
3.4.2 Confidence intervals for the coefficients
Thanks to (3.7), we can have the \(100(1\alpha)\%\) CI for the coefficient \(\beta_j\), \(j=0,\ldots,k\): \[\begin{align} \left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)t_{nk1;\alpha/2}\right)\tag{3.8} \end{align}\]where \(t_{nk1;\alpha/2}\) is the \(\alpha/2\)upper quantile of the \(t_{nk1}\). Note that with \(k=1\) we have same CI as in (2.8).
Let’s see how we can compute the CIs. We return to the wine
dataset, so in case you do not have it loaded, you can download it here as an .RData
file. We analyse the CI for the coefficients of Price ~ Age + WinterRain
.
# Fit model
mod < lm(Price ~ Age + WinterRain, data = wine)
# Confidence intervals at 95%
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 4.746010626 7.220074676
## Age 0.007702664 0.064409106
## WinterRain 0.001030725 0.002593278
# Confidence intervals at other levels
confint(mod, level = 0.90)
## 5 % 95 %
## (Intercept) 4.9575969417 7.008488360
## Age 0.0125522989 0.059559471
## WinterRain 0.0007207941 0.002283347
confint(mod, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) 4.306650310 7.659434991
## Age 0.002367633 0.074479403
## WinterRain 0.001674299 0.003236852
In this example, the 95% confidence interval for \(\beta_0\) is \((4.7460, 7.2201)\), for \(\beta_1\) is \((0.0077, 0.0644)\) and for \(\beta_2\) is \((0.0010, 0.0026)\). Therefore, we can say with a 95% confidence that the coefficient of WinterRain
is non significant. But in Section 3.1.1 we saw that it was significant in the model Price ~ Age + AGST + HarvestRain + WinterRain
! How is this possible? The answer is that the presence of extra predictors affects the coefficient estimate, as we saw in Figure 3.6. Therefore, the precise statement to make is: in the model Price ~ Age + WinterRain
, with \(\alpha=0.05\), the coefficient of WinterRain
is non significant. Note that this does not mean that it will be always non significant: in Price ~ Age + AGST + HarvestRain + WinterRain
it is.
Compute and interpret the CIs for the coefficients, at levels α = 0.10, 0.05, 0.01, for the following regressions:

medv ~ .  lstat  chas  zn  crim
(Boston
) 
nox ~ chas + zn + indus + lstat + dis + rad
(Boston
) 
Price ~ WinterRain + HarvestRain + AGST
(wine
) 
AGST ~ Year + FrancePop
(wine
)
3.4.3 Testing on the coefficients
The distributions in (3.7) also allow to conduct a formal hypothesis test on the coefficients \(\beta_j\), \(j=0,\ldots,k\). For example the test for significance is specially important: \[\begin{align*} H_0:\beta_j=0 \end{align*}\] for \(j=0,\ldots,k\). The test of \(H_0:\beta_j=0\) with \(1\leq j\leq k\) is specially interesting, since it allows to answer whether the variable \(X_j\) has a significant linear effect on \(Y\). The statistic used for testing for significance is the \(t\)statistic \[\begin{align*} \frac{\hat\beta_j0}{\hat{\mathrm{SE}}(\hat\beta_j)}, \end{align*}\]which is distributed as a \(t_{nk1}\) under the (veracity of) the null hypothesis. \(H_0\) is tested against the bilateral alternative hypothesis \(H_1:\beta_j\neq 0\).
Remember two important insights regarding hypothesis testing.
In an hypothesis test, the pvalue measures the degree of veracity of H_{0} according to the data. The rule of thumb is the following:
Is the pvalue lower than α?
 Yes → reject H_{0}.
 No → do not reject H_{0}.
The connection of a ttest for H_{0} : β_{j} = 0 and the CI for β_{j}, both at level α, is the following.
Is 0 inside the CI for β_{j}?
 Yes ↔ do not reject H_{0}.
 No ↔ reject H_{0}.
The tests for significance are builtin in the summary
function, as we saw in Section ??. For mod
, the regression of Price ~ Age + WinterRain
, we have:
summary(mod)
##
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.88964 0.51421 0.00066 0.43103 1.06897
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 5.9830427 0.5993667 9.982 5.09e10 ***
## Age 0.0360559 0.0137377 2.625 0.0149 *
## WinterRain 0.0007813 0.0008780 0.890 0.3824
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple Rsquared: 0.2371, Adjusted Rsquared: 0.1736
## Fstatistic: 3.73 on 2 and 24 DF, pvalue: 0.03884
The unilateral test H_{0} : β_{j} ≥ 0 (respectively, H_{0} : β_{j} ≤ 0) vs H_{1} : β_{j} < 0 (H_{1} : β_{j} > 0) can be done by means of the CI for β_{j}. If H_{0} is rejected, they allow to conclude that \(\hat\beta_j\) is significantly negative (positive) and that for the considered regression model, X_{j} has a significant negative (positive) effect on Y. We have been doing them using the following rule of thumb:
Is the CI for β_{j} below (above) 0 at level α?
 Yes → reject H_{0} at level α. Conclude X_{j} has a significant negative (positive) effect on Y at level α.
 No → the criterion is not conclusive.
With \(m=1\), the density of a \(\mathcal{N}_{m}\) corresponds to a bellshaped curve With \(m=2\), the density is a surface similar to a bell.↩