3.4 Inference for model parameters
The assumptions introduced in the previous section allow to specify what is the distribution of the random vector \hat{\boldsymbol{\beta}}. The distribution is derived conditionally on the sample predictors \mathbf{X}_1,\ldots,\mathbf{X}_n. In other words, we assume that the randomness of \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} comes only from the error terms and not from the predictors. To denote this, we employ lowercase for the sample predictors \mathbf{x}_1,\ldots,\mathbf{x}_n.
3.4.1 Distributions of the fitted coefficients
The distribution of \hat{\boldsymbol{\beta}} is: \begin{align} \hat{\boldsymbol{\beta}}\sim\mathcal{N}_{k+1}\left(\boldsymbol{\beta},\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\right) \tag{3.6} \end{align} where \mathcal{N}_{m} is the m-dimensional normal, this is, the extension of the usual normal distribution to deal with m random variables24. The interpretation of (3.6) is not so easy as in the simple linear case. Here are some broad remarks:
Bias. The estimates are unbiased.
Variance. Depending on:
- Sample size n. Hidden inside \mathbf{X}^T\mathbf{X}. As n grows, the precision of the estimators increases.
- Error variance \sigma^2. The larger \sigma^2 is, the less precise \hat{\boldsymbol{\beta}} is.
- Predictor sparsity (\mathbf{X}^T\mathbf{X})^{-1}. The more sparse the predictor is (small |(\mathbf{X}^T\mathbf{X})^{-1}|), the more precise \hat{\boldsymbol{\beta}} is.
The problem with (3.6) is that \sigma^2 is unknown in practice, so we need to estimate \sigma^2 from the data. We do so by computing a rescaled sample variance of the residuals \hat\varepsilon_1,\ldots,\hat\varepsilon_n: \begin{align*} \hat\sigma^2=\frac{\sum_{i=1}^n\hat\varepsilon_i^2}{n-k-1}. \end{align*} Note the n-k-1 in the denominator. Now n-k-1 are the degrees of freedom, the number of data points minus the number of already fitted parameters (k slopes and 1 intercept). As in simple linear regression, the mean of the residuals \hat\varepsilon_1,\ldots,\hat\varepsilon_n is zero.
If we use the estimate \hat\sigma^2 instead of \sigma^2, we get more useful distributions, this time for the individual \beta_j’s: \begin{align} \frac{\hat\beta_j-\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim t_{n-k-1},\quad\hat{\mathrm{SE}}(\hat\beta_j)^2=\hat\sigma^2v_j^2\tag{3.7} \end{align} where t_{n-k-1} represents the Student’s t distribution with n-k-1 degrees of freedom and v_j\text{ is the }j\text{-th element of the diagonal of }(\mathbf{X}^T\mathbf{X})^{-1}. The LHS of (3.7) is the t-statistic for \beta_j, j=0,\ldots,k. They are employed for building confidence intervals and hypothesis tests.
3.4.2 Confidence intervals for the coefficients
Thanks to (3.7), we can have the 100(1-\alpha)\% CI for the coefficient \beta_j, j=0,\ldots,k: \begin{align} \left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)t_{n-k-1;\alpha/2}\right)\tag{3.8} \end{align} where t_{n-k-1;\alpha/2} is the \alpha/2-upper quantile of the t_{n-k-1}. Note that with k=1 we have same CI as in (2.8).
Let’s see how we can compute the CIs. We return to the wine
dataset, so in case you do not have it loaded, you can download it here as an .RData
file. We analyze the CI for the coefficients of Price ~ Age + WinterRain
.
# Fit model
<- lm(Price ~ Age + WinterRain, data = wine)
mod
# Confidence intervals at 95%
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 4.746010626 7.220074676
## Age 0.007702664 0.064409106
## WinterRain -0.001030725 0.002593278
# Confidence intervals at other levels
confint(mod, level = 0.90)
## 5 % 95 %
## (Intercept) 4.9575969417 7.008488360
## Age 0.0125522989 0.059559471
## WinterRain -0.0007207941 0.002283347
confint(mod, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) 4.306650310 7.659434991
## Age -0.002367633 0.074479403
## WinterRain -0.001674299 0.003236852
In this example, the 95% confidence interval for \beta_0 is (4.7460, 7.2201), for \beta_1 is (0.0077, 0.0644) and for \beta_2 is (-0.0010, 0.0026). Therefore, we can say with a 95% confidence that the coefficient of WinterRain
is non significant. But in Section 3.1.1 we saw that it was significant in the model Price ~ Age + AGST + HarvestRain + WinterRain
! How is this possible? The answer is that the presence of extra predictors affects the coefficient estimate, as we saw in Figure 3.6. Therefore, the precise statement to make is: in the model Price ~ Age + WinterRain
, with \alpha=0.05, the coefficient of WinterRain
is non significant. Note that this does not mean that it will be always non significant: in Price ~ Age + AGST + HarvestRain + WinterRain
it is.
Compute and interpret the CIs for the coefficients, at levels \alpha=0.10,0.05,0.01, for the following regressions:
-
medv ~ . - lstat - chas - zn - crim
(Boston
) -
nox ~ chas + zn + indus + lstat + dis + rad
(Boston
) -
Price ~ WinterRain + HarvestRain + AGST
(wine
) -
AGST ~ Year + FrancePop
(wine
)
3.4.3 Testing on the coefficients
The distributions in (3.7) also allow to conduct a formal hypothesis test on the coefficients \beta_j, j=0,\ldots,k. For example the test for significance is specially important: \begin{align*} H_0:\beta_j=0 \end{align*} for j=0,\ldots,k. The test of H_0:\beta_j=0 with 1\leq j\leq k is specially interesting, since it allows to answer whether the variable X_j has a significant linear effect on Y. The statistic used for testing for significance is the t-statistic \begin{align*} \frac{\hat\beta_j-0}{\hat{\mathrm{SE}}(\hat\beta_j)}, \end{align*} which is distributed as a t_{n-k-1} under the (veracity of) the null hypothesis. H_0 is tested against the bilateral alternative hypothesis H_1:\beta_j\neq 0.
Remember two important insights regarding hypothesis testing.
In an hypothesis test, the p-value measures the degree of veracity of H_0 according to the data. The rule of thumb is the following:
Is the p-value lower than \alpha?
- Yes \rightarrow reject H_0.
- No \rightarrow do not reject H_0.
The connection of a t-test for H_0:\beta_j=0 and the CI for \beta_j, both at level \alpha, is the following.
Is 0 inside the CI for \beta_j?
- Yes \leftrightarrow do not reject H_0.
- No \leftrightarrow reject H_0.
The tests for significance are built-in in the summary
function, as we saw in Section 3. For mod
, the regression of Price ~ Age + WinterRain
, we have:
summary(mod)
##
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88964 -0.51421 -0.00066 0.43103 1.06897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9830427 0.5993667 9.982 5.09e-10 ***
## Age 0.0360559 0.0137377 2.625 0.0149 *
## WinterRain 0.0007813 0.0008780 0.890 0.3824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared: 0.2371, Adjusted R-squared: 0.1736
## F-statistic: 3.73 on 2 and 24 DF, p-value: 0.03884
The unilateral test H_0:\beta_j\geq 0 (respectively, H_0:\beta_j\leq 0) vs H_1:\beta_j<0 (H_1:\beta_j>0) can be done by means of the CI for \beta_j. If H_0 is rejected, they allow to conclude that \hat\beta_j is significantly negative (positive) and that for the considered regression model, X_j has a significant negative (positive) effect on Y. We have been doing them using the following rule of thumb:
Is the CI for \beta_j below (above) 0 at level \alpha?
- Yes \rightarrow reject H_0 at level \alpha. Conclude X_j has a significant negative (positive) effect on Y at level \alpha.
- No \rightarrow the criterion is not conclusive.
With m=1, the density of a \mathcal{N}_{m} corresponds to a bell-shaped curve With m=2, the density is a surface similar to a bell.↩︎