2.4 Inference for model parameters
The assumptions introduced in the previous section allow us to specify what is the distribution of the random vector The distribution is derived conditionally on the predictors’ sample In other words, we assume that the randomness of comes only from the error terms and not from the predictors.25 To denote this, we employ lowercase for the predictors’ sample
2.4.1 Distributions of the fitted coefficients
The distribution of is:
This result can be obtained from the form of given in (2.7), the sample version of the model assumptions given in (2.10), and the linear transformation property of a normal given in (1.4). Equation (2.11) implies that the marginal distribution of is
where is the standard error, and
Recall that an equivalent form for (2.12) is (why?)
The interpretation of (2.12) is simpler in the case with where
with
Some insights on (2.13) and (2.14), illustrated interactively in Figure 2.13, are the following:
Bias. Both estimates are unbiased. That means that their expectations are the true coefficients for any sample size
Variance. The variances and have interesting interpretations in terms of their components:
Sample size . As the sample size grows, the precision of the estimators increases, since both variances decrease.
Error variance . The more disperse the error is, the less precise the estimates are, since more vertical variability is present.
Predictor variance . If the predictor is spread out (large ), then it is easier to fit a regression line: we have information about the data trend over a long interval. If is small, then all the data is concentrated on a narrow vertical band, so we have a much more limited view of the trend.
Mean . It has influence only on the precision of The larger is, the less precise is.
Figure 2.13: Illustration of the randomness of the fitted coefficients and the influence of and The predictors’ sample is fixed and new responses are generated each time from a linear model Application available here.
The insights about (2.11) are more convoluted. The following broad remarks, extensions of what happened when apply:
Bias. All the estimates are unbiased for any sample size
Variance. It depends on:
- Sample size . Hidden inside As grows, the precision of the estimators increases.
- Error variance . The larger is, the less precise is.
- Predictor sparsity . The more “disperse”26 the predictors are, the more precise is.
The problem with the result in (2.11) is that is unknown in practice. Therefore, we need to estimate in order to use a result similar to (2.11). We do so by computing a rescaled sample variance of the residuals :
Note the in the denominator. The factor represents the degrees of freedom: the number of data points minus the number of already27 fitted parameters ( slopes plus intercept) with the data. For the interpretation of it is key to realize that the mean of the residuals is zero, this is Therefore, is indeed a rescaled sample variance of the residuals which estimates the variance of 28 It can be seen that is unbiased as an estimator of
If we use the estimate instead of we get more useful29 distributions than (2.12):
where represents the Student’s distribution with degrees of freedom.
The LHS of (2.16) is the -statistic for We will employ them for building confidence intervals and hypothesis tests in what follows.
2.4.2 Confidence intervals for the coefficients
Thanks to (2.16), we can have the Confidence Intervals (CI) for the coefficient :
where is the -upper quantile of the . Usually, are considered.
Figure 2.14: Illustration of the randomness of the CI for at confidence. The plot shows 100 random CIs for computed from 100 random datasets generated by the same simple linear model, with intercept The illustration for is completely analogous. Application available here.
This random CI contains the unknown coefficient “with a probability of ”. The previous quoted statement has to be understood as follows. Suppose you have 100 samples generated according to a linear model. If you compute the CI for a coefficient, then in approximately of the samples the true coefficient would be actually inside the random CI. Note also that the CI is symmetric around This is illustrated in Figure 2.14.
2.4.3 Testing on the coefficients
The distributions in (2.16) allow also to conduct a formal hypothesis test on the coefficients For example the test for significance30 is especially important, that is, the test of the hypotheses
for The test of with is especially interesting, since it allows us to answer whether the variable has a significant linear effect on . The statistic used for testing for significance is the -statistic
which is distributed as a under the (veracity of) the null hypothesis.31
The null hypothesis is tested against the alternative hypothesis, If is rejected, it is rejected in favor of The alternative hypothesis can be two-sided (we will focus mostly on these alternatives), such as
or one-sided, such as
The test based on the -statistic is referred to as the -test. It rejects (against ) at significance level for large absolute values of the -statistic, precisely for those above the -upper quantile of the distribution. That is, it rejects at level if 32 For the one-sided tests, it rejects against or if or respectively.
Remember the following insights about hypothesis testing.
In an hypothesis test, the -value measures the degree of veracity of according to the data. The rule of thumb is the following:
Is the -value lower than ?
- Yes reject .
- No do not reject .
The connection of a -test for and the CI for both at level is the following:
Is inside the CI for ?
- Yes do not reject .
- No reject .
The one-sided test vs. (respectively, ) can be done by means of the CI for If is rejected, they allow us to conclude that is significantly negative (positive) and that for the considered regression model, has a significant negative (positive) effect on . The rule of thumb is the following:
Is the CI for below (above) at level ?
- Yes reject at level Conclude has a significant negative (positive) effect on at level
- No the criterion is not conclusive.
2.4.4 Case study application
Let’s analyze the multiple linear model we have considered for the wine
dataset, now that we know how to make inference on the model parameters. The relevant information is obtained with the summary
of the model:
# Fit
modWine1 <- lm(Price ~ ., data = wine)
# Summary
sumModWine1 <- summary(modWine1)
sumModWine1
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46541 -0.24133 0.00413 0.18974 0.52495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.343e+00 7.697e+00 -0.304 0.76384
## WinterRain 1.153e-03 4.991e-04 2.311 0.03109 *
## AGST 6.144e-01 9.799e-02 6.270 3.22e-06 ***
## HarvestRain -3.837e-03 8.366e-04 -4.587 0.00016 ***
## Age 1.377e-02 5.821e-02 0.237 0.81531
## FrancePop -2.213e-05 1.268e-04 -0.175 0.86313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.293 on 21 degrees of freedom
## Multiple R-squared: 0.8278, Adjusted R-squared: 0.7868
## F-statistic: 20.19 on 5 and 21 DF, p-value: 2.232e-07
# Contains the estimation of sigma ("Residual standard error")
sumModWine1$sigma
## [1] 0.2930287
# Which is the same as
sqrt(sum(modWine1$residuals^2) / modWine1$df.residual)
## [1] 0.2930287
The Coefficients
block of the summary
output contains the next elements regarding the significance of each coefficient this is, the test vs. :
Estimate
: least squares estimateStd. Error
: estimated standard errort value
: -statisticPr(>|t|)
: -value of the -test.Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
: codes indicating the size of the -value. The more asterisks, the more evidence supporting that does not hold.33
Note that a high proportion of predictors are not significant in modWine1
: FrancePop
and Age
are not significant (and the intercept is not significant also). This is an indication of an excess of predictors adding little information to the response. One explanation is the almost perfect correlation between FrancePop
and Age
shown before: one of them is not adding any extra information to explain Price
. This complicates the model unnecessarily and, more importantly, it has the undesirable effect of making the coefficient estimates less precise. We opt to remove the predictor FrancePop
from the model since it is exogenous to the wine context.34 A data-driven justification of the removal of this variable is that it is the least significant in modWine1
.
Then, the model without FrancePop
35 is:
modWine2 <- lm(Price ~ . - FrancePop, data = wine)
summary(modWine2)
##
## Call:
## lm(formula = Price ~ . - FrancePop, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.6515703 1.6880876 -2.163 0.04167 *
## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
## Age 0.0238480 0.0071667 3.328 0.00305 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08
All the coefficients are significant at level Therefore, there is no clear redundant information. In addition, the is very similar to the full model, but the 'Adjusted R-squared'
, a weighting of the to account for the number of predictors used by the model, is slightly larger. As we will see in Section 2.7.2, this means that, compared to the number of predictors used, modWine2
explains more variability of Price
than modWine1
.
A handy way of comparing the coefficients of both models is car::compareCoefs
:
car::compareCoefs(modWine1, modWine2)
## Calls:
## 1: lm(formula = Price ~ ., data = wine)
## 2: lm(formula = Price ~ . - FrancePop, data = wine)
##
## Model 1 Model 2
## (Intercept) -2.34 -3.65
## SE 7.70 1.69
##
## WinterRain 0.001153 0.001167
## SE 0.000499 0.000482
##
## AGST 0.6144 0.6164
## SE 0.0980 0.0952
##
## HarvestRain -0.003837 -0.003861
## SE 0.000837 0.000808
##
## Age 0.01377 0.02385
## SE 0.05821 0.00717
##
## FrancePop -2.21e-05
## SE 1.27e-04
##
Note how the coefficients for modWine2
have smaller errors than modWine1
.
The individual CIs for the unknown ’s can be obtained by applying the confint
function to an lm
object. Let’s compute the CIs for the model coefficients of modWine1
, modWine2
, and a new model modWine3
:
# Fit a new model
modWine3 <- lm(Price ~ Age + WinterRain, data = wine)
summary(modWine3)
##
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88964 -0.51421 -0.00066 0.43103 1.06897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9830427 0.5993667 9.982 5.09e-10 ***
## Age 0.0360559 0.0137377 2.625 0.0149 *
## WinterRain 0.0007813 0.0008780 0.890 0.3824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared: 0.2371, Adjusted R-squared: 0.1736
## F-statistic: 3.73 on 2 and 24 DF, p-value: 0.03884
# Confidence intervals at 95%
# CI: (lwr, upr)
confint(modWine3)
## 2.5 % 97.5 %
## (Intercept) 4.746010626 7.220074676
## Age 0.007702664 0.064409106
## WinterRain -0.001030725 0.002593278
# Confidence intervals at other levels
confint(modWine3, level = 0.90)
## 5 % 95 %
## (Intercept) 4.9575969417 7.008488360
## Age 0.0125522989 0.059559471
## WinterRain -0.0007207941 0.002283347
confint(modWine3, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) 4.306650310 7.659434991
## Age -0.002367633 0.074479403
## WinterRain -0.001674299 0.003236852
# Compare with previous models
confint(modWine1)
## 2.5 % 97.5 %
## (Intercept) -1.834844e+01 13.6632391095
## WinterRain 1.153872e-04 0.0021910509
## AGST 4.106337e-01 0.8182146540
## HarvestRain -5.577203e-03 -0.0020974232
## Age -1.072931e-01 0.1348317795
## FrancePop -2.858849e-04 0.0002416171
confint(modWine2)
## 2.5 % 97.5 %
## (Intercept) -7.1524497573 -0.150690903
## WinterRain 0.0001670449 0.002166393
## AGST 0.4190113907 0.813771726
## HarvestRain -0.0055353098 -0.002185890
## Age 0.0089852800 0.038710748
confint(modWine3)
## 2.5 % 97.5 %
## (Intercept) 4.746010626 7.220074676
## Age 0.007702664 0.064409106
## WinterRain -0.001030725 0.002593278
In modWine3
, the 95% CI for is for is and for is Therefore, we can say with a 95% confidence that the coefficient of WinterRain
is non-significant (0
is inside the CI). But, inspecting the CI of in modWine2
we can see that it is significant for the model! How is this possible? The answer is that the presence of extra predictors affects the coefficient estimate, as we saw in Figure 2.7. Therefore, the precise statement to make is:
In model
Price ~ Age + WinterRain
, with the coefficient ofWinterRain
is non-significant.
Note that this does not mean that the coefficient will be always non-significant: in Price ~ Age + AGST + HarvestRain + WinterRain
it is.
Compute and interpret the CIs for the coefficients, at levels for the following regressions:
Price ~ WinterRain + HarvestRain + AGST
(wine
).AGST ~ Year + FrancePop
(wine
).
For the assumptions
dataset, do the following:
- Regression
y7 ~ x7
. Check that:- The intercept is not significant for the regression at any reasonable level
- The slope is significant for any
- Regression
y6 ~ x6
. Assume the linear model assumptions are verified.- Check that is significantly different from zero at any level
- For which is significantly different from zero?
In certain applications, it is useful to center the predictors prior to fit the model, in such a way that the slope coefficients measure the effects of deviations of the predictors from their means. Theoretically, this amounts to considering the linear model
In the sample case, we proceed by replacing by which can be easily
done by the scale
function (see below). If, in addition,
the response is also centered, then and This centering of the data
has no influence on the significance of the predictors (but has
influence on the significance of ), as it is just a linear
transformation of them.
# By default, scale centers (subtracts the mean) and scales (divides by the
# standard deviation) the columns of a matrix
wineCen <- data.frame(scale(wine, center = TRUE, scale = FALSE))
# Regression with centered response and predictors
modWine3Cen <- lm(Price ~ Age + WinterRain, data = wineCen)
# Summary
summary(modWine3Cen)
##
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wineCen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88964 -0.51421 -0.00066 0.43103 1.06897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.964e-16 1.110e-01 0.000 1.0000
## Age 3.606e-02 1.374e-02 2.625 0.0149 *
## WinterRain 7.813e-04 8.780e-04 0.890 0.3824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared: 0.2371, Adjusted R-squared: 0.1736
## F-statistic: 3.73 on 2 and 24 DF, p-value: 0.03884
This is for theoretical and modeling convenience. With this assumption, we just model the randomness of given the predictors. If the randomness of and the randomness of was to be modeled, we will require from a significantly more complex model.↩︎
Undestood as small ↩︎
Prior to undertake the estimation of we have used the sample to estimate The situation is thus analogous to the discussion between the sample variance and the sample quasi-variance that are computed from a sample When estimating both estimate previously through The fact that accounts for that prior estimation through the degrees of freedom makes that estimator unbiased for ( is not).↩︎
Recall that the sample variance of is ↩︎
In the sense of practically realistic.↩︎
Shortcut for significantly different from zero.↩︎
This is denoted as ↩︎
In R, can be computed as
qt(p = 1 - alpha / 2, df = n - p - 1)
orqt(p = alpha / 2, df = n - p - 1, lower.tail = FALSE)
.↩︎For example,
'**'
indicates that the -value lies within and ↩︎This is a context-guided decision, not data-driven.↩︎
Notice the use of
-
for excluding a particular predictor.↩︎