2.5 Inference for the model coefficients

The assumptions introduced in the previous section allow to specify what is the distribution of the random variables ˆβ0 and ˆβ1. As we will see, this is a key point for making inference on β0 and β1.

The distributions are derived conditionally on the sample predictors X1,,Xn. In other words, we assume that the randomness of Yi=β0+β1Xi+εi, i=1,,n, comes only from the error terms and not from the predictors. To denote this, we employ lowercase for the sample predictors x1,,xn.

2.5.1 Distributions of the fitted coefficients

The distributions of ˆβ0 and ˆβ1 are: ˆβ0N(β0,SE(ˆβ0)2),ˆβ1N(β1,SE(ˆβ1)2) where SE(ˆβ0)2=σ2n[1+ˉX2s2x],SE(ˆβ1)2=σ2ns2x. Recall that an equivalent form for (2.4) is (why?) ˆβ0β0SE(ˆβ0)N(0,1),ˆβ1β1SE(ˆβ1)N(0,1).

Some important remarks on (2.4) and (2.5) are

  • Bias. Both estimates are unbiased. That means that their expectations are the true coefficients.

  • Variance. The variances SE(ˆβ0)2 and SE(ˆβ1)2 have an interesting interpretation in terms of its components:

    • Sample size n. As the sample size grows, the precision of the estimators increases, since both variances decrease.

    • Error variance σ2. The more disperse the error is, the less precise the estimates are, since the more vertical variability is present.

    • Predictor variance s2x. If the predictor is spread out (large s2x), then it is easier to fit a regression line: we have information about the data trend over a long interval. If s2x is small, then all the data is concentrated on a narrow vertical band, so we have a much more limited view of the trend.

    • Mean ˉX. It has influence only on the precision of ˆβ0. The larger ˉX is, the less precise ˆβ0 is.

Figure 2.20: Illustration of the randomness of the fitted coefficients (ˆβ0,ˆβ1) and the influence of n, σ2 and s2x. The sample predictors x1,,xn are fixed and new responses Y1,,Yn are generated each time from a linear model Y=β0+β1X+ε. Application also available here.

The problem with (2.4) and (2.5) is that σ2 is unknown in practice, so we need to estimate σ2 from the data. We do so by computing the sample variance of the residuals ˆε1,,ˆεn. First note that the residuals have zero mean. This can be easily seen by replacing ˆβ0=ˉYˆβ1ˉX: ˉε=1nni=1(Yiˆβ0ˆβ1Xi)=1nni=1(YiˉY+ˆβ1ˉXˆβ1Xi)=0. Due to this, we can and we can do it by computing a rescaled sample variance of the residuals: ˆσ2=ni=1ˆε2in2. Note the n2 in the denominator, instead of n! n2 are the degrees of freedom and is the number of data points minus the number of already fitted parameters. The interpretation is that “we have consumed 2 degrees of freedom of the sample on fitting ˆβ0 and ˆβ1.”

If we use the estimate ˆσ2 instead of σ2, we get different – and more useful – distributions for β0 and β1: ˆβ0β0^SE(ˆβ0)tn2,ˆβ1β1^SE(ˆβ1)tn2 where tn2 represents the Student’s t distribution10 with n2 degrees of freedom and ^SE(ˆβ0)2=ˆσ2n[1+ˉX2s2x],^SE(ˆβ1)2=ˆσ2ns2x are the estimates of SE(ˆβ0)2 and SE(ˆβ1)2. The LHS of (2.6) is called the t-statistic because of its distribution. The interpretation of (2.7) is analogous to the one of (2.5).

2.5.2 Confidence intervals for the coefficients

Due to (2.6) and (2.7), we can have the 100(1α)% Confidence Intervals (CI) for the coefficients: (ˆβj±^SE(ˆβj)tn2;α/2),j=0,1, where tn2;α/2 is the α/2-upper quantile of the tn2 (see Figure 2.21). Usually, α=0.10,0.05,0.01 are considered.

The Student’s \(t\) distribution for the \(t\)-statistics associated to null intercept and slope, for the y1 ~ x1 regression of the assumptions dataset.

Figure 2.21: The Student’s t distribution for the t-statistics associated to null intercept and slope, for the y1 ~ x1 regression of the assumptions dataset.

Do you need to remember the above equations? No, although you need to fully understand them. R + R Commander will compute everything for you through the functions lm, summary and confint.

This random CI contains the unknown coefficient βj with a probability of 1α. Note also that the CI is symmetric around ˆβj. A simple way of understanding this concept is as follows. Suppose you have 100 samples generated according to a linear model. If you compute the CI for a coefficient, then in approximately 100(1α) of the samples the true coefficient would be actually inside the random CI. This is illustrated in Figure 2.22.

Figure 2.22: Illustration of the randomness of the CI for β0 at 100(1α)% confidence. The plot shows 100 random CIs for β0, computed from 100 random datasets generated by the same linear model, with intercept β0. The illustration for β1 is completely analogous. Application also available here.

Let’s see how we can compute the CIs for β0 and β1 in practice. We do it in the first regression of the assumptions dataset. Assuming you have loaded the dataset, in R we can simply type:

mod1 <- lm(y1 ~ x1, data = assumptions)
confint(mod1)
##                  2.5 %     97.5 %
## (Intercept) -0.2256901  0.1572392
## x1          -0.5587490 -0.2881032

In this example, the 95% confidence interval for β0 is (0.2257,0.1572). For β1 is (0.5587,0.2881). Therefore, we can say that with a 95% confidence x1 has a negative effect on y1. If the CI for β1 was (0.2256901,0.1572392), we could not arrive to the same conclusion, since the CI contains both positive and negative numbers.

By default, the confidence interval is computed for α=0.05. You can change this on the level argument, for example:

confint(mod1, level = 0.90) # alpha = 0.10
##                    5 %       95 %
## (Intercept) -0.1946762  0.1262254
## x1          -0.5368291 -0.3100231
confint(mod1, level = 0.95) # alpha = 0.05
##                  2.5 %     97.5 %
## (Intercept) -0.2256901  0.1572392
## x1          -0.5587490 -0.2881032
confint(mod1, level = 0.99) # alpha = 0.01
##                  0.5 %     99.5 %
## (Intercept) -0.2867475  0.2182967
## x1          -0.6019030 -0.2449492

Note that the larger the confidence of the interval, the longer – thus less useful – it is. For example, the interval (,) contains any coefficient with a 100% confidence, but is completely useless.

If you want to make the CIs through the help of R Commander (assuming the dataset has been loaded and is the active one), then do the following:

  1. Fit the linear model ('Statistics' -> 'Fit models' -> 'Linear regression...').
  2. Go to 'Models' -> 'Confidence intervals...' and then input the 'Condifence Level'.

Compute the CIs (95%) for the coefficients of the regressions:

  • y2 ~ x2
  • y6 ~ x6
  • y7 ~ x7

Do you think all of them are meaningful? Which ones are and why? (Recall: inference on the model makes sense if assumptions of the model are verified)

Compute the CIs for the coefficients of the following regressions:

  • MathMean ~ ScienceMean (pisa)
  • MathMean ~ ReadingMean (pisa)
  • Seats2013.2023 ~ Population2010 (US)
  • CamCom2011 ~ Population2010 (EU)

For the above regressions, can we conclude with a 95% confidence that the effect of the predictor is positive in the response?

A CI for σ2 can be also computed, but is less important in practice. The formula is: (n2χ2n2;α/2ˆσ2,n2χ2n2;1α/2ˆσ2) where χ2n2;q is the q-upper quantile of the χ2 distribution11 with n2 degrees of freedom, χ2n2. Note that the CI is not symmetric around ˆσ2.

Compute the CI for σ2 for the regression of MathMean on logGDPp in the pisa dataset. Do it for α=0.10,0.05,0.01.

  • To compute χ2n2;q, you can do:

    • In R by qchisq(p = q, df = n - 2, lower.tail = FALSE).
    • In R Commander, go to ‘Distributions’ -> ‘Continuous distributions’ -> ‘Chi-squared distribution’ -> ‘Chi-squared quantiles’ and then select ‘Upper tail’. Input q as the ‘Probabilities’ and n2 as the ‘Degrees of freedom’.
  • To compute ˆσ2, use summary(lm(MathMean ~ logGDPp, data = pisa))$sigma^2. Remember that there are 65 countries in the study.

Answers: c(1720.669, 3104.512), c(1635.441, 3306.257) and c(1484.639, 3752.946).

2.5.3 Testing on the coefficients

The distributions in (2.6) also allow to conduct a formal hypothesis test on the coefficients βj, j=0,1. For example the test for significance (shortcut for significantly difference from zero) is specially important, that is, the test of the hypotheses H0:βj=0 for j=0,1. The test of H0:β1=0 is specially interesting, since it allows to answer whether the variable X has a significant linear effect on Y. The statistic used for testing for significance is the t-statistic ˆβj0^SE(ˆβj), which is distributed as a tn2 under the (veracity of) the null hypothesis.

Remember the analogy of hypothesis testing vs a trial, as given in the table below.

Hypothesis testing Trial
Null hypothesis H0 Accused of committing a crime. It has the “presumption of innocence,” which means that it is not guilty until there is enough evidence to supporting its guilt
Sample X1,,Xn Collection of small evidences supporting innocence and guilt
Statistic Tn Summary of the evidence presented by the prosecutor and defense lawyer
Distribution of Tn under H0 The judge conducting the trial. Evaluates the evidence presented by both sides and presents a verdict for H0
Significance level α 1α is the strength of evidences required by the judge for condemning H0. The judge allows evidences that on average condemn 100α% of the innocents! α=0.05 is considered a reasonable level
p-value Decision of the judge. If p-value<α, H0 is declared guilty. Otherwise, is declared not guilty
H0 is rejected H0 is declared guilty: there are strong evidences supporting its guilt
H0 is not rejected H0 is declared not guilty: either is innocent or there are no enough evidences supporting its guilt

More formally, the p-value is defined as:

The p-value is the probability of obtaining a statistic more unfavourable to H0 than the observed, assuming that H0 is true.

Therefore, if the p-value is small (smaller than the chosen level α), it is unlikely that the evidence against H0 is due to randomness. As a consequence, H0 is rejected. If the p-value is large (larger than α), then it is more possible that the evidences against H0 are merely due to the randomness of the data. In this case, we do not reject H0.

The null hypothesis H0 is tested against the alternative hypothesis, H1. If H0 is rejected, it is rejected in favor of H1. The alternative hypothesis can be bilateral, such as H0:βj=0vsH1:βj0 or unilateral, such as H0:βj()0vsH1:βj<(>)0 For the moment, we will focus only on the bilateral case.

The connection of a t-test for H0:βj=0 and the CI for βj, both at level α, is the following.

Is 0 inside the CI for βj?

  • Yes do not reject H0.
  • No reject H0.

The tests for significance are built-in in the summary function, as we glimpsed in Section 2.1.2. For mod1, we have:

summary(mod1)
## 
## Call:
## lm(formula = y1 ~ x1, data = assumptions)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.13678 -0.62218 -0.07824  0.54671  2.63056 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03423    0.09709  -0.353    0.725    
## x1          -0.42343    0.06862  -6.170 3.77e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9559 on 198 degrees of freedom
## Multiple R-squared:  0.1613, Adjusted R-squared:  0.157 
## F-statistic: 38.07 on 1 and 198 DF,  p-value: 3.772e-09

The Coefficients block of the output of summary contains the next elements regarding the test H0:βj=0 vs H1:βj0:

  • Estimate: least squares estimate ˆβj.
  • Std. Error: estimated standard error ^SE(ˆβj).
  • t value: t-statistic ˆβj^SE(ˆβj).
  • Pr(>|t|): p-value of the t-test.
  • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1: codes indicating the size of the p-value. The more stars, the more evidence supporting that H0 does not hold.

In the above output for summary(mod1), H0:β0=0 is not rejected at any reasonable level for α (that is, 0.10,0.05 and 0.01). Hence ˆβ0 is not significantly different from zero and β0 is not significant for the regression. On the other hand, H0:β1=0 is rejected at any level α larger than the p-value, 3.77e-09. Therefore, β1 is significant for the regression (and ˆβ1 is not significantly different from zero).

For the assumptions dataset, do the next:

  • Regression y7 ~ x7. Check that:
    • The intercept of is not significant for the regression at any reasonable level α.
    • The slope is significant for any α107.
  • Regression y6 ~ x6. Assume the linear model assumptions are verified.
    • Check that ˆβ0 is significantly different from zero at any level α.
    • For which α=0.10,0.05,0.01 is ˆβ1 significantly different from zero?

Re-analyze the significance of the coefficients in Seats2013.2023 ~ Population2010 and Seats2011 ~ Population2010 for the US and EU datasets, respectively.


  1. The Student’s t distribution has heavier tails than the normal, which means that large observations in absolute value are more likely. tn converges to a N(0,1) when n is large. For example, with n larger than 30, the normal is a good approximation.↩︎

  2. χ2n is the distribution of the sum of the squares of n random variables N(0,1).↩︎