2.4 Inference for model parameters

The assumptions introduced in the previous section allow us to specify what is the distribution of the random vector \(\hat{\boldsymbol{\beta}}.\) The distribution is derived conditionally on the predictors’ sample \(\mathbf{X}_1,\ldots,\mathbf{X}_n.\) In other words, we assume that the randomness of \(\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\) comes only from the error terms and not from the predictors24. To denote this, we employ lowercase for the predictors’ sample \(\mathbf{x}_1,\ldots,\mathbf{x}_n.\)

2.4.1 Distributions of the fitted coefficients

The distribution of \(\hat{\boldsymbol{\beta}}\) is:

\[\begin{align} \hat{\boldsymbol{\beta}}\sim\mathcal{N}_{p+1}\left(\boldsymbol{\beta},\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\right). \tag{2.11} \end{align}\]

This result can be obtained from the form of \(\hat{\boldsymbol{\beta}}\) given in (2.7), the sample version of the model assumptions given in (2.10), and the linear transformation property of a normal given in (1.4). Equation (2.11) implies that the marginal distribution of \(\hat\beta_j\) is

\[\begin{align} \hat{\beta}_j\sim\mathcal{N}\left(\beta_j,\mathrm{SE}(\hat\beta_j)^2\right),\tag{2.12} \end{align}\]

where \(\mathrm{SE}(\hat\beta_j)\) is the standard error, \(\mathrm{SE}(\hat\beta_j)^2:=\sigma^2v_j,\) and

\[\begin{align*} v_j\text{ is the }j\text{-th element of the diagonal of }(\mathbf{X}'\mathbf{X})^{-1}. \end{align*}\]

Recall that an equivalent form for (2.12) is (why?)

\[\begin{align*} \frac{\hat\beta_j-\beta_j}{\mathrm{SE}(\hat\beta_j)}\sim\mathcal{N}(0,1). \end{align*}\]

The interpretation of (2.12) is simpler in the case with \(p=1,\) where

\[\begin{align} \hat\beta_0\sim\mathcal{N}\left(\beta_0,\mathrm{SE}(\hat\beta_0)^2\right),\quad\hat\beta_1\sim\mathcal{N}\left(\beta_1,\mathrm{SE}(\hat\beta_1)^2\right),\tag{2.13} \end{align}\]

with

\[\begin{align} \mathrm{SE}(\hat\beta_0)^2=\frac{\sigma^2}{n}\left[1+\frac{\bar X^2}{s_x^2}\right],\quad \mathrm{SE}(\hat\beta_1)^2=\frac{\sigma^2}{ns_x^2}.\tag{2.14} \end{align}\]

Some insights on (2.13) and (2.14), illustrated interactively in Figure 2.13, are the following:

  • Bias. Both estimates are unbiased. That means that their expectations are the true coefficients for any sample size \(n.\)

  • Variance. The variances \(\mathrm{SE}(\hat\beta_0)^2\) and \(\mathrm{SE}(\hat\beta_1)^2\) have interesting interpretations in terms of their components:

    • Sample size \(n\). As the sample size grows, the precision of the estimators increases, since both variances decrease.

    • Error variance \(\sigma^2\). The more disperse the error is, the less precise the estimates are, since more vertical variability is present.

    • Predictor variance \(s_x^2\). If the predictor is spread out (large \(s_x^2\)), then it is easier to fit a regression line: we have information about the data trend over a long interval. If \(s_x^2\) is small, then all the data is concentrated on a narrow vertical band, so we have a much more limited view of the trend.

    • Mean \(\bar X\). It has influence only on the precision of \(\hat\beta_0.\) The larger \(\bar X\) is, the less precise \(\hat\beta_0\) is.

Figure 2.13: Illustration of the randomness of the fitted coefficients \((\hat\beta_0,\hat\beta_1)\) and the influence of \(n,\) \(\sigma^2,\) and \(s_x^2.\) The predictors’ sample \(x_1,\ldots,x_n\) is fixed and new responses \(Y_1,\ldots,Y_n\) are generated each time from a linear model \(Y=\beta_0+\beta_1X+\varepsilon.\) Application available here.

The insights about (2.11) are more convoluted. The following broad remarks, extensions of what happened when \(p=1,\) apply:

  • Bias. All the estimates are unbiased for any sample size \(n.\)

  • Variance. It depends on:

    • Sample size \(n\). Hidden inside \(\mathbf{X}'\mathbf{X}.\) As \(n\) grows, the precision of the estimators increases.
    • Error variance \(\sigma^2\). The larger \(\sigma^2\) is, the less precise \(\hat{\boldsymbol{\beta}}\) is.
    • Predictor sparsity \((\mathbf{X}'\mathbf{X})^{-1}\). The more “disperse”25 the predictors are, the more precise \(\hat{\boldsymbol{\beta}}\) is.

The problem with the result in (2.11) is that \(\sigma^2\) is unknown in practice. Therefore, we need to estimate \(\sigma^2\) in order to use a result similar to (2.11). We do so by computing a rescaled sample variance of the residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\):

\[\begin{align} \hat\sigma^2:=\frac{1}{n-p-1}\sum_{i=1}^n\hat\varepsilon_i^2.\tag{2.15} \end{align}\]

Note the \(n-p-1\) in the denominator. The factor \(n-p-1\) represents the degrees of freedom: the number of data points minus the number of already26 fitted parameters (\(p\) slopes plus \(1\) intercept) with the data. For the interpretation of \(\hat\sigma^2,\) it is key to realize that the mean of the residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\) is zero, this is \(\bar{\hat\varepsilon}=0.\) Therefore, \(\hat\sigma^2\) is indeed a rescaled sample variance of the residuals which estimates the variance of \(\varepsilon\;\)27. It can be seen that \(\hat\sigma^2\) is unbiased as an estimator of \(\sigma^2.\)

If we use the estimate \(\hat\sigma^2\) instead of \(\sigma^2,\) we get more useful28 distributions than (2.12):

\[\begin{align} \frac{\hat\beta_j-\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim t_{n-p-1},\quad\hat{\mathrm{SE}}(\hat\beta_j)^2:=\hat\sigma^2v_j,\tag{2.16} \end{align}\]

where \(t_{n-p-1}\) represents the Student’s \(t\) distribution with \(n-p-1\) degrees of freedom.

The LHS of (2.16) is the \(t\)-statistic for \(\beta_j,\) \(j=0,\ldots,p.\) We will employ them for building confidence intervals and hypothesis tests in what follows.

2.4.2 Confidence intervals for the coefficients

Thanks to (2.16), we can have the \(100(1-\alpha)\%\) Confidence Intervals (CI) for the coefficient \(\beta_j,\) \(j=0,\ldots,p\):

\[\begin{align} \left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)t_{n-p-1;\alpha/2}\right)\tag{2.17} \end{align}\]

where \(t_{n-p-1;\alpha/2}\) is the \(\alpha/2\)-upper quantile of the \(t_{n-p-1}\). Usually, \(\alpha=0.10,0.05,0.01\) are considered.

Figure 2.14: Illustration of the randomness of the CI for \(\beta_0\) at \(100(1-\alpha)\%\) confidence. The plot shows 100 random CIs for \(\beta_0,\) computed from 100 random datasets generated by the same simple linear model, with intercept \(\beta_0.\) The illustration for \(\beta_1\) is completely analogous. Application available here.

This random CI contains the unknown coefficient \(\beta_j\) “with a probability of \(1-\alpha\). The previous quoted statement has to be understood as follows. Suppose you have 100 samples generated according to a linear model. If you compute the CI for a coefficient, then in approximately \(100(1-\alpha)\) of the samples the true coefficient would be actually inside the random CI. Note also that the CI is symmetric around \(\hat\beta_j.\) This is illustrated in Figure 2.14.

2.4.3 Testing on the coefficients

The distributions in (2.16) allow also to conduct a formal hypothesis test on the coefficients \(\beta_j,\) \(j=0,\ldots,p.\) For example the test for significance29 is especially important, that is, the test of the hypotheses

\[\begin{align*} H_0:\beta_j=0 \end{align*}\]

for \(j=0,\ldots,p.\) The test of \(H_0:\beta_j=0\) with \(1\leq j\leq p\) is especially interesting, since it allows us to answer whether the variable \(X_j\) has a significant linear effect on \(Y\). The statistic used for testing for significance is the \(t\)-statistic

\[\begin{align*} \frac{\hat\beta_j-0}{\hat{\mathrm{SE}}(\hat\beta_j)}, \end{align*}\]

which is distributed as a \(t_{n-p-1}\) under the (veracity of) the null hypothesis30.

The null hypothesis \(H_0\) is tested against the alternative hypothesis, \(H_1.\) If \(H_0\) is rejected, it is rejected in favor of \(H_1.\) The alternative hypothesis can be two-sided (we will focus mostly on these alternatives), such as

\[\begin{align*} H_0:\beta_j= 0\quad\text{vs.}\quad H_1:\beta_j\neq 0 \end{align*}\]

or one-sided, such as

\[\begin{align*} H_0:\beta_j=0 \quad\text{vs.}\quad H_1:\beta_j<(>)0. \end{align*}\]

The test based on the \(t\)-statistic is referred to as the \(t\)-test. It rejects \(H_0:\beta_j=0\) (against \(H_1:\beta_j\neq 0\)) at significance level \(\alpha\) for large absolute values of the \(t\)-statistic, precisely for those above the \(\alpha/2\)-upper quantile of the \(t_{n-p-1}\) distribution. That is, it rejects \(H_0\) at level \(\alpha\) if \(\frac{|\hat\beta_j|}{\hat{\mathrm{SE}}(\hat\beta_j)}>t_{n-p-1;\alpha/2}\;\)31. For the one-sided tests, it rejects \(H_0\) against \(H_1:\beta_j<0\) or \(H_1:\beta_j>0\) if \(\frac{\hat\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}<-t_{n-p-1;\alpha}\) or \(\frac{\hat\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}>t_{n-p-1;\alpha},\) respectively.

Remember the following insights about hypothesis testing.

The analogy of conducting an hypothesis test and a trial can be seen in Appendix A.1.

In an hypothesis test, the \(p\)-value measures the degree of veracity of \(H_0\) according to the data. The rule of thumb is the following:

Is the \(p\)-value lower than \(\alpha\)?

  • Yes \(\rightarrow\) reject \(H_0\).
  • No \(\rightarrow\) do not reject \(H_0\).

The connection of a \(t\)-test for \(H_0:\beta_j=0\) and the CI for \(\beta_j,\) both at level \(\alpha,\) is the following:

Is \(0\) inside the CI for \(\beta_j\)?

  • Yes \(\leftrightarrow\) do not reject \(H_0\).
  • No \(\leftrightarrow\) reject \(H_0\).

The one-sided test \(H_0:\beta_j=0\) vs. \(H_1:\beta_j<0\) (respectively, \(H_1:\beta_j>0\)) can be done by means of the CI for \(\beta_j.\) If \(H_0\) is rejected, they allow us to conclude that \(\hat\beta_j\) is significantly negative (positive) and that for the considered regression model, \(X_j\) has a significant negative (positive) effect on \(Y\). The rule of thumb is the following:

Is the CI for \(\beta_j\) below (above) \(0\) at level \(\alpha\)?

  • Yes \(\rightarrow\) reject \(H_0\) at level \(\alpha.\) Conclude \(X_j\) has a significant negative (positive) effect on \(Y\) at level \(\alpha.\)
  • No \(\rightarrow\) the criterion is not conclusive.

2.4.4 Case study application

Let’s analyze the multiple linear model we have considered for the wine dataset, now that we know how to make inference on the model parameters. The relevant information is obtained with the summary of the model:

# Fit
modWine1 <- lm(Price ~ ., data = wine)

# Summary
sumModWine1 <- summary(modWine1)
sumModWine1
## 
## Call:
## lm(formula = Price ~ ., data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46541 -0.24133  0.00413  0.18974  0.52495 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.343e+00  7.697e+00  -0.304  0.76384    
## WinterRain   1.153e-03  4.991e-04   2.311  0.03109 *  
## AGST         6.144e-01  9.799e-02   6.270 3.22e-06 ***
## HarvestRain -3.837e-03  8.366e-04  -4.587  0.00016 ***
## Age          1.377e-02  5.821e-02   0.237  0.81531    
## FrancePop   -2.213e-05  1.268e-04  -0.175  0.86313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.293 on 21 degrees of freedom
## Multiple R-squared:  0.8278, Adjusted R-squared:  0.7868 
## F-statistic: 20.19 on 5 and 21 DF,  p-value: 2.232e-07

# Contains the estimation of sigma ("Residual standard error")
sumModWine1$sigma
## [1] 0.2930287

# Which is the same as
sqrt(sum(modWine1$residuals^2) / modWine1$df.residual)
## [1] 0.2930287

The Coefficients block of the summary output contains the next elements regarding the significance of each coefficient \(\beta_j,\) this is, the test \(H_0:\beta_j=0\) vs. \(H_1:\beta_j\neq0\):

  • Estimate: least squares estimate \(\hat\beta_j.\)
  • Std. Error: estimated standard error \(\hat{\mathrm{SE}}(\hat\beta_j).\)
  • t value: \(t\)-statistic \(\frac{\hat\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}.\)
  • Pr(>|t|): \(p\)-value of the \(t\)-test.
  • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1: codes indicating the size of the \(p\)-value. The more asterisks, the more evidence supporting that \(H_0\) does not hold32.

Note that a high proportion of predictors are not significant in modWine1: FrancePop and Age are not significant (and the intercept is not significant also). This is an indication of an excess of predictors adding little information to the response. One explanation is the almost perfect correlation between FrancePop and Age shown before: one of them is not adding any extra information to explain Price. This complicates the model unnecessarily and, more importantly, it has the undesirable effect of making the coefficient estimates less precise. We opt to remove the predictor FrancePop from the model since it is exogenous to the wine context33. A data-driven justification of the removal of this variable is that it is the least significant in modWine1.

Then, the model without FrancePop34 is:

modWine2 <- lm(Price ~ . - FrancePop, data = wine)
summary(modWine2)
## 
## Call:
## lm(formula = Price ~ . - FrancePop, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46024 -0.23862  0.01347  0.18601  0.53443 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.6515703  1.6880876  -2.163  0.04167 *  
## WinterRain   0.0011667  0.0004820   2.420  0.02421 *  
## AGST         0.6163916  0.0951747   6.476 1.63e-06 ***
## HarvestRain -0.0038606  0.0008075  -4.781 8.97e-05 ***
## Age          0.0238480  0.0071667   3.328  0.00305 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared:  0.8275, Adjusted R-squared:  0.7962 
## F-statistic: 26.39 on 4 and 22 DF,  p-value: 4.057e-08

All the coefficients are significant at level \(\alpha=0.05.\) Therefore, there is no clear redundant information. In addition, the \(R^2\) is very similar to the full model, but the 'Adjusted R-squared', a weighting of the \(R^2\) to account for the number of predictors used by the model, is slightly larger. As we will see in Section 2.7.2, this means that, compared to the number of predictors used, modWine2 explains more variability of Price than modWine1.

A handy way of comparing the coefficients of both models is car::compareCoefs:

car::compareCoefs(modWine1, modWine2)
## Calls:
## 1: lm(formula = Price ~ ., data = wine)
## 2: lm(formula = Price ~ . - FrancePop, data = wine)
## 
##               Model 1   Model 2
## (Intercept)     -2.34     -3.65
## SE               7.70      1.69
##                                
## WinterRain   0.001153  0.001167
## SE           0.000499  0.000482
##                                
## AGST           0.6144    0.6164
## SE             0.0980    0.0952
##                                
## HarvestRain -0.003837 -0.003861
## SE           0.000837  0.000808
##                                
## Age           0.01377   0.02385
## SE            0.05821   0.00717
##                                
## FrancePop   -2.21e-05          
## SE           1.27e-04          
## 

Note how the coefficients for modWine2 have smaller errors than modWine1.

The individual CIs for the unknown \(\beta_j\)’s can be obtained by applying the confint function to an lm object. Let’s compute the CIs for the model coefficients of modWine1, modWine2, and a new model modWine3:

# Fit a new model
modWine3 <- lm(Price ~ Age + WinterRain, data = wine)
summary(modWine3)
## 
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88964 -0.51421 -0.00066  0.43103  1.06897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.9830427  0.5993667   9.982 5.09e-10 ***
## Age         0.0360559  0.0137377   2.625   0.0149 *  
## WinterRain  0.0007813  0.0008780   0.890   0.3824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared:  0.2371, Adjusted R-squared:  0.1736 
## F-statistic:  3.73 on 2 and 24 DF,  p-value: 0.03884

# Confidence intervals at 95%
# CI: (lwr, upr)
confint(modWine3)
##                    2.5 %      97.5 %
## (Intercept)  4.746010626 7.220074676
## Age          0.007702664 0.064409106
## WinterRain  -0.001030725 0.002593278

# Confidence intervals at other levels
confint(modWine3, level = 0.90)
##                       5 %        95 %
## (Intercept)  4.9575969417 7.008488360
## Age          0.0125522989 0.059559471
## WinterRain  -0.0007207941 0.002283347
confint(modWine3, level = 0.99)
##                    0.5 %      99.5 %
## (Intercept)  4.306650310 7.659434991
## Age         -0.002367633 0.074479403
## WinterRain  -0.001674299 0.003236852

# Compare with previous models
confint(modWine1)
##                     2.5 %        97.5 %
## (Intercept) -1.834844e+01 13.6632391095
## WinterRain   1.153872e-04  0.0021910509
## AGST         4.106337e-01  0.8182146540
## HarvestRain -5.577203e-03 -0.0020974232
## Age         -1.072931e-01  0.1348317795
## FrancePop   -2.858849e-04  0.0002416171
confint(modWine2)
##                     2.5 %       97.5 %
## (Intercept) -7.1524497573 -0.150690903
## WinterRain   0.0001670449  0.002166393
## AGST         0.4190113907  0.813771726
## HarvestRain -0.0055353098 -0.002185890
## Age          0.0089852800  0.038710748
confint(modWine3)
##                    2.5 %      97.5 %
## (Intercept)  4.746010626 7.220074676
## Age          0.007702664 0.064409106
## WinterRain  -0.001030725 0.002593278

In modWine3, the 95% CI for \(\beta_0\) is \((4.7460, 7.2201),\) for \(\beta_1\) is \((0.0077, 0.0644),\) and for \(\beta_2\) is \((-0.0010, 0.0026).\) Therefore, we can say with a 95% confidence that the coefficient of WinterRain is non-significant (0 is inside the CI). But, inspecting the CI of \(\beta_2\) in modWine2 we can see that it is significant for the model! How is this possible? The answer is that the presence of extra predictors affects the coefficient estimate, as we saw in Figure 2.7. Therefore, the precise statement to make is:

In model Price ~ Age + WinterRain, with \(\alpha=0.05,\) the coefficient of WinterRain is non-significant.

Note that this does not mean that the coefficient will be always non-significant: in Price ~ Age + AGST + HarvestRain + WinterRain it is.

Compute and interpret the CIs for the coefficients, at levels \(\alpha=0.10,0.05,0.01,\) for the following regressions:

  1. Price ~ WinterRain + HarvestRain + AGST (wine).
  2. AGST ~ Year + FrancePop (wine).

For the assumptions dataset, do the following:

  1. Regression y7 ~ x7. Check that:
    • The intercept is not significant for the regression at any reasonable level \(\alpha.\)
    • The slope is significant for any \(\alpha \geq 10^{-7}.\)
  2. Regression y6 ~ x6. Assume the linear model assumptions are verified.
    • Check that \(\hat\beta_0\) is significantly different from zero at any level \(\alpha.\)
    • For which \(\alpha=0.10,0.05,0.01\) is \(\hat\beta_1\) significantly different from zero?

In certain applications, it is useful to center the predictors \(X_1,\ldots,X_p\) prior to fit the model, in such a way that the slope coefficients \((\beta_1,\ldots,\beta_p)\) measure the effects of deviations of the predictors from their means. Theoretically, this amounts to considering the linear model

\[\begin{align*} Y=\beta_0+\beta_1(X_1-\mathbb{E}[X_1])+\cdots+\beta_p(X_p-\mathbb{E}[X_p])+\varepsilon. \end{align*}\]

In the sample case, we proceed by replacing \(X_{ij}\) by \(X_{ij}-\bar{X}_j,\) which can be easily done by the scale function (see below). If, in addition, the response is also centered, then \(\beta_0=0\) and \(\hat\beta_0=0.\) This centering of the data has no influence on the significance of the predictors (but has influence on the significance of \(\hat\beta_0\)), as it is just a linear transformation of them.

# By default, scale centers (subtracts the mean) and scales (divides by the
# standard deviation) the columns of a matrix
wineCen <- data.frame(scale(wine, center = TRUE, scale = FALSE))

# Regression with centered response and predictors
modWine3Cen <- lm(Price ~ Age + WinterRain, data = wineCen)

# Summary
summary(modWine3Cen)
## 
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wineCen)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88964 -0.51421 -0.00066  0.43103  1.06897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 5.964e-16  1.110e-01   0.000   1.0000  
## Age         3.606e-02  1.374e-02   2.625   0.0149 *
## WinterRain  7.813e-04  8.780e-04   0.890   0.3824  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared:  0.2371, Adjusted R-squared:  0.1736 
## F-statistic:  3.73 on 2 and 24 DF,  p-value: 0.03884

  1. This is for theoretical and modeling convenience. With this assumption, we just model the randomness of \(Y\) given the predictors. If the randomness of \(Y\) and the randomness of \(X_1,\ldots,X_n\) was to be modeled, we will require from a significantly more complex model.↩︎

  2. Undestood as small \(|(\mathbf{X}'\mathbf{X})^{-1}|.\)↩︎

  3. Prior to undertake the estimation of \(\sigma\) we have used the sample to estimate \(\hat{\boldsymbol\beta}.\) The situation is thus analogous to the discussion between the sample variance \(s_x^2=\frac{1}{n}\sum_{i=1}^n\left(X_i-\bar{X}\right)^2\) and the sample quasi-variance \(\hat{s}_x^2=\frac{1}{n-1}\sum_{i=1}^n\left(X_i-\bar{X}\right)^2\) that are computed from a sample \(X_1,\ldots,X_n.\) When estimating \(\mathbb{V}\mathrm{ar}[X],\) both estimate previously \(\mathbb{E}[X]\) through \(\bar{X}.\) The fact that \(\hat{s}_x^2\) accounts for that prior estimation through the degrees of freedom \(n-1\) makes that estimator unbiased for \(\mathbb{V}\mathrm{ar}[X]\) (\(s_x^2\) is not).↩︎

  4. Recall that the sample variance of \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\) is \(\frac{1}{n}\sum_{i=1}^n\left(\hat\varepsilon_i-\bar{\hat\varepsilon}\right)^2.\)↩︎

  5. In the sense of practically realistic.↩︎

  6. Shortcut for significantly different from zero.↩︎

  7. This is denoted as \(\frac{\hat{\beta}_j-0}{\hat{\mathrm{SE}}(\hat\beta_j)}\stackrel{H_0}{\sim}t_{n-p-1}.\)↩︎

  8. In R, \(t_{n-p-1;\alpha/2}\) can be computed as qt(p = 1 - alpha / 2, df = n - p - 1) or qt(p = alpha / 2, df = n - p - 1, lower.tail = FALSE).↩︎

  9. For example, '**' indicates that the \(p\)-value lies within \(0.001\) and \(0.01.\)↩︎

  10. This is a context-guided decision, not data-driven.↩︎

  11. Notice the use of - for excluding a particular predictor.↩︎