4.12 CIs and PIs in multiple regression
4.12.1 CIs for \(\beta\)’s
Previously, we looked at hypothesis tests about the individual coefficients, \(\beta_j\). We could do t tests with the null hypothesis that the \(\beta_j\) was 0 (i.e., the corresponding term in the model wasn’t useful).
But a simple reject/fail-to-reject decision isn’t very informative; even a p-value only reflects your confidence in “0 or not 0.” These things don’t tell you what you think the true \(\beta_j\) value might actually be. For that, we need confidence intervals!
Turns out, these work the same way as ever: \(estimate \pm CV*SE(estimate)\).
- The CV (critical value) is drawn from the appropriate distribution for the estimate; it’s the same as the distribution of the test statistic used in the corresponding hypothesis test. So for a coefficient in a regression with \(k\) predictors, it’d come from a \(t_{n-k-1}\) distribution.
- The standard error of the estimate is the same as we established for the hypothesis test. Take the variance-covariance matrix \(Var(\boldsymbol{b})\), pick the relevant element on the diagonal, take the square root, and then sub in \(s\) for \(\sigma^2\).
You can also do CIs about multiple elements of \(\boldsymbol{\beta}\): this is like getting a confidence region. Again, we won’t go into depth on this, but note that you need to do an adjustment because you’re doing simultaneous guessing in multiple dimensions.
4.12.2 CIs and PIs for points
Here’s the good news: conceptually speaking, this part works pretty much exactly the same way as in simple linear regression. The variance and standard error expressions involve some additional stuff, since your regression is using multiple predictors and thus more \(b\)’s. Fortunately, we can use matrices to avoid writing out really gnarly formulas.
Also as before, if you are going to predict an individual value, there’s additional uncertainty involved.
The variance of fitted values in the dataset is:
\[\begin{aligned} Var(\hat{\boldsymbol y}) &= Var({\boldsymbol X \boldsymbol b})\\ &= Var(\boldsymbol X(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X' \boldsymbol y)\\ &= Var(\boldsymbol H \boldsymbol y)\\ &={\boldsymbol H}Var({\boldsymbol y})\boldsymbol H'\\ &=\sigma^2{\boldsymbol H} \end{aligned}\]
where \({\boldsymbol H} = \boldsymbol X(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X'\). This is the hat matrix again!
In simple linear regression, we found an expression for the variance of a fitted value. We can write it like this:
\[Var(\hat{y}_{i}) = \sigma^2\left(\frac{1}{n} + \frac{ (x_{i} - \bar{x})^2}{S_{xx}}\right)\]
The diagonal elements of \({\boldsymbol H}\) are the equivalents of \(\frac{1}{n} + \frac{(x_i - \bar{x})^2}{S_{xx}}\) for each point – except they take all \(k\) of the predictor values for that point into account.
These diagonal elements are the formal leverage values for each point. This gives us an efficient, quantitative way to describe the potential influence of points on the regression line.
Of course, as always, this may not match the actual influence of the point; that depends on both the leverage and the point’s \(y\) value. If the point’s \(y\) value is in line with what the model-without-the-point would predict, then including the point doesn’t change anything. The surest way to tell (for now) is to look at the model results with and without the point included.
4.12.3 Variance of predictions at a new point
Suppose we have a new point whose response we want to predict. We’ll refer to the set of predictor values as \(\boldsymbol{x}_{\nu}\) (get it, nu, ’cause it’s new). That’s now a whole vector of values:
\[{\boldsymbol x}_{\nu} = \left(\begin{array}{c}1\\ x_{\nu 1}\\ x_{\nu 2} \\ \vdots \\ x_{\nu k}\end{array}\right)\] Note that, as is the default for all vectors in this course, \(\boldsymbol x_{\nu}\) is a column vector. But the information it contains is equivalent to a row of the \(\boldsymbol X\) matrix – all the predictor values for a single observation.
4.12.3.1 Prediction of the mean response
So what do we know about the expected or average response \(\hat{y}\), otherwise known as \(\hat{\mu}_{\nu}\), for this \(\boldsymbol x_{\nu}\)? There’s some uncertainty (stemming from our uncertainty about the true coefficient values):
\[ \begin{aligned} Var(\hat{\mu}_{\nu}) &= Var(\boldsymbol x_{\nu}' \boldsymbol b)\\ &= Var(\boldsymbol x_{\nu}'(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X' y)\\ &=\boldsymbol x_{\nu}'(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X'Var({ y}) (\boldsymbol x_{\nu}'(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X')'\\ &=\sigma^2 \boldsymbol x_{\nu}'(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X' \boldsymbol X(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol x_{\nu}\\ &=\sigma^2 \boldsymbol x_{\nu}'(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol x_{\nu}\\ \end{aligned} \]
This is what we have to use to generate the confidence interval for the mean response at this point.
There are some interesting parallels between this formula and the ones we’ve seen previously. For example, we know that \((\boldsymbol X' \boldsymbol X)^{-1}\) involves the spread of the predictor values (remember how \(S_{xx}\) kept popping up in there?) – while the \(\boldsymbol x_{\nu}\) brings in this particular point’s predictor values. Just as in the simple regression case, a point with unusual/extreme predictor values is harder to predict than one with \(x\) values close to the average, so \(Var(\hat{\mu}_{\nu})\) will be larger for the more extreme point.
4.12.3.2 Prediction of a single response
For a new individual, there’s an additional level of wrong. Not only is there uncertainty in where the average response is for this set of predictor values, there’s uncertainty about how far from that average this individual is going to be. How much uncertainty? Why, \(\sigma^2\), as usual! So the variance of the individual response value has an extra \(\sigma^2\) added:
\[\begin{aligned} Var(y_{\nu}) &= \sigma^2 \boldsymbol x_{\nu}'(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol x_{\nu} + \sigma^2\\ &= \sigma^2(1+ \boldsymbol x_{\nu}'(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol x_{\nu}) \end{aligned}\]
This also looks a little familiar, in a way. If you look back to the simple-regression case, that formula had a similar structure: take the uncertainty about the average response, \(Var(\hat{\mu}_{\nu})\), and then throw in an extra \(\sigma^2\).