4.7 The t test for regression, with details
So, let’s say we have a nice regression with one predictor. We know some things about the first and second moments of the estimators \(b_0\) and, more interestingly, \(b_1\). But we would like to describe these distributions more completely. Well, we can do that – as in, we can say exactly what they are – if we make some assumptions about the distribution of the errors.
Why is that important? Because, as you have already observed, the errors are the only “source of randomness” in the whole mess. If we just had \(y=\beta_0+\beta_1 x\), we wouldn’t need statistics to talk about the line. We’d need two observations and some algebra.
Because of the \(+ \varepsilon\) part, though, the \(y\) values have a random component. And that means that \(b_1\), which as we showed is a linear combination of y’s, is also random – in a way that depends very closely on the distribution of \(\varepsilon\).
Since this is not the time to rock the boat, let us continue to assume the errors are normal.
Now, a linear combination of normal RV’s is also normal (you’ll probably prove that in Probability or someplace). So the distribution of \(b_1\) is normal. So if we subtract the mean and divide by the standard deviation, we’ll get a standard normal, and we can do all the usual fun confidence interval stuff, right?
Well, no.
4.7.1 Student’s \(t\) distribution
Recall from your intro stats experience: If a random variable \(X \sim N(\mu_X,\sigma_X)\) then the transformed RV
\[ Z = \frac{X-\mu_X}{\sigma_X}\]
has a standard normal distribution, \(N(0,1)\). Subtracting and dividing by constants hasn’t changed the shape of \(X\)’s distribution – it’s still normal.
Well, what if we estimate \(\sigma_X\)? Will
\[ \frac{X-\mu_X}{s_X}\]
still be normal?
Alas no.
Estimating \(\sigma_X\) adds variability to the denominator – so the ratio winds up being all over the place. In particular, \(s_X\) often underestimates \(\sigma_X\), leading to the ratio being larger than it “should be” (in either the positive or negative direction).
To see more details, let’s rewrite the statistic:
\[ \frac{X-\mu_X}{\sigma_X}\big/ \frac{s_X}{\sigma_X}\] The numerator (on the left) is normal, as we said above. Let’s look at the denominator:
\[ \frac{s_X^2}{\sigma_X^2} = \frac{S_{xx}}{(n-1)\sigma_X^2}= \frac{\sum (x_i - \bar{x})^2}{(n-1)\sigma_X^2} \sim \frac{\chi^2_{n-1}}{n-1}\] Yeah, that’s a chi-squared distribution. Proving this is beyond the scope of this course, but if you are curious, there is a proof here: https://onlinecourses.science.psu.edu/stat414/node/174.
So, that z-score that we calculated with unknown variance is… a ratio of a normal RV and the square root of a chi-squared RV. Which turns out to be (as we will also not prove) a Student’s \(t\).
\[ \frac{X-\mu_X}{s_X}\sim z\big/ \sqrt{\frac{\chi^2_{n-1}}{n-1}} \sim t_{n-1}\]
4.7.2 Okay, so what about the slope?
Let’s apply this to the slope, \(b_1\): \[b_1 \sim N \left( \beta_1, \frac{\sigma}{\sqrt{S_{xx}}} \right)\]
We estimate the error variance \(\sigma^2\) with the mean squared error (MSE), dividing the SSE by \(n-2\):
\[MSE = \frac{\sum{e_i^2}}{n-2} = s^2\]
and so if we estimate \(\sigma\) with \(s = \sqrt{MSE}\), then
\[\frac{(b_1 -\beta_1)}{s/\sqrt{S_{xx}}} ~ \sim t_{n-2}\]
4.7.3 The t test in multiple regression
So, we have worked out the whole distribution of the \(b_1\) estimates. We can make confidence intervals with the critical values of a \(t\) distribution! And we can do a hypothesis test of the null hypothesis that \(\beta_1=0\), by comparing a sample statistic \[t_{n-2} = \frac{b_1}{s/\sqrt{S_{xx}}}\] to the \(t\) distribution and finding a p-value! It is a world of adventure out there.
But we did all this with the single-predictor case. What if we have more stuff in the model?
Well, happily, it all works pretty much the same way. (In fact, the way the matrix-style equations are written doesn’t change at all! Only the dimensions of the matrices change.)
In the multiple regression context, we have some vector of coefficients \(\boldsymbol{\beta}\) of length \(k+1\). Suppose we want to do a hypothesis test about a particular element of \(\boldsymbol{\beta}\). For example: is \({\beta}_2\) equal to 0? To test this, we:
- Find the estimate \(b_2\) based on our sample data
- Find a test statistic from this estimate: \((b_2-\beta_2)/SE(b_2)\)
- Since the variance of \(\boldsymbol{b}\) is \(\sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1}\), the variance of \(b_2\) is \(\sigma^2\) times the (3,3) element of \((\boldsymbol{X}'\boldsymbol{X})^{-1}\). (Remember the indexing starts at 0!)
- The square root of that number yields the standard deviation of \(b_2\). But!
- We still don’t know \(\sigma^2\) (probably we should get used to that). So, as before, we sub in \(s\). But! This should worry you…
- Find the distribution of the test statistic
- The estimate \(b_2\) itself is normally distributed, because all the randomness comes from the normal \(\boldsymbol{\varepsilon}\). But!
- Also as before, \(s\) is a flawed estimate of \(\sigma^2\), and so that “standardized” \(b_2\) estimate isn’t going to be normal. Instead (we won’t prove this), it’s \(t\)-distributed.
- As you might suspect, the degrees of freedom on that \(t\) are different now! Previously, we subtracted two from \(n\): one for each thing we estimated (intercept and slope). Now, we’re estimating \(k+1\) things, so the degrees of freedom will be \(n-k-1\).
- Compare the observed value of the test statistic to its null distribution. If the observed value would be very unlikely according to that distribution (that is, \(P(|t|\ge |t_{obs}|)<\alpha\)), reject the null. Of course, you should actually report the p-value itself, and for preference a confidence interval, so you don’t wind up in this territory: https://xkcd.com/1478/.
You can also do “compound” hypothesis tests about multiple elements of \(\boldsymbol{\beta}\), but we won’t go into depth here. Just remember that any time you do multiple hypothesis tests, you’d better do an adjustment. See the jelly bean xkcd comic: https://xkcd.com/882/.