4.7 The t test for regression, with details

So, let’s say we have a nice regression with one predictor. We know some things about the first and second moments of the estimators b0 and, more interestingly, b1. But we would like to describe these distributions more completely. Well, we can do that – as in, we can say exactly what they are – if we make some assumptions about the distribution of the errors.

Why is that important? Because, as you have already observed, the errors are the only “source of randomness” in the whole mess. If we just had y=β0+β1x, we wouldn’t need statistics to talk about the line. We’d need two observations and some algebra.

Because of the +ε part, though, the y values have a random component. And that means that b1, which as we showed is a linear combination of y’s, is also random – in a way that depends very closely on the distribution of ε.

Since this is not the time to rock the boat, let us continue to assume the errors are normal.

Now, a linear combination of normal RV’s is also normal (you’ll probably prove that in Probability or someplace). So the distribution of b1 is normal. So if we subtract the mean and divide by the standard deviation, we’ll get a standard normal, and we can do all the usual fun confidence interval stuff, right?

Well, no.

4.7.1 Student’s t distribution

Recall from your intro stats experience: If a random variable XN(μX,σX) then the transformed RV

Z=XμXσX

has a standard normal distribution, N(0,1). Subtracting and dividing by constants hasn’t changed the shape of X’s distribution – it’s still normal.

Well, what if we estimate σX? Will

XμXsX

still be normal?

Alas no.

Estimating σX adds variability to the denominator – so the ratio winds up being all over the place. In particular, sX often underestimates σX, leading to the ratio being larger than it “should be” (in either the positive or negative direction).

To see more details, let’s rewrite the statistic:

XμXσX/sXσX The numerator (on the left) is normal, as we said above. Let’s look at the denominator:

s2Xσ2X=Sxx(n1)σ2X=(xiˉx)2(n1)σ2Xχ2n1n1 Yeah, that’s a chi-squared distribution. Proving this is beyond the scope of this course, but if you are curious, there is a proof here: https://onlinecourses.science.psu.edu/stat414/node/174.

So, that z-score that we calculated with unknown variance is… a ratio of a normal RV and the square root of a chi-squared RV. Which turns out to be (as we will also not prove) a Student’s t.

XμXsXz/χ2n1n1tn1

4.7.2 Okay, so what about the slope?

Let’s apply this to the slope, b1: b1N(β1,σSxx)

We estimate the error variance σ2 with the mean squared error (MSE), dividing the SSE by n2:

MSE=e2in2=s2

and so if we estimate σ with s=MSE, then

(b1β1)s/Sxx tn2

4.7.3 The t test in multiple regression

So, we have worked out the whole distribution of the b1 estimates. We can make confidence intervals with the critical values of a t distribution! And we can do a hypothesis test of the null hypothesis that β1=0, by comparing a sample statistic tn2=b1s/Sxx to the t distribution and finding a p-value! It is a world of adventure out there.

But we did all this with the single-predictor case. What if we have more stuff in the model?

Well, happily, it all works pretty much the same way. (In fact, the way the matrix-style equations are written doesn’t change at all! Only the dimensions of the matrices change.)

In the multiple regression context, we have some vector of coefficients \boldsymbol{\beta} of length k+1. Suppose we want to do a hypothesis test about a particular element of \boldsymbol{\beta}. For example: is {\beta}_2 equal to 0? To test this, we:

  • Find the estimate b_2 based on our sample data
  • Find a test statistic from this estimate: (b_2-\beta_2)/SE(b_2)
    • Since the variance of \boldsymbol{b} is \sigma^2(\boldsymbol{X}'\boldsymbol{X})^{-1}, the variance of b_2 is \sigma^2 times the (3,3) element of (\boldsymbol{X}'\boldsymbol{X})^{-1}. (Remember the indexing starts at 0!)
    • The square root of that number yields the standard deviation of b_2. But!
    • We still don’t know \sigma^2 (probably we should get used to that). So, as before, we sub in s. But! This should worry you…
  • Find the distribution of the test statistic
    • The estimate b_2 itself is normally distributed, because all the randomness comes from the normal \boldsymbol{\varepsilon}. But!
    • Also as before, s is a flawed estimate of \sigma^2, and so that “standardized” b_2 estimate isn’t going to be normal. Instead (we won’t prove this), it’s t-distributed.
    • As you might suspect, the degrees of freedom on that t are different now! Previously, we subtracted two from n: one for each thing we estimated (intercept and slope). Now, we’re estimating k+1 things, so the degrees of freedom will be n-k-1.
  • Compare the observed value of the test statistic to its null distribution. If the observed value would be very unlikely according to that distribution (that is, P(|t|\ge |t_{obs}|)<\alpha), reject the null. Of course, you should actually report the p-value itself, and for preference a confidence interval, so you don’t wind up in this territory: https://xkcd.com/1478/.

You can also do “compound” hypothesis tests about multiple elements of \boldsymbol{\beta}, but we won’t go into depth here. Just remember that any time you do multiple hypothesis tests, you’d better do an adjustment. See the jelly bean xkcd comic: https://xkcd.com/882/.