2.16 Moments of coefficient estimates, MLR matrix edition

Let’s start with the expected value. If we consider our whole vector of coefficient estimates, \(\boldsymbol{b}\), what is the expected value of this vector?

Check yourself! Why is the expected value of the vector \(\boldsymbol{y}\) equal to \(\boldsymbol{X\beta}\)?

\[ E({\boldsymbol b}) = E[(\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X' \boldsymbol y]\\ = (\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X' E[{\boldsymbol y}]\\ = (\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X' \boldsymbol X \boldsymbol \beta\\ = \boldsymbol \beta \]

How about variance? Well, since we have a whole vector of coefficient estimates, we need a variance-covariance matrix: \[ \begin{aligned} Var({\boldsymbol b}) &= Var[ (\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X' \boldsymbol y]\\ &= (\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X'Var[{\boldsymbol y}] ((\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X')'\\ &= (\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X'Var[{\boldsymbol y}] \boldsymbol X(\boldsymbol X' \boldsymbol X)^{-1}\\ &= (\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X'\sigma^2 \boldsymbol I \boldsymbol X(\boldsymbol X' \boldsymbol X)^{-1}\\ &= \sigma^2 (\boldsymbol X' \boldsymbol X)^{-1}\boldsymbol X' \boldsymbol X(\boldsymbol X' \boldsymbol X)^{-1}\\ &= \sigma^2(\boldsymbol X' \boldsymbol X)^{-1} \end{aligned} \]

Let’s check what happens with the simple linear regression case, with just an intercept and a single slope coefficient. Previously we found \((\boldsymbol{X}'\boldsymbol{X})^{-1}\) for this scenario, so let’s use it!

\[ \begin{aligned} (\boldsymbol X' \boldsymbol X)^{-1} &= \frac{1}{nS_{xx}}\left(\begin{array}{cc} \sum_{i=1}^nx_i^2&-\sum_{i=1}^nx_i\\ -\sum_{i=1}^n x_i&n \end{array}\right)\\ &= \frac{1}{S_{xx}}\left(\begin{array}{cc} n^{-1}\sum_{i=1}^nx_i^2&-\bar{x}\\ -\bar{x}&1 \end{array}\right)\\ &= \frac{1}{S_{xx}}\left(\begin{array}{cc} n^{-1}(\sum_{i=1}^nx_i^2 - n\bar{x}^2+ n\bar{x}^2)&-\bar{x}\\ -\bar{x}&1 \end{array}\right)\\ &= \frac{1}{S_{xx}}\left(\begin{array}{cc} n^{-1}S_{xx} + \bar{x}^2&-\bar{x}\\ -\bar{x}&1 \end{array}\right)\\ &= \left(\begin{array}{cc} \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}&\frac{-\bar{x}}{S_{xx}}\\ \frac{-\bar{x}}{S_{xx}}&\frac{1}{S_{xx}} \end{array}\right)\\ \end{aligned} \] So

\[ \begin{aligned} \sigma^2_{\varepsilon}(\boldsymbol X' \boldsymbol X)^{-1} &= \left(\begin{array}{cc} \sigma^2_{\varepsilon}(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}})&\frac{-\bar{x}\sigma^2_{\varepsilon}}{S_{xx}}\\ \frac{-\bar{x}\sigma^2_{\varepsilon}}{S_{xx}}&\frac{\sigma^2_{\varepsilon}}{S_{xx}} \end{array}\right)\\ \end{aligned} \]

The diagonal elements of the variance-covariance matrix are the variances of the individual vector components (so \(Var(b_0)\) and \(Var(b_1)\) here). If you’ve seen formulations for the variance or standard error of \(b_0\) or \(b_1\) before, these are equivalent – though you might not have used the sum-of-squares notation previously. Meanwhile, on the off-diagonal, we have the covariance of \(b_0\) and \(b_1\). Are they independent?

2.16.1 Moments of regression estimators

2.16.1.1 Expected value of the slope

Remember that

\[ \begin{aligned} b_1 &= \frac{S_{xy}}{S_{xx}}\\ & = \frac{\sum{(x_i-\bar{x})y_i}}{S_{xx}}\\ &= \sum{k_i y_i} \end{aligned} \]

You can use this handy expression to show that \(E(b_1) = \beta_1\).

2.16.1.2 Variance of the slope

First, let’s think about this: what does the variance of the slope seem to depend on?

The spread of the \(x\) values
The vertical spread of the points around the line (in other words, the size of the errors around the line)
The number of points

Let’s put this into math. \(S_{xx}\) is equal to the (sample) variance of the \(x's\) times \((n-1)\):

\[S_{xx} = (n-1)s_x^2\] With some fancy algebra, you can show that: \[Var(b_1) = \frac{\sigma^2}{S_{xx}} =\frac{\sigma^2}{s_x^2 (n-1)}\]

So the stability/variance of the slope depends only on:

The spread of the \(x\) values, as represented by \(s_x\)
The spread around the line, as represented by \(\sigma\) – which is the standard deviation of the errors, \(\varepsilon\)
The number of points, as represented by \(n\)

2.16.1.3 Variance of the intercept (hooray, I guess)

How about \(b_0\)? What should its stability depend on?

It turns out (and you can show) that:

\[Var(b_0) = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}} \right]\]

So you have the number of points and the spread of errors (again), and the relationship between the center of the \(x\) values, \(\bar{x}\), and their spread \(S_{xx}\).

Consider what would happen if you transformed \(X\) by shifting or scaling it. It’s important to note that such a transformation doesn’t actually change anything. You can’t make your predictions better just by changing units! But it can change what numbers look big in the output.

Check out the class code about global temperature over time (in which we transform the \(X\) values to be centered at 0), or any code where you shift the values of a predictor. Compare the “Std. Error” of the coefficients in these cases. For the slope, it doesn’t change. Sensible: none of the pieces we listed above are any different.

But the standard error of the intercept changes. Why? \(S_{xx}\) doesn’t change; it’s the same in both cases because \(S_{xx}\) is the sum of the \(x's\) squared , wherever that is. But in the first case \(\bar{x}\) is huge compared to \(S_{xx}\) and in the second case it’s \(0\).

This makes sense, though – in the first case, the y-intercept, where \(x=0\), is far from the observed \(x\) values. If we’re even a little bit unsure about the slope of the line, we’re going to be pretty vague about \(\hat{y}\) when \(x\) is all the way out there. But in the second case, the y-intercept is right in the middle of the cloud of observations. Even if our line’s off by a little bit, the intercept is going to be pretty much the same.