2.5 Least squares for simple linear regression, matrix edition

So we’ve managed to figure out the normal equations for the best-fit regression coefficients. Now our job is to solve a rather complicated system of equations. Let’s do the simple regression version (i.e., with one predictor), but with matrices, to illustrate how this works.

First, ask yourself: How many columns are there in \(\boldsymbol{X}\)? What is \(\boldsymbol X' \boldsymbol X\)? \[ {\bf X'X} = \left(\begin{array}{cccc} 1&1&\ldots&1\\ x_1&x_2&\ldots&x_n \end{array}\right)\left(\begin{array}{cc} 1&x_1\\ 1&x_2\\ \vdots&\vdots\\ 1&x_n\\ \end{array}\right) = \left(\begin{array}{cc} \sum_{i=1}^n (1*1)&\sum_{i=1}^n(1*x_i)\\ \sum_{i=1}^n (x_i*1)&\sum_{i=1}^n(x_i*x_i) \end{array}\right) = \left(\begin{array}{cc} n&\sum_{i=1}^nx_i\\ \sum_{i=1}^n x_i&\sum_{i=1}^nx_i^2 \end{array}\right) \]

Okay, what is \((\boldsymbol X' \boldsymbol X)^{-1}\)?

The esteemed reference “mathisfun.com” says: “To find the Inverse of a 2x2 Matrix: swap the positions of a and d, put negatives in front of b and c, and divide everything by the determinant (ad-bc).”

Let’s first label the four values.

\[ a = n,\quad b = \sum_{i=1}^nx_i,\quad c = \sum_{i=1}^nx_i,\quad d = \sum_{i=1}^nx_i^2 \]

The first step is to find the determinant.

\[ \begin{aligned} det(\boldsymbol X' \boldsymbol X) &= ad-bc \\ &= n\sum_{i=1}^nx_i^2 - (\sum_{i=1}^n x_i)^2 \\ &= n \left(\sum_{i=1}^nx_i^2 - n^{-1}(\sum_{i=1}^n x_i)^2\right)\\ &= n \left(\sum_{i=1}^nx_i^2 - n \bar{x}^2\right)\\ &= n S_{xx} \end{aligned} \]

Hey look, the sum of squares for x! Knew that would come in handy.

Then the inverse is:

\[ \begin{aligned} (\boldsymbol X' \boldsymbol X)^{-1} &= \frac{1}{nS_{xx}}\left(\begin{array}{cc} \sum_{i=1}^nx_i^2&-\sum_{i=1}^nx_i\\ -\sum_{i=1}^n x_i&n \end{array}\right)\\ &= \frac{1}{S_{xx}}\left(\begin{array}{cc} n^{-1}\sum_{i=1}^nx_i^2&-\bar{x}\\ -\bar{x}&1 \end{array}\right)\\ &= \frac{1}{S_{xx}}\left(\begin{array}{cc} n^{-1}(\sum_{i=1}^nx_i^2 - n\bar{x}^2+ n\bar{x}^2)&-\bar{x}\\ -\bar{x}&1 \end{array}\right)\\ &= \frac{1}{S_{xx}}\left(\begin{array}{cc} n^{-1}S_{xx} + \bar{x}^2&-\bar{x}\\ -\bar{x}&1 \end{array}\right)\\ &= \left(\begin{array}{cc} \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}&\frac{-\bar{x}}{S_{xx}}\\ \frac{-\bar{x}}{S_{xx}}&\frac{1}{S_{xx}} \end{array}\right)\\ \end{aligned} \]

So what happens if we keep doing out the multiplication? \[ \boldsymbol{b} = (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol y\]

I’m focusing on the bottom row because that element is the slope. The first/top element is about the intercept, \(b_0\); and while \(b_0\) is mathematically important for specifying where the line is, it’s often not very meaningful in context, and frankly it’s just not as interesting. Take my word for it that we can in fact find a closed-form solution for the best-fit \(b_0\), and we can all move on with our lives.

Let’s focus on that bottom row. \[\begin{align*} \begin{pmatrix}b_0 \\ b_1\end{pmatrix} &= \left(\begin{array}{cc} \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}&\frac{-\bar{x}}{S_{xx}}\\ \frac{-\bar{x}}{S_{xx}}&\frac{1}{S_{xx}} \end{array}\right) \begin{pmatrix} 1 & 1 & \dots \\ x_1 & x_2 & \dots \end{pmatrix} \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \\ &= \begin{pmatrix} \dots & stuff & \dots \\ \frac{-\bar{x}}{S_{xx}}*1 + \frac{1}{S_{xx}}*x_1 & \frac{-\bar{x}}{S_{xx}}*1 + \frac{1}{S_{xx}}*x_2 & \dots \end{pmatrix} \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \\ &= \begin{pmatrix} stuff \\ \frac{x_1-\bar{x}}{S_{xx}}* y_1 + \frac{x_2-\bar{x}}{S_{xx}}*y_2 + \dots \end{pmatrix} \\ &= \begin{pmatrix} stuff \\ \sum_{i} \left( \frac{x_i-\bar{x}}{S_{xx}} \right) y_i \end{pmatrix} \\ \end{align*}\]

Whoa.

Look at that second element. That’s \(b_1\), expressed in a very interesting way: as a weighted average of the \(y\) values. Specifically, they’re weighted by the discrepancy between the point’s \(x\) value and the average \(x\) value. This is going to work in a similar way when we have more predictors – that is, more rows in the \(\boldsymbol{b}\) vector and columns in \(\boldsymbol{X}\).

This is a hint as to why we always say high-leverage points are of concern. The slope is a weighted average of the \(y\) values, with weights determined by how far each \(x\) is from the mean. So for points with extreme \(x\) values, their \(y\) values are going to have a major – perhaps disproportionate – effect on our estimate of \(b_1\). This is still true in multiple regression!