Chapter 15 Advanced OLS

We will prove the Gauss–Markov Theorem with matrix algebra and learn how to generate random numbers in R.

15.1 Derive OLS estimator (Matrix Form)

Suppose we have a linear statistical model \(y=XB+e\). Let y is an n x 1 vector of observations on the dependent variable

\[y = \begin{bmatrix}y_1\\y_2\\\vdots\\y_n\end{bmatrix}\].

Let X be an n x k matrix of observations on k - 1 independent variables \[X = \begin{bmatrix}1 & X_{21} & X_{31}&\cdots&X_{k1}\\ 1 & X_{22} & X_{32}&\cdots&X_{k2}\\ &&\ddots\\ 1 & X_{2n} & X_{3n}&\cdots&X_{kn} \end{bmatrix}\]

Let \(\hat{\beta}\) be a k x 1 vector of estimators for B. \[\hat{\beta}=\begin{bmatrix}\hat{\beta_1}\\ \hat{\beta_2}\\ \vdots\\ \hat{\beta_k}\\ \end{bmatrix}\]

Let e be an n x 1 matrix of residuals.

\[e = \begin{bmatrix}e_1\\e_2\\\vdots\\e_n\end{bmatrix}\]

We want to find \(\hat{\beta}\) such that \(\sum{e^2}\) is a minimum. The estimated equation is \[\hat{y} = X\hat{\beta}+\hat{e}\]

The ordinary least squares estimator is \(\hat{\beta}\) such that \(\hat{e}^T\hat{e}\) is minimized. Solving for \(\hat{e}\) yields. \[\hat{e}=y-X\hat{\beta}\]

So,

\[\begin{aligned}\hat{e}^T\hat{e}&=(y-X\hat{\beta})^T(y-X\hat{\beta)}\\ &= y^Ty-\hat{\beta}X^Ty-y^TX\hat{\beta} + \hat{\beta}X^TX\hat{\beta}\\ &=y^Ty-2\hat{\beta}X^Ty+\hat{\beta}X^TX\hat{\beta}\\ \end{aligned}\]

Take the partial derivative of \(\hat{e}^T\hat{e}\) with respect to \(\hat{\beta}\) and set it equal to 0.

\[\begin{aligned} \frac{\partial\hat{e}^T\hat{e}}{\partial\hat{\beta}} &= -2X^Ty+2X^TX\hat{\beta}=0\\ &=-X^Ty + X^TX\hat{\beta} = 0 \\ X^TX\hat{\beta} &= X^Ty \end{aligned}\]

Pre-multiple both sides by \((X^TX)^{-1}\)

\[\begin{aligned} (X^TX)^{-1}(X^TX)\hat{\beta} &= (X^TX)^{-1}X^Ty \\ I\hat{\beta} &= (X^TX)^{-1}X^Ty \\ \hat{\beta} &= (X^TX)^{-1}X^Ty \end{aligned}\]

15.1.1 Example

Suppose we have 14 observations on the dependent y:

\[\begin{bmatrix}1065\\ 1254\\ 1300\\1577\\1600\\1750\\1800\\1870\\1935\\1948\\2254\\ 2600\\2800\\3000\end{bmatrix}\]

We also have 14 observations on a single independent variable

\[X = \begin{bmatrix} 1 & 199.9 \\ 1 & 228 \\ 1 & 235\\ 1 & 285\\ 1 & 239\\ 1 & 293\\ 1 & 285\\ 1 & 365\\ 1 & 295\\ 1 & 290\\ 1 & 385\\ 1 & 505\\ 1 & 425\\ 1 & 425\\ 1 & 415 \end{bmatrix}\]

Let’s find the \(\begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix}\) step by step using matrix operators in R. The matrix operators we need are in the table below.

Operator What it does
%*% matrix multiplication
t() transposes a matrix
solve() inverts a matrix
crossprod() performs t(x) %*% x

Let’s step through the calculations one at a time.

15.1.1.1 create X and y

# create the 1 x 14 column vector y
y <- c(199.9, 228, 235, 285, 239, 293, 285, 365, 295, 290, 385, 505, 425, 415)
# create the 2 x 14 matrix X
# cbind combines vectors by columns into a matrix
X <- cbind(c(rep(1,14)), # rep() repeats a value a given number of times
           c(1065, 1254, 1300,1577,1600,1750,1800,1870,1935,1948,2254, 2600,2800,3000))
y
 [1] 200 228 235 285 239 293 285 365 295 290 385 505 425 415
X
      [,1] [,2]
 [1,]    1 1065
 [2,]    1 1254
 [3,]    1 1300
 [4,]    1 1577
 [5,]    1 1600
 [6,]    1 1750
 [7,]    1 1800
 [8,]    1 1870
 [9,]    1 1935
[10,]    1 1948
[11,]    1 2254
[12,]    1 2600
[13,]    1 2800
[14,]    1 3000

15.1.1.2 create X transpose

X_T <- t(X)
X_T
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1
[2,] 1065 1254 1300 1577 1600 1750 1800 1870 1935  1948  2254  2600  2800  3000

15.1.1.3 create X transpose X

X_t_X <- X_T %*% X
X_t_X
      [,1]     [,2]
[1,]    14    26753
[2,] 26753 55462515
# alternatively we could call crossprod
X_T_X <- crossprod(X)
X_T_X
      [,1]     [,2]
[1,]    14    26753
[2,] 26753 55462515

15.1.1.4 invert X transpose X

X_T_X_inverse <- solve(X_T_X)
X_T_X_inverse
         [,1]        [,2]
[1,]  0.91293 -0.00044036
[2,] -0.00044  0.00000023

15.1.1.5 X transpose X inverse X Transpose

X_T_X_inverse_X_T <- X_T_X_inverse %*% X_T
X_T_X_inverse_X_T
          [,1]      [,2]      [,3]      [,4]       [,5]       [,6]       [,7]
[1,]  0.443944  0.360715  0.340459  0.218478  0.2083499  0.1422955  0.1202774
[2,] -0.000195 -0.000151 -0.000141 -0.000077 -0.0000717 -0.0000371 -0.0000256
            [,8]       [,9]      [,10]      [,11]     [,12]     [,13]     [,14]
[1,]  0.08945199 0.06082841 0.05510370 -0.0796473 -0.232013 -0.320085 -0.408158
[2,] -0.00000943 0.00000555 0.00000854  0.0000791  0.000159  0.000205  0.000251

15.1.1.6 X transpose X inverse X Transpose y

beta <- X_T_X_inverse %*% X_T %*% y
beta
       [,1]
[1,] 52.351
[2,]  0.139

This is the matrix of our estimates for B. So, the equation we have estimated is \(\hat{y} = 52.351 + 0.139X\)

15.2 Gauss–Markov Theorem

The Gauss-Markov theorem proves that among the class of linear estimators of B, the ordinary least squares estimator has the minimum variance. That is, the OLS estimator is BLUE: the Best, Linear, Unbiased, Estimator. Below is the proof.

15.2.1 OLS estimator is linear

Since \((X^TX)^{-1}X^T\) is a matrix of fixed numbers, \(\hat{\beta}\) is linear combination of X and y.

15.2.2 OLS estimator is unbiased

\(\hat{\beta}\) is an unbiased estimator of B if \(E(\hat{\beta})=B\)

\[E(\hat{\beta}) = E\left[(X^TX)^{-1}X^Ty)\right]\] Substituting for \(y=XB+e\)

\[\begin{aligned} E(\hat{\beta}) &= E\left[(X^TX)^{-1}X^T(XB+e)\right]\\ &=E\left[(X^TX)^{-1}X^T(XB+e)\right]\\ &=E\left[(X^TX)^{-1}(X^TX)B+(X^TX)^{-1}X^Te\right]\\ &=E\left[B+(X^TX)^{-1}X^Te\right]\\ &=E(B) + E\left[(X^TX)^{-1}X^Te\right]\\ \end{aligned}\]

Since B is a matrix of parameters it is equal to its expected value so

\[E(\hat{\beta}) = B + E\left[(X^TX)^{-1}X^Te\right]\]

For \(\hat{\beta}\) to be an unbiased estimator of B, \(E\left[(X^TX)^{-1}X^Te\right]\) must be \(0\). If the X is a matrix of non-stochastic observations on the independent variables, then \[E\left[(X^TX)^{-1}X^Te\right] = (X^TX)^{-1}X^TE(e)\] Since \(E(e)=0\), \(\hat{\beta}\) is an unbiased estimator of B. If we assume that X is fixed in repeated samples, X is non-stochastic.

In the wild X is not fixed in repeated samples, therefore X is stochastic. So if \(E\left[(X^TX)^{-1}X^Te\right]\ne0\) X and e are correlated. This is the problem of endogeneity.

15.2.3 Variance-Covariance is a minimum

Let’s find the “variance” of the OLS estimators.41 \(\text{var-cov}(\hat{\beta})\).

\[\begin{aligned} \text{var-cov}(\hat{\beta}) &= E\left[(\hat{\beta}-\beta)(\hat{\beta}-\beta)^T\right]\\ \text{recall from above}\\ \hat{\beta} &= B + (X^TX)^{-1}X^Te\\ \text{so} \\ \hat{\beta} - B &= (X^TX)^{-1}X^Te\\ (\hat{\beta}-\beta)(\hat{\beta}-\beta)^T &= \left[(X^TX)^{-1}X^Te\right]\left[(X^TX)^{-1}X^Te\right]^T\\ &= (X^TX)^{-1}X^Tee^TX(X^TX)^{-1}\\ \text{thus} \\ \text{var-cov}(\hat{\beta}) &= E\left[ (X^TX)^{-1}X^Tee^TX(X^TX)^{-1}\right]\\ \text{if X is exogenous}\\ &= (X^TX)^{-1}X^TE(ee^T)X(X^TX)^{-1}\\ \text{since } E(ee^T) = \sigma^2I\\ &= (X^TX)^{-1}X^T \sigma^2 I X(X^TX)^{-1}\\ &= \sigma^2(X^TX)^{-1}X^T I X(X^TX)^{-1}\\ &= \sigma^2(X^TX)^{-1}X^T X(X^TX)^{-1}\\ \text{var-cov}(\hat{\beta})&=\sigma^2(X^TX)^{-1} \end{aligned}\]

To prove that this variance is the minimum variance among the class of linear estimators, we will show that any other unbiased linear estimator must have a larger variance. Let \(\tilde{\beta}\) be any other linear estimator of B, which can be written as\(\tilde{\beta} = \left[ (X^TX)^{-1}X^T+C) \right]y\) where C is a matrix of constants. Substituting \(y = X\beta+e\) yields

\[ \begin{aligned} \tilde{\beta} &= \left[ (X^TX)^{-1}X^T+C) \right]\left[ XB+e \right]\\ &= (X^TX)^{-1}X^TXB + CXB + (X^TX)^{-1}X^Te + Ce\\ &= B + CXB + (X^TX)^{-1}Xe + Ce \end{aligned} \]

15.3 Probability Distributions in R

Every distribution that R handles has four functions. There is a root name, for example, the root name for the normal distribution is norm. This root is prefixed by one of the letters

  • p for “probability”, the cumulative distribution function (c. d. f.)
  • q for “quantile”, the inverse c. d. f.
  • d for “density”, the density function (p. f. or p. d. f.)
  • r for “random”, a random variable having the specified distribution

For the normal distribution, these functions are pnorm, qnorm, dnorm, and rnorm.

For a continuous distribution (like the normal), the most useful functions for doing problems involving probability calculations are the “p” and “q” functions (c. d. f. and inverse c. d. f.), because the the density (p. d. f.) calculated by the “d” function can only be used to calculate probabilities via integrals and R doesn’t do integrals.

For a discrete distribution (like the binomial), the “d” function calculates the density (p. f.), which in this case is a probability \(f(x) = P(X = x)\) and hence is useful in calculating probabilities.

R has functions to handle many probability distributions. The table below gives the names of the functions for a few of the distributions.

Distribution Functions
Binomial pbinom qbinom dbinom rbinom
Chi-Square pchisq qchisq dchisq rchisq
F pf qf df rf
Normal rnorm qnorm dnorm rnorm
Student t pt qt dt rt
Uniform punif qunif dunif runif

You can find the specific argumnets for each with ?args(pnorm), for example. Or help with ?pt, for example.

15.3.1 Obtaining Critical Statistics

Make use of the functions above to obtain critical statistics for hypothesis testing. For example, suppose we wanted to perform a two-tail t-test at the the \(\alpha=5\%\) level of significance with \(df = 132\) degrees of freedom. We would call qt(p = .975, df = 132, lower.tail = TRUE). This would return \(t_{.05,132} = 1.978\)

15.3.2 Generating Random Numbers

Supposed we’d like to generate a sample of size \(n = 10\) random values of \(X\) such that \(X \sim N(12, 5)\), we would call rnorm(n = 10, mean = 12, sd = 5). This would return \(13.493, 14.434, 6.973, 9.344, 6.096, 6.82, 11.894, 10.542, 13.653, 9.515\).


  1. Recall that the variance of a random variable, X, is essentially the mean of the squared deviations. Or \(\text{Var}(X) = E(X-\mu)^2\)↩︎