Chapter 17 Least Squares Estimation for Linear Models

17.1 Introduction

In Section 16 we introduced linear models with particular emphasis on Normal linear models. We derived the least square estimates of the model parameters for the straight line model:
y=α+βx+ϵ,y=α+βx+ϵ,

and showed that if ϵN(0,σ2)ϵN(0,σ2) then the least square estimates coincide with the maximum likelihood estimates of the parameters. In this section we consider the mathematics behind least squares estimation for general linear models. This relies heavily on linear algebra (matrix manipulation) and we give a review of key linear algebra results in Section 17.2. The main message is that we can concisely express key quantities such as least square parameter estimates, ˆβ^β, fitted values, ˆy^y and residuals, ϵϵ as functions of matrices.

17.2 Linear algebra review

Rank of a matrix
Let MM be any n×mn×m matrix. Then the rank of MM is the maximum number of linearly independent column vectors of MM.

Transpose of a matrix
If M=(mij)M=(mij), then MT=(mji)MT=(mji) is said to be the transpose of the matrix MM.

Properties of square matrices
Suppose AA is a square n×nn×n matrix, then

  • AA is symmetric if and only if AT=AAT=A;
  • A1A1 is the inverse of AA if and only if AA1=A1A=InAA1=A1A=In;
  • The matrix AA is nonsingular if and only if rank(A)=nrank(A)=n;
  • AA is orthogonal if and only if A1=ATA1=AT;
  • AA is idempotent if and only if A2=AA=AA2=AA=A;
  • AA is positive definite if and only if xTAx>0xTAx>0 for all non-zero vectors xx.

Note the following two important results:

  • AA has an inverse if and only if AA is nonsingular, that is the rows and columns are linearly independent;
  • ATAATA is positive definite if AA has an inverse.

The following computational results are also useful:

  • Let NN be an n×pn×p matrix and PP be a p×np×n matrix, then
    (NP)T=PTNT;(NP)T=PTNT;
  • Suppose AA and BB are two invertible n×nn×n matrices, then
    (AB)1=B1A1;(AB)1=B1A1;
  • We can write the sum of squares ni=1x2i=xTxni=1x2i=xTx, where xT=[x1,x2,,xn]xT=[x1,x2,,xn] is a 1×n1×n row vector.
Given nn-dimensional vectors xx and y=y(x)y=y(x), we define
dydx=(y1x1ynx1y1x2ynx2y1xnynxn).dydx=⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜y1x1ynx1y1x2ynx2y1xnynxn⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟.

Then the following results hold in the calculus of matrices:

  • ddx(Ax)=ATddx(Ax)=AT, where AA is a matrix of constants;
  • ddx(xTAx)=(A+AT)x=2Axddx(xTAx)=(A+AT)x=2Ax whenever AA is symmetric;
  • If f(x)f(x) is a function of several variables the necessary condition to maximise or minimise f(x)f(x) is
    f(x)x=0.f(x)x=0.
  • Let H=2f(x)xxTH=2f(x)xxT be the Hessian of ff, that is the matrix of second derivatives. Then a maximum will occur if HH is negative definite, and a minimum will occur if HH is positive definite.

Let AA be a matrix of constants and YY be a random vector, then we have the following expectation and variance results:

  • E[AY]=AE[Y]E[AY]=AE[Y];
  • Var(AY)=AVar(Y)ATVar(AY)=AVar(Y)AT.

17.3 Deriving the least squares estimator

Recall that a linear model is given in matrix form by Y=Zβ+ϵY=Zβ+ϵ, where
Y=(Y1Y2Yn),Z=(1X11X(p1)11X12X(p1)21X1nX(p1)n),β=(β0β1βp1),ϵ=(ϵ1ϵ2ϵn),Y=⎜ ⎜ ⎜ ⎜Y1Y2Yn⎟ ⎟ ⎟ ⎟,Z=⎜ ⎜ ⎜ ⎜ ⎜ ⎜1X11X(p1)11X12X(p1)21X1nX(p1)n⎟ ⎟ ⎟ ⎟ ⎟ ⎟,β=⎜ ⎜ ⎜ ⎜β0β1βp1⎟ ⎟ ⎟ ⎟,ϵ=⎜ ⎜ ⎜ ⎜ϵ1ϵ2ϵn⎟ ⎟ ⎟ ⎟,

where

  • YY is an n×1n×1 column vector of observations of the response variable;
  • ZZ is the n×pn×p design matrix whose first column is a column of 11’s, if there is a constant in the model. The other columns are the observations on the explanatory variables (X1,X2,,Xp1)(X1,X2,,Xp1);
  • ββ is a p×1p×1 column vector of the unknown parameters;
  • ϵϵ is an n×1n×1 column vector of the random error terms.

The general linear regression model assumes that E[ϵ]=0E[ϵ]=0 and Var(ϵ)=σ2InVar(ϵ)=σ2In.

Our aim is to estimate the unknown vector of parameters, ββ.

Least squares estimate
The least squares estimate of ββ is
ˆβ=(ZTZ)1ZTy.^β=(ZTZ)1ZTy.

The least squares estimator is the value of ββ that minimises the model deviance DD. Consider

D=ni=1(yi(Zβ)i)2=(yZβ)T(yZβ)=yTyyTZββTZTy+βTZTZβ=yTy2yTZβ+βTZTZβ.D=ni=1(yi(Zβ)i)2=(yZβ)T(yZβ)=yTyyTZββTZTy+βTZTZβ=yTy2yTZβ+βTZTZβ.

Taking the derivative of DD with respect to ββ and noticing that ZTZZTZ is a symmetric matrix, we have that

Dβ=(2yTZ)T+2ZTZβ=2ZTy+2ZTZβ.Dβ=(2yTZ)T+2ZTZβ=2ZTy+2ZTZβ.

Therefore the least squares estimator ˆβ^β of ββ will satisfy ZTZˆβ=ZTyZTZ^β=ZTy. This system of equations are the normal equations for the general linear regression model. To be able to isolate ˆβ^β it is necessary for ZTZZTZ to be invertible. Therefore we need ZZ to be of full rank, that is, rank(Z)=prank(Z)=p. If rank(Z)=prank(Z)=p, then

ˆβ=(ZTZ)1ZTy.^β=(ZTZ)1ZTy.

We know that ˆβ^β minimising DD is equivalent to the Hessian of DD will be positive definite. If ZZ has full rank, then since ZTZZTZ is a symmetric matrix, we have that

H=2Dβ2=(2ZTZ)T=2ZTZ.H=2Dβ2=(2ZTZ)T=2ZTZ.

We know ZTZZTZ is positive definite and hence, ˆβ^β is the least squares estimator of ββ.


Let ˆy=Zˆβ^y=Z^β be the n×1n×1 vector of fitted values of yy. Note that
ˆy=Zˆβ=Z(ZTZ)1ZTy.^y=Z^β=Z(ZTZ)1ZTy.

If we set P=Z(ZTZ)1ZTP=Z(ZTZ)1ZT, then we can write ˆy=Py.^y=Py. The matrix PP is therefore often referred to as the hat matrix. Note that PP is symmetric and idempotent because PT=PPT=P and P2=PP2=P.

The residuals, ϵϵ, satisfy
ϵ=yˆy=InyPy=(InP)y,ϵ=y^y=InyPy=(InP)y,

where InIn is the n×nn×n identify matrix.

Therefore the sum of the square of the residuals is given by
ni=1ϵ2i=(yZˆβ)T(yZˆβ)=((InP)y)T(InP)y=yT(InP)T(InP)y.ni=1ϵ2i=(yZ^β)T(yZ^β)=((InP)y)T(InP)y=yT(InP)T(InP)y.

The quantity
s2=1npni=1ϵ2i=1np(yZˆβ)T(yZˆβ)s2=1npni=1ϵ2i=1np(yZ^β)T(yZ^β)

is an unbiased estimator of σ2σ2.

Note that to obtain an unbiased estimator of σ2σ2, we divide the sum of the square of the residuals by np. That is, the number of observations (n) minus the number of parameters (p) estimated in β. This is in line with the divisor n1 in estimating the variance of a random variable X from data x1,x2,,xn with μ=E[X] (one parameter) estimated by ˉx=1nni=1xi.

17.4 Examples


Suppose we have two observations such that

y1=θ+ϵ,y2=2θ+ϵ.

Calculate the least squares estimator of θ.

Writing the given linear model in a matrix format, one obtains

Z=(12),andy=(y1y2)

Then (ZTZ)1=15 and by applying Theorem 17.3.1 (Least squares estimate):

ˆθ=(ZTZ)1ZTY=15(y1+2y2).


Simple Linear Regression
Consider the simple regression model, yi=a+bxi+ϵ, for i{1,,n}. Then in matrix terms Y=Zβ+ϵ where,
Y=(y1y2yn),Z=(1x11x21xn),β=(ab),ϵ=(ϵ1ϵ2ϵn).

Calculate the least squares estimator of β.

The least squares estimators of β will be given by,

ˆβ=(ZTZ)1ZTy,

where,

ZTZ=(111x1x2xn)(1x11x21xn)=(nni=1xini=1xini=1x2i),ZTy=(ni=1yini=1xiyi).
Therefore, the inverse of ZTZ is
(ZTZ)1=1ni=1(xiˉx)2(1nni=1x2iˉxˉx1),
and so
ˆβ=(ZTZ)1ZTy=1ni=1(xiˉx)2(1nni=1x2iˉxˉx1)(ni=1yini=1xiyi)=1ni=1(xiˉx)2(1nni=1x2ini=1yiˉxni=1xiyiˉxni=1yi+ni=1xiyi)=1ni=1(xiˉx)2(ˉyni=1x2iˉxni=1xiyini=1yi(xiˉx))=1ni=1(xiˉx)2(ˉyni=1(xiˉx)2ˉxni=1(xiˉx)(yiˉy)ni=1(xiˉx)(yiˉy))=(ˉyˉxni=1(xiˉx)(yiˉy)ni=1(xiˉx)2ni=1(xiˉx)(yiˉy)ni=1(xiˉx)2).

So,

(ˆaˆb)=(ˉyˆbˉxsxys2x).

The least square estimates agree with the estimates we obtained in Section 16.6.

17.5 Properties of the estimator of β

In this section we give a collection of results about the properties of the estimator of β. The properties are given with proofs. It is not important to know the proofs but to know what the key properties are and have an understanding of why they are important.

Unbiasedness of LSE
ˆβ is an unbiased estimator of β.

Since (ZTZ)1ZT is a constant,
E[ˆβ]=E[(ZTZ)1ZTy]=(ZTZ)1ZTE[y].
Substituting in Zβ+ϵ for y
E[ˆβ]=(ZTZ)1ZTE[Zβ+ϵ]=(ZTZ)1ZT(Zβ+E[ϵ])=(ZTZ)1ZT(Zβ+0)=Ipβ=β.


Variance of LSE
The variance of ˆβ is given by
Var(ˆβ)=σ2(ZTZ)1.
Since for a constant matrix A, we have that Var(AY)=AVar(Y)AT, it follows that
Var(ˆβ)=Var((ZTZ)1ZTy)=(ZTZ)1ZTVar(y)((ZTZ)1ZT)T
Substituting in Zβ+ϵ for y, and noting that Zβ is a constant, we have that
Var(ˆβ)=(ZTZ)1ZTVar(Zβ+ϵ)Z(ZTZ)1=(ZTZ)1ZTVar(ϵ)Z(ZTZ)1=(ZTZ)1ZTσ2InZ(ZTZ)1=σ2(ZTZ)1ZTZ(ZTZ)1=σ2(ZTZ)1.


Note that Var(ˆβ) is the p×p variance-covariance matrix of the vector ˆβ. Specifically the ith diagonal entry is Var(ˆβi) and the (i,j)th entry is Cov(ˆβi,ˆβj).

Uncertainty in simple linear regression
Consider the straight line model:
yi=α+βxi+ϵ,i=1,2,,n,

where ϵN(0,σ2).

Watch Video 25 for a run through uncertainty in the estimates of the parameters of a simple linear regression model. A summary of the results are presented after the video.

Video 25: Uncertainty in simple linear regression

Then
Z=(1x11x21xn),β=(αβ).
We have shown in Example 17.4.2 (Simple Linear Regression) that
(ZTZ)1=1ni=1(xiˉx)2(1nni=1x2iˉxˉx1).
Therefore,
Var(ˆβ)=σ2(ZTZ)1=σ2ni=1(xiˉx)2(1nni=1x2iˉxˉx1),
and so,
Var(ˆα)=σ2ni=1x2inni=1(xiˉx)2,Var(ˆβ)=σ2ni=1(xiˉx)2,Cov(ˆα,ˆβ)=σ2ˉxni=1(xiˉx)2.

The variance of the ˆβ does not depend on the values of α and β but on σ2 (the variance of ϵ) and the design matrix Z. This tells us that if we have input in choosing xi (constructing the design matrix) then we can construct the design matrix to reduce the variance of the estimator. In particular, the larger ni=1(xiˉx)2 (variability in the xis), the smaller the variance of the estimates. Note that there will often be scientific and practical reasons for choosing xi within a given range.

Observe that the covariance between ˆα and ˆβ has the opposite sign to ˉx and becomes larger as |ˉx| increases. The correlation between ˆα and ˆβ is
Cor(ˆα,ˆβ)=Cov(ˆα,ˆβ)Var(ˆα)Var(ˆβ)=nˉxni=1x2i.

The correlation in the estimates is larger, in absolute value, the larger ˉx2 is relative to ni=1x2i.

To illustrate the variability in ˆβ we use an example. Suppose that we have ten observations from the model:
y=2+0.6x+ϵ

where ϵN(0,1) and for i=1,2,,10, xi=i.

Then
Var(ˆβ)=1ni=1(xiˉx)2(1nni=1x2iˉxˉx1)=182.5(38.55.55.51)=(0.466670.066670.066670.01212)

We simulated 100 sets of data from the model and for each set of data calculate ˆα and ˆβ. In Figure 17.1, the estimates of ˆβ against ˆα are plotted along with the true parameter values β=0.6 and α=2. The estimates show negative correlation.

Plot of estimates of straight line model parameters with true parameter values denoted by a red dot

Figure 17.1: Plot of estimates of straight line model parameters with true parameter values denoted by a red dot

In Figure 17.2, the fitted line ˆα+ˆβx is plotted for each simulated data set along with the true line 2+0.6x. Observe that the lines with the highest intercepts tend to have the smallest slope and visa-versa. Also note that there is more variability in the estimated lines at the end points (x=1 and x=10) rather than in the middle of the range (close to x=5.5).

Estimated lines from 100 simulations with true line in red

Figure 17.2: Estimated lines from 100 simulations with true line in red

Distribution of LSE
If additionally ϵNn(0,σ2In), then ˆβNp(β,σ2(ZTZ)1).

Note,

ˆβ=(ZTZ)1ZTy=(ZTZ)1ZT(Zβ+ϵ)=(ZTZ)1ZTZβ+(ZTZ)1ZTϵ=β+(ZTZ)1ZTϵ.

Hence ˆβ is a linear function of a normally distributed random variable. Using the identities E[Ax+b]=AE[x]+b and Var(Ax+b)=AVar(x)AT, consequently ˆβ has a normal distribution with mean and variance as required.


Note that since ˆβNp(β,σ2(ZTZ)1), then each of the individual parameters has a distribution
ˆβiN(βi,σ2((ZTZ)1)ii),

However the individual ˆβi are not independent as we saw in Example 17.5.3 (Uncertainty in simple linear regression).

17.6 Gauss-Markov Theorem

The Gauss-Markov Theorem shows that a good choice of estimator for aTβ, a linear combination of the parameters, is aTˆβ.

Gauss-Markov Theorem
If ˆβ is the least squares estimator of β, then aTˆβ is the unique linear unbiased estimator of aTβ with minimum variance.

The details of the proof of Theorem 17.6.1 (Gauss-Markov Theorem) are provided but can be omitted.

Proof of Gauss-Markov Theorem.

Consider ˆβ=(ZTZ)1ZTy, the LSE of β. Hence,

aTˆβ=aT(ZTZ)1ZTy=Cy,

where C=aT(ZTZ)1ZT. It follows that aTˆβ is a linear function of y.

Note that aTˆβ is an unbiased estimator of aTβ because,
E[aTˆβ]=E[Cy]=CE[Zβ+ϵ]=CZβ+CE[ϵ]=aT(ZTZ)1ZTZβ+0=aTβ.
Suppose there exists another linear unbiased estimator of aTβ, denoted bTy. By definition
E[bTy]=aTβ.
However we can also calculate
E[bTy]=bTE[Zβ+ϵ]=bTZβ.
It follows that
bTZβ=aTβ,for all β,
so
aT=bTZ.
Now
Var(bTy)=bTVar(Zβ+ϵ)b=bTVar(ϵ)b=bTσ2Inb=σ2bTb,
and
Var(aTˆβ)=Var(Cy)=CVar(Zβ+ϵ)CT=CVar(ϵ)CT=Cσ2InCT=σ2CCT=σ2(aT(ZTZ)1ZT)(aT(ZTZ)1ZT)T=σ2aT(ZTZ)1ZTZ(ZTZ)1a=σ2aT(ZTZ)1a.

Substituting aT=bTZ, we can rewrite

Var(aTˆβ)=σ2bTZ(ZTZ)1(bTZ)T=σ2bTZ(ZTZ)1ZTb=σ2bTPb.
Comparing Var(bTy) and Var(aTˆβ), we get
Var(bTy)Var(aTˆβ)=σ2bTbσ2bTPb=σ2bT(InP)b=σ2bT(InP)2b=σ2bT(InP)T(InP)b=σ2DTD,
where D=(InP)b. Therefore,
Var(bTy)Var(aTˆβ)=σ2DTD0,

so aTˆβ has the smallest variance of any other linear unbiased estimator.

Finally suppose that bTy is another linear unbiased estimator such that Var(bTy)=Var(aTˆβ), then
Var(bTy)Var(aTˆβ)=σ2DTD=0D=0
Since D=(InP)b=0, it follows that
b=Pb=Z(ZTZ)1ZTb=Z(ZTZ)1abT=aT(ZTZ)1ZT,
So
bTy=aT(ZTZ)1ZTy=aTˆβ.

Therefore aTˆβ is the unique linear unbiased estimator of aTβ.


Best linear unbiased estimator (BLUE)
If aT=(0,0,,1,0,,0) where the 1 is in the ith position, then ˆβi is the best linear unbiased estimator, shorthand BLUE, of βi.

The following R shiny app generates data and fits a regression line, y=α+βx. It allows for variability in the coefficients and how the covariates x are generated. Predicted values can also be plotted with confidence intervals, see Section 18, Interval Estimation for an introduction to confidence intervals and Session 12: Linear Models II for a discussion of confidence intervals for predicted values.

R Shiny app: Linear Model

Task: Session 9

Attempt the R Markdown file for Session 9:
Session 9: Linear Models I

Student Exercises

Attempt the exercises below.


Suppose that a model states that
E[Y1]=θ,E[Y2]=2θϕ,E[Y3]=θ+2ϕ.

Find the least squares estimates of θ and ϕ.

Solution to Exercise 17.1.
Using first principles, minimise
D=3i=1(yiE[yi])2=(y1θ)2+(y22θ+ϕ)2+(y3θ2ϕ)2
by finding solutions to the normal equations,
Dθ=0=2(y1θ)4(y22θ+ϕ)2(y3θ2ϕ),Dϕ=0=2(y22θ+ϕ)4(y3θ2ϕ).
Solving for θ,ϕ gives
ˆθ=16(y1+2y2+y3),ˆϕ=15(2y3y2).
Note that generally if we have a linear model with p parameters θ=(θ1,θ2,,θp) and n observations y1,y2,,yn with E[yi]=pj=1aijθj, then the least squares estimate of θj (j=1,2,,p) is
ˆθj=1ni=1a2ijni=1aijyi.



The marks of 8 students are presented below. For student i, xi denotes their mark in a mid-term test and yi denotes their mark in the final exam.

Mid-term test marks:
x=(x1,x2,,x8)=(75,68,60,58,70,67,64,65)
Final exam marks:
y=(y1,y2,,y8)=(62,54,55,43,59,59,56,50).
Note that:
ˉx=188i=1xi=65.875ˉy=188i=1yi=54.75sxy=181{8i=1xiyinˉxˉy}=25.679s2x=181{8i=1x2inˉx2}=29.554s2y=181{8i=1y2inˉy2}=35.929
  1. Calculate the correlation between the mid-term test marks and the final exam marks.
  2. Fit a straight line linear model y=α+βx+ϵ, where ϵN(0,σ2) with the final exam mark, y, as the response and the mid-term test mark, x, as the predictor variable. Include estimation of σ2 in the model fit.
  3. Find the expected final exam mark of a student who scores 79 in the mid-term test.
Solution to Exercise 17.2.
  1. The correlation between x and y is given by
    sxysx×sy=25.67929.554×35.929=0.788
  2. The coefficients of the linear regression model are:
    ˆβ=sx,ys2x=25.67929.554=0.869.ˆα=ˉyˆβˉx=54.750.869×65.875=2.488.
Since we have p=2 parameters (α and β), an unbiased estimate of σ2 is:
s2=1npni=1(yi[ˆα+ˆβx])2=15.886.
Thus the fitted model is:
y=2.488+0.869x+ϵ,ϵN(0,15.886).
  1. The expected final exam mark of a student who scores 79 in the mid-term test is:
    y=2.488+0.869×79=66.163.