Chapter 17 Least Squares Estimation for Linear Models
17.1 Introduction
In Section 16 we introduced linear models with particular emphasis on Normal linear models. We derived the least square estimates of the model parameters for the straight line model:and showed that if ϵ∼N(0,σ2)ϵ∼N(0,σ2) then the least square estimates coincide with the maximum likelihood estimates of the parameters. In this section we consider the mathematics behind least squares estimation for general linear models. This relies heavily on linear algebra (matrix manipulation) and we give a review of key linear algebra results in Section 17.2. The main message is that we can concisely express key quantities such as least square parameter estimates, ˆβ^β, fitted values, ˆy^y and residuals, ϵϵ as functions of matrices.
17.2 Linear algebra review
Rank of a matrix
Let MM be any n×mn×m matrix. Then the rank of MM is the maximum number of linearly independent column vectors of MM.
Transpose of a matrix
If M=(mij)M=(mij), then MT=(mji)MT=(mji) is said to be the transpose of the matrix MM.
Properties of square matrices
Suppose AA is a square n×nn×n matrix, then
- AA is symmetric if and only if AT=AAT=A;
- A−1A−1 is the inverse of AA if and only if AA−1=A−1A=InAA−1=A−1A=In;
- The matrix AA is nonsingular if and only if rank(A)=nrank(A)=n;
- AA is orthogonal if and only if A−1=ATA−1=AT;
- AA is idempotent if and only if A2=AA=AA2=AA=A;
- AA is positive definite if and only if xTAx>0xTAx>0 for all non-zero vectors xx.
Note the following two important results:
- AA has an inverse if and only if AA is nonsingular, that is the rows and columns are linearly independent;
- ATAATA is positive definite if AA has an inverse.
The following computational results are also useful:
- Let NN be an n×pn×p matrix and PP be a p×np×n matrix, then
(NP)T=PTNT;(NP)T=PTNT; - Suppose AA and BB are two invertible n×nn×n matrices, then
(AB)−1=B−1A−1;(AB)−1=B−1A−1; - We can write the sum of squares n∑i=1x2i=xTxn∑i=1x2i=xTx, where xT=[x1,x2,…,xn]xT=[x1,x2,…,xn] is a 1×n1×n row vector.
Then the following results hold in the calculus of matrices:
- ddx(Ax)=ATddx(Ax)=AT, where AA is a matrix of constants;
- ddx(xTAx)=(A+AT)x=2Axddx(xTAx)=(A+AT)x=2Ax whenever AA is symmetric;
- If f(x)f(x) is a function of several variables the necessary condition to maximise or minimise f(x)f(x) is
∂f(x)∂x=0.∂f(x)∂x=0. - Let H=∂2f(x)∂x∂xTH=∂2f(x)∂x∂xT be the Hessian of ff, that is the matrix of second derivatives. Then a maximum will occur if HH is negative definite, and a minimum will occur if HH is positive definite.
Let AA be a matrix of constants and YY be a random vector, then we have the following expectation and variance results:
- E[AY]=AE[Y]E[AY]=AE[Y];
- Var(AY)=AVar(Y)ATVar(AY)=AVar(Y)AT.
17.3 Deriving the least squares estimator
Recall that a linear model is given in matrix form by Y=Zβ+ϵY=Zβ+ϵ, wherewhere
- YY is an n×1n×1 column vector of observations of the response variable;
- ZZ is the n×pn×p design matrix whose first column is a column of 11’s, if there is a constant in the model. The other columns are the observations on the explanatory variables (X1,X2,…,Xp−1)(X1,X2,…,Xp−1);
- ββ is a p×1p×1 column vector of the unknown parameters;
- ϵϵ is an n×1n×1 column vector of the random error terms.
The general linear regression model assumes that E[ϵ]=0E[ϵ]=0 and Var(ϵ)=σ2InVar(ϵ)=σ2In.
Our aim is to estimate the unknown vector of parameters, ββ.
The least squares estimate of ββ is
The least squares estimator is the value of ββ that minimises the model deviance DD. Consider
Taking the derivative of DD with respect to ββ and noticing that ZTZZTZ is a symmetric matrix, we have that
Therefore the least squares estimator ˆβ^β of ββ will satisfy ZTZˆβ=ZTyZTZ^β=ZTy. This system of equations are the normal equations for the general linear regression model. To be able to isolate ˆβ^β it is necessary for ZTZZTZ to be invertible. Therefore we need ZZ to be of full rank, that is, rank(Z)=prank(Z)=p. If rank(Z)=prank(Z)=p, then
We know that ˆβ^β minimising DD is equivalent to the Hessian of DD will be positive definite. If ZZ has full rank, then since ZTZZTZ is a symmetric matrix, we have that
We know ZTZZTZ is positive definite and hence, ˆβ^β is the least squares estimator of ββ.
Let ˆy=Zˆβ^y=Z^β be the n×1n×1 vector of fitted values of yy. Note that
If we set P=Z(ZTZ)−1ZTP=Z(ZTZ)−1ZT, then we can write ˆy=Py.^y=Py. The matrix PP is therefore often referred to as the hat matrix. Note that PP is symmetric and idempotent because PT=PPT=P and P2=PP2=P.
The residuals, ϵϵ, satisfywhere InIn is the n×nn×n identify matrix.
Therefore the sum of the square of the residuals is given byThe quantity
is an unbiased estimator of σ2σ2.
Note that to obtain an unbiased estimator of σ2σ2, we divide the sum of the square of the residuals by n−p. That is, the number of observations (n) minus the number of parameters (p) estimated in β. This is in line with the divisor n−1 in estimating the variance of a random variable X from data x1,x2,…,xn with μ=E[X] (one parameter) estimated by ˉx=1n∑ni=1xi.
17.4 Examples
Suppose we have two observations such that
Calculate the least squares estimator of θ.
Writing the given linear model in a matrix format, one obtains
Then (ZTZ)−1=15 and by applying Theorem 17.3.1 (Least squares estimate):
Consider the simple regression model, yi=a+bxi+ϵ, for i∈{1,…,n}. Then in matrix terms Y=Zβ+ϵ where,
Calculate the least squares estimator of β.
The least squares estimators of β will be given by,
where,
So,
The least square estimates agree with the estimates we obtained in Section 16.6.
17.5 Properties of the estimator of β
In this section we give a collection of results about the properties of the estimator of β. The properties are given with proofs. It is not important to know the proofs but to know what the key properties are and have an understanding of why they are important.
Unbiasedness of LSE
ˆβ is an unbiased estimator of β.
The variance of ˆβ is given by
Note that Var(ˆβ) is the p×p variance-covariance matrix of the vector ˆβ. Specifically the ith diagonal entry is Var(ˆβi) and the (i,j)th entry is Cov(ˆβi,ˆβj).
Consider the straight line model:
where ϵ∼N(0,σ2).
Watch Video 25 for a run through uncertainty in the estimates of the parameters of a simple linear regression model. A summary of the results are presented after the video.
Video 25: Uncertainty in simple linear regression
The variance of the ˆβ does not depend on the values of α and β but on σ2 (the variance of ϵ) and the design matrix Z. This tells us that if we have input in choosing xi (constructing the design matrix) then we can construct the design matrix to reduce the variance of the estimator. In particular, the larger ∑ni=1(xi−ˉx)2 (variability in the xis), the smaller the variance of the estimates. Note that there will often be scientific and practical reasons for choosing xi within a given range.
Observe that the covariance between ˆα and ˆβ has the opposite sign to ˉx and becomes larger as |ˉx| increases. The correlation between ˆα and ˆβ isThe correlation in the estimates is larger, in absolute value, the larger ˉx2 is relative to ∑ni=1x2i.
To illustrate the variability in ˆβ we use an example. Suppose that we have ten observations from the model:where ϵ∼N(0,1) and for i=1,2,…,10, xi=i.
ThenWe simulated 100 sets of data from the model and for each set of data calculate ˆα and ˆβ. In Figure 17.1, the estimates of ˆβ against ˆα are plotted along with the true parameter values β=0.6 and α=2. The estimates show negative correlation.

Figure 17.1: Plot of estimates of straight line model parameters with true parameter values denoted by a red dot
In Figure 17.2, the fitted line ˆα+ˆβx is plotted for each simulated data set along with the true line 2+0.6x. Observe that the lines with the highest intercepts tend to have the smallest slope and visa-versa. Also note that there is more variability in the estimated lines at the end points (x=1 and x=10) rather than in the middle of the range (close to x=5.5).

Figure 17.2: Estimated lines from 100 simulations with true line in red
Distribution of LSE
If additionally ϵ∼Nn(0,σ2In), then
ˆβ∼Np(β,σ2(ZTZ)−1).
Note,
Hence ˆβ is a linear function of a normally distributed random variable. Using the identities E[Ax+b]=AE[x]+b and Var(Ax+b)=AVar(x)AT, consequently ˆβ has a normal distribution with mean and variance as required.
However the individual ˆβi are not independent as we saw in Example 17.5.3 (Uncertainty in simple linear regression).
17.6 Gauss-Markov Theorem
The Gauss-Markov Theorem shows that a good choice of estimator for aTβ, a linear combination of the parameters, is aTˆβ.
Gauss-Markov Theorem
If ˆβ is the least squares estimator of β, then aTˆβ is the unique linear unbiased estimator of aTβ with minimum variance.
The details of the proof of Theorem 17.6.1 (Gauss-Markov Theorem) are provided but can be omitted.
Proof of Gauss-Markov Theorem.
Consider ˆβ=(ZTZ)−1ZTy, the LSE of β. Hence,
where C=aT(ZTZ)−1ZT. It follows that aTˆβ is a linear function of y.
Note that aTˆβ is an unbiased estimator of aTβ because,Substituting aT=bTZ, we can rewrite
so aTˆβ has the smallest variance of any other linear unbiased estimator.
Finally suppose that bTy is another linear unbiased estimator such that Var(bTy)=Var(aTˆβ), thenTherefore aTˆβ is the unique linear unbiased estimator of aTβ.
Best linear unbiased estimator (BLUE)
If aT=(0,0,…,1,0,…,0) where the 1 is in the ith position, then ˆβi is the
best linear unbiased estimator, shorthand BLUE, of βi.
The following R shiny app generates data and fits a regression line, y=α+βx. It allows for variability in the coefficients and how the covariates x are generated. Predicted values can also be plotted with confidence intervals, see Section 18, Interval Estimation for an introduction to confidence intervals and Session 12: Linear Models II for a discussion of confidence intervals for predicted values.
R Shiny app: Linear Model
Task: Session 9
Attempt the R Markdown file for Session 9:
Session 9: Linear Models I
Student Exercises
Attempt the exercises below.
Suppose that a model states that
Find the least squares estimates of θ and ϕ.
Solution to Exercise 17.1.
The marks of 8 students are presented below. For student i, xi denotes their mark in a mid-term test and yi denotes their mark in the final exam.
Mid-term test marks:- Calculate the correlation between the mid-term test marks and the final exam marks.
- Fit a straight line linear model y=α+βx+ϵ, where ϵ∼N(0,σ2) with the final exam mark, y, as the response and the mid-term test mark, x, as the predictor variable. Include estimation of σ2 in the model fit.
- Find the expected final exam mark of a student who scores 79 in the mid-term test.
Solution to Exercise 17.2.
- The correlation between x and y is given by
sxysx×sy=25.679√29.554×√35.929=0.788 - The coefficients of the linear regression model are:
ˆβ=sx,ys2x=25.67929.554=0.869.ˆα=ˉy−ˆβˉx=54.75−0.869×65.875=−2.488.
- The expected final exam mark of a student who scores 79 in the mid-term test is:
y∗=−2.488+0.869×79=66.163.