Chapter 16 Introduction to Linear Models

16.1 Introduction

In this section we present an introduction to Linear Models. Linear Models are the most common type of statistical model and is a wider class of model than is perhaps apparent at first. In Section 16.2, we introduce the concept of constructing a statistical model. In Section 16.3 we identify the key (linear) elements for a Linear Model and in Section 16.4 introduce the Normal (Gaussian) linear model. In Section 16.5 we define residuals the differences between the observations and what we expect to observe given the model. In Section 16.6 we consider least squares estimation for the parameters of linear models and show that for the Normal (Gaussian) linear model that this is equivalent to maximum likelihood estimation. In Section 16.7 we study two examples and consider how the residuals of the model can help us assess the appropriateness of the model. In Section 16.8 we briefly consider using the linear model for prediction which is one of the main purposes for constructing linear modeals. Finally in Section 16.9 we introduce the concept of nested models where a simpler model is nested within (a special case of) a more complex model. Whether to choose a simple or more complex model is a challenging statistical question and in this section we begin to study some of the considerations needed in making our choice.

16.2 Statistical models

One of the tasks of a statistician is the analysis of data. Statistical analysis usually involves one or more of the following:

  • Summarising data;
  • Estimation;
  • Inference;
  • Prediction.

In general, we statistically model the relationship between two or more random variables by considering models of the form: Y=f(X,β)+ϵ,Y=f(X,β)+ϵ, where,

  • YY is the response variable;
  • ff is some mathematical function;
  • XX is some matrix of predictor (input) variables;
  • ββ are the model parameters;
  • ϵϵ is the random error term (residual).

If we assume that E[ϵ]=0E[ϵ]=0, then E[Y]=f(X,β)E[Y]=f(X,β) if XX is assumed to be non-random. Otherwise E[Y|X]=f(X,β)E[Y|X]=f(X,β).

Consider the following examples where such modelling theory could be applied.

Cars on an intersection

We observe the number of cars passing an intersection over a one minute interval. We want to estimate the average rate at which cars pass this intersection.

If YY is the number of cars passing the intersection over a one minute interval, then YY is likely to have a Poisson distribution with mean λλ. We want to estimate λλ, the model parameter.


Industrial producivity

In economics, the production of an industry, YY, is modelled to be a function of the amount of labour available, LL, and the capital input, KK. In particular, the Cobb-Douglas Production Function is given to be Y=C0LαKβY=C0LαKβ.

Furthermore if α+β=1α+β=1, then an industry is said to operate under constant returns to scale, that is, if capital and labour increase by a factor of tt, then production also increases by a factor of tt.

As a consultant to an economic researcher, you collect production, labour and capital data for a specific industry and want to estimate the functional relationship and test whether α+β=1α+β=1 in this industry. The problem will have the following components:

  • A theoretical model: Y=C0LαKβY=C0LαKβ;
  • A statistical model: logY=C+αlogL+βlogK+ϵlogY=C+αlogL+βlogK+ϵ;
  • Some estimations: CC, αα and ββ;
  • A test: does α+β=1α+β=1?


Blood pressure

Suppose we are interested in studying what factors affect a person’s blood pressure. We have a proposed statistical model, where blood pressure depends on a large number of factors:

Blood pressure=f(age,weight,gender,activity level,personality type,time of day,genetic predisposition)+ϵ.Blood pressure=fage,weight,gender,activity level,personality type,time of day,genetic predisposition+ϵ.

We want to estimate the functional relationship ff and potentially test which of the factors has a significant influence on a person’s blood pressure.

16.3 The linear model

Suppose that Y1,Y2,YnY1,Y2,Yn are a collection of response variables, with YiYi dependent on the predictor variables Xi=(1X1,iXp,i)Xi=(1X1,iXp,i). Assume that, for i=1,2,,ni=1,2,,n,
Yi=f(Xi,β)+ϵi=β0+β1X1i++βpXpi+ϵi.Yi=f(Xi,β)+ϵi=β0+β1X1i++βpXpi+ϵi.
A matrix representation of the model is Y=Zβ+ϵY=Zβ+ϵ, where
Y=[Y1Y2Yn],Z=[1X11Xp11X12Xp21X1nXpn],β=[β0β1βp],ϵ=[ϵ1ϵ2ϵn].Y=⎢ ⎢ ⎢ ⎢Y1Y2Yn⎥ ⎥ ⎥ ⎥,Z=⎢ ⎢ ⎢ ⎢ ⎢1X11Xp11X12Xp21X1nXpn⎥ ⎥ ⎥ ⎥ ⎥,β=⎢ ⎢ ⎢ ⎢ ⎢β0β1βp⎥ ⎥ ⎥ ⎥ ⎥,ϵ=⎢ ⎢ ⎢ ⎢ϵ1ϵ2ϵn⎥ ⎥ ⎥ ⎥.

Here ZZ is called the design matrix and in models with a constant term includes a column of ones as well as the data matrix XX and possibly functions of XX. If no constant or functions are included in the model, ZZ and XX are equivalent.

It is common to represent linear models in matrix form. As we shall observe in Section 17 using matrices allows for concise representations of key quantities such as parameter estimates and calculating residuals. Familiarity with core concepts from linear algebra is essential in understanding and manipulating linear models.

Linear Model

A model is considered to be linear if it is linear in its parameters.

Consider the following models, and why they are linear/non-linear:

  • Yi=β0+β1X1i+β2X2i+ϵiYi=β0+β1X1i+β2X2i+ϵi is linear;
  • Yi=β0+β1X1i+β2X22i+ϵiYi=β0+β1X1i+β2X22i+ϵi is linear;
  • logY=logC0LαKβ+ϵlogY=logC0LαKβ+ϵ is considered linear since we can transform the model into logY=C0+αlogL+βlogK+ϵlogY=C0+αlogL+βlogK+ϵ;
  • Y=β1β1β2[eβ2Xeβ1X]+ϵY=β1β1β2[eβ2Xeβ1X]+ϵ is non-linear.

A linear model therefore has the following underlying assumptions:

  • The model form can be written Yi=β0+β1X1i++βpXpi+ϵiYi=β0+β1X1i++βpXpi+ϵi for all ii;
  • E[ϵi]=0E[ϵi]=0 for all ii;
  • Var(ϵi)=σ2Var(ϵi)=σ2 for all ii;
  • Cov(ϵi,ϵj)=0Cov(ϵi,ϵj)=0 for all ijij.

16.4 The Normal (Gaussian) linear model

Normal Linear Model

A normal linear model assumes:

  • A model form given by Yi=β0+β1X1i++βpXpi+ϵiYi=β0+β1X1i++βpXpi+ϵi for all ii;
  • ϵiN(0,σ2)ϵiN(0,σ2) are independent and identically distributed (i.i.d.).

There are two key implications of these assumptions:

  • YiN(β0+β1X1i++βpXpi,σ2)YiN(β0+β1X1i++βpXpi,σ2) for all i=1,,ni=1,,n. Equivalently writing this in matrix form, YNn(Zβ,σ2In)YNn(Zβ,σ2In) where NnNn is the nn-dimensional multivariate normal distribution. Note that if two random variables XX and YY are normally distributed, then Cov(X,Y)=0Cov(X,Y)=0 if and only if XX and YY are independent.
  • Since Cov(ϵi,ϵj)=0Cov(ϵi,ϵj)=0 for all ijij, then Cov(Yi,Yj)=0Cov(Yi,Yj)=0 for all ijij, and Y1,Y2,,YnY1,Y2,,Yn are independent. (Remember that for Normal (Gaussian) random variables XX and YY are independent if and only if they are uncorrelated Cov(X,Y)=0Cov(X,Y)=0, see Section 15.2.)

16.5 Residuals

We have the statistical model:
Y=f(X,β)+ϵ,Y=f(X,β)+ϵ,
which satisfies E[ϵ]=0E[ϵ]=0, and hence, E[Y]=f(X,β)E[Y]=f(X,β). The above equation can be rewritten as
ϵ=Yf(X,β).ϵ=Yf(X,β).
Suppose that we have observed responses y1,y2,,yny1,y2,,yn, where observation ii has predictors xixi. Then given parameters ββ, the expected value for the ithith response is f(xi,β)f(xi,β), and is often denoted ˆyi^yi. The residual ϵiϵi is the difference between the observed response and its expected value:
ϵi=yiˆyi=yif(xi,β).ϵi=yi^yi=yif(xi,β).
Note that if f(xi,β)=α+βxif(xi,β)=α+βxi, then
ϵi=yi{α+βxi}.ϵi=yi{α+βxi}.

An assumption of the Normal linear model is that ϵiN(0,σ2)ϵiN(0,σ2), independent of the value of f(xi,β)=α+βxif(xi,β)=α+βxi. Therefore plotting residual values against expected values is one way of testing the appropriateness of the model.

16.6 Straight Line, Horizontal Line and Quadratic Models

In this section we look at three types of linear model that we might choose to assume to fit some data XiXi and YiYi. Making this choice appropriately is a key part of the statistical analysis process.

Straight line model

The straight line model is the simple linear regression model given by

yi=α+βxi+ϵi,for all i=1,n.yi=α+βxi+ϵi,for all i=1,n.

Given a straight line model, we have E[Y]=α+βXE[Y]=α+βX. The key part of using a straight line model is to estimate the values of αα and ββ.


Let ˆα^α and ˆβ^β be the estimated values of αα and ββ, respectively. We call yiyi the ithith observed value and ˆyi=ˆα+ˆβxi^yi=^α+^βxi the ithith fitted value (expected value).

Model deviance

D=ni=1(yiˆyi)2D=ni=1(yi^yi)2 is called the model deviance.

Since ϵi=yiˆyiϵi=yi^yi, the model deviance is the sum of the squares of the residuals, D=ni=1ϵ2i.D=ni=1ϵ2i.

For the estimated line to fit the data well, one wants an estimator of αα and ββ that minimises the model deviance. So, choose ˆα^α and ˆβ^β to minimise
D=ni=1(yiˆyi)2=ni=1(yiαβxi)2.D=ni=1(yi^yi)2=ni=1(yiαβxi)2.
To minimise DD we calculate the stationary points of DD, that is, the values of αα and ββ for which the two first order partial derivatives vanish.
Dα=2ni=1(yi(ˆα+ˆβxi))=0Dβ=2ni=1xi(yi(ˆα+ˆβxi))=0Dα=2ni=1(yi(^α+^βxi))=0Dβ=2ni=1xi(yi(^α+^βxi))=0
Solving these equations simultaneously gives
ˆβ=ni=1(xiˉx)(yiˉy)ni=1(xiˉx)2=(n1)sxy(n1)s2x=sxys2x,ˆα=ˉyˆβˉx.^β=ni=1(xi¯x)(yi¯y)ni=1(xi¯x)2=(n1)sxy(n1)s2x=sxys2x,^α=¯y^β¯x.

Note that ˆα^α and ˆβ^β are called least squares estimators of αα and ββ. We did not use the normality assumption in our derivation, so the least squares estimators are invariant to the choice of distribution for the error terms.

If we include the assumption of normality, then it can be shown that ˆα^α and ˆβ^β are also the MLEs of αα and ββ respectively.


Let y1,y2,,yny1,y2,,yn be observations from a normal linear model:
y=α+βx+ϵ,y=α+βx+ϵ,
where ϵN(0,σ2)ϵN(0,σ2) and x1,x2,,xnx1,x2,,xn are the predictor variables. Then
ˆβ=sxys2xandˆα=ˉyˆβˉx,^β=sxys2xand^α=¯y^β¯x,

are the maximum likelihood estimates of ββ and αα, respectively.

First, note that
yiN(α+βxi,σ2).yiN(α+βxi,σ2).
Therefore the likelihood is given by
L(α,β,σ2)=ni=1f(yi|α,β,xi,σ2)=ni=112πσ2exp(12σ2{yi(α+βxi)}2)=(2πσ2)n2exp(12σ2ni=1{yi(α+βxi)}2).L(α,β,σ2)=ni=1f(yi|α,β,xi,σ2)=ni=112πσ2exp(12σ2{yi(α+βxi)}2)=(2πσ2)n2exp(12σ2ni=1{yi(α+βxi)}2).
The log-likelihood is
l(α,β,σ2)=n2log(2πσ2)12σ2ni=1{yi(α+βxi)}2.l(α,β,σ2)=n2log(2πσ2)12σ2ni=1{yi(α+βxi)}2.

In order to maximise the log-likelihood with respect to αα and ββ, we note that:

  • n2log(2πσ2)n2log(2πσ2) does not involve αα and ββ, so can be treated as a constant;
  • Since σ2>0σ2>0, 12σ2ni=1{yi(α+βxi)}212σ2ni=1{yi(α+βxi)}2 will reach its maximum when ni=1{yi(α+βxi)}2ni=1{yi(α+βxi)}2 is minimised.

Therefore the parameter values ˆα^α and ˆβ^β which maximise l(α,β,σ2)l(α,β,σ2) are the parameter values which minimise ni=1{yi(α+βxi)}2ni=1{yi(α+βxi)}2, i.e. the least squares estimates.

Note that the maximum likelihood estimates of ˆα^α and ˆβ^β do not depend on the value of σ2σ2.

Horizontal line model

The horizontal line model is the simple linear regression model given by

yi=μ+ϵi,for i=1,,n.yi=μ+ϵi,for i=1,,n.

Given a horizontal line model, we have E[Y]=μE[Y]=μ. Specifically in this model, we assume the predictor variable has no ability to explain the variance in the response variable.

To estimate μμ by least squares we minimise D=ni=1(yiμ)2D=ni=1(yiμ)2. Setting Dμ=0Dμ=0 and solving, we get ˆμ=ˉy^μ=¯y.

Let D1D1 be the deviance of the straight line model and D2D2 be the deviance of the horizontal line model. The straight line model does a better job of explaining the variance in YY if D1D2D1D2, in which case we say that the straight line model fits the data better.

Quadratic line model

The quadratic model is the simple linear regression model given by

yi=α+βxi+γx2i+ϵi,for i=1,,n.yi=α+βxi+γx2i+ϵi,for i=1,,n.
To estimate αα, ββ and γγ by least squares we minimise
D=ni=1(yi(α+βxi+γx2i))2.D=ni=1(yi(α+βxi+γx2i))2.

If we let D3=ni=1(yi(ˆα+ˆβxi+ˆγx2i))2D3=ni=1(yi(^α+^βxi+^γx2i))2 be the model deviance, then the quadratic model fits the data better than the straight line model if D3D1D3D1.

16.7 Examples

Tyre experiment

A laboratory tests tyres for tread wear by conducting an experiment where tyres from a particular manufacturer are mounted on a car. The tyres are rotated from wheel to wheel every 1000 miles, and the groove depth is measured in mils (0.001 inches) initially and after every 4000 miles giving the following data (Tamhane and Dunlap, 2000):

Mileage (1000 miles) Groove Depth (mils)
0 394.23
4 329.50
8 291.00
12 255.17
16 229.33
20 204.83
24 179.00
28 163.83
32 150.33

Watch Video 24 for a work through fitting linear models to the tyre experiment in R and a discussion of choosing the most appropriate linear model. A work through of the analysis is presented below.

Video 24: Tyre Experiment

Firstly we have to determine which is the response variable and which is the predictor (or controlled) variable. Secondly, we have to hypothesise a functional relationship between the two variables, using either theoretical relationships or exploratory data analysis.

Let the response variable, YY, be groove depth and let the predictor variable, X, be mileage. Consider a plot of mileage vs. depth:


Note that as mileage increases, the groove depth decreases.

The straight line model for the data is


The straight line model is:
y=360.5997.279x+ϵ

and the deviance for the model is 2524.8.

A plot of the residuals for the straight line model shows a relationship with x (mileage), possibly quadratic. This informs us that there is a relationship between mileage and groove depth which is not captured by the linear model.

The horizontal line model for the data is

The horizontal line model is: y=244.136+ϵ.

The deviance for the model is 53388.7, which is much higher than the deviance for the straight line model suggesting it is a considerably worse fit to the data. In this case the poorness of the fit of the horizontal line model is obvious, but later will explore ways of making rigorous the comparison between models.

A plot of the residuals for the horizontal line model shows a very clear downward trend with x (mileage) confirming that the horizontal line model does not capture the relationship between mileage and groove depth.

The quadratic model for the data is


The quadratic model is y=386.19912.765x+0.171x2+ϵ and the deviance for the model is 207.65. This is a substantial reduction on the deviance of the straight line model suggesting that it offers a substantial improvement.

A plot of the residuals for the quadratic model shows no obvious pattern. This suggests that the model could be appropriate in capturing the relationship between mileage and groove depth, with the differences between observed and expected values due to randomness.

Drug effectiveness

Suppose we are interested in the effect a certain drug has on the weight of an organ. An experiment is designed in which rats are randomly assigned to different treatment groups in which each group contains J rats and receives the drug at one of 7 distinct levels. Upon completion of the treatment, the organs are harvested from the rats and weighed.

Let Yij be the weight of the organ (response variable) of the jth rat in the ith treatment,
Yij=μi+ϵij,for all i=1,,7 and j=1,,J.

The model is linear in the parameters μ1,μ2,,μ7. Specifically our aim is to test whether μ1=μ2==μ7. Estimating using least squares minimises D=7i=1Jj=1(yijμi)2,

and least squares estimators are given by:
ˆμi=ˉyi.=1JJi=1yij.

The estimate of the mean for level i is the sample mean of the J rats receiving treatment level i.

16.8 Prediction

A linear model identifies a relationship between the response variable y and the predictor variable(s), x. We can then use the linear model to predict y given predictor variables x.

Given the model,
Y=f(x,β)+ϵ
where E[ϵ]=0, we have that our best estimate of y given x and model parameters ˆβ is
y=E[Y|x,ˆβ]=f(x,ˆβ)=ˆβ0+pj=1ˆβjxj.

We will discuss later the uncertainty in the prediction of y which depends upon the variance of ϵ, uncertainty in the estimation of ˆβ and how far the xjs are from the mean of the xjs.

For Example 16.7.1 (Tyre experiment), we can use the three models (straight line, horizontal line, quadratic) to predict the groove depth (mils) of the tyres after 18,000 miles. The predicted values are:

  • Straight line model - y=360.5997.279(18)=229.577.
  • Horizontal line model - y=244.136.
  • Quadratic model - y=386.19912.765(18)+0.171(182)=211.833.

16.9 Nested Models

In Example 16.7.1 (Tyre experiment), we considered three models:

  • Straight line model: y=α+βx+ϵ.
  • Horizontal line model: y=μ+ϵ.
  • Quadratic model: y=α+βx+γx2+ϵ.

The quadratic model is the most general of the three models. We note that if γ=0, then the quadratic model reduces to the straight line model. Similarly letting β=γ=0 in the quadratic model and with μ=α, the quadratic model reduces to the horizontal model. That is, the horizontal line and straight line models are special cases of the quadratic model, and we say that the horizontal line and straight line models are nested models in the quadratic model. Similarly the horizontal line model is a nested model in the straight line model.

Nested model

Model A is a nested model of Model B if the parameters of Model A are a subset of the parameters of Model B.


Let DA and DB denote the deviances of Models A and B, respectively. Then if Model A is a nested model of Model B,
DBDA.

That is, the more complex model with additional parameters will fit the data better.

Let α=(α1,α2,,αK) denote the K parameters of Model A and let β=(α1,α2,,αK,δ1,,δM) denote the K+M parameters of Model B.

If ˆα=(ˆα1,ˆα2,,ˆαK) minimise the deviance under Model A, then the parameters (ˆα1,ˆα2,,ˆαK,0,,0), where δi=0 will achieve the deviance DA under Model B. Thus DB can be at most equal to DA.


Since more complex models will better fit data (smaller deviance), why not choose the most complex model possible?

There are many reasons and these include interpretability, can you interpret the relation between xi and yi (for example, the time taken to run 100 metres is unlikely to be a good predictor for performance on a maths test) and predictive power, given a new predictor observation x can we predict y.

Student Exercise

Attempt the exercise below.


Given the following theoretical relationships between Y and X, with α and β unknown, can you find ways of expressing the relationship in a linear manner?

  1. Y=αeβX,
  2. Y=αXβ,
  3. Y=log(α+eβX).
Solution to Exercise 16.1.
  1. Take logs, so logY=logα+βX. Hence, Y=α+βX where Y=logY,α=logα.
  2. Again, take logs, so logY=logα+βlogX. Hence, Y=α+βX where Y=logY, α=logα, X=logX.
  3. This cannot be written as a linear model.