4 Multiple regression

4.1 Introductory examples

Setup: response variable \(y\), predictors \(x_1\), \(x_2\), …, \(x_k\).

4.1.1 Example 1: Fuel Use

Example from Section 2. Information was recorded on fuel usage and average temperature (\(^oF\)) over the course of one week for eight office complexes of similar size. Data are from Bowerman and Schafer (1990).

\(y\) = fuel use,

\(x_1\) = temperature,

\(x_2\) = chill index.

Data:

Temp	Fuel	Chill
28.0	12.4	18
28.0	11.7	14
32.5	12.4	24
39.0	10.8	22
45.9	9.4	8
57.8	9.5	16
58.1	8.0	1
62.5	7.5	0

We wish to use \(x_1\) and \(x_2\) to predict \(y\). This should give more accurate predictions than either \(x_1\) or \(x_2\) alone.

A multiple linear regression model is: fuel use \(\approx\) a linear function of temperature and chill index.

More precisely:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon.\]

As before, \(\epsilon\) is the unobserved error, \(\beta_0, \beta_1, \beta_2\) are the unknown parameters.

When \(\mathbb{E}[\epsilon ] = 0\) we have

\[\mathbb{E}[y] = \beta_0 + \beta_1 x_1 + \beta_2 x_2.\]

In SLR we can check model appropriateness by plotting \(y\) vs \(x\) and observing whether the points fall close to a line. Here we could construct a 3-d plot of \(y\), \(x_1\), \(x_2\) and points should fall close to a plane.

For a given set of values of \(x_1\) and \(x_2\), say \(x_1 = 45.9\) and \(x_2 = 8\), the model says that the mean fuel use is:

\[\mathbb{E}[y] = \beta_0 + \beta_1 \times 45.9 + \beta_2 \times 8.\]

If \(x_1 = x_2 = 0\) then \(\mathbb{E}[y] = \beta_0\), the model intercept.

To interpret \(\beta_1\) suppose \(x_1 = t\) and \(x_2 = c\). Then

\[\mathbb{E}[y]=\beta_0 + \beta_1 \times t + \beta_2 \times c.\]

Now suppose \(x_1\) increases by \(1\) and \(x_2\) stays fixed:

\[\mathbb{E}[y]=\beta_0 + \beta_1 \times (t + 1) + \beta_2 \times c.\]

Substracting these we find that \(\beta_1\) is the increase in \(\mathbb{E}[y]\) associated with 1 unit increase in \(x_1\) for a fixed \(x_2\).

I.e. two weeks having the same chill index but whose temperature differed by \(1^o\) would have a mean fuel use difference of \(\beta_1\).

4.1.2 Example 2: Categorical predictors

Suppose we wish to predict the fuel efficiency of different car types. Data are from Cryer and Miller (1991). We have data on:

\(y\) = gallons per mile (gpm),

\(x_1\) = car weight (w),

\(x_2\) = transmission type (ttype): 1 = automatic or 0 = manual.

We use the model

\[\mathbb{E}[y] = \beta_0 + \beta_1 x_1 + \beta_2 x_2.\]

\(\beta_0\) = the mean gpm for cars of weight \(w = 0\) and ttype = manual. \(\beta_1\) = change in mean gpm when weight increases by 1 for the same ttype. \(\beta_2\) = change in mean gpm when the car of the same weight is changed from manual to automatic.

The model says that:

\[\begin{align*} \mathbb{E}[y] & = \beta_0 + \beta_1 x_1 \quad \mbox{ for manual}\\ & = \beta_0 + \beta_2 + \beta_1 x_1 \quad \mbox{ for automatic.} \end{align*}\]

Therefore we are fitting two lines with different intercepts but the same slope.

The data should look like:

Suppose the data look like this:

This suggests we should fit two lines with different intercepts and different slopes. We introduce a third predictor:

\[\mathbb{E}[y] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1x_2,\]

giving:

\[\begin{align*} \mathbb{E}[y] & = \beta_0 + \beta_1 x_1 \quad \mbox{ for manual}\\ & = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) x_1 \quad \mbox{ for automatic.} \end{align*}\]

The term \(x_1x_2\) is called an interaction term.

Here:

\(\beta_2\) = difference in intercept

\(\beta_3\) = difference in slope.

4.1.3 Example 3: Polynomials

We have one predictor \(x\) but the plot of \(y\) vs \(x\) exhibits a quadratic pattern.

Then we can fit a multiple regression model:

\[\mathbb{E}[y] = \beta_0 + \beta_1 x + \beta_2 x^2.\]

This is also called a quadratic regression model or, more generally, a polynomial regression model.

Higher order polynomial regression models can also be used if needed.

You must enable Javascript to view this page properly.

Link: http://www.rpubs.com/kdomijan/333155

4.1.4 Example 4: Nonlinear relationships

For example,

\[y = \alpha x_1 ^{\beta x_2} \epsilon.\]

Nonlinear models can sometimes be linearized, for example:

\[log(y) = log(\alpha) + \beta x_2 log(x_1) + log(\epsilon).\]

Here: \(x = x_2 log(x_1)\).

NOTE: the term linear refers to the linearity of regression parameters.

A general form for multiple linear regression model (with two explanatory variables):

\[y = \beta_0 f_0(x_1, x_2) + \beta_1 f_1(x_1, x_2) + \beta_2 f_2(x_1, x_2) + \dots\]

where \(f_j(x_1, x_2)\) are known functions of explanatory variables.

The extension to more than two explanatory variables is straightforward.

4.1.5 Cigarette Data continued

Data from 3.7.4. Consider a second predictor (weight):

## The following objects are masked from cigarette (pos = 9):
## 
##     brand, carbon.monoxide, nicotine, tar, weight

## The following objects are masked from cigarette (pos = 16):
## 
##     brand, carbon.monoxide, nicotine, tar, weight

## The following objects are masked from cigarette (pos = 23):
## 
##     brand, carbon.monoxide, nicotine, tar, weight

## The following objects are masked from cigarette (pos = 30):
## 
##     brand, carbon.monoxide, nicotine, tar, weight

## The following objects are masked from cigarette (pos = 37):
## 
##     brand, carbon.monoxide, nicotine, tar, weight

## The following objects are masked from cigarette (pos = 44):
## 
##     brand, carbon.monoxide, nicotine, tar, weight

## The following objects are masked from cigarette (pos = 51):
## 
##     brand, carbon.monoxide, nicotine, tar, weight

## The following objects are masked from cigarette (pos = 59):
## 
##     brand, carbon.monoxide, nicotine, tar, weight

Regression (nicotine only)

summary(fit)

## 
## Call:
## lm(formula = carbon.monoxide ~ nicotine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3273 -1.2228  0.2304  1.2700  3.9357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.6647     0.9936   1.675    0.107    
## nicotine     12.3954     1.0542  11.759 3.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.828 on 23 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8512 
## F-statistic: 138.3 on 1 and 23 DF,  p-value: 3.312e-11

Regression (weight only)

summary(fit2)

## 
## Call:
## lm(formula = carbon.monoxide ~ weight)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.524 -2.533  0.622  2.842  7.268 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -11.795      9.722  -1.213   0.2373  
## weight        25.068      9.980   2.512   0.0195 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.289 on 23 degrees of freedom
## Multiple R-squared:  0.2153, Adjusted R-squared:  0.1811 
## F-statistic: 6.309 on 1 and 23 DF,  p-value: 0.01948

Regression (both predictors)

fit3 <- lm(carbon.monoxide ~ weight + nicotine)
summary(fit3)

## 
## Call:
## lm(formula = carbon.monoxide ~ weight + nicotine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3304 -1.2249  0.2314  1.2677  3.9371 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.61398    4.44663   0.363    0.720    
## weight       0.05883    5.02395   0.012    0.991    
## nicotine    12.38812    1.24473   9.952 1.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.87 on 22 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8444 
## F-statistic: 66.13 on 2 and 22 DF,  p-value: 4.966e-10

Regression (quadratic)

nicotine.sq <- nicotine^2
fit4 <- lm(carbon.monoxide ~ nicotine + nicotine.sq)
summary(fit4)

## 
## Call:
## lm(formula = carbon.monoxide ~ nicotine + nicotine.sq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9857 -1.1052  0.1834  0.8654  3.4145 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.784      1.453  -1.227  0.23264    
## nicotine      20.111      2.775   7.248 2.92e-07 ***
## nicotine.sq   -3.730      1.267  -2.945  0.00749 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.583 on 22 degrees of freedom
## Multiple R-squared:  0.8977, Adjusted R-squared:  0.8884 
## F-statistic: 96.53 on 2 and 22 DF,  p-value: 1.284e-11

4.2 Least squares estimation for multiple regression

Our model states that:

\[y = \beta_0 + \beta_1x_{1} + \beta_2x_{2} + ... + \beta_kx_k + \epsilon,\]

where \(k<n\).

For each observation we have:

\[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_kx_{ik} + \epsilon_i.\]

We can write this more compactly using matrix notation.

Let \(\mathbf{Y}\) be the response vector:

\[\mathbf{Y} =\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots\\ y_{n} \end{bmatrix}\]

Let \(\mathbf{X}\) be the \(n \times p\) matrix, where \(p = k+1\):

\[\mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & x_{13} & \dots & x_{1k} \\ 1 & x_{21} & x_{22} & x_{23} & \dots & x_{2k} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & x_{n3} & \dots & x_{nk} \end{bmatrix}\]

Let \(\boldsymbol{\beta}\) be the \(p\)-dim parameter vector:

\[\boldsymbol{\beta} =\begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots\\ \beta_{k} \end{bmatrix}\]

Let \(\boldsymbol{\epsilon}\) be the \(n\)-dim error vector:

\[\boldsymbol{\epsilon} =\begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots\\ \epsilon_{n} \end{bmatrix}\]

The model states that:

\[\mathbf{Y}=\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}.\]

The vector of fitted values is:

\[\hat{\mathbf{Y}}=\mathbf{X}\hat{\boldsymbol{\beta}}.\]

The corresponding residual values are:

\[\mathbf{e}=\mathbf{Y}-\hat{\mathbf{Y}}.\]

The OLS estimates minimise:

\[S(\boldsymbol{\beta}) = \sum_{i=1}^{n}(y_i-\beta_0-\beta_1x_{i1}- ... - \beta_kx_{ik})^2\]

over \(\boldsymbol{\beta}\).

Therefore the OLS estimates satisfy:

\[\frac{\delta S(\boldsymbol{\beta})}{\delta \beta_j} = 0, \quad \forall j\]

and as before we evaluate at \(\hat{\boldsymbol{\beta}}\)

\[\frac{\delta S(\boldsymbol{\beta})}{\delta \beta_0} = -2 \sum_{i=1}^{n}(y_i-\beta_0-\beta_1x_{i1}- ... - \beta_kx_{ik})\]

\[\frac{\delta S(\boldsymbol{\beta})}{\delta \beta_j} = -2 \sum_{i=1}^{n} x_{ij}(y_i-\beta_0-\beta_1x_{i1}- ... - \beta_kx_{ik}), \quad \forall j = 1,...,k.\]

The OLS estimates of \(\boldsymbol{\beta}\) satisfy:

\[\sum_{i=1}^{n}(y_i-\hat{\beta}_0-\hat{\beta}_1x_{i1}- ... - \hat{\beta}_kx_{ik}) = 0\]

and

\[\sum_{i=1}^{n}(y_i-\hat{\beta}_0-\hat{\beta}_1x_{i1}- ... - \hat{\beta}_kx_{ik})x_{ij} = 0, \quad \forall j = 1,...,k.\]

These normal equations (see (3.1) and (3.2)) can be written as:

\[\sum_{i=1}^{n}e_i = 0\]

and

\[\sum_{i=1}^{n}x_{ij}e_i = 0, \quad \forall j = 1,...,k.\]

We can combine this into one matrix equation:

\[\mathbf{X}^T\mathbf{e}= \mathbf{0}\]

or equivalently:

\[\mathbf{X}^T(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})= \mathbf{0}\]

Therefore the OLS estimator \(\hat{\boldsymbol{\beta}}\) satisfies:

\[\begin{align} \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} &= \mathbf{X}^T\mathbf{Y}\\ \hat{\boldsymbol{\beta}} &= (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}.\tag{4.1} \end{align}\]

Matrix \(\mathbf{X}^T\mathbf{X}\) is non-singular (i.e has an inverse) iff \(rank(\mathbf{X}) =p\), i.e. \(\mathbf{X}\) has full rank and the columns of \(\mathbf{X}\) are linearly independent.

4.2.1 Estimation of \(\sigma^2\) = Var\((\epsilon)\)

A point estimate of \(\sigma^2\) is the mean squared error:

\[\hat{\sigma}^2 = \mbox{MSE} = \frac{\mbox{SSE}}{n-p} = \frac{\sum_{i=1}^n e_i^2}{n-p}.\]

4.2.2 Estimation of Var\((\hat{\beta})\)

\[\mbox{Var}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^T\mathbf{X})^{-1} \sigma^2.\]

\[\widehat{\mbox{Var}(\hat{\boldsymbol{\beta}})} = (\mathbf{X}^T\mathbf{X})^{-1} \hat{\sigma}^2.\]

4.3 Prediction from multiple linear regression model

As we have seen already, to predict from a multiple regression model we use:

\[\hat{y}_i = \hat{\beta}_0+ \hat{\beta}_1x_{i1}+ \cdots+\hat{\beta}_kx_{ik}\]

or \[\hat{\mathbf{Y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\]

At a particular set of \(x_0\) values we predict the response \(y_0\) by:

\[\hat{y}_0 = \mathbf{x}_0^T\hat{\boldsymbol{\beta}}\]

where \(\mathbf{x}_0^T = ( 1, x_{01},x_{02},..., x_{0k})\).

We also use \(\hat{y}_0\) to estimate \(\mathbb{E}(y_0)\), the mean of \(y_0\) at a given set of \(x_0\) values.

The \(\mbox{S.E.}\) for the estimate of the mean \(\mathbb{E}(y_0)\) is:

\[\mbox{S.E.}_{\mbox{fit}} (\hat{y}_0)= \hat{\sigma}\sqrt{\mathbf{x}_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_0}.\]

A \(1-\alpha\) confidence interval for the expected response at \(\mathbf{x}_0\) is given by:

\[\hat{y}_0 \pm t_{n-p}(\alpha/2) \times \mbox{S.E.}_{\mbox{fit}} (\hat{y}_0).\]

The \(\mbox{S.E.}\) for the predicted \(y_0\):

\[\mbox{S.E.}_{\mbox{pred}}(\hat{y}_0) = \hat{\sigma}\sqrt{1+ \mathbf{x}_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_0}.\]

Note: \[\mbox{S.E.}_{\mbox{pred}}(\hat{y}_0)= \sqrt{\hat{\sigma}^2+\mbox{S.E.}_{\mbox{fit}}(\hat{y}_0)^2}\]

4.4 Regression models in matrix notation: examples

4.4.1 Example 1: SLR

The \(\mathbf{X}\) matrix is:

\[\mathbf{X} = \begin{bmatrix} 1 & x_{1}\\ \vdots & \vdots\\ 1 &x_{n} \end{bmatrix}\]

To estimate the coefficients \(\hat{\boldsymbol{\beta}}\):

\[\begin{align*} \mathbf{X}^T\mathbf{X} &= \begin{bmatrix} n & \sum x_{i}\\ \sum x_{i}& \sum x_{i}^2 \end{bmatrix}\\ (\mathbf{X}^T\mathbf{X})^{-1} & = \frac{1}{n \sum x_{i}^2 - (\sum x_{i})^2}\begin{bmatrix} \sum x_{i}^2& -\sum x_{i}\\ -\sum x_{i} & n \end{bmatrix} \\ & = \frac{1}{n (\sum x_{i}^2 - n\bar{x}^2)}\begin{bmatrix} \sum x_{i}^2 & -n\bar{x} \\ -n\bar{x} & n \end{bmatrix} \\ & = \frac{1}{S_{xx}}\begin{bmatrix} \sum x_{i}^2/n & -\bar{x} \\ -\bar{x} & 1 \end{bmatrix} \\ \mathbf{X}^T\mathbf{Y} &= \begin{bmatrix} \sum y_{i} \\ \sum x_{i}y_{i} \end{bmatrix} = \begin{bmatrix} n\bar{y} \\ \sum x_{i}y_{i} \end{bmatrix}\\ \hat{\boldsymbol{\beta}} & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \\ & = \frac{1}{S_{xx}}\begin{bmatrix} \bar{y}\sum x_{i}^2 -\bar{x} \sum x_{i}y_i \\ -n \bar{x} \bar{y} + \sum x_{i}y_i \end{bmatrix} \end{align*}\]

With some algebra, this gives:

\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\] and

\[\hat{\beta}_0= \bar{y} - \hat{\beta}_1\bar{x}\]

as before, and

\[\begin{align*} \mbox{Var}(\hat{\boldsymbol{\beta}})& = (\mathbf{X}^T\mathbf{X})^{-1}\sigma^2\\ & = \frac{\sigma^2}{S_{xx}} \begin{bmatrix} \sum x_{i}^2/n& -\bar{x} \\ -\bar{x}& 1 \end{bmatrix} \end{align*}\]

which gives

\[\mbox{Var}(\hat{\beta}_0) = \sigma^2\left(\frac{1}{n}+ \frac{\bar{x}^2}{S_{xx}}\right),\]

\[\mbox{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}},\]

\[\mbox{Cov}(\hat{\beta}_0, \hat{\beta}_1) = -\bar{x}\frac{\sigma^2}{S_{xx}}.\]

4.4.2 Example 2

Example from Stapleton (2009), Problem 3.1.1, pg 81.

A scale has 2 pans. The measurement given by the scale is the difference between the weight in pan 1 and pan 2, plus a random error \(\epsilon\).

Suppose that \(\mathbb{E}[\epsilon] = 0\) and \(\mbox{Var}(\epsilon) = \sigma^2\) and the \(\epsilon_i\) are independent.

Suppose also that two objects have weight \(\beta_1\) and \(\beta_2\) and that 4 measurements are taken:

Pan 1: object 1, Pan 2: empty
Pan 1: empty, Pan 2: object 2
Pan 1: object 1, Pan 2: object 2
Pan 1: object 1 and 2, Pan 2: empty

Let \(y_1\), \(y_2\), \(y_3\) and \(y_4\) be the four observations. Then:

\[\begin{align*} y_1 & = \beta_1 + \epsilon_1\\ y_2 & =- \beta_2 + \epsilon_2\\ y_3 & = \beta_1 - \beta_2 + \epsilon_3\\ y_4 & = \beta_1 + \beta_2 + \epsilon_4\\ \end{align*}\]

\(\mathbf{X} = \begin{bmatrix} 1 &0 \\ 0 & -1 \\ 1 & -1 \\ 1 & 1 \end{bmatrix}\)

\(\mathbf{Y} = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ y_4 \end{bmatrix}\)

\(\boldsymbol{\beta} = \begin{bmatrix} \beta_1 \\ \beta_2 \end{bmatrix}\)

\(\boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \end{bmatrix}\)

The model is:

\[\mathbf{Y} = \begin{bmatrix} 1 &0 \\ 0 & -1 \\ 1 & -1 \\ 1 & 1 \end{bmatrix} \times\begin{bmatrix} \beta_1 \\ \beta_2 \end{bmatrix} + \boldsymbol{\epsilon} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

The OLS estimates of \(\boldsymbol{\beta}\) are given by:

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}.\]

\[\begin{align*} \mathbf{X}^T\mathbf{X} & = \begin{bmatrix} 1 & 0 & 1 & 1\\ 0 & -1 & -1 & 1\\ \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & -1 \\ 1 & -1 \\ 1 & 1\\ \end{bmatrix}\\ & = \begin{bmatrix} 3 & 0 \\ 0 & 3 \\ \end{bmatrix}\\ & = 3\begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix}\\ \mathbf{X}^T\mathbf{Y} &= \begin{bmatrix} y_1 + y_3 + y_4\\ -y_2 - y_3 + y_4\\ \end{bmatrix}\\ \hat{\boldsymbol{\beta}} & = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}\\ & = \frac{1}{3}\begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} y_1 + y_3 + y_4\\ -y_2 - y_3 + y_4\\ \end{bmatrix}\\ & = \frac{1}{3}\begin{bmatrix} y_1 + y_3 + y_4\\ -y_2 - y_3 + y_4\\ \end{bmatrix}\\ \mbox{Var}(\hat{\boldsymbol{\beta}}) &= (\mathbf{X}^T\mathbf{X})^{-1} \sigma^2 = \frac{1}{3}\begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \sigma^2.\\ \end{align*}\]

Can we improve the experiment so that the 4 measurements yield estimates of \(\boldsymbol{\beta}\) with smaller variance?

present design: \(\mbox{Var}(\hat{\beta}_i) = \frac{\sigma^2}{3}\)
Let \(\mathbf{X} = \begin{bmatrix} 1 & -1 \\ 1 & -1 \\ 1 & 1 \\ 1 & 1 \end{bmatrix}\),

\(\mbox{Var}(\hat{\beta}_i) = \frac{\sigma^2}{4}\)

Let \(\mathbf{X} = \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix}\),

\(\mbox{Var}(\hat{\beta}_i) = \frac{\sigma^2}{2}\).

4.5 The formal multiple regression model and properties

4.5.1 Concepts: random vectors, covariance matrix, multivariate normal distribution (MVN).

Let \(Y_1,...,Y_n\) be r.v.s defined on a common probability space.

Then \(\mathbf{Y}\) is a random vector.

Let \(\mu_i = \mathbb{E}[y_i]\) and \(\boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \vdots \\ \mu_n \end{bmatrix}\).

Then \(\boldsymbol{\mu}\) is a mean vector and we write:

\[\mathbb{E}[\mathbf{Y}] = \boldsymbol{\mu}.\]

Let \(\sigma_{ij} = Cov(y_i, y_j)\). Then \(\boldsymbol{\Sigma}\) is the covariance matrix of \(\mathbf{Y}\), where \(\boldsymbol{\Sigma}_{ij} = [\sigma_{ij}].\)

We write:

\[\mbox{Var}(\mathbf{Y}) = \boldsymbol{\Sigma}.\]

Aside:

\(\mbox{Cov}(Y_i,Y_j) = \mathbb{E}[(Y_i-\mathbb{E}[Y_i])(Y_j-\mathbb{E}[Y_j])]\)

\(\mbox{Cov}(Y_i,Y_i) = \mbox{Var}(Y_i)\)

If \(Y_i,Y_j\) are independent then \(\mbox{Cov}(Y_i,Y_j) = 0\).

When \(Y_i,Y_j\) have bivariate normal distribution, if \(\mbox{Cov}(Y_i,Y_j) = 0\), then \(Y_i,Y_j\) are independent.

Fact: Suppose \(\mathbf{Y}\) has mean \(\boldsymbol{\mu}\) and variance \(\boldsymbol{\Sigma}\). Then for a vector of constants \(\mathbf{b}\) and matrix of constants \(\mathbf{C}\):

\[\mathbb{E}[\mathbf{C}\mathbf{Y} + \mathbf{b}] = \mathbf{C}\boldsymbol{\mu} + \mathbf{b}\]

and

\[\mbox{Var}( \mathbf{C}\mathbf{Y} + \mathbf{b} ) = \mathbf{C}\boldsymbol{\Sigma} \mathbf{C}^T.\]

Defn: A random \(n\) - dim vector \(\mathbf{Y}\) is said to have a MVN distribution if \(\mathbf{Y}\) can be written as \[\mathbf{Y} = \mathbf{A}\mathbf{Z} + \boldsymbol{\mu}\] where:

\(Z_1, Z_2, ..., Z_p\) are iid N(0,1),
\(\mathbf{A}\) is an \(n \times p\) matrix of constants and
\(\boldsymbol{\mu}\) is an \(n\) vector of constants.

Notes:

Random vector \(\mathbf{Z}\) is multivariate normal with mean \(\mathbf{0}\) and covariance \(\mathbf{I}_p\) since \(Z_i\)s are independent so their covariances are 0. We write: \[\mathbf{Z} \sim N_p(\mathbf{0}, \mathbf{I}_p).\]
\(\mathbb{E}[\mathbf{Y}] = \mathbb{E}[\mathbf{A}\mathbf{Z} + \boldsymbol{\mu}] = \boldsymbol{\mu}\),
\(\mbox{Var}(\mathbf{Y}) = \mathbf{A}\mathbf{A}^T\),
\(\mathbf{Y} \sim N_n (\boldsymbol{\mu}, \mathbf{A}\mathbf{A}^T)\).

4.5.2 Multiple regression model

\[\mathbf{Y}=\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}.\]

\(\mathbf{Y}=\) \(n\) - dimensional response random vector.

\(\boldsymbol{\beta}=\) unknown \(p\) - dimensional parameter vector.

\(\mathbf{X}=\) an \(n \times p\) matrix of constants.

\(\boldsymbol{\epsilon}=\) \(n\) - dimensional error vector.

Assumptions:

Linearity: \(\mathbb{E}[\boldsymbol{\epsilon} ] =\mathbf{0}\), hence \(\mathbb{E}[\mathbf{Y}] = \mathbf{X}\boldsymbol{\beta}\).
Constant variance and 0 covariances \(\mbox{Var}(\boldsymbol{\epsilon}) = \sigma^2 I_n\) and \(\mbox{Var}(\mathbf{Y}) = \sigma^2 I_n\).
MVN distribution: \(\boldsymbol{\epsilon} \sim N_n(\mathbf{0},\sigma^2 I_n )\) and \(\mathbf{Y} \sim N_n(\mathbf{X}\boldsymbol{\beta},\sigma^2 I_n )\)

Notes: When the off diagonal entries of the covariance matrix of a MVN distribution are 0, the \(Y_1, ..., Y_n\) are independent.

Theorem 4.1 Let \(\hat{\boldsymbol{\beta}}\) be the OLS estimator of \(\boldsymbol{\beta}\). When the model assumptions hold:

\[\hat{\boldsymbol{\beta}} \sim N_p(\boldsymbol{\beta}, (\mathbf{X}^T\mathbf{X})^{-1}\sigma^2)\]

Corollary: \(\hat{\beta}_j \sim N(\beta_j, c_{jj}\sigma^2)\), where \(c_{jj}\) is the \(jj\) entry of \((\mathbf{X}^T\mathbf{X})^{-1}\) for \(j = 0, ..., k\).

Theorem 4.2 Let \(\hat{\sigma}^2 = \frac{\mbox{SSE}}{n-p}\). When the model assumptions hold:

\[(n-p)\frac{\hat{\sigma}^2}{\sigma^2} \sim \chi ^2_{(n-p)}\]

and

\[\mathbb{E}[\hat{\sigma}^2] =\sigma^2\]

The distribution of \(\hat{\sigma}^2\) is independent of \(\hat{\boldsymbol{\beta}}\).

Corollary:

\[\frac{\hat{\beta}_j - \beta_j }{\hat{\sigma} \sqrt{c_{jj}}} \sim t_{n-p}\] So we can do tests and obtain CIs for \(\beta_j\)

4.6 The hat matrix

The vector of fitted values:

\[\hat{\mathbf{Y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} = \mathbf{H}\mathbf{Y}.\]

The hat matrix (also known as the projection matrix):

\[\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\]

has dimension \(n \times n\) is symmetric (\(\mathbf{H}^T = \mathbf{H}\)) and is idempotent (\(\mathbf{H}^2 = \mathbf{H}\mathbf{H} = \mathbf{H}\)).

We have

\[\hat{\mathbf{Y}}= \mathbf{H}\mathbf{Y}\]

\[\mathbf{e}= \mathbf{Y} - \hat{\mathbf{Y}} =\mathbf{Y} - \mathbf{H}\mathbf{Y} = (\mathbf{I} - \mathbf{H})\mathbf{Y}\]

\[\mbox{SSE} = \mathbf{e}^T\mathbf{e} = \mathbf{Y}^T (\mathbf{I} - \mathbf{H})\mathbf{Y}\]

4.6.1 The QR Decomposition of a matrix

We have seen that OLS estimates for \(\boldsymbol{\beta}\) can be found by using:

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}.\]

Inverting the \(\mathbf{X}^T\mathbf{X}\) matrix can sometimes introduce significant rounding errors into the calculations and most software packages use QR decomposition of the design matrix \(\mathbf{X}\) to compute the parameter estimates. E.g. take a look at the documentation for the lm method in R.

How does this work?

We need to find an \(n \times p\) matrix \(\mathbf{Q}\) and a \(p \times p\) matrix \(\mathbf{R}\) such that:

\[\mathbf{X}=\mathbf{Q}\mathbf{R}\]

and

\(\mathbf{Q}\) has orthonormal columns, i.e. \(\mathbf{Q}^T\mathbf{Q} = \mathbf{I}_p\)
\(\mathbf{R}\) is an upper triangular matrix.

There are several methods for computing the \(\mathbf{Q}\mathbf{R}\) factorization (we won’t study them, but high quality code for the computation exists in publicly available Lapack package {http://www.netlib.org/lapack/lug/}).

We can show that:

\[\begin{align*} \mathbf{X} &=\mathbf{Q}\mathbf{R} \\ \mathbf{X}^T\mathbf{X} &=(\mathbf{Q}\mathbf{R})^T(\mathbf{Q}\mathbf{R}) = \mathbf{R}^T\mathbf{R}\\ \end{align*}\]

Then:

\[\begin{align*} (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}} & =\mathbf{X}^T\mathbf{Y}\\ (\mathbf{R}^T\mathbf{R})\hat{\boldsymbol{\beta}} & =\mathbf{R}^T\mathbf{Q}^T\mathbf{Y}\\ \mathbf{R}\hat{\boldsymbol{\beta}} & = \mathbf{Q}^T\mathbf{Y}\\ \end{align*}\]

Since \(\mathbf{R}\) is a triangular matrix we can use backsolving and this is an easy equation to solve.

We can also show that the hat matrix becomes:

\[\mathbf{H} = \mathbf{Q}\mathbf{Q}^T\]

4.7 ANOVA for multiple regression

Recap: ANOVA decomposition

\[\begin{align*} \mbox{SSR} & = \sum_{i=1}^n(\hat{y}_i - \bar{y})^2 = \sum_{i=1}^n \hat{y}_i ^2- n\bar{y}^2 \\ \mbox{SSE} & = \sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n e_i^2\\ \mbox{SST} & = \sum_{i=1}^n(y_i - \bar{y})^2 = \sum_{i=1}^n y_i ^2- n\bar{y}^2 \\ \end{align*}\]

Theorem 4.3 \(\mbox{SST} = \mbox{SSR} + \mbox{SSE}\)

Proof: this follows from the decomposition of response = fit + residual.

\[\begin{align*} \mathbf{Y} & = \hat{\mathbf{Y}} + \mathbf{e}\\ \mathbf{Y}^T\mathbf{Y} & = (\hat{\mathbf{Y}} + \mathbf{e})^T (\hat{\mathbf{Y}} + \mathbf{e})\\ & = \hat{\mathbf{Y}}^T\hat{\mathbf{Y}} + \mathbf{e}^T\mathbf{e}+ 2\hat{\mathbf{Y}}^T\mathbf{e} \\ \end{align*}\]

But \(\hat{\mathbf{Y}}^T = \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\) and \(\mathbf{X}^T\mathbf{e} = 0\), from normal equations, so \(\hat{\mathbf{Y}}^T\mathbf{e} = 0\).

Alternatively: \(\hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}\), \(\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{Y}\), so \(\hat{\mathbf{Y}}^T\mathbf{e} = \mathbf{Y}^T\mathbf{H}(\mathbf{I} - \mathbf{H})\mathbf{Y} = 0\), since \(\mathbf{H}^2=\mathbf{H}\).

Therefore,

\[\mathbf{Y}^T\mathbf{Y} = \hat{\mathbf{Y}}^T\hat{\mathbf{Y}} + \mathbf{e}^T\mathbf{e}\]

\[\sum_{i=1}^n y_i^2 =\sum_{i=1}^n \hat{y}_i^2 + \sum_{i=1}^n e_i^2\]

and substracting \(n\bar{y}^2\) from both sides completes the proof.

ANOVA table:

\[\begin{align*} \mbox{SSR} & = \sum_{i=1}^n(\hat{y}_i - \bar{y})^2, \;\;\;\; df =p-1 \\ \mbox{SSE} & = \sum_{i=1}^n(y_i - \hat{y}_i)^2, \;\;\;\; df = n-p \\ \mbox{SST} & = \sum_{i=1}^n(y_i - \bar{y})^2, \;\;\;\; df =n-1. \end{align*}\]

SOURCE	df	SS	MS	F
Regression	p-1	SSR	MSR = SSR/(p-1)	MSR/MSE
Error	n-p	SSE	MSE = SSE/(n-p)
Total	n-1	SST

If \(\beta_1 = \beta_2 = ... = \beta_k = 0\) then \(\hat{\beta}_j \approx 0\) for \(j = 1,...,k\) and \(\hat{y}_i \approx \bar{y}\).

Then, \(\mbox{SSE} \approx \mbox{SST}\) and \(\mbox{SSR} \approx 0\). Small values of \(\mbox{SSR}\) relative to \(\mbox{SSE}\) provide indication that \(\beta_1 = \beta_2 = ... = \beta_k = 0\).

\[\begin{align*} &H_0: \beta_1 = \beta_2 = ... = \beta_k = 0\\ &H_A: \mbox{ not all }\beta_j = 0 \mbox{ for }j = 1,...,k \end{align*}\]

Under \(H_0\):

\[F= \frac{\mbox{SSR}/(p-1)}{\mbox{SSE}/(n-p)} \sim F_{(p-1, n-p)}\]

P-value is \(P( F_{(p-1, n-p)} \geq F_{obs})\), where \(F_{obs}\) is the observed \(F\)-value.

Coefficient of determination \(R^2 = \frac{\mbox{SSR}}{\mbox{SST}}\), \(0 \leq R^2 \leq 1\).

\(R^2\) is the proportion of variability in \(Y\) explained by regression on \(X_1,...,X_k\).

Adjusted \(R^2\) is the modified version of \(R^2\) adjusted for the number of predictors in the model. R uses:

\[R^2_{Adj} = 1-(1- R^2)\frac{n-1}{n-p-1}.\]

4.8 1-way ANOVA model

4.8.1 Example:

A study was carried out to examine the effects of caffeine. Thirty students were randomly assigned to one of:

control, no caffeine
low dose caffeine
low dose caffeine plus sugar

The response \(y\) is an index measuring unrest 2 hrs later.

(Example from Draper and Smith (1966).)

Let \(y_{ij}\) be the response for the \(j^{th}\) person in the \(i^{th}\) group, \(j=1,...,10\), \(i=1,2,3\).

Let \(n_i\) be the number assigned to group \(i\).

Model:

\(y_{ij} = \mu_{i} + \epsilon_{ij}\), \(\quad\) \(\epsilon_{ij}\) iid \(N(0, \sigma^2)\), where \(\mu_i\) is the population mean for those at dose \(i\).

Or equivalently:

\(y_{ij} = \mu + \alpha_{i} + \epsilon_{ij}\), \(\quad\) \(\epsilon_{ij}\) iid \(N(0, \sigma^2)\), where \(\mu\) is the overall population mean and \(\alpha_i\) is the effect of receiving treatment \(i\).

O.L.S estimates for Model 1 are:

\[\begin{align*} S(\mu_1, ..., \mu_g)& =\sum_{i=1}^g\sum_{j=1}^{n_i}\epsilon_{ij}^2 = \sum_{i=1}^g\sum_{j=1}^{n_i}(y_{ij}-\mu_i)^2,\\ \frac{\delta S(\mu_1, ..., \mu_g)}{\delta \mu_i}& = -2\sum_{j=1}^{n_i}(y_{ij}-\mu_i), \quad \forall i = 1,...,g \\ \end{align*}\]

Setting these equal to 0 and evaluating at \(\hat{\mu}_i\) gives:

\[\begin{align*} \sum_{j=1}^{n_i}(y_{ij}-\hat{\mu}_i) & =0.\\ \sum_{j=1}^{n_i}y_{ij}-n_i\hat{\mu}_i & =0.\\ \hat{\mu}_i =\sum_{j=1}^{n_i}y_{ij}/n_i & =\bar{y}_{i.}\\ \end{align*}\]

NOTE: \(\bar{y}_{i.}\) is the average of responses at level \(i\) of \(X\).

Model 1 has \(g=3\) parameters but model 2 has 4 parameters and is over-parameterised (\(\mu_i = \mu + \alpha_{i}\)).

Usually the constraint \(\sum \alpha_i = 0\) or \(\alpha_3 = 0\) is imposed.

The hypothesis of interest in this model is:

\[\begin{align*} & H_0: \mu_1=\mu_2 = ...= \mu_g\\ & H_A: \mbox{not all } \mu_i \mbox{ are the same.}\\ \end{align*}\]

Equivalently:

\[\begin{align*} & H_0: \alpha_i=0, \hspace{1cm}\forall i = 1,...,g\\ & H_A: \mbox{not all } \alpha_i = 0.\\ \end{align*}\]

Calculations can be summarised in the ANOVA table:

SOURCE	df	SS	MS	F
Group	g-1	SSG	MSG = SSG/(g-1)	MSG/MSE
Error	n-g	SSE	MSE = SSE/(n-g)
Total	n-1	SST

where:

\(\mbox{SSG} = \sum_{i = 1}^gn_{i}(\bar{y}_{i.} - \bar{y}_{..})^{2}\)
\(\mbox{SSE} = \sum_{i=1}^g\sum_{j = 1}^{n_i}(y_{ij} - \bar{y}_{i.})^{2}\)
\(\mbox{SST} = \sum_{i=1}^g\sum_{j = 1}^{n_i}(y_{ij} - \bar{y}_{..})^{2}\)

Under \(H_{0}\): \[F_{obs} = \frac{\mbox{MSG}}{\mbox{MSE}} \sim F_{g-1, n-g}.\]

We reject \(H_{0}\) for large values of \(F_{obs}\).

4.9 One way ANOVA in regression notation

First we have to set up dummy variables:

\[X_i= \{ \begin{array}{ll} 1 & \quad\mbox{if observation in gp }i\\ 0 & \quad\mbox{ow}\end{array}\]

Model: (effects model \(y_{ij} = \mu + \alpha_{i} + \epsilon_{ij}\))

\[Y = \mu + \alpha_1X_1 + \alpha_2X_2 + \alpha_3X_3 + \epsilon\]

\[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

where \(\mathbf{Y}\) is a \(30 \times 1\) vector of responses and

\[\boldsymbol{\beta} = \begin{bmatrix} \mu \\ \alpha_{1} \\ \alpha_{2} \\ \alpha_{3} \end{bmatrix} \quad \quad \quad \mathbf{X} = \begin{bmatrix} 1 & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & 0 & 0 & 1 \end{bmatrix}\]

Note that in the \(\mathbf{X}\) matrix the first column equals the sum of the second, third and fourth columns, therefore it is not of full rank so \(\mathbf{X}^T\mathbf{X}\) does not have an inverse and there is not unique \(\hat{\boldsymbol{\beta}}\).

We could require \(\sum\alpha_i = 0\) and then the solution would be unique. Or, we could require that \(\alpha_3 = 0\) and drop the last column of \(\mathbf{X}\).

We could also derive a solution where the first column of \(\mathbf{X}\) is omitted. Then the model becomes:

\[Y = \mu_1X_1 + \mu_2X_2 + \mu_3X_3 + \epsilon\]

\[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

where

\[\boldsymbol{\beta} =\begin{bmatrix} \mu_{1} \\ \mu_{2}\\ \mu_{3} \end{bmatrix}\quad \quad \quad \mathbf{X} = \begin{bmatrix} 1 & 0 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ \vdots & \vdots & \vdots \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \vdots & \vdots & \vdots \\ 0 & 0 & 1 \end{bmatrix}\]

This is the means model \(y_{ij} = \mu_i + \epsilon_{ij}\).

\[\begin{align*} \hat{\boldsymbol{\beta}} & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\\ & = \begin{bmatrix} 10 & 0 & 0 \\ 0 & 10 & 0 \\ 0 & 0 & 10 \end{bmatrix}^{-1}\begin{bmatrix} \sum_{j=1}^{10} y_{1j} \\ \sum_{j=1}^{10} y_{2j}\\ \sum_{j=1}^{10} y_{3j} \end{bmatrix}\\ & = \frac{1}{10}\mathbf{I}_3\begin{bmatrix} y_{1.} \\ y_{2.}\\ y_{3.} \end{bmatrix}\\ & = \begin{bmatrix} \bar{y}_{1.} \\ \bar{y}_{2.}\\ \bar{y}_{3.} \end{bmatrix}\\ & = \begin{bmatrix} \hat{\mu}_{1} \\ \hat{\mu}_{2}\\ \hat{\mu}_{3} \end{bmatrix} \end{align*}\]

The fitted values are then:

\[\hat{Y} = \hat{\mu}_1X_1 + \hat{\mu}_2X_2 + \hat{\mu}_3X_3\]

\[\hat{Y} = \hat{\mu}_i = \bar{y}_{i.}\]

if \(Y\) comes from group \(i\).

4.9.1 Fitting the model in MTB and R

One way anova, \(Y\) is response \(X\) is group.
Using regression, make dummy variables \(X_1\), \(X_2\), \(X_3\) and:

use predictors \(X_1\), \(X_2\), \(X_3\) with no intercept, or
use predictors \(X_1\), \(X_2\) with intercept.

The second regression on dummy variables gives the same ANOVA table as the one way ANOVA table.

When the model does not include an intercept, the ANOVA table shows the uncorrected SS, i.e. \(\bar{y}\) not substracted:

SOURCE	df	SS
Regression	p	SSR
Error	n-p	SSE
Total	n	SST

Where \(SSR = \sum \hat{y}^2\), \(SSE = \sum(y - \hat{y})^{2}\) and \(SST = \sum y^{2}\).

4.9.1.1 Example: Caffeine in MTB

In this example x is a factor, not a continuous variable, x1, x2 and x3 are the dummy variables that identify the groups in x.

One-way ANOVA: y versus x

Analysis of Variance

  Source  DF  Adj SS  Adj MS  F-Value  P-Value
  x        2   61.40  30.700  6.18     0.006
  Error   27  134.10   4.967
  Total   29  195.50
  
  
  Model Summary
  
  S        R-sq    R-sq(adj)  R-sq(pred)
  2.22860  31.41%  26.33%     15.32%
  
  
  Means
  
  x   N  Mean     StDev   95% CI
  1  10  244.800  2.394  (243.354, 246.246)
  2  10  246.400  2.066  (244.954, 247.846)
  3  10  248.300  2.214  (246.854, 249.746)
  
  Pooled StDev = 2.22860


Regression Analysis: y versus x1, x2, x3

The following terms cannot be estimated and were removed:
  x3


Analysis of Variance

Source        DF  Seq SS Seq MS  F-Value  P-Value
Regression    2   61.40  30.700  6.18     0.006
x1            1   43.35  43.350  8.73     0.006
x2            1   18.05  18.050  3.63     0.067
Error        27  134.10   4.967
Total        29  195.50


Model Summary

S        R-sq    R-sq(adj)  R-sq(pred)
2.22860  31.41%  26.33%     15.32%


Coefficients

Term      Coef      SE Coef    T-Value    P-Value   
Constant  248.3      0.705     352.33      0.000
x1         -3.50     0.997      -3.51      0.002  
x2         -1.90     0.997      -1.91      0.067  


Regression Equation

y = 248.300 - 3.500 x1 - 1.900 x2

\(X_3\) is dropped. Compare ANOVA table with the 1-way ANOVA results. It is also possible to fit the model without explicitly specifying the dummy variables, but specifing that \(x\) is a categorical variable. The same model is fitted, but \(X_1\) is dropped.

  
  Regression Analysis: y versus x1, x2, x3
  
  Analysis of Variance
  
  Source        DF  Seq SS    Seq MS  F-Value     P-Value
  Regression    3  1822929    607643  122344.22   0.000
  x1            1   599270    599270  120658.47   0.000
  x2            1   607130    607130  122240.86   0.000
  x3            1   616529    616529  124133.34   0.000
  Error        27      134         5
  Total        30  1823063
  
  
  Model Summary
  
  S         R-sq R-sq(adj)  R-sq(pred)
  2.22860  99.99% 99.99%  99.99%
  
  
  Coefficients
  
  Term  Coef    SE Coef  T-Value  P-Value   VIF
  x1    244.800   0.705   347.36  0.000    1.00
  x2    246.400   0.705   349.63  0.000    1.00
  x3    248.300   0.705   352.33  0.000    1.00
  
  
  Regression Equation
  
  y = 244.800 x1 + 246.400 x2 + 248.300 x3

Model with intercept = 0.

4.9.1.2 Example: Caffeine in R

#Tell R this is a categorical variable
coffee.data$x <- as.factor(coffee.data$x)

fit.oneway.anova <- aov(y~x, data = coffee.data) 
summary(fit.oneway.anova)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## x            2   61.4  30.700   6.181 0.00616 **
## Residuals   27  134.1   4.967                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit.oneway.anova)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x          2   61.4 30.7000  6.1812 0.006163 **
## Residuals 27  134.1  4.9667                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model.tables(fit.oneway.anova, "means")

## Tables of means
## Grand mean
##       
## 246.5 
## 
##  x 
## x
##     1     2     3 
## 244.8 246.4 248.3

# plot(fit.oneway.anova) #diagnostic plots
coefficients(fit.oneway.anova)

## (Intercept)          x2          x3 
##       244.8         1.6         3.5

#TukeyHSD(fit.oneway.anova, "x", ordered = TRUE)
#plot(TukeyHSD(fit.oneway.anova, "x"))


fit.coffee <- lm(y~x, data = coffee.data)
summary(fit.coffee)

## 
## Call:
## lm(formula = y ~ x, data = coffee.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.400 -2.075 -0.300  1.675  3.700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 244.8000     0.7047 347.359  < 2e-16 ***
## x2            1.6000     0.9967   1.605  0.12005    
## x3            3.5000     0.9967   3.512  0.00158 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.229 on 27 degrees of freedom
## Multiple R-squared:  0.3141, Adjusted R-squared:  0.2633 
## F-statistic: 6.181 on 2 and 27 DF,  p-value: 0.006163

anova(fit.coffee)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x          2   61.4 30.7000  6.1812 0.006163 **
## Residuals 27  134.1  4.9667                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# or using dummy variables

d1 <- as.numeric(coffee.data$x == 1)
d2 <- as.numeric(coffee.data$x == 2)
d3 <- as.numeric(coffee.data$x == 3)

fit.coffee.dummy <- lm(coffee.data$y ~ d1 +d2)
summary(fit.coffee.dummy)

## 
## Call:
## lm(formula = coffee.data$y ~ d1 + d2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.400 -2.075 -0.300  1.675  3.700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 248.3000     0.7047 352.326  < 2e-16 ***
## d1           -3.5000     0.9967  -3.512  0.00158 ** 
## d2           -1.9000     0.9967  -1.906  0.06730 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.229 on 27 degrees of freedom
## Multiple R-squared:  0.3141, Adjusted R-squared:  0.2633 
## F-statistic: 6.181 on 2 and 27 DF,  p-value: 0.006163

#no intercept
fit.coffee.dummy2 <- lm(coffee.data$y ~ d1 +d2 +d3 - 1)
summary(fit.coffee.dummy2)

## 
## Call:
## lm(formula = coffee.data$y ~ d1 + d2 + d3 - 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.400 -2.075 -0.300  1.675  3.700 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## d1 244.8000     0.7047   347.4   <2e-16 ***
## d2 246.4000     0.7047   349.6   <2e-16 ***
## d3 248.3000     0.7047   352.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.229 on 27 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 1.223e+05 on 3 and 27 DF,  p-value: < 2.2e-16

4.10 Confidence intervals and hypothesis tests for linear combinations of \(\boldsymbol{\beta}\)

From the theory of OLS:

\[\hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta},(\mathbf{X}^T\mathbf{X})^{-1}\sigma^2)\]

and

\[\mathbf{c}^T\hat{\boldsymbol{\beta}} \sim N(\mathbf{c}^T\boldsymbol{\beta},\mathbf{c}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{c}\sigma^2)\]

For the caffeine example (1 way ANOVA model): suppose we want to compare the treatment means with the control mean, that is, we want a CI for:

\[\frac{\mu_2+\mu_3}{2}-\mu_1\]

Let \(\mathbf{c}^T= (-1, 1/2, 1/2)\), \(\boldsymbol{\beta}^T = (\mu_1, \mu_2, \mu_3)\).

\[\mathbf{c}^T\boldsymbol{\beta} = -\mu_1+\mu_2/2+ \mu_3/2\]

is estimated by:

\[\mathbf{c}^T\hat{\boldsymbol{\beta}} = -\bar{y}_{1.}+\bar{y}_{2.}/2+ \bar{y}_{3.}/2\]

and the variance is:

\[\begin{bmatrix} -1 & 1/2 & 1/2 \end{bmatrix} \frac{1}{10}\mathbf{I}_3 \begin{bmatrix} -1 \\ 1/2 \\ 1/2\end{bmatrix}\sigma^2 = \frac{3}{20}\sigma^2\]

So, the \(100 \times (1- \alpha) \%\)CI is:

\[\mathbf{c}^T\hat{\boldsymbol{\beta}} \pm t_{27}(\alpha/2) \sqrt{\frac{3}{20}\hat{\sigma}^2}\]

The df is: \(n-g = 30-3 = 27\).

We could also test hypotheses e.g. \(H_o: \mathbf{c}^T\boldsymbol{\beta} = 0\). The test statistic:

\[\frac{\mathbf{c}^T\hat{\boldsymbol{\beta}}}{\sqrt{\frac{3}{20}\hat{\sigma}^2}} \sim t_{27}.\]

References

Bowerman, Bruce L., and Daniel Schafer. 1990. Linear Statistical Models. 2nd ed. Thomson Wadsworth.

Cryer, J.D., and R.B. Miller. 1991. Statistics for Business: Data Analysis and Modelling. Business Statistics. PWS-Kent.

Stapleton, James H. 2009. Linear Statistical Models. 2nd ed. Wiley Series in Probability and Statistics. Wiley.

Draper, Norman Richard, and Harry Smith. 1966. Applied Regression Analysis. Wiley Series in Probability and Mathematical Statistics. Wiley.