7 Special cases of multiple regression

7.1 Categorical and continuous predictors (binary categories)

7.1.0.1 Example (from Ramsey and Schafer (2002) pg 236, 245):

\(Y\): average number of flowers per plant (meadowfoam).
Light intensity: 150, 300, 450, 600, 750, 900 (\(\mu\) mol/\(m^2\)/sec)
Timing: Timing of onset of light treatment Early/Late. Coded 0/1.

Suppose data is in the table below (every 2nd row) and consider the following models:

Parallel lines model (model A):

\[\mathbb{E}(y)= \beta_0+\beta_1(timing)+ \beta_2 (light)\]

Separate lines model (model B):

\[\mathbb{E}(y)= \beta_0+\beta_1(timing)+ \beta_2 (light) + \beta_3 (timing \times light)\]

Give the design matrix and the parameter vector for both models
Test \(H_0: \beta_3 = 0\).

Flowers	Timing	Intensity
62.3	Early	150
77.4	Early	150
55.3	Early	300
54.2	Early	300
49.6	Early	450
61.9	Early	450

## The following objects are masked from flowers.data (pos = 9):
## 
##     Flowers, Intensity, Time, Timing

## The following objects are masked from flowers.data (pos = 16):
## 
##     Flowers, Intensity, Time, Timing

## The following objects are masked from flowers.data (pos = 23):
## 
##     Flowers, Intensity, Time, Timing

## The following objects are masked from flowers.data (pos = 30):
## 
##     Flowers, Intensity, Time, Timing

## The following objects are masked from flowers.data (pos = 37):
## 
##     Flowers, Intensity, Time, Timing

## The following objects are masked from flowers.data (pos = 44):
## 
##     Flowers, Intensity, Time, Timing

## The following objects are masked from flowers.data (pos = 51):
## 
##     Flowers, Intensity, Time, Timing

Parallel lines model:

\[\mathbb{E}(y)= \beta_0+\beta_1(timing)+ \beta_2 (light)\]

\[\mathbf{X} = \begin{bmatrix} 1 & 0 & 150 \\ 1 & 0 & 300\\ 1 & 0 & 450 \\ 1 & 0 & 600\\ 1 & 0 & 750\\ 1 & 0 & 900\\ 1 & 1 & 150 \\ 1 & 1 & 300\\ 1 & 1 & 450 \\ 1 & 1 & 600 \\ 1 & 1 & 750\\ 1 & 1 & 900 \\ \end{bmatrix}\]

\(\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2\\ \end{bmatrix}\)

Separate lines model:

\[\mathbb{E}(y)= \beta_0+\beta_1(timing)+ \beta_2 (light) + \beta_3 (timing \times light)\]

\[\mathbf{X} = \begin{bmatrix} 1 & 0 & 150 &0 \\ 1 & 0 & 300 &0 \\ 1 & 0 & 450 &0 \\ 1 & 0 & 600 &0 \\ 1 & 0 & 750 &0 \\ 1 & 0 & 900 &0 \\ 1 & 1 & 150 &150 \\ 1 & 1 & 300 &300 \\ 1 & 1 & 450 &450 \\ 1 & 1 & 600 &600 \\ 1 & 1 & 750 &750 \\ 1 & 1 & 900 &900 \\ \end{bmatrix}\]

\[\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2\\ \beta_3\\ \end{bmatrix}\]

To test \(H_0: \beta_3 = 0\), P-value = 0.910, so cannot reject \(H_0\) (See table of coefficients, output below).

Model A


  Regression Analysis: Flowers versus Time, Intensity
  
  Analysis of Variance
  
  Source         DF  Seq SS   Seq MS  F-Value  P-Value
  Regression      2  3466.7  1733.35    41.78    0.000
  Time            1   887.0   886.95    21.38    0.000
  Intensity       1  2579.8  2579.75    62.18    0.000
  Error          21   871.2    41.49
  Lack-of-Fit     9   215.3    23.92     0.44    0.889
  Pure Error     12   655.9    54.66
  Total          23  4337.9
  
  
  Model Summary
  
  S        R-sq       R-sq(adj)   R-sq(pred)
  6.44107  79.92%     78.00%      73.84%
  
  
  Coefficients
  
  Term           Coef  SE Coef  T-Value  P-Value   VIF
  Constant      71.31     3.27    21.78    0.000
  Time          12.16     2.63     4.62    0.000  1.00
  Intensity  -0.04047  0.00513    -7.89    0.000  1.00
  
  
  Regression Equation
  
  Flowers = 71.31 + 12.16 Time - 0.04047 Intensity

Model B

  Regression Analysis: Flowers versus Time, Intensity, TxI
  
  Analysis of Variance
  
  Source         DF   Seq SS   Seq MS  F-Value  P-Value
  Regression      3  3467.28  1155.76    26.55    0.000
  Time            1   886.95   886.95    20.37    0.000
  Intensity       1  2579.75  2579.75    59.26    0.000
  TxI             1     0.58     0.58     0.01    0.910
  Error          20   870.66    43.53
  Lack-of-Fit     8   214.73    26.84     0.49    0.841
  Pure Error     12   655.93    54.66
  Total          23  4337.94
  
  
  Model Summary
  
  S        R-sq       R-sq(adj)   R-sq(pred)
  6.59795  79.93%     76.92%      70.95%
  
  
  Coefficients
  
  Term           Coef  SE Coef  T-Value  P-Value   VIF
  Constant      71.62     4.34    16.49    0.000
  Time          11.52     6.14     1.88    0.075  5.20
  Intensity  -0.04108  0.00744    -5.52    0.000  2.00
  TxI          0.0012   0.0105     0.12    0.910  6.20
  
  
  Regression Equation
  
  Flowers = 71.62 + 11.52 Time - 0.04108 Intensity + 0.0012 TxI

Model A:

fit1 <- lm(Flowers ~ Intensity + Time)
summary(fit1)

## 
## Call:
## lm(formula = Flowers ~ Intensity + Time)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.652 -4.139 -1.558  5.632 12.165 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 71.305833   3.273772  21.781 6.77e-16 ***
## Intensity   -0.040471   0.005132  -7.886 1.04e-07 ***
## Time        12.158333   2.629557   4.624 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared:  0.7992, Adjusted R-squared:   0.78 
## F-statistic: 41.78 on 2 and 21 DF,  p-value: 4.786e-08

Model B:

fit2 <- lm(Flowers ~ Intensity * Time)
summary(fit2)

## 
## Call:
## lm(formula = Flowers ~ Intensity * Time)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.516 -4.276 -1.422  5.473 11.938 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    71.623333   4.343305  16.491 4.14e-13 ***
## Intensity      -0.041076   0.007435  -5.525 2.08e-05 ***
## Time           11.523333   6.142360   1.876   0.0753 .  
## Intensity:Time  0.001210   0.010515   0.115   0.9096    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.598 on 20 degrees of freedom
## Multiple R-squared:  0.7993, Adjusted R-squared:  0.7692 
## F-statistic: 26.55 on 3 and 20 DF,  p-value: 3.549e-07

model.matrix(fit2)

##    (Intercept) Intensity Time Intensity:Time
## 1            1       150    0              0
## 2            1       150    0              0
## 3            1       300    0              0
## 4            1       300    0              0
## 5            1       450    0              0
## 6            1       450    0              0
## 7            1       600    0              0
## 8            1       600    0              0
## 9            1       750    0              0
## 10           1       750    0              0
## 11           1       900    0              0
## 12           1       900    0              0
## 13           1       150    1            150
## 14           1       150    1            150
## 15           1       300    1            300
## 16           1       300    1            300
## 17           1       450    1            450
## 18           1       450    1            450
## 19           1       600    1            600
## 20           1       600    1            600
## 21           1       750    1            750
## 22           1       750    1            750
## 23           1       900    1            900
## 24           1       900    1            900
## attr(,"assign")
## [1] 0 1 2 3

7.2 Categorical and continuous predictors (more than two categories)

7.2.0.1 Example: (from Ramsey and Schafer (2002)):

\(Y\): Measure of energy
\(X_1\): Measure of weight
Group: Type of flyer (1,2,3). Z1, Z2, Z3 (dummy variables).

Parallel lines model (model A):

\[\mathbb{E}(y)= \beta_0+\beta_1 z_2+ \beta_2 z_3 + \beta_3 x_1\]

Separate lines model (model B): \[\mathbb{E}(y)= \beta_0+\beta_1 z_2+ \beta_2 z_3 + \beta_3 x_1 + \beta_4 x_1 z_2+ \beta_5 x_1 z_3\]

Hypothesis testing:

Test \(H_0: \beta_4 = \beta_5 = 0\) by comparing the two models using an F-test.
Test \(H_0: \beta_1 = \beta_2 = 0\) by comparing the parallel lines model to the model \(\mathbb{E}(y)= \beta_0+\beta_3 x_1\) using an F-test.
Give the design matrix and the parameter vector for both models.
Test \(H_0: \beta_4 = \beta_5 = 0\), i.e.

\(H_0:\) Model A is correct

\(H_A:\) Model B is preferable to Model A

\[\begin{align*} F & =\frac{(\mbox{SSE}(A)-\mbox{SSE}(B))/(k-q)}{\mbox{SSE}(B)/(n-p)}\\ & =\frac{(0.5533- 0.5049)/(5-3)}{0.5049/(20-6)}\\ & =\frac{0.0242}{0.0361}\\ & = 0.67.\\ \end{align*}\]

\(F_{(2,14)}(0.95) = 3.73 > 0.67\) so we cannot reject \(H_0\), model A is OK.

Test \(H_0: \beta_1 = \beta_2 = 0\), i.e. let model C = one group model:

\[\mathbb{E}(y)= \beta_0+ \beta_3 x_1 \]

\(H_0:\) Model C is correct

\(H_A:\) Model A is preferable to Model C

\[\begin{align*} F & =\frac{(\mbox{SSE}(C)-\mbox{SSE}(A))/(k-q)}{\mbox{SSE}(A)/(n-p)}\\ & =\frac{\mbox{SSR}(A|C)/(3-1)}{0.5533/(20-4)}\\ & =\frac{(0.0008+0.0288)/(3-1)}{0.5533/(20-4)}\\ & =\frac{0.0296/2}{0.0346}\\ & = 0.43\\ \end{align*}\]

We don’t need to see the fit for Model C, take Seq SS.

\(F_{(2,16)}(0.95) = 3.63 > 0.43\) so we cannot reject \(H_0\), model C is adequate.

OUTPUT: Model A




Regression Analysis: y versus x1, Z2, Z3

Analysis of Variance

Source      DF   Seq SS   Seq MS  F-Value  P-Value
Regression   3  29.4215   9.8072   283.59    0.000
x1           1  29.3919  29.3919   849.91    0.000
Z2           1   0.0288   0.0288     0.83    0.375
Z3           1   0.0008   0.0008     0.02    0.883
Error       16   0.5533   0.0346
Total       19  29.9748


Model Summary

S         R-sq       R-sq(adj)   R-sq(pred)
0.185963  98.15%     97.81%      97.30%


Coefficients

Term        Coef  SE Coef  T-Value  P-Value   VIF
Constant  -1.498    0.150    -9.99    0.000
x1        0.8150   0.0445    18.30    0.000  2.58
Z2        -0.079    0.203    -0.39    0.703  3.80
Z3         0.024    0.158     0.15    0.883  3.45


Regression Equation

y = -1.498 +0.8150x1 -0.079Z2 +0.024Z3

OUTPUT: Model B


Regression Analysis: y versus x1, Z2, Z3, Z2*x1, Z3*x1

Analysis of Variance

Source      DF   Seq SS   Seq MS  F-Value  P-Value
Regression   5  29.4699   5.8940   163.44    0.000
x1           1  29.3919  29.3919   815.04    0.000
Z2           1   0.0288   0.0288     0.80    0.387
Z3           1   0.0008   0.0008     0.02    0.886
Z2*x1        1   0.0452   0.0452     1.25    0.282
Z3*x1        1   0.0032   0.0032     0.09    0.770
Error       14   0.5049   0.0361
Total       19  29.9748


Model Summary

S         R-sq       R-sq(adj)     R-sq(pred)
0.189900  98.32%     97.71%      96.29%


Coefficients

Term        Coef  SE Coef  T-Value  P-Value     VIF
Constant  -1.471    0.248    -5.94    0.000
x1        0.8047   0.0867     9.28    0.000    9.37
Z2          1.27     1.29     0.99    0.341  146.62
Z3        -0.110    0.385    -0.29    0.779   19.70
Z2*x1     -0.215    0.224    -0.96    0.353  166.38
Z3*x1      0.031    0.103     0.30    0.770   41.94


Regression Equation

y = -1.471 +0.8047 x1 +1.27 Z2 -0.110 Z3-
- 0.215Z2 x1 +0.031Z3 x1

x1G <- flying.data$x1 * flying.data$G1
x2G <- flying.data$x1 * flying.data$G2
fitA <- lm(y ~ x1 + G1 + G2, data = flying.data)
summary(fitA)

## 
## Call:
## lm(formula = y ~ x1 + G1 + G2, data = flying.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23224 -0.12199 -0.03637  0.12574  0.34457 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.49770    0.14987  -9.993 2.77e-08 ***
## x1           0.81496    0.04454  18.297 3.76e-12 ***
## G1          -0.07866    0.20268  -0.388    0.703    
## G2           0.02360    0.15760   0.150    0.883    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.186 on 16 degrees of freedom
## Multiple R-squared:  0.9815, Adjusted R-squared:  0.9781 
## F-statistic: 283.6 on 3 and 16 DF,  p-value: 4.464e-14

anova(fitA)

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## x1         1 29.3919 29.3919 849.9108 2.691e-15 ***
## G1         1  0.0288  0.0288   0.8327    0.3750    
## G2         1  0.0008  0.0008   0.0224    0.8828    
## Residuals 16  0.5533  0.0346                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fitB <- lm(y ~ x1 + G1 + G2 + x1G + x2G, data = flying.data)
summary(fitB)

## 
## Call:
## lm(formula = y ~ x1 + G1 + G2 + x1G + x2G, data = flying.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25152 -0.12643 -0.00954  0.08124  0.32840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.47052    0.24767  -5.937 3.63e-05 ***
## x1           0.80466    0.08668   9.283 2.33e-07 ***
## G1           1.26807    1.28542   0.987    0.341    
## G2          -0.11032    0.38474  -0.287    0.779    
## x1G         -0.21487    0.22362  -0.961    0.353    
## x2G          0.03071    0.10283   0.299    0.770    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1899 on 14 degrees of freedom
## Multiple R-squared:  0.9832, Adjusted R-squared:  0.9771 
## F-statistic: 163.4 on 5 and 14 DF,  p-value: 6.696e-12

anova(fitB)

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## x1         1 29.3919 29.3919 815.0383 8.265e-14 ***
## G1         1  0.0288  0.0288   0.7986    0.3866    
## G2         1  0.0008  0.0008   0.0215    0.8855    
## x1G        1  0.0452  0.0452   1.2543    0.2816    
## x2G        1  0.0032  0.0032   0.0892    0.7696    
## Residuals 14  0.5049  0.0361                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fitA, fitB)

## Analysis of Variance Table
## 
## Model 1: y ~ x1 + G1 + G2
## Model 2: y ~ x1 + G1 + G2 + x1G + x2G
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     16 0.55332                           
## 2     14 0.50487  2   0.04845 0.6718 0.5265

fitC <- lm(y ~ x1, data = flying.data)
#summary(fitC)
#anova(fitC)

7.3 Quadratic terms and interactions

Example from Ramsey and Schafer (2002) pg 252. The data on corn yields and rainfall are in `RainfallData.csv’, or library(Sleuth3) in ‘ex0915’. Variables:

Yield: corn yield (bushels/acre)
Rainfall: rainfall (inches/year)
Year: year.

Link: http://www.rpubs.com/kdomijan/332466

7.4 An example with two continuous and two categorical predictors

FEV data - for a full description see http://ww2.amstat.org/publications/jse/v13n2/datasets.kahn.html.

Response variable: fev (forced expiratory volume) measures respiratory function.

Predictors: age, height, gender and smoke.

The dataset is in library(covreg).

Link: http://www.rpubs.com/kdomijan/325930

References

Ramsey, Fred, and Daniel Schafer. 2002. The Statistical Sleuth: A Course in Methods of Data Analysis. 2nd ed. Duxbury Press.