Chapter 7 Multivariate OLS: Where the Action Is

7.1 Computing Corner

Packages needed for this chapter.

In this chapter you will learn the basics of estimating multivariate OLS models.

7.1.1 Multiple Regression

To estitmate a multiple regression (a regression with more than one independent variable) use the same function lm but change the formula argument to include the additional variables. In a simple regression, the formula argument was of the form y ~ x. In a multiple regression, the formula argument takes the form y ~ x1 + x2. To include additional variables, extend the argument in a similar manner y ~ x1 + x2 + x3 + .... The remaining arguments are the same as in the simple regression. You can assign the results to an object just as with a simple regression. The output will be the list of 12, but the objects in the list will change to reflect the additional variable(s).

To make use of the results, you can use any of the functions described in Chapter 3 of this manual. You can also make use of any of the subsetting commands as well.

Estimate a regression with robust standard errors with lm_robust with the modified function argument.

7.1.2 Multicollinearity

You can directly estimate the VIF’s with the vif() function from the car package. To esitmate the VIF’s call ols %>% vif() where ols is the object you created with the lm call.

7.1.3 Standardized Coefficients

Estimate standardized regression coefficients with lm.beta() from the lm.beta package. ols %>% lm.beta().

7.1.4 F tests

F tests in econometrics are generally about the joint significance of multiple variables. Suppose, we estimate the regression on \(i=1,2,\ldots n\) observations. \[y_i=\beta_0+\beta_1x_{1,i}+\beta_2x_{2,i}+\cdots+\beta_mx_{i,m}+\beta_{m+1}x_{m+1,i}+\cdots+\beta_kx_{i,k} + \epsilon_i\]

To test the joint significance of the \(\beta_1,\ldots,\beta_m\) in the model we would use an F test to perform the following hypothesis test: \[H_0: \beta_1=\beta_2=\cdots=\beta_m=0\] \[H_1:\text{@ least one }\beta_j\ne0\]

An F test essentially compares the difference in the residual sum of squares under the null and alternative hypotheses. If this difference in large enough relative to the unrestricted standard error, we have evidence to reject the null hypothesis in favor of the alternative hypothesis. The mechanics of the test are as follows:

  1. Estimate the model that does not hold under the null hypothesis, that is, the model above and call it the unsrestricted model and retrieve the residual sum of squares. Retrieve the residual sum of squares, \(rss_u\). The residuals from unrestricted model will have \(n-k-1\) degrees of freedom. The unrestricted model, U, is: \[\text{U: }y_i=\beta_0+\beta_1x_{1,i}+\beta_2x_{2,i}+\cdots+\beta_mx_{i,m}+\beta_{m+1}x_{m+1,i}+\cdots+\beta_kx_{i,k} + \epsilon_i\]

  2. Estimate the model that holds under the null hypothesis Restrict the model so that the null hypothesis holds. That restricted model, R, is \[\text{R: }y_i=\beta_0+\beta_{m+1}x_{m+1,i}+\beta_{m+2}x_{m+2,i}+\cdots+\beta_kx_{k,i} + \eta_i\]. Retrieve the residual sum of squares \(rss_r\) The residual from restricted model will have \(n-m-1\) degrees of freedom.

  3. Calculate the difference in the residual sum of squares \(rss_r - rss_u\) and divide by its degrees of of freedom \(q = (n-m-1)-(n-k-1) = k-m\). So, q is the number of restrictions. A simple way to calculate the number of restrictions is to count the number of equal signs, \(=\), in the null hypothesis.

  4. Calculate \(rss_u/(n-k-1)\)

  5. Divide the result from 3 by the result from 4. This will give you an F statistic with k-m and n-k-1 degrees of freedom.

\[F_c=\frac{\frac{rss_r-rss_u}{q}}{\frac{rss_u}{n-k-1}}\]

The F-test (Wald test) can be used for any number of restrictions on the unrestricted model. For example, suppose we would like to know if a production function with a Cobb-Douglas form has constant returns to scale. The Cobb-Douglas function for output as a function of labor and capital takes the form \[q=al^\alpha k^\beta\epsilon\]. If constant returns to scale hold, \(\alpha+\beta=1\). So we test the following hypothesis: \[H_0:\alpha+\beta=1\] \[H_1:\alpha+\beta\ne1\]

To test this hypothesis form the unrestricted and restricted forms of the model, estimate the models, retrieve the sum of squared residuals, and calculate the F statistic. In the form presented above, the Cobb-Douglas model is not linear in the parameters so it can’t be estimated with OLS. We can make it linear in the parameters by taking the logarithm of both sides. \[\ln(q)=\ln(al^\alpha k^\beta\epsilon)\] \[\text{U: }\ln(q)=\gamma+\alpha \ln(l)+\beta\ln(k)+\eta\].

Form the restricted model by emposing the null hypothesis on the paramaters. From the null hypothesis, \(\beta=1-\alpha\). Substituting for \(\beta\) in the restricted model yields the restricted model. \[\text{R: }\ln(q)-\ln(k)=\gamma+\alpha[\ln(l)-\ln(k)]+\eta\]

The F-stat is: \[F_c=\frac{rss_r-rss_u}{\frac{rss_u}{n-k-1}}\]

The degrees of freedom are \(q=1\) (the number of equal signs in the null hypothesis) and \(n-k-1\).

7.1.4.1 F-test for overall significance.

Estimate the model \(y_i=\beta_0+\beta_1x_{1,i}+\beta_2x_{2,i}+\cdots+\beta_kx_{k,i}+\epsilon_i\). Test the hypothesis \[H_0: \beta_1=\beta_2=\cdots=\beta_k=0\] \[H_1:\text{@ least one }\beta_j\ne0\]

If we reject the null hypothesis, we can say that we have explained some varation in \(y\) with variation in at least one of the \(x's\). In other words, we have a model that is significant. If we fail to reject the null hypothesis, our model has no explanatory power. There is no need to calculate the F-statistic to perform this test because it is reported as a matter of course in the base R call summary or in glance from the broom package. The degrees of freedom are \(q=k\) (the number of coefficients estimated - 1) and \(n-k-1\).

summary will report the F-statistic, its degrees of freedom (numerator and denominator), and the p-value. glance reports the F as “statistic”, the p-value as “p.value”, \(k+1\) as “df”, and \(n-k-1\) as “df.residual”. Note that this test is also a test for the significance of \(R^2\).

7.1.4.2 F-test of linear restrictions

The test we performed above are tests of linear restrictions of the parameters. These hypotheses can be tested directly using linearHypothesis from the car package. Performing a test of linear restrictions using linearHypothesis requires two arguments: model and hypothesis.matrix.

Let the unrestricted model be \[y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\epsilon\] Estimate the model as ols_u <- df %$% lm(y ~ x1 + x2 + x3), where df is the data frame containing the data.

Let’s test the hypothesis \(\beta_2=\beta_3=0\) versus at that one of the \(\beta's\ne0\) using linearHypothesis(model = ols_u, hypothesis.matrix = c("x2 = 0", "x3 = 0"). The result will be an F-test on the restrictions. The F-statistic, its degrees of freedom, and p-value will be returned.

Let’s test the linear restriction for the Cobb-Douglas model above. Estimate the model as ols_u <- df %$% lm(log(q) ~ log(l) + log(k)). To test the hypothesis \(\alpha=\beta\) pipe ols_u into linearHypothesis with the argument \(c(log(l) = log(k))\): ols_u %>% linearHypothesis(c("log(l) = log(k)")). Again, the F-statistic, its degrees of freedom, and p-value will be returned.

7.1.5 Examples

The Motor Trend Car Road Test (mtcars) data set is part of the datasets in base R. The data were extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). See ?mtcars for more information on the data. data(mtcars) will load the data into your global environment as mtcars. We will perform each of the F-tests described above: overall significance, joint signficance of a subset of variables, and equality of two coefficients.

7.1.5.1 Multiple Regression

Suppose we want to estimate the mpg as a function of the number of cylinders, the displacement, and the gross horsepower, then our (unrestricted) model is \[mpg=\beta_0+\beta_1cyl+\beta_2disp+\beta_3hp+\epsilon\].

Let’s estimate the unrestricted model using the expose pipe %$% both with and without robust errors.

# A tibble: 4 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  34.2       2.59       13.2  1.54e-13
2 cyl          -1.23      0.797      -1.54 1.35e- 1
3 disp         -0.0188    0.0104     -1.81 8.09e- 2
4 hp           -0.0147    0.0147     -1.00 3.25e- 1
         term estimate std.error statistic           p.value conf.low
1 (Intercept)  34.1849    2.4700     13.84 0.000000000000048  29.1253
2         cyl  -1.2274    0.5967     -2.06 0.049121075438813  -2.4498
3        disp  -0.0188    0.0083     -2.27 0.031138440490781  -0.0358
4          hp  -0.0147    0.0109     -1.34 0.190818678697032  -0.0371
  conf.high df outcome
1  39.24451 28     mpg
2  -0.00506 28     mpg
3  -0.00183 28     mpg
4   0.00775 28     mpg

7.1.5.2 Multicollinearity

Using the model above \[mpg=\beta_0+\beta_1cyl+\beta_2disp+\beta_3hp+\epsilon\].

We can calcualte the VIF’s as follows:

 cyl disp   hp 
6.73 5.52 3.35 
 cyl disp   hp 
3.67 2.90 2.71 

7.1.5.3 Standardize Regression Coefficients

Using the model \[mpg=\beta_0+\beta_1cyl+\beta_2disp+\beta_3hp+\epsilon\], estimate standardized regression coefficients as follows:


Call:
lm(formula = mpg ~ cyl + disp + hp)

Standardized Coefficients::
(Intercept)         cyl        disp          hp 
      0.000      -0.364      -0.387      -0.167 

7.1.5.4 F-test for Overall significance

Suppose we want to estimate the mpg as a function of the number of cylinders, the displacement, and the gross horsepower, then our (unrestricted) model is \[mpg=\beta_0+\beta_1cyl+\beta_2disp+\beta_3hp+\epsilon\].

Let’s estimate the unrestricted model using the expose pipe %$%

The test for overall significance is: \[H_0:\beta_1=\beta_2=\beta_3=0\] \[H_1: \text{@ least one }\beta_j\ne0\]

Recall that the F-test is reported as a matter of course in summary from base R and glance from the broom package.


Call:
lm(formula = mpg ~ cyl + disp + hp)

Residuals:
   Min     1Q Median     3Q    Max 
-4.089 -2.085 -0.774  1.397  6.918 

Coefficients:
            Estimate Std. Error t value         Pr(>|t|)    
(Intercept)  34.1849     2.5908   13.19 0.00000000000015 ***
cyl          -1.2274     0.7973   -1.54            0.135    
disp         -0.0188     0.0104   -1.81            0.081 .  
hp           -0.0147     0.0147   -1.00            0.325    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.06 on 28 degrees of freedom
Multiple R-squared:  0.768, Adjusted R-squared:  0.743 
F-statistic: 30.9 on 3 and 28 DF,  p-value: 0.00000000505
# A tibble: 1 x 11
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
1     0.768         0.743  3.06      30.9 5.05e-9     4  -79.0  168.  175.
# ... with 2 more variables: deviance <dbl>, df.residual <int>

So we see that \(F=30.877\), \(q=3\), and \(df2=28\). The critical F with \(\alpha=.05\) is \(2.947\). Since the calculated F-stat is greater than the critical F-stat, we reject \(H_0\) in favor of \(H_1\). That is, the explanatory power of the model is statistical significant.

7.1.5.5 F-test of Joint Significance

Suppose we’d like to add the weight (wt), number of gears (gear), and number of carburetors (carb) together increase the explanatory power of the model at the \(\alpha=.05\), level of significance. Our unrestricted model becomes: \[mpg=\beta_0+\beta_1cyl+\beta_2disp+\beta_3hp+\beta_4wt+\beta_5gear+\beta_6carb+\eta\].

The null and alternative hypotheses are: \[H_0:\beta_4=\beta_5=\beta_6=0\] \[H_1:\text{@ least one }\beta_j\ne0\]

7.1.5.7 Perform the test with linearHypothesis

Linear hypothesis test

Hypothesis:
wt = 0
gear = 0
carb = 0

Model 1: restricted model
Model 2: mpg ~ cyl + disp + disp + hp + wt + gear + carb

  Res.Df RSS Df Sum of Sq   F Pr(>F)   
1     28 261                           
2     25 166  3      95.5 4.8  0.009 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Of course, we have the same result.

7.1.5.8 Test of Linear Restrictions

Let the model be \[\ln(mpg)=\beta_0+\beta_1\ln(cyl)+\beta_2\ln(wt)+\epsilon\]. Suppose we’d like to test \[H_0:\beta_1+\beta_2=-1\] against \[H_0:\beta_1+\beta_2\ne-1\]

7.1.5.8.1 Perform the Test “Manually”

Form the restricted model under \(H_0\). If \(H_0\) holds, \(\beta_2=-1-\beta_1\). Substituting into the unrestricted model yields the restricted model: \[\text{R: }\ln(mpg)+\ln(wt)=\beta_0+\beta_1(\ln(cyl)-\ln(wt))+\eta\]

[1] 1.29
[1] 4.18
[1] 0.266

Since 1.289 is less than 4.183 we can failt to reject \(H_0\) and conclude that we have no evidence to suggest that \(\beta_1+\beta_2\ne1\). We can also see that the p-vallue for our calculated F-statistic is 0.266. Since this is greater than \(\alpha=.05\) we fail to reject \(H_0\).

7.1.5.9 Perform the test with linearHypothesis

Linear hypothesis test

Hypothesis:
log(cyl)  + log(wt) = - 1

Model 1: restricted model
Model 2: log(mpg) ~ log(cyl) + log(wt)

  Res.Df   RSS Df Sum of Sq    F Pr(>F)
1     30 0.419                         
2     29 0.401  1    0.0178 1.29   0.27