6.6 Exercises
1. The Boston Housing Data Set
For the course of this section, you will work with Boston, the Boston Housing data set which contains 506 observations on housing values in suburbs of Boston. Boston comes with the package MASS which is already installed for the interactive R exercises below.
Instructions:
Load both the package and the data set.
Get yourself an overview over the data using function(s) known from the previous chapters.
Estimate a simple linear regression model that explains the median house value of districts (medv) by the percent of households with low socioeconomic status, lstat, and a constant. Save the model to bh_mod.
Print a coefficient summary to the console that reports robust standard errors.
Hint:
You only need basic R functions here: library(), data(), lm() and coeftest().
2. A Multiple Regression Model of Housing Prices I
Now, let us expand the approach from the previous exercise by adding additional regressors to the model and estimating it again.
As has been discussed in Chapter 6.3, adding regressors to the model improves the fit so the decreases and the increases.
The packages AER and MASS have been loaded. The model object bh_mod is available in the environment.
Instructions:
Regress the median housing value in a district, medv, on the average age of the buildings, age, the per-capita crime rate, crim, the percentage of individuals with low socioeconomic status, lstat, and a constant. Put differently, estimate the model
Print a coefficient summary to the console that reports robust standard errors for the augmented model.
The of the simple regression model is stored in R2_res. Save the multiple regression models to R2_unres and check whether the augmented model yields a higher . Use < or > for the comparison.
3. A Multiple Regression Model of Housing Prices II
The equation below describes estimated model from Exercise 2 (heteroskedasticity-robust standard errors in parentheses).
This model is saved in bh_mult_mod which is available in the working environment.
Instructions:
As has been stressed in Chapter 6.3, it is not meaningful to use when comparing regression models with a different number of regressors. Instead, the should be used. adjusts for the circumstance that the reduces when a regressor is added to the model.
Use the model object to compute the correction factor where is the number of observations and is the number of regressors, excluding the intercept. Save it to CF.
Use summary() to obtain and for bh_mult_mod. It is sufficient if you print both values to the console.
Check that Use the == operator.
4. A Fully-Fledged Model for Housing Values?
Have a look at the description of the variables contained in the Boston data set. Which variable would you expect to have the highest -value in a multiple regression model which uses all remaining variables as regressors to explain medv?
Instructions:
Regress medv on all remaining variables that you find in the Boston data set.
Obtain a heteroskedasticity-robust summary of the coefficients.
The for the model in exercise 3 is . What can you say about the of the large regression model? Does this model improve on the previous one (no code submission needed)?
The packages AER and MASS as well as the data set Boston are loaded to the working environment.
Hints:
For brevity, use the regression formula medv ~. in your call of lm(). This is a shortcut that specifies a regression of medv on all the remaining variables in the data set supplied to the argument data.
Use summary on both models for a comparison of both s.
5. Model Selection
Maybe we can improve the model by dropping a variable?
In this exercise, you have to estimate several models, each time dropping one of the explanatory variables used in the large regression model of Exercise 4 and compare the .
The full regression model from the previous exercise, full_mod, is available in your environment.
Instructions:
You are completely free in solving this exercise. We recommend the following approach:
Start by estimating a model mod_new, say, where, e.g., lstat is excluded from the explanatory variables. Next, access the of this model.
Compare the of this model to the of the full model (this was about ).
Repeat Steps 1 and 2 for all explanatory variables used in the full regression model. Save the model with the highest improvement in to better_mod.