Data Analytics Module
Lecturer: Hans van der Zwan
Handout 05
Topic: multiple regression
Literature
Rumsey D. J. (2010). Statistical Essentials for Dummies. Hoboken: Wiley Publishing.
Ismay C. & Kim A. Y. (2019). ModernDive. Statistical Inference for Data Science. https://moderndive.com.
Recommended literature
Preparation class
See module description
A simple linear regression model is a model with a numeric Y-variable and one X-variable. Such a model can be easily expanded to a multiple reression model, i.e. a model with more than one X-variable. Assessing such a model doesn’t differ from assessing a simple linear regression model.
The buyer of a new car has to pay a special tax. The heigth of this special tax depends on different factors. Aim of this example is to find a model with which the heigth of the special tax for a Toyota can be estimated, based on different characteristics of this car. For this reason a random sample from in the Netherlands registered Toyota’s, has been drawn, reference date 2019-06-12; see file toyota_sample.csv.
The sample contains 400 observations on 15 variables.
As a first analysis the correlation coefficients between some of the numeric variables have been calculated. MS Excel: Data/Data Analysis/Correlation.
Figure 1. Correlation matrix generated with MS Excel.
As can be seen in Figure 5 the variable SPECIAL_TAX has the highest correlation with CATALOG_PRICE. That’s why the first regression model is a simple linear regression model with CATALOG_PRICE as explanatory variable.
Figure 2. Scatterplot, SPECIAL_TAX in euro against CATALOG_PRICE in euro.
A first model uses CATALOG_PRICE as explanatory variable.
Figure 3. MS Excel output simple linear regression model with SPECIAL_TAX as response variable and CATALOG_PRICE as explanatory variable. In the data file for three observations the catalog price is not available (NA), these observations have to be removed from the data to generate the regression model with Excel.
In this model R\(^2\) equals 0.415, so 41.5% of the variation in the catalog prices is explained by the model. Catalog price is a very significant variable in this model, p < .001. The value of the standard error of the estimate is 2839, i.e. 104.9% of the average SPECIAL-TAX values. This means that, although the variable is a significant explanatory variable, the model is not very usefull as a predcitive model.
A second model uses MASS as explanatory variable.
Figure 4. MS Excel output simple linear regression model with SPECIAL_TAX as response variable and MASS as explanatory variable.
In this model R\(^2\) equals 0.424, so 42.4% of the variation in the special tax values is explained by the model. Mass is a very significant variable in this model, p < .001. The sandard eroor of the estimate is greater than in the first model. So overall this model is less good than the first model.
SPECIAL_TAX ~ MASS + CATALOG_PRICE In a third model two explanatory variables ares used: MASS and CATALOG_PRICE. This model is sometimes notated as: SPECIAL_TAX ~ MASS + CATALOG_PRICE).
Figure 5. MS Excel output linear regression model with SPECIAL_TAX as response variable and CATALOG_PRICE and MASS as explanatory variable.
Although this model uses two explanatory variables which are both moderately correlated with the response variable SPECIAL_TAX, the model is not much better than the simple linear models, as can be seen by comparing the R\(^2\) values. The reason for this is that the two explanatory variables are highly correlated with each other. In general, it is preferable to use explanatory variables which are not correlated to each other.
The third model makes use of a dummy variable ELECTRIC (1 = FUEL_DESCRPTION=“ELECTRICITY”, 0 = FUEL_DESCRPTION<>“ELECTRICITY”): SPECIAL_TAX ~ CATALOG_PRICE + ELECTRIC.
Figure 6. MS Excel output linear regression model with SPECIAL_TAX as response variable and CATALOG_PRICE and ELECTRIC as explanatory variables. ELECTRIC is a dummy variable which takes on value 1 if the car is an electric car and 0 otherwise.
This model is really an improvement of the simple regression model with CATALOG_PRICE as the only explanatory variable.
A special case of multiple regression analysis in case of longitudinal: panel data analysis.
See for instance this presentation.