Chapter 5 Traditional Approach

5.1 Introduction to Problem

I am looking to use a multiple linear regression model to predict house prices using a sample data set from the website Kaggle (2015). The data contains property sales prices for transactions between May 2014 and May 2015 in King County USA which includes Seattle.

I start by looking at any online literature on house price prediction. This “meta analysis” has two purposes:

  • It helps select variables to include in the model
  • It adds credibility to my findings if I incorporate and build on the empirical evidence of previous research studies

I then perform a formal model selection procedure. This involves:

  • Variable selection
  • Model fitting
  • Diagnostics
  • Selection between competing models

Having selected a final model, I then use the model to make predictions about property prices.

5.2 Meta Analysis/Literature Review

My data set only includes information on micro variables. These micro variables relate to the intrinsic features of a property and its immediate environment. By contrast, macro variables are excluded. These relate to the “external environment” of the property (see examples below).

By excluding macro variables, the parameter estimates of my model could be biased. For example, the data comes from the period from May 2014 to May 2015 when the external environment in the US was relatively stable. This means that the level of noise in the data could be artificially low and the explanatory power of micro variables overstated.

To mitigate this risk I have performed a quick online literature search and identified micro variables which previous research studies have found to be significant predictors of property price. I have included these variables as the starting point of the model selection process. This should lead as much overlap as possible between my model and previous studies and stops me relying 100% on a statistical model selection procedure which is vulnerable to bias.

5.2.0.1 Micro variable Examples

  • Property location
  • Building features
  • Interior Layout
  • Quality of fixture and fittings

5.2.0.2 Macro variable Examples

  • National and local government policies
  • the state of the mortgage market
  • interest rates
  • unemployment levels in the job market
  • demographics

5.3 Model Selection

5.3.1 Correlation Matrix

The correlation plot below shows that none of the data-set variables identified from the meta analysis are highly correlated (r>80%). This means they can all be included in the model without causing issues regarding multi-collinearity and non-convergence of the model fitting algorithm.

The full correlation plot for the data-set is shown below. The bottom row shows the correlations between the data-set variables and LogSalePrice. The top five correlated variables are LivingSpace, Grade,AboveGroundFloorArea, NumberOfBathrooms, View. All of which are positively correlated with LogSalePrice.

5.3.2 Model Fitting

Model 1 was fitted using only the variables identified in the meta-analysis. Models 2 to 5 were obtained using a step-wise AIC procedure to automatically select new models. Model 5 is where the procedure terminated.

Automated Model Selection
Model 1 Model 2 Model 3 Model 4 Model 5
(Intercept) 17.36*** 17.14*** 16.98*** -65.15*** -65.07***
(0.22) (0.22) (0.22) (8.94) (8.93)
ConstructionYear -0.00*** -0.00*** -0.00*** -0.00*** -0.00***
(0.00) (0.00) (0.00) (0.00) (0.00)
LivingSpace 0.00*** 0.00*** 0.00*** 0.00*** 0.00***
(0.00) (0.00) (0.00) (0.00) (0.00)
NumberOfFloors 0.03*** 0.03*** 0.03*** 0.03*** 0.02***
(0.01) (0.00) (0.00) (0.00) (0.01)
SeattleFlag 0.22*** 0.22*** 0.21*** 0.21*** 0.22***
(0.01) (0.01) (0.01) (0.01) (0.01)
RenovationYear 0.00*** 0.00*** 0.00*** 0.00*** 0.00***
(0.00) (0.00) (0.00) (0.00) (0.00)
TotalArea 0.00*** 0.00*** 0.00*** 0.00*** 0.00***
(0.00) (0.00) (0.00) (0.00) (0.00)
NumberOfBedrooms -0.02*** -0.02*** -0.01*** -0.01*** -0.01***
(0.00) (0.00) (0.00) (0.00) (0.00)
NumberOfBathrooms 0.07*** 0.06*** 0.06*** 0.06*** 0.06***
(0.00) (0.00) (0.00) (0.00) (0.00)
Condition 0.06*** 0.06*** 0.06*** 0.06*** 0.06***
(0.00) (0.00) (0.00) (0.00) (0.00)
Grade 0.22*** 0.22*** 0.21*** 0.21*** 0.21***
(0.00) (0.00) (0.00) (0.00) (0.00)
WaterfrontView 0.52*** 0.39*** 0.39*** 0.39***
(0.02) (0.03) (0.03) (0.03)
View 0.04*** 0.04*** 0.04***
(0.00) (0.00) (0.00)
SaleYear 0.04*** 0.04***
(0.00) (0.00)
AboveGroundFloorArea 0.00*
(0.00)
R2 0.66 0.67 0.67 0.67 0.67
Adj. R2 0.66 0.67 0.67 0.67 0.67
Num. obs. 21436 21436 21436 21436 21436
RMSE 0.31 0.30 0.30 0.30 0.30
p < 0.001, p < 0.01, p < 0.05

The \(R^{2}\) value of Model 1 is 0.66. This seems relatively low with the regression fit only explaining circa 66% of the total variation about the mean.

The \(R^{2}\) value of Model 5 is 0.67. This also seems relatively low and only a small improvement on Fit 1. It is however statistically significant and we would reject the null hypothesis that the final model has no explanatory power in a formal test at the 0.1% level (see below).

Statistical Test Model 5 vs Model 1
Model DoF RSS DOF_Diff SUmOfSq FProb
1 Fit1 21425.00 2018.00
2 Fit5 21421.00 1952.00 4.00 65.65 0.00

I don’t think the improvement in fit in Model 5 is worth the extra four variable required and will use Model 1. For Model 1 it needs to be investigated further why the coefficients for NumberOfBedrooms and ConstructionYear have changed sign versus the values in the correlation matrix.

5.3.3 Diagnostics

The chart below present visually key checks on whether the assumptions of the model are met. There is nothing to suggest that the residuals have a trend. The variability of the residuals does not change with fitted value and appears to follow a normal distribution. Two observations “6022” and “2061” have high leverage (influence on own fitted value) and high standardised residual (greater deviance from zero than expected). They do not have a Cook’s distance greater than 0.5 and hence do not have a material impact on fitted coefficient values.

Visual Checks on Model 1

Figure 5.1: Visual Checks on Model 1

References

Kaggle. 2015. “This Dataset Contains House Sale Prices for King County.” https://www.kaggle.com/harlfoxem/housesalesprediction.

Galati, Gabriele, Federica Teppa, and Rob JM Alessie. 2011. “Macro and Micro Drivers of House Price Dynamics: An Application to Dutch Data.”

Ezgi CANDAS, Seda BAGDATLI KALKAN, and Tahsin YOMRALIOGLU. 2015. “Determining the Factors Affecting Housing Prices.”

RightMove. 2017. “Positive and Negative Impacts on House Prices.” http://www.rightmove.co.uk/what-affects-house-prices.html.