Section 31 Conclusion
There were two key ideas in this Dissertation. The application of Multivariate Analysis techniques to Business Analytics and the empowerment of Analysts to build and publish their own Analytics Software Tools. To weave both themes together, I used an illustrative Business Problem to develop a tool that accurately models a Property’s Value.
As a foundation for what followed, I started with a Theory Section. This provided a summarised introduction to Multivariate Analysis. Concepts such as a Random Sample, a Linear Predictor Function and the Mean Squared Error criterion were introduced. The analytical properties of five distinct Multivariate Statistical Techniques were also presented.
Subsequently, this document was organised with a linear flow that mirrored the structure of an Analytics project:
I performed significant data analysis to explore, enrich and remove outliers. This involved a wide range of plots to visualise data as well as a multivariate clustering technique. Data enrichment was performed using online data providers such as Google and Zillow.
I applíed statistical models to the data. I sought a lower dimensional description of the data using Factor Analysis; I used MANOVA to test whether differences between high and low value properties were statistically significant; I used correlation analysis to investigate the overlap between the original and enriched data set; I used a Classical Linear Regression model to model property prices.
I developed an interactive app. This contained the enriched data-set and allowed the user to select and fit linear models. The fitted model and actual values for each property were displayed on maps. On a separate tab the statistical properties of the fitted model were summarised.
When comparing fitted and original price data in the app, it is clear there are some spatial patterns which have not been sufficiently modelled. A logical next step is to investigate these feature with computationally intensive models. Hierarchical clustering methods or data mining are one possibility and are discussed in Chapters 12 and 13 of Johnson and Wichern (Johnson, Wichern, and others (2014)). Another possibility is to apply spatial models from epidemiology such as those discussed in Chapter 7 of Bivand and Rubio (Bivand et al. (2008)).
The geospatial data in this project was appended onto the original through enrichment. Ideally any further investigations would start with directly collecting geospatial data or with a data-set containing geospatial data. This would avoid the errors and approximations of using open source data providers and ultimately the accuracy of a fitted model.
31.1 Reproducibility
Source Files for Reproducibility
This project has been written using the R package Bookdown
. For reproducibility, all source files and data are available on my GitHub repository here. To recreate this document just clone the repository and save it in the Thesis folder at the following location C:\ThesisSource\bookdown-demo-master
.
Session info for Reproducibility
The R session info when compiling this book is shown below:
## R version 3.3.2 (2016-10-31)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 10 x64 (build 15063)
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] parallel grid splines stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] webshot_0.4.1 robustHD_0.5.1 perry_0.2.0
## [4] robustbase_0.92-7 ggplot2_2.2.1 CCA_1.2
## [7] fields_9.0 maps_3.2.0 spam_2.1-1
## [10] dotCall64_0.9-04 fda_2.4.7 Matrix_1.2-7.1
## [13] texreg_1.36.23 dplyr_0.5.0 nFactors_2.3.3
## [16] lattice_0.20-34 boot_1.3-18 MASS_7.3-45
## [19] psych_1.7.8 xtable_1.8-2 knitr_1.15.1
## [22] rcompanion_1.10.1
##
## loaded via a namespace (and not attached):
## [1] stringi_1.1.2 evaluate_0.10 ordinal_2015.6-28
## [4] permute_0.9-4 expm_0.999-2 mgcv_1.8-15
## [7] survival_2.41-3 highr_0.6 nloptr_1.0.4
## [10] DBI_0.5-1 rstudioapi_0.6 jpeg_0.1-8
## [13] nlme_3.1-128 maxLik_1.3-4 MatrixModels_0.4-1
## [16] mvtnorm_1.0-6 miscTools_0.6-22 ade4_1.7-8
## [19] rprojroot_1.2 tools_3.3.2 manipulate_1.0.1
## [22] sandwich_2.4-0 pbkrtest_0.4-7 magrittr_1.5
## [25] DEoptimR_1.0-8 car_2.1-5 Rcpp_0.12.13
## [28] vegan_2.4-4 rmarkdown_1.3 assertthat_0.1
## [31] R6_2.2.0 nnet_7.3-12 munsell_0.4.3
## [34] mnormt_1.5-5 png_0.1-7 coin_1.2-1
## [37] digest_0.6.12 codetools_0.2-15 bookdown_0.3
## [40] corrplot_0.77 BSDA_1.2.0 lmtest_0.9-35
## [43] colorspace_1.3-2 stats4_3.3.2 WRS2_0.9-2
## [46] minqa_1.2.4 plyr_1.8.4 gtable_0.2.0
## [49] zoo_1.7-14 modeltools_0.2-21 nortest_1.0-4
## [52] quantreg_5.33 tibble_1.2 ucminf_1.1-4
## [55] cluster_2.0.5 backports_1.1.1 scales_0.4.1
## [58] hermite_1.1.1 RVAideMemoire_0.9-68 multcompView_0.1-7
## [61] stringr_1.1.0 EMT_1.1 lme4_1.1-14
## [64] SparseM_1.77 mc2d_0.1-18 class_7.3-14
## [67] htmltools_0.3.6 DescTools_0.99.22 yaml_2.1.14
## [70] lazyeval_0.2.0 TH.data_1.0-8 e1071_1.6-8
## [73] foreign_0.8-67 reshape_0.8.7 multcomp_1.4-7
31.2 References
Bivand, Roger S, Edzer J Pebesma, Virgilio Gomez-Rubio, and Edzer Jan Pebesma. 2008. Applied Spatial Data Analysis with R. Vol. 747248717. Springer.
Ezgi CANDAS, Seda BAGDATLI KALKAN, and Tahsin YOMRALIOGLU. 2015. “Determining the Factors Affecting Housing Prices.”
Galati, Gabriele, Federica Teppa, and Rob JM Alessie. 2011. “Macro and Micro Drivers of House Price Dynamics: An Application to Dutch Data.”
Gandrud, Christopher. 2013. Reproducible Research with R and R Studio. CRC Press.
HBR. 1997. “The Four Faces of Mass Customisation.” https://hbr.org/1997/01/the-four-faces-of-mass-customization.
Johnson, Richard Arnold, Dean W Wichern, and others. 2014. Applied Multivariate Statistical Analysis. Vol. 4. Prentice-Hall New Jersey.
Kaggle. 2015. “This Dataset Contains House Sale Prices for King County.” https://www.kaggle.com/harlfoxem/housesalesprediction.
kdnuggets. 2017. “Trends in Data Science.” http://www.kdnuggets.com/2016/07/4-trends-disrupting-data-science-market.html.
Lewis, Michael, and Nigel Slack. 2002. Operations Strategy. Prentice-Hall.
Marriott, F.H.C. 1974. The Interpretation of Multiple Observations. London Academic Press.
Pine, B Joseph. 1993. Mass Customization: The New Frontier in Business Competition. Harvard Business Press.
Rao, Calyampudi Radhakrishna, Calyampudi Radhakrishna Rao, Mathematischer Statistiker, Calyampudi Radhakrishna Rao, and Calyampudi Radhakrishna Rao. 1973. Linear Statistical Inference and Its Applications. Vol. 2. Wiley New York.
RightMove. 2017. “Positive and Negative Impacts on House Prices.” http://www.rightmove.co.uk/what-affects-house-prices.html.
References
Johnson, Richard Arnold, Dean W Wichern, and others. 2014. Applied Multivariate Statistical Analysis. Vol. 4. Prentice-Hall New Jersey.
Bivand, Roger S, Edzer J Pebesma, Virgilio Gomez-Rubio, and Edzer Jan Pebesma. 2008. Applied Spatial Data Analysis with R. Vol. 747248717. Springer.