Section 2 Introduction
There are two key ideas in this Dissertation. The Application of Multivariate Analysis techniques to Business Analytics and the empowerment of Analysts to build and publish their own Analytics Software Tools. To weave both themes together, I use an illustrative Business Problem.
My illustrative problem is to develop an Analytics Tool that accurately models a Property’s Value. There are some very bad property price modelling tools in commercial use. In subsequent chapters I aim to demonstrate that a high performance Analytical tool integrating Statistical and Software Elements is easily obtained.
Document Structure
This document is organised with a linear flow that mirrors the Structure of an Analytics Project. After a Theory Section providing a summarised introduction to Multivariate Analysis, Data Analysis is performed. Subsequently a model is selected to describe the features in the data. Finally a software tool is published containing the model selected.
The Mathematical Theory section starts by establishing the key theoretical concepts required to understand Multivariate Models. The Key Elements Section defines a Random Sample and presents (without proof) the asymptotic properties of the (multivariate) sample mean and sample covariance matrix. The Linear Predictors Section section defines a Linear Predictor Function and the Mean Squared Error criterion for measuring the accuracy of linear prediction functions.
Subsequently I introduce the analytical properties of four distinct Multivariate Statistical Techniques which will be applied in subsequent chapters. Each technique emphasises a different aspect of the sample data. The Classical Linear Regression approach focuses on the sample mean, the Principal Component Analysis approach emphasises the sample covariance matrix, Factor Analysis and Canonical Correlation Analysis is a mixture of both.
Here I introduce the concept of Mass Customisation and describe its application to Business Analytics. In general terms, Mass Customisation is about the mass production of customised goods and services. It treats each customer as a separate market and through advanced Technology, each individual order is produced using a customised production set-up.
The data section is split in two parts.
I am interested in applying multivariate analysis to develop an Analytics tool which accurately models property prices. To do this, I sourced a property data-set from the website Kaggle (2015). I call this the original data set.
I start by presenting the original data-set which contains sale prices and property details for 21613 residential property transactions between May 2014 and May 2015 in King County USA (includes Seattle).
I perform significant exploratory analysis in addition to Multivariate Clustering to detect ouliers.
To discover which variables, not included in the data-set, are known to influence property prices I conducted an online literature search. In the Literature Search subsection, I detail the results. Potential gaps in the Original Data were identified and I performed a significant data enrichment process. This process is documented in the Enrichment Process subsection.
Having explored, enriched and removed outliers from the data set, we apply statistical models to the data. This section presents applications of the Multivariate models introduced in the Statistical Modelling Theory Section:
- In the Profile Analysis Section, I use the MANOVA technique to check that the relationships in the Property Data set are consistent with common sense.
- In the Factor Analysis section, I apply the Orthogonal Factor Model to look for a simpler description of the property data set with.
- In the Correlation Analysis section, I use Canonical Correlation Analysis to determine whether the enrichment process was worthwhile. Is there significant overlap between the information in the enriched and original data sets?
- In the Multivariate Regression Section, I investigate how accurately property prices can be modelled with a Classical Linear Regression model.
This section builds the statistical model identified in the Model Selection chapter into an Interactive Analytics Tool. The tool is published on a standalone website and embedded into this document.
This contains concluding remarks and recommendations for nexts steps.
References
Kaggle. 2015. “This Dataset Contains House Sale Prices for King County.” https://www.kaggle.com/harlfoxem/housesalesprediction.