Section 20 Data Processing
The original data is missing a number of fields which the literature search flagged as important. In particular Proximity to Amenities, Rental Value, Large City vs not large. Some of these could simply be calculated from the original data. For example, since Seattle was the only large city in King County, “Large City vs Not Large” can be calculated using the zipcode data field and a list of Seattle zipcodes.
Many of the missing data fields could not be calculated from the original data. Correspondingly I wrote custom R functions to retrieve additional fields from online data providers. Conducting significant volumes of “API calls” for free is a fiddly computational exercise. The code needs to handle error messages and parse results automatically. Furthemore, data providers typically place free usage limits of 1000 API calls per day and this needed to be circumvented to enrich my 21613 records!
The first step to using an API data-service is to obtain an Access Token from the data service. This can only be done manually by visiting the developer pages of the service’s website and supplying user details (eg. email address). Access Tokens typically do not expire but do have usage limits of 1000 calls per day. By supplying Credit Card details to verify my identity I was able to increase my daily limit on Google to 50000 calls.
Reverse Geocoding
Each data-service requires a particular set of input information to be supplied to the server in a specified format. Unfortunately Zillow required the full property address as an input and this was not contained in the original data. Therefore before starting with data-enrichment I wrote the ReverseGeo function to reverse geocode my data (see Code Section). Using this function, I looped through the 21613 records and by suppling latitute and longitude information, I obtained the full address details. To check that this process completed without error, I identified 10 properties randomly in the data and checked on google maps that the same building was identified using first lat/long information and then full address.
Radar Search
Having gotten the full address for each property, I used the Google Radar Search API to return places of interest near each property. Unfortunately to obtain the 10 data fields below, I had to loop through the entire data set over 10 times (once per data field). This exceeded my daily usage limit and I needed to create multiple access tokens to circumvent any access. Again, to check that this process completed without error, I identified 10 properties randomly in the data and checked on google maps that their neighbourhood contained the places of interest identified by the Radar API.
Micro-variable | Source | Description |
---|---|---|
Restaurants250m | Number of Restaurants within 250m | |
Schools1000m | Number of Schools within 1000m | |
PoliceStation1000m | Number of Police Stations within 1000m | |
SuperMarketGrocery750m | Number of Supermarkets or Grocery Stores within 750m | |
Library750m | Number of Libraries within 750m | |
LiquourStore250m | Number of Liquour Stores within 250m | |
DoctorDentist250m | Number of Doctor or Dentists within 250m | |
DepartmentStoreShoppingMall750m | Number of Department Stores or Shopping Malls with 750m | |
BusTrainTransitStation100m | Number of Bus or Transit Stations within 100m | |
BarNightclubMovie500m | Number of Bars, Nightclubs or Movie Theatres within 500m |
Zillow Search
Finally, I used the Zillow API to obtain information on the rental value of properties. The results from API calls to Zillow were in a hierarchical XML format and needed to be tidied significantly before they could be analysed. Occasionally the XML format of the result set changed and I needed to write the function in figure (see Code Section) to automate result retrieval. Using this tidying function, I was able to add the following fields to each property in the data-set:
Micro-variable | Source | Description |
---|---|---|
RZEstimateAmount | Zillow | Monthly Rental Charge Estimatefor the Property |