Data Analytics Module
Lecturer: Hans van der Zwan
Lab 05
Topic: multiple regression


EXERCISE 5.1

(continuation of the homework assignment from handout 4)
The file 20191126_forsale_amsterdam.csv contains information about properties for sale in Amsterdam on November 26, 2019.

  1. Create a table with summary statistics for the house prices per PC3 district. A PC3 district is a district with the same thirst three postcode characters; so the PC3 districts in Amsterdam are: PC100, PC101, …, PC110.
  2. Create a scatterplot with PRICE as Y-variable and AREA as X-variable.
  3. Generate a regression model with PRICE as response and AREA as explanatory variable.
  4. Asses the regression model from part (iii).
  5. Create a scatterplot with PRICE as Y-variable and ROOMS as X-variable.
  6. Generate a regression model with PRICE as response variable and ROOMS as explanatory variable.
  7. Asses the regression model from part (vi).
  8. Create a regression model with PRICE as response variable and AREA and ROOMS as explanatory variables.
  9. Asses the regression model from part (viii) and compare it with the two other models.
  10. Add a dummy variable: 1 = located in the city centre, 0 = not located in the city center. Add this variable to the regreesion model of part (viii).


EXERCISE 5.2

Topic: determinants of healthcare costs in the Netherlands
Steps (methodology):

  • Collect historica data:
    • Healthcare costs from 2017 per municipality (vektis.nl)
    • Figures from (socio-economic) factors in which are assumed to be related with healthcare costs
  • Generate a multiple regression model with healthcare costs in 2017 as resonse variable and the (socio-economic) factors as predictors

Open the file healthcare_nl.csv. This file contains information about a sample of Dutch Municipalities in 2017.

Variables in this dataset:

  • MUNCODE; a unique identifier for Dutch municapalities
  • MUNICIPALITY; the name of the Municipality
  • TOTAL_COSTS; total healthcare costs insured under the basic Dutch health insurance
  • INSURED_YEARS; total number of insured years
  • UNEMPLOYMENT_RATE
  • DISTANCE_HOSPITAL; average distance to hospital
  • HOSPITALS; average number of hospitals within a distance of 20 km

Develop a multiple regression model with HEALTHCARE_COSTS as response variable and assess the model.