Data Analytics Module
Lecturer: Hans van der Zwan
Research wk 04
Topic: linear regression Type: individual assignment
Dutch healthcare costs (3)
The objective of this research assignment is to identify healthcare costs determinants in the Netherlands, e.g. factors that influence healthcare costs.
There is a lot of research about healthcare costs determinants.
Download the file with Dutch reimbursed healthcare costs per municipality in 2018: vektis2018_extended.xlsx.
Draw a random sample of 100 observations from this data set.
- Create a correlation matrix with the numerical variables in the data set. Which variable has the highest correlation with COSTS_PER_INSURED_YEAR?
- Comment on what can be seen in the correlation matrix.
- MODEL1: Generate a linear regression model with COSTS_PER_INSURED_YEAR as response variable (Y-variable) and AGE_AVERAGE as explanatory variable (X-variable).
- MODEL2: Generate a linear regression model with COSTS_PER_INSURED_YEAR as response variable (Y-variable) and AGE_MEDIAN as explanatory variable (X-variable).
- Compare MODEL1 with MODEL2; which model is the most usefull to explain the variation in the COSTS_PER_INSURED_YEAR for the different municipalities.
- MODEL3: Generate a multiple linear regression model with COSTS_PER_INSURED_YEAR as response variable (Y-variable) and a couple of explanatory variables; use the correlation matrix for making a selection of features (explanatory variables) in the model.
Linear regression
Search for a dataset with a couple of variables.
The dataset must contain one variable which can be used as response variable and one or more explanatory variables.
Generate a regression model. Assess the model. Give an interpretation of the coefficient(s) of the explanatory variable(s) in the model.