9 Homework: Week 2

9.1 Instructions

This is homework assignment 2 of 2 for this module. In this assignment, you will apply the concepts learned through Chapter 7 using the MEPS data. There is a total of 42 possible points for this assignment (27 in Question 1, and 15 in Question 2).

You will download a code skeleton for this assignment below by clicking on the red button. This skeleton will provide you an outline for your submission, which will take the form of a R Markdown HTML document.

You will find the homework questions listed in the subsections below. There are a total of 2 questions in this homework assignment, each with several parts. The questions are intended to be challenging, but doable if you spent the time to learn the material through Chapter 7. Where appropriate, we may provide hints and links to sections in the module that you can reference to help you with the question. All questions are intended to be answerable with the module material provided, but you are encouraged to use any online resources and search engines to come up with solutions to the questions.

9.2 Download Code Skeleton

9.3 Homework Questions:

9.3.1 Question 1 (27 points)

This question will have several parts that will allow you to practice the linear regression concepts learned in chapter 6. Specifically, we will practice fitting a linear regression model to predict total adult medical expenditures using age, BMI, and perceived mental health status as predictors.

First, make sure you first have your MEPS data loaded with the same alterations that we made at the beginning of chapter 5. We’ll define a new subset of the data, called meps_sub that we’ll use for question 1 of this homework. Please copy and run this code so that everyone is starting with the same dataset.

meps_sub = meps %>% 
  dplyr::select(TOTEXP18, AGE42X, ADBMI42, MNHLTH42) %>% 
  dplyr::filter(AGE42X >= 18, 
                ADBMI42 > 0, 
                MNHLTH42 != "INVALID")

In this code, we kept only the four variables that we’ll need for our model, as well as only adults (18+), and then filtered out any invalid values for BMI and Mental Health Status.

9.3.1.1 Part 1 (5pts):

Create three plots that show the following relationships. Remember to reference Chatper 5 on data visualization for ideas.

  1. Total expenditures vs Age (1pt)
  2. Total expenditures vs BMI (1pt)
  3. Total expenditures vs Mental Health status (1pt)
  4. Then, create a correlation matrix for only the numerical variables in the meps_sub dataset (1pt)
  5. Finally, discuss what these plots and correlation matrix tell you about how your predictors (Age, BMI, Mental Health Status) might be related to Total Medical Expenditures (2pts).

9.3.1.2 Part 2 (12pts):

  1. Fit a linear regression model to predict Total Medical Expenditures as a function of Age, BMI, and Mental Health Status (1pt)
  2. Interpret the model’s estimated coefficients for each of the predictor variables in terms of their estimated impact on Total Medical Expenditures. Also interpret the intercept term. (3pts)
  3. Which predictors appear to be important in predicting Total Medical Expenditures? How did you determine whether a predictor was important? (3pts)
  4. For one of the predictors (of your choosing), interpret what the associated p-value means. (1pt)
  5. How well does the model fit the data? What statistic did you use to assess the model’s ability to explain variation in Total Medical Expenditures? (2pts)
  6. Interpret the p-value associated with the F-statistic of the model. Based on this p-value, do you think at least one of the predictors is likely to be important in predicting medical expenditures? (2pts)

9.3.1.3 Part 3 (2pts):

  1. Important in any linear regression setting is understanding the underlying assumptions that make using a linear model appropriate. One important assumption is the assumption of linearity of the coefficients. Assess whether you think the model satisfies the linearity assumption. (2pts)
  2. Bonus - if you do not think the model satisfies the linearity assumption, what is something you would do to create a different model that might? (2 bonus pts)

9.3.1.4 Part 4 (8pts):

  1. Split the meps data randomly into a training dataset and a testing (or validation) dataset using a seed value of 10. (1pt)
  2. Create a new linear regression model, called model_train using the training dataset (1pt)
  3. Use the predict() function to calculate the predicted values when applying the model_train model to the test dataset. Use the head() function to print the first few values. (2pts)
  4. Calculate the Mean Squared Error of the model when applied to the test dataset. (2pts)
  5. Explain why calculating a test MSE is superior to the training MSE when evaluating the predictive performance of competing models. (2pts)

9.3.2 Question 2 (15 points)

Question 2 will allow you to practice the logistic regression concepts learned in chapter 7.

Imagine you are an actuary working for the government and have been assigned to create a model to prioritize adult individuals for outreach about receiving the flu vaccine. Further assume that the MEPS data is your dataset from which you are to train your model on. We want to accomplish two things:

  1. Understand better what factors are associated with adults receiving the flu vaccine
  2. Create a predictive model that will allow you to predict who is most likely to have received their flu vaccine already so you can move them to the bottom of the priority list.

Again, make sure you first have your MEPS data loaded with the same alterations that we made at the beginning of chapter 5. We’ll define a new subset of the data, called meps_logistic that we’ll use for question 2 of this homework. Please copy and run this code so that everyone is starting with the same dataset.

meps_logistic = meps %>% 
  dplyr::select(ADFLST42, AGE42X, SEX, INSCOV18, POVCAT18, RACETHX) %>% 
  dplyr::filter(ADFLST42 != "INVALID",
                AGE42X >= 18) %>% 
  dplyr::mutate(ADFLST42 = ifelse(ADFLST42 == "YES", 1, 0))

In this code, we kept only the flu vaccination status, age, sex, insurance coverage, poverty category, and race variables for our model. We also filtered to include only adults (18+), and then filtered out any invalid values for flu vaccination status. Finally, we re-coded the flu vaccination status variable to be equal to “1” if a person has received the vaccine already, and zero otherwise.

9.3.2.1 Part 1 (9pts):

  1. Fit a logistic model using the meps_logistic dataset to predict flu vaccination status as a function of age, sex, insurance coverage, poverty category, and race. Name this model flu_logistic_model (2pts).
  2. Call summary() on the model (1pt).
  3. Interpret the coefficient of the SEX variable and the likelihood of receiving a flu shot for males vs females. Is the difference between males and females significant? Explain why or why not (2pts).
  4. Interpret the two coefficients of the INSCOV18 variable. How does insurance coverage impact the likelihood of receiving a flu shot? (2pts)
  5. In general, does having more income appear to be associated with higher or lower likelihood of receiving a flu shot? (1pt)
  6. How does race appear to be associated with flu shot status? (1pt)

9.3.2.2 Part 2 (6pts):

  1. Use the predict() function to calculate predicted probabilities of receiving a flu shot for each person in the training dataset (1pt).
  2. Print the first 10 probabilities (1pt).
  3. Pick a prediction threshold to classify individuals as “1” vs “0” (1pt).
  4. Calculate the prediction accuracy of your model with your chosen prediction threshold (1pt).
  5. Calculate the sensitivity of your model with your chosen prediction threshold (1pt).
  6. Interpret the sensitivity value calculated in the step above (1pt).

9.3.2.3 Bonus Question (3pts)

Imagine you are developing a fictional model that is able to use evidence from a murder scene to predict whether a suspect is complicit in the murder. In said model, would you rather have your model have high sensitivity or high specificity? Explain your rationale.