11.5 Exercises
1. Titanic Survival Data
Chapter 11.2 presented three approaches to model the conditional expectation function of a binary dependent variable: the linear probability model as well as Probit and Logit regression.
The exercises in this Chapter use data on the fate of the passengers of the ocean linear Titanic. We aim to explain survival, a binary variable, by socioeconomic variables using the above approaches.
In this exercise we start with the aggregated data set Titanic. It is part of the package datasets which is part of base R. The following quote from the description of the dataset motivates the attempt to predict the probability of survival:
The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts — from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class — are reflected in the survival rates for various classes of passenger.
Instructions:
Assign the Titanic data to Titanic_1 and get an overview.
Visualize the conditional survival rates for travel class (Class), gender (Sex) and age (Age) using mosaicplot().
2. Titanic Survival Data — Ctd.
The Titanic data set from Exercise 1 is not useful for regression analysis because it is highly aggregated. In this exercise you will work with titanic.csv which is available under the URL https://stanford.io/2O9RUCF.
The columns of titanic.csv contain the following variables:
Survived — The survived indicator
Pclass — passenger class
Name — passenger’s Name
Sex — passenger’s gender
Age — passengers’s age
Siblings — number of siblings aboard
Parents.Children.Aboard — number of parents and children aboard
fare — the fare paid in british pound
Instructions:
Import the data from titanic.csv using the function read.csv2(). Save it to Titanic_2.
Assign the following column names to Titanic_2:
Survived, Class, Name, Sex, Age, Siblings, Parents and Fare.
Get an overview over the data set. Drop the column Name.
Attach the packages corrplot and dplyr. Check whether there is multicollinearity in the data using corrplot().
Hints:
read_csv() guesses the column specification as well as the seperators used in the .csv file. You should always check if the result is correct.
You may use select_if() from the dplyr package to select all numeric columns from the data set.
3. Titanic Survival Data — Survival Rates
Contingency tables similar to those provided by the data set Titanic from Exercise 1 may shed some light on the distribution of survival conditional and possible determinants thereof, e.g., the passenger class. Contingency tables are easily created using the base R function table.
Instructions:
Generate a contingency table for Survived and Class using table(). Save the table to t_abs.
t_abs reports absolute frequencies. Transform t_abs into a table which reports relative frequencies (relative to the total number of observations). Save the result to t_rel.
Visualize the relative frequencies in t_rel using barplot(). Use different colors for better distinquishablitly among survival and non-survival rate (it does not matter which colors you use).
4. Titanic Survival Data — Conditional Distributions of Age
Contingency tables are useful for summarizing distribution of categorical variables like Survived and Class in Exercise 3. They are, however, not useful when the variable of interest takes on many different integers (and they are even impossible to generate when the variable is continuous).
In this exercise you are asked to generate and visualize density estimates of the distribution of Age conditional on Survived to see whether there are indications how age relates to the chance of survival (despite that the data set reports integers, we treat Age as a continuous variable here). For example, it is interesting to see if the ‘women and children first’ policy was effective.
The data set Titanic_2 from the previous exercises is available in your working environment.
Instructions:
Obtain kernel density estimates of the distributions of Age for both the survivors and the deceased.
Save the results to dens_age_surv (survived) and dens_age_died (died).
Plot both kernel density estimates (overlay them in a single plot!). Use different colors of your choive to make the estimates distinguishable.
Hints:
Kernel density estimates can be obtained using the functon density().
Use plot() and lines() to plot the density estimates.
5. Titanic Survival Data — A Linear Probability Model for Survival I
How do socio-economic characteristics of the passengers impact the probability of survival? In particular, are there systematic differences between the three passenger classes? Do the data reflect the ‘children and women first’ policy?
It is natural to start the analysis by estimating a simple linear probability model like (LMP) \[Survived_i = \beta_0 + \beta_1 Class2_i + \beta_2 Class3_i + u_i\] with dummy variables \(Class2_i\) and \(Class3_i\).
The data set Titanic_2 from the previous exercises is available in your working environment.
Instructions:
Attach the AER package.
Class is of type int (integer), Convert Class to a factor variable.
Estimate the linear probability model and save the result to surv_mod.
Obtain a robust summary of the model coefficients.
Use surv_mod to predict the probability of survival for the three passenger classes.
Hints:
Linear probability models can be estimated using lm().
Use predict() to obtain the predictions. Remember that a data.frame must be provided to the argument newdata.
6. Titanic Survival Data — A Linear Probability Model for Survival II
Consider again the outcome from Exercise 5:
\[\widehat{Survived}_i = \underset{(0.03)}{0.63} - \underset{(0.05)}{0.16} Class2_i - \underset{(0.04)}{0.39} Class3_i + u_i \]
(The estimated coefficients in this model are related to the class specific sample means of Survived. You are asked to compute them below.)
The highly significant coefficients indicate that the probability of survival decreases with the passenger class, that is, passengers from a less luxurious class are less likely to survive.
This result could be affected by omitted variable bias arising from correlation of the passenger class with determinants of the probability of survival not included in the model. We therefore augment the model such that it includes all remaining variables as regressors.
The data set Titanic_2 as well as the model surv_mod from the previous exercises are available in your working environment. The AER package is attached.
Instructions:
Use the model object surv_mod to obtain the class specific estimates for the probability of survival. Store them in surv_prob_c1, surv_prob_c2 and surv_prob_c3.
Fit the augmented LMP and assign the result to the object LPM_mod.
Obtain a robust summary of the model coefficients.
Hint:
- Remember that the formula a ~ . specifies a regression of a on all other variables in the data set provided as the argument data in glm().
7. Titanic Survival Data — Logistic Regression
Chapter 11.2 introduces Logistic regression, also called Logit regression, which is a more suitable than the LPM for modelling the conditional probability function of a dichotomous outcome variable. Logit regression uses a nonlinear link function that restricts the fitted values to lie between \(0\) and \(1\): in Logit regression, the log-odds of the outcome are modeled as a linear combination of the predictors while the LPM assumes that the conditional probability function of outcome is linear.
The data set Titanic_2 from Exercise 2 is available in your working environment. The package AER is attached.
Instructions:
- Use glm() to estimate the model \[\begin{align*} \log\left(\frac{P(survived_i = 1)}{1-P(survived_i = 1)}\right) =& \, \beta_0 + \beta_1 Class2_i + \beta_2 Ckass3_i + \beta_3 Sex_i \\ +& \, \beta_4 Age_i + \beta_5 Siblings_i + \beta_6 Perents_i + \beta_7 Fare_i + u_i. \end{align*}\]
Obtain a robust summary of the model coefficients.
The data frame passengers contains data on three hypothetical male passengers that differ only in their passenger class (the other variables are set to the respective sample average). Use Logit_mod to predict the probability of survival for these passengers.
Hints:
Remember that the formula a ~ . specifies a regression of a on all other variables in the data set provided as the argument data in glm().
You need to specify the correct type of prediction in predict().
8. Titanic Survival Data — Probit Regression
Repeat Exercise 7 but this time estimate the Probit model \[\begin{align*} P(Survived_i = 1\vert Class2_i, Class3_i, \dots, Fare_i) =& \, \Phi (\beta_0 + \beta_1 Class2_i + \beta_2 Class3_i + \beta_3 Sex_i \\ +& \, \beta_4 Age_i + \beta_5 Siblings_i + \beta_6 Parents_i + \beta_7 Fare_i + u_i). \end{align*}\]The data set Titanic_2 from the previous exercises as well as the Logit model Logit_mod are available in your working environment. The package AER is attached.
Instructions:
Use glm() to estimate the above Probit model. Save the result to Probit_mod.
Obtain a robust summary of the model coefficients.
The data frame passengers contains data on three hypothetical male passengers that differ only in their passenger class (the other variables are set to the respective sample average). Use Probit_mod to predict the probability of survival for these passengers.
Hints:
Remember that the formula a ~ . specifies a regression of a on all other variables in the data set provided as the argument data in glm().
You need to specify the correct type of prediction in predict().