3.2 Missing data options
When examining the data, you will often find there are variables with missing values. Standard regression analysis algorithms require all values to be non-missing, so the default option of regression software, when faced with data that contain missing outcome and/or predictor values, is to remove any case that has any missing values, resulting in what is called a complete case analysis. A more rigorous approach is to use the method of multiple imputation to randomly fill in missing values. Complete case analysis is easier, but “easier” does not mean “better”. For the purpose of learning regression methods it does, however, make sense to stick to the easier missing data method first and then later, after becoming familiar with regression, learn the more rigorous method.
3.2.1 Complete case analysis
A complete case analysis is one in which all cases with missing data on any variable are removed from the analysis. In R, complete case analysis is the default when using a regression function – that is, R automatically removes cases with any missing data before fitting the regression model. However, if you want to first describe the data used in a complete case analysis, you need to explicitly remove the cases with missing values yourself in order to compute descriptive statistics that are based on the same sample as your regression analysis.
If you instead use a dataset that contains missing values to compute descriptive statistics for each variable, the results for a specific variable will be based on all available cases with non-missing values for that variable regardless of whether any other variables have missing values for those cases. Thus, descriptive statistics for different variables might be based on different subsets of cases. To ensure that the descriptive statistics for all variables are based on the same sample as each other, and the same sample as will be used when fitting the regression model, remove cases that have missing data for any variable in the analysis.
Example 3.1 (continued): For the same data and variables summarized previously, the following code demonstrates two methods for creating a complete case analysis dataset. Both methods result in the correct cases, but the first retains all the variables in the dataset, including those not being used in the analysis. The second method only retains the analysis variables.
# Remove cases with missing data and keep all variables
SUB <- complete.cases(nhanes[, c("sbp", "RIDAGEYR", "RIAGENDR", "income")])
complete.dat <- subset(nhanes, subset = SUB)
# Number of cases and variables in the original dataset
dim(nhanes)
## [1] 1000 85
## [1] 855 85
# Alternative that does the same thing
complete.dat <- nhanes %>%
drop_na(sbp, RIDAGEYR, RIAGENDR, income)
# Number of cases and variables in the complete case dataset
dim(complete.dat)
## [1] 855 85
# Remove cases with missing data and only keep variables of interest
complete.dat <- nhanes %>%
select(sbp, RIDAGEYR, RIAGENDR, income) %>%
drop_na()
# Number of cases and variables in the complete case dataset
dim(complete.dat)
## [1] 855 4
## sbp RIDAGEYR RIAGENDR income
## Min. : 89 Min. :20.0 Male :425 <$25,000 :148
## 1st Qu.:111 1st Qu.:32.0 Female:430 $25,000 to <$55,000:248
## Median :121 Median :48.0 $55,000+ :459
## Mean :124 Mean :48.1
## 3rd Qu.:134 3rd Qu.:62.0
## Max. :234 Max. :80.0
3.2.2 Multiple imputation
Multiple imputation (MI) of missing data is a method of filling in missing values based on the observed associations between the analysis variables. The filling in (imputation) is done randomly and multiple times to avoid pretending the missing values are known with certainty. Following MI, fit the regression model multiple times, once using each imputed dataset, and combine the results according to certain mathematical rules (Rubin 1987).
In the example we have been using, the original dataset has 1000 cases, of which only 855 have complete data on all the variables of interest. A complete case analysis uses just those 855 cases. An MI analysis, however, uses all 1000 cases. However, due to the missing data, we do not actually have a sample size of 1000 complete cases. The mathematical rules used to combine the results appropriately account for the fact that we have more information than just the 855 complete cases, but less information than a dataset with 1000 complete cases. Multiple imputation will be discussed in more detail in Chapter 9.