7.5 Survival analysis dataset structure
The R functions we will use for survival analysis require a dataset with a specific structure. There must be a numeric event time variable and a binary event indicator variable, coded as numeric, with values of 1 for events, 0 for censored event times, and no other non-missing values. There can also be additional variables representing predictors. In the basic set-up, the predictors do not vary over time and so there is one row per individual. Later, we will discuss time-varying predictors (Section 7.14) which require a dataset with multiple rows per individual.
Example 7.1 (continued): The first five rows of the Natality teaching dataset, shown below, include the event time (gestage37
), the indicator of preterm birth (preterm01
), and a few time-invariant demographic variables and other risk factors – mother’s age (MAGER
), mother’s race/Hispanic origin (MRACEHISP
), previous preterm birth (RF_PPTERM
), and previous Cesarean birth (RF_CESAR
). Four of the five births have a gestational age censored at 37 weeks (preterm01
= 0), and one was preterm at gestational age 31 weeks (preterm01
= 1).
load("Data/natality2018_rmph.Rdata")
natality %>%
select(gestage37, preterm01, MAGER, MRACEHISP, RF_PPTERM, RF_CESAR) %>%
head(5)
## # A tibble: 5 × 6
## gestage37 preterm01 MAGER MRACEHISP RF_PPTERM RF_CESAR
## <dbl> <dbl> <dbl> <fct> <fct> <fct>
## 1 37 0 35 Hispanic No Yes
## 2 31 1 28 NH White Yes No
## 3 37 0 22 NH Black No No
## 4 37 0 35 NH White No No
## 5 37 0 30 NH White No No
Verify the event time variable (gestage37
) is numeric using is.numeric()
and summarize the event times using summary()
.
## [1] TRUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 37.0 37.0 36.4 37.0 37.0
Similarly, verify the event indicator (preterm01
) is numeric and use table()
to verify its only non-missing values are 0 and 1.
## [1] TRUE
##
## 0 1
## 1748 252
In Example 7.1, the event indicator variable was already in the correct format. What if it is not?
Example 7.2: The Digitalis Investigation Group (DIG) teaching dataset (dig_rmph.rData
, see Appendix A.6) contains data from a clinical trial investigating the safety and efficacy of Digoxin for treating congestive heart failure. One of the endpoints measured was toxicity (DIG
). Examine if this event indicator variable is numeric with values 0 and 1 and, if it is not, then convert it to that form.
## [1] FALSE
##
## No Event First Event
## 6702 98
The variable is not numeric, but it does have just two values. Create a numeric event indicator variable that is 1 when the original variable is “First Event”, and use table()
to check the derivation. The syntax dig$DIG == "First Event"
creates a logical vector of TRUE
and FALSE
values, and as.numeric()
converts that logical vector to numeric, converting TRUE
to 1 and FALSE
to 0.
dig$DIG_event <- as.numeric(dig$DIG == "First Event")
table(dig$DIG, dig$DIG_event, useNA = "ifany")
##
## 0 1
## No Event 6702 0
## First Event 0 98
The datasets used in this text all include an event time variable. However, in your future work you may encounter datasets for which you have to compute the event time. For example, you may be given the dates the individuals started being observed and dates that events occurred (or were censored). Computing the event times, the times between those dates, is facilitated in R by using date-formatted variables and functions specifically designed to count time units between date-formatted variables. See, for example, the chapter “Dates and Times” in R for Data Science (H. Wickham, Çetinkaya-Rundel, and Grolemund 2023).