2.1 Estimating survival by means of the Kaplan Meier estimator
If there are no censored observations in a sample of dimension \(n\), the most natural estimator for survival is the empirical estimator, given by
\[ \hat S(t) = P(T \gt t) = \frac{1}{n} \sum_{i=1}^{n} I(t_i \gt t) \] that is, the proportion of observations with failure times greater than \(t\).
x <- c(1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23)
sum(x > 8)/length(x) # hat S(8)
## [1] 0.3809524
sum(x > 12)/length(x) # hat S(12)
## [1] 0.1904762
Another option for estimating survival could be to use the hazard:
\[ \hat S(t) = \prod_{k = 1}^{t-1} \bigg [ 1- \hat \lambda(k)\bigg] \quad {\text{where}} \quad \hat \lambda(t) = \frac{\sum_{i=1}^{n} I(Y_i = t)}{\sum_{i=1}^{n} I (Y_i \ge t)} \]
Note that \(\hat \lambda(t)\) is obtained as the number of individuals that die at time \(t\) divided by the number of individuals that survive to \(t\), the number of individuals at risk at time \(t\) (using the death as event).
However, alternative methods are necessary to incorporate censoring (censored times are different than event times).
# preprocesing data
head(loan)[, c(51, 65, 6, 7, 19, 18, 50)]
## LoanKey LoanOriginationDate LoanStatus
## 1 E33A3400205839220442E84 2007-09-12 00:00:00 Completed
## 2 9E3B37071505919926B1D82 2014-03-03 00:00:00 Current
## 3 6954337960046817851BCB2 2007-01-17 00:00:00 Completed
## 4 A0393664465886295619C51 2012-11-01 00:00:00 Current
## 5 A180369302188889200689E 2013-09-20 00:00:00 Current
## 6 C3D63702273952547E79520 2013-12-24 00:00:00 Current
## ClosedDate Occupation BorrowerState StatedMonthlyIncome
## 1 2009-08-14 00:00:00 Other CO 3083.333
## 2 Professional CO 6125.000
## 3 2009-12-17 00:00:00 Other GA 2083.333
## 4 Skilled Labor GA 2875.000
## 5 Executive MN 9583.333
## 6 Professional NM 8333.333
table(loan$LoanStatus)
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
# removing duplicates
loan_nd <- loan[unique(loan$LoanKey), ]
# removing LoanStatus no needed
sel_status <- loan_nd$LoanStatus %in% c("Completed", "Current",
"ChargedOff", "Defaulted",
"Cancelled")
loan_filtered <- loan_nd[sel_status, ]
# creating status variable for censoring
loan_filtered$status <- ifelse(
loan_filtered$LoanStatus == "Defaulted" |
loan_filtered$LoanStatus == "Chargedoff", 1, 0)
# adding the final date to "current" status
head(levels(loan_filtered$ClosedDate))
## [1] "" "2005-11-25 00:00:00" "2005-11-29 00:00:00"
## [4] "2005-11-30 00:00:00" "2005-12-08 00:00:00" "2005-12-28 00:00:00"
levels(loan_filtered$ClosedDate)[1] <- "2014-11-03 00:00:00"
# creating the time-to-event variable
loan_filtered$start <- as.Date(loan_filtered$LoanOriginationDate)
loan_filtered$end <- as.Date(loan_filtered$ClosedDate)
loan_filtered$time <- as.numeric(difftime(loan_filtered$end, loan_filtered$start, units = "days"))
# there is an error in the data (time to event less than 0)
loan_filtered <- loan_filtered[-loan_filtered$time < 0, ]
# just considering a year of loans creation
ii <- format(as.Date(loan_filtered$LoanOriginationDate),'%Y') %in% c("2006")
loan_filtered <- loan_filtered[ii, ]
dim(loan_filtered)
## [1] 4923 85
head(loan_filtered)[, c(51, 65, 6, 7, 19, 18, 50, 83, 84, 85)]
## LoanKey LoanOriginationDate LoanStatus
## 55706 569F3376160094112B0CCBC 2006-12-07 00:00:00 Completed
## 5258 E3C433749566192177F6A25 2006-11-21 00:00:00 Completed
## 64330 3E5A33783711441966A924A 2006-12-29 00:00:00 Completed
## 1485 AC4533744391314602B8E3A 2006-12-07 00:00:00 Completed
## 22540 08B63364821540522E94FD2 2006-06-13 00:00:00 Completed
## 50637 31C9337247671326054DF29 2006-11-03 00:00:00 Completed
## ClosedDate Occupation BorrowerState StatedMonthlyIncome
## 55706 2009-07-27 00:00:00 Professional 4534.250
## 5258 2008-07-03 00:00:00 Other 3833.333
## 64330 2009-12-29 00:00:00 Doctor 18083.333
## 1485 2008-11-21 00:00:00 Other IN 4576.000
## 22540 2007-09-05 00:00:00 3458.333
## 50637 2009-11-03 00:00:00 Professional IN 15666.667
## start end time
## 55706 2006-12-07 2009-07-27 963
## 5258 2006-11-21 2008-07-03 590
## 64330 2006-12-29 2009-12-29 1096
## 1485 2006-12-07 2008-11-21 715
## 22540 2006-06-13 2007-09-05 449
## 50637 2006-11-03 2009-11-03 1096
#------
# censoring status 0 = censored, 1 = no censored (default)
table(loan_filtered$status)
##
## 0 1
## 3560 1363
prop.table(table(loan_filtered$status))
##
## 0 1
## 0.7231363 0.2768637
# median time until default (taking into account just no cendored data)
median(loan_filtered$time[loan_filtered$status==1]) # I'm underestimating
## [1] 333
# median time until default (with all data)
mean(loan_filtered$time) # I'm underestimating the median survival too
## [1] 633.4331
# (in censored times, the real time is bigger)
Kaplan and Meier (1958) obtained a nonparametric estimate of the survival function, called product-limit, which is the generalization of the empirical estimator for censored data
\[ \hat S(t) = P(T \gt t) = \prod_{i:t_i \le t} \bigg[1-\frac{d_i}{n_i} \bigg] \] where \(t_1, t_2, \ldots,t_n\) are the observed event times, \(d_i\) is the number of events at time \(t_i\), and \(n_i\) is the number of individuals at risk at time \(t_i\) (i.e, the original sample minus all those who had the event before \(t_j\).)
Note that \(d_i/n_i\) is the proportion that failed at the event time \(t_i\) and \(1 - d_i/n_i\) is the proportion surviving the event time \(t_j\).
The Kaplan-Meier estimate is a step function with jumps at event times. The size of the steps depend on the number of events and the number of individuals at risk at the corresponding time. Note that if the last data is censored, the estimator will not reach the zero value.
Without censoring, the estimator is equivalent to the empirical survival function \(\hat S(t) = \frac{1}{n} \sum_{i=1}^{n} I(t_i \gt t)\) or to the one using the risk estimates \(\hat S(t) = \prod_{k = 1}^{t-1} \bigg [ 1- \hat \lambda(k)\bigg]\).
km <- survfit(Surv(time, status) ~ 1, data = loan_filtered)
km # we can see the correct estimated median
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered)
##
## n events median 0.95LCL 0.95UCL
## 4923 1363 1189 1158 1217
print(km, print.rmean = TRUE)
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered)
##
## n events *rmean *se(rmean) median 0.95LCL
## 4923.00 1363.00 926.53 6.85 1189.00 1158.00
## 0.95UCL
## 1217.00
## * restricted mean with upper limit = 1224
?Surv
of the survival package.
2.1.1 Other representation
Assume that \(\widetilde T_i = min (T_i, C_i)\) and \(\Delta_i = I (T_i \le C_i)\), we introduce a weighted average representation of the Kaplan-Meier estimator which will be used later to introduce estimators for the conditional survival function
\[\begin{equation*} \widehat S(y)=1-\sum_{i=1}^{n}W_{i}I(\widetilde T_{(i)}\leq y), \end{equation*}\]where \(\widetilde T_{\left( 1\right) }\leq ...\leq \widetilde T_{\left( n\right) }\) denotes the ordered \(\widetilde T\)-sample and
\[\begin{equation*} W_{i}=\frac{\Delta_{\left[ i\right] }}{n-i+1}\prod_{j=1}^{i-1}\left[ 1-\frac{% \Delta _{\left[ j\right] }}{n-j+1}\right] \end{equation*}\]is the Kaplan-Meier weight attached to \(\widetilde T_{\left( i\right) }\). In the expression of \(W_{i}\) notation \(\Delta_{\left[ i\right] }\) is used for the \(i\)-th concomitant value of the censoring indicator (that is, \(\Delta_{\left[ i \right] }=\Delta _{j}\) if \(\widetilde T_{\left( i\right) }=\widetilde T_{j}\)).
References
Kaplan, E.L., and P. Meier. 1958. “Nonparametric Estimation from Incomplete Observations.” Journal of the American Statistical Association 53: 457–81.