A short course on Survival Analysis applied to the Financial Industry

2.1 Estimating survival by means of the Kaplan Meier estimator

If there are no censored observations in a sample of dimension \(n\), the most natural estimator for survival is the empirical estimator, given by

\[ \hat S(t) = P(T \gt t) = \frac{1}{n} \sum_{i=1}^{n} I(t_i \gt t) \] that is, the proportion of observations with failure times greater than \(t\).

x <- c(1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23)
sum(x > 8)/length(x) # hat S(8) 
## [1] 0.3809524
sum(x > 12)/length(x) # hat S(12) 
## [1] 0.1904762

Another option for estimating survival could be to use the hazard:

\[ \hat S(t) = \prod_{k = 1}^{t-1} \bigg [ 1- \hat \lambda(k)\bigg] \quad {\text{where}} \quad \hat \lambda(t) = \frac{\sum_{i=1}^{n} I(Y_i = t)}{\sum_{i=1}^{n} I (Y_i \ge t)} \]

Note that \(\hat \lambda(t)\) is obtained as the number of individuals that die at time \(t\) divided by the number of individuals that survive to \(t\), the number of individuals at risk at time \(t\) (using the death as event).

However, alternative methods are necessary to incorporate censoring (censored times are different than event times).

#  preprocesing data

head(loan)[, c(51, 65, 6, 7, 19, 18, 50)]
##                   LoanKey LoanOriginationDate LoanStatus
## 1 E33A3400205839220442E84 2007-09-12 00:00:00  Completed
## 2 9E3B37071505919926B1D82 2014-03-03 00:00:00    Current
## 3 6954337960046817851BCB2 2007-01-17 00:00:00  Completed
## 4 A0393664465886295619C51 2012-11-01 00:00:00    Current
## 5 A180369302188889200689E 2013-09-20 00:00:00    Current
## 6 C3D63702273952547E79520 2013-12-24 00:00:00    Current
##            ClosedDate    Occupation BorrowerState StatedMonthlyIncome
## 1 2009-08-14 00:00:00         Other            CO            3083.333
## 2                      Professional            CO            6125.000
## 3 2009-12-17 00:00:00         Other            GA            2083.333
## 4                     Skilled Labor            GA            2875.000
## 5                         Executive            MN            9583.333
## 6                      Professional            NM            8333.333
table(loan$LoanStatus)
## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

# removing duplicates
loan_nd <- loan[unique(loan$LoanKey), ] 

# removing LoanStatus no needed 
sel_status  <- loan_nd$LoanStatus %in% c("Completed", "Current", 
                                          "ChargedOff", "Defaulted", 
                                          "Cancelled")
loan_filtered <- loan_nd[sel_status, ]

# creating status variable for censoring
loan_filtered$status <- ifelse(
  loan_filtered$LoanStatus == "Defaulted" |
    loan_filtered$LoanStatus == "Chargedoff",  1, 0)

# adding the final date to "current" status
head(levels(loan_filtered$ClosedDate))
## [1] ""                    "2005-11-25 00:00:00" "2005-11-29 00:00:00"
## [4] "2005-11-30 00:00:00" "2005-12-08 00:00:00" "2005-12-28 00:00:00"
levels(loan_filtered$ClosedDate)[1] <- "2014-11-03 00:00:00"

# creating the time-to-event variable
loan_filtered$start <- as.Date(loan_filtered$LoanOriginationDate)
loan_filtered$end <- as.Date(loan_filtered$ClosedDate)
loan_filtered$time <- as.numeric(difftime(loan_filtered$end, loan_filtered$start, units = "days"))

# there is an error in the data (time to event less than 0)
loan_filtered <- loan_filtered[-loan_filtered$time < 0, ]

# just considering a year of loans creation
ii <- format(as.Date(loan_filtered$LoanOriginationDate),'%Y') %in% c("2006")
loan_filtered <- loan_filtered[ii, ] 


dim(loan_filtered)
## [1] 4923   85
head(loan_filtered)[, c(51, 65, 6, 7, 19, 18, 50, 83, 84, 85)]
##                       LoanKey LoanOriginationDate LoanStatus
## 55706 569F3376160094112B0CCBC 2006-12-07 00:00:00  Completed
## 5258  E3C433749566192177F6A25 2006-11-21 00:00:00  Completed
## 64330 3E5A33783711441966A924A 2006-12-29 00:00:00  Completed
## 1485  AC4533744391314602B8E3A 2006-12-07 00:00:00  Completed
## 22540 08B63364821540522E94FD2 2006-06-13 00:00:00  Completed
## 50637 31C9337247671326054DF29 2006-11-03 00:00:00  Completed
##                ClosedDate   Occupation BorrowerState StatedMonthlyIncome
## 55706 2009-07-27 00:00:00 Professional                          4534.250
## 5258  2008-07-03 00:00:00        Other                          3833.333
## 64330 2009-12-29 00:00:00       Doctor                         18083.333
## 1485  2008-11-21 00:00:00        Other            IN            4576.000
## 22540 2007-09-05 00:00:00                                       3458.333
## 50637 2009-11-03 00:00:00 Professional            IN           15666.667
##            start        end time
## 55706 2006-12-07 2009-07-27  963
## 5258  2006-11-21 2008-07-03  590
## 64330 2006-12-29 2009-12-29 1096
## 1485  2006-12-07 2008-11-21  715
## 22540 2006-06-13 2007-09-05  449
## 50637 2006-11-03 2009-11-03 1096

#------



# censoring status 0 = censored, 1 = no censored (default)
table(loan_filtered$status)
## 
##    0    1 
## 3560 1363
prop.table(table(loan_filtered$status))
## 
##         0         1 
## 0.7231363 0.2768637


# median time until default (taking into account just no cendored data)
median(loan_filtered$time[loan_filtered$status==1])  # I'm underestimating
## [1] 333


# median time until default (with all data)
mean(loan_filtered$time)  # I'm underestimating the median survival too 
## [1] 633.4331
# (in censored times, the real time is bigger)

Kaplan and Meier (1958) obtained a nonparametric estimate of the survival function, called product-limit, which is the generalization of the empirical estimator for censored data

\[ \hat S(t) = P(T \gt t) = \prod_{i:t_i \le t} \bigg[1-\frac{d_i}{n_i} \bigg] \] where \(t_1, t_2, \ldots,t_n\) are the observed event times, \(d_i\) is the number of events at time \(t_i\), and \(n_i\) is the number of individuals at risk at time \(t_i\) (i.e, the original sample minus all those who had the event before \(t_j\).)

Note that \(d_i/n_i\) is the proportion that failed at the event time \(t_i\) and \(1 - d_i/n_i\) is the proportion surviving the event time \(t_j\).

The Kaplan-Meier estimate is a step function with jumps at event times. The size of the steps depend on the number of events and the number of individuals at risk at the corresponding time. Note that if the last data is censored, the estimator will not reach the zero value.

Without censoring, the estimator is equivalent to the empirical survival function \(\hat S(t) = \frac{1}{n} \sum_{i=1}^{n} I(t_i \gt t)\) or to the one using the risk estimates \(\hat S(t) = \prod_{k = 1}^{t-1} \bigg [ 1- \hat \lambda(k)\bigg]\).

km <- survfit(Surv(time, status) ~ 1, data = loan_filtered)
km  # we can see the correct estimated median
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    4923    1363    1189    1158    1217
print(km, print.rmean = TRUE)
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered)
## 
##          n     events     *rmean *se(rmean)     median    0.95LCL 
##    4923.00    1363.00     926.53       6.85    1189.00    1158.00 
##    0.95UCL 
##    1217.00 
##     * restricted mean with upper limit =  1224

Take a look at ?Surv of the survival package.

2.1.1 Other representation

Assume that \(\widetilde T_i = min (T_i, C_i)\) and \(\Delta_i = I (T_i \le C_i)\), we introduce a weighted average representation of the Kaplan-Meier estimator which will be used later to introduce estimators for the conditional survival function

\[\begin{equation*} \widehat S(y)=1-\sum_{i=1}^{n}W_{i}I(\widetilde T_{(i)}\leq y), \end{equation*}\]

where \(\widetilde T_{\left( 1\right) }\leq ...\leq \widetilde T_{\left( n\right) }\) denotes the ordered \(\widetilde T\)-sample and

\[\begin{equation*} W_{i}=\frac{\Delta_{\left[ i\right] }}{n-i+1}\prod_{j=1}^{i-1}\left[ 1-\frac{% \Delta _{\left[ j\right] }}{n-j+1}\right] \end{equation*}\]

is the Kaplan-Meier weight attached to \(\widetilde T_{\left( i\right) }\). In the expression of \(W_{i}\) notation \(\Delta_{\left[ i\right] }\) is used for the \(i\)-th concomitant value of the censoring indicator (that is, \(\Delta_{\left[ i \right] }=\Delta _{j}\) if \(\widetilde T_{\left( i\right) }=\widetilde T_{j}\)).

References

Kaplan, E.L., and P. Meier. 1958. “Nonparametric Estimation from Incomplete Observations.” Journal of the American Statistical Association 53: 457–81.