1.1 What is survival analysis?
In a general way, survival analysis is a collection of statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs, often referred to as a failure time, survival time, or event time.
Survival time refers to a variable which measures the time from a particular starting time (e.g., time initiated the treatment) to a particular endpoint of interest: time-to-event.
The problem of analyzing time to event data arises in a number of applied fields, such as:
- medicine, biology, public health (time to death)
- social sciences (time for doing some task)
- economics (time looking for employment)
- financial or credit scoring (time to default)
- engineering (time to a failure of some electronic component)
1.1.1 Time, time origen, time scale, event
In survival analysis three requirements are needed for the precise definition of the failure time of an individual. A time origin must be specified, a time scale for measuring time must be agreed upon and the meaning of failure - event must be clear.
By time, we mean years, months, weeks, or days from the beginning of follow-up of an individual until the event of study occurs, but we need to specify the scale.
By time origin we understand the time of entry into the study.
By event, we mean –it depends on the field– death, disease incidence, recovery (e.g., return to work) if we focus on biomedical applications, default in the credit scoring field, renewals in insurance framework, fault in the engeniering field, etc.
Generally, we will assume that only one event is of designated interest. When more than one event is considered (e.g., death from any of several causes), the statistical problem can be characterized as either a recurrent event or a competing risk. We will see the case of the recurrence event using the condSURV package in the Chapter 5.
It is time to see now an example in a real dataset. This is the Prosper Loan data provided by Udacity Data Analyst Nanodegree (last updated 3/11/14). It is also at Kaggle.
Prosper.com is a peer-to-peer lending marketplace. Borrowers make loan requests and investors contribute as little as $25 towards the loans of their choice. Historically, Prosper made their loan data public nightly, however, effective January 2015, information will be made available 45 days after the end of the quarter.
A link to the data is here and a variable dictionary can be found here.
# Prosper Loan data
web <- "https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv"
loan <- read.csv(web)
head(loan)[, c(51, 65, 6, 7, 19, 18, 50)]
## LoanKey LoanOriginationDate LoanStatus
## 1 E33A3400205839220442E84 2007-09-12 00:00:00 Completed
## 2 9E3B37071505919926B1D82 2014-03-03 00:00:00 Current
## 3 6954337960046817851BCB2 2007-01-17 00:00:00 Completed
## 4 A0393664465886295619C51 2012-11-01 00:00:00 Current
## 5 A180369302188889200689E 2013-09-20 00:00:00 Current
## 6 C3D63702273952547E79520 2013-12-24 00:00:00 Current
## ClosedDate Occupation BorrowerState StatedMonthlyIncome
## 1 2009-08-14 00:00:00 Other CO 3083.333
## 2 Professional CO 6125.000
## 3 2009-12-17 00:00:00 Other GA 2083.333
## 4 Skilled Labor GA 2875.000
## 5 Executive MN 9583.333
## 6 Professional NM 8333.333
1.1.2 Goals of the survival analysis
Estimate time-to-event for a group of individuals, such as time until default for a group of clients.
Compare time-to-event between two or more groups, such as residence place for clients.
Assess the relationship of covariates to time-to-event, such as: occupation, state, income, etc.