笔记 21 生存分析
21.1 Concepts
- examines and models the time it takes for events to occur
- the distribution of survival times
- Popular: Cox proportional-hazards regression model
21.2 Notation
- T as a random variable with cumulative distribution function \(P (t) = Pr(T ≤ t)\) and probability density function \(p(t) = \frac{dP (t)}{dt}\)
- survival function S(t) is the complement of the distribution function, \(S(t) = Pr(T > t) = 1 − P (t)\)
- hazard function \(log h(t) = ν + ρt\)
21.3 Cox proportional-hazards regression model
- \(log h_i(t)=α+_1x_{i1} +β_2x_{ik} +···+β_kx_{ik}\)
- Cox model
- \(log h_i(t)=α(t)+β_1x_{i1} +β_2x_{ik} +···+β_kx_{ik}\)
- the Cox model is a proportional-hazards model
- \(\frac{h_i(t)}{h_{i'}(t)} = \frac{e^{\eta_i}}{e^{\eta'}}\)
21.4 Case: Recidivism
- Target: recidivism of 432 male prisoners, who were observed for a year after being released from prison
- arrest means the male prisoners who rearrested
- 52 weeks
- factors: financial aid after release from prison, affected,release ages,race,work experience,marriage,parole,prior convictions, education
library(survival)
library(car)
# perform survival analysis
Rossi <- read.table('http://ftp.auckland.ac.nz/software/CRAN/doc/contrib/Fox-Companion/Rossi.txt', header=T)
Rossi[1:5, 1:10]
## week arrest fin age race wexp mar paro prio educ
## 1 20 1 0 27 1 0 0 1 3 3
## 2 17 1 0 18 1 0 0 1 8 4
## 3 25 1 0 19 0 1 0 1 13 3
## 4 52 0 1 23 1 1 1 1 1 5
## 5 52 0 0 19 0 1 0 1 3 3
mod.allison <- coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio, data=Rossi)
summary(mod.allison)
## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
## mar + paro + prio, data = Rossi)
##
## n= 432, number of events= 114
##
## coef exp(coef) se(coef) z Pr(>|z|)
## fin -0.3794 0.6843 0.1914 -1.98 0.0474 *
## age -0.0574 0.9442 0.0220 -2.61 0.0090 **
## race 0.3139 1.3688 0.3080 1.02 0.3081
## wexp -0.1498 0.8609 0.2122 -0.71 0.4803
## mar -0.4337 0.6481 0.3819 -1.14 0.2561
## paro -0.0849 0.9186 0.1958 -0.43 0.6646
## prio 0.0915 1.0958 0.0286 3.19 0.0014 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## fin 0.684 1.461 0.470 0.996
## age 0.944 1.059 0.904 0.986
## race 1.369 0.731 0.748 2.503
## wexp 0.861 1.162 0.568 1.305
## mar 0.648 1.543 0.307 1.370
## paro 0.919 1.089 0.626 1.348
## prio 1.096 0.913 1.036 1.159
##
## Concordance= 0.64 (se = 0.027 )
## Rsquare= 0.074 (max possible= 0.956 )
## Likelihood ratio test= 33.3 on 7 df, p=2.36e-05
## Wald test = 32.1 on 7 df, p=3.87e-05
## Score (logrank) test = 33.5 on 7 df, p=2.11e-05
# plot time vs survival prob
plot(survfit(mod.allison), ylim=c(.7, 1), xlab='Weeks', ylab='Proportion Not Rearrested')
21.4.1 result
- The covariates age and prio (prior convictions) have highly statistically significant coefficients, while the coefficient for fin (financial aid) is marginally significant
- holding the other covariates constant, an additional year of age reduces the weekly hazard of rearrest by a factor of \(e^b = 0.944\) on average – that is, by 5.6
- likelihood-ratio, Wald, and score chi-square statistics: null hypothesis all of the β’s are zero.
21.5 further
- assess the impact of financial aid on rearrest
- new data frame with two rows, one for each value of fin; the other covariates are fixed to their average values
attach(Rossi)
Rossi.fin <- data.frame(fin=c(0,1), age=rep(mean(age),2), race=rep(mean(race),2), wexp=rep(mean(wexp),2), mar=rep(mean(mar),2), paro=rep(mean(paro),2), prio=rep(mean(prio),2))
detach()
plot(survfit(mod.allison, newdata=Rossi.fin), conf.int=T, lty=c(1,2), ylim=c(.6, 1))
legend("bottomleft", legend=c('fin = 0', 'fin = 1'), lty=c(1,2))
- the higher estimated ‘survival’ of those receiving financial aid, but the two confidence envelopes overlap substantially, even after 52 weeks
21.6 Time-Dependent Covariates
- treat the employed variable as a tim-dependent covariates with 52 weeks’ record
sum(!is.na(Rossi[,11:62])) # record count
## [1] 19809
Rossi2 <- matrix(0, 19809, 14) # to hold new data set
colnames(Rossi2) <- c('start', 'stop', 'arresttime', names(Rossi)[1:10], 'employed')
row<-0
for (i in 1:nrow(Rossi)) {
for (j in 11:62) {
if (is.na(Rossi[i, j])) next
else {
row <- row + 1 # increment row counter
start <- j - 11 # start time (previous week)
stop <- start + 1 # stop time (current week)
arresttime <- if (stop == Rossi[i, 1] && Rossi[i, 2] ==1) 1 else 0
Rossi2[row,] <- c(start, stop, arresttime, unlist(Rossi[i, c(1:10, j)]))
}
}
}
Rossi2 <- as.data.frame(Rossi2)
remove(i, j, row, start, stop, arresttime)
modallison2 <- coxph(Surv(start, stop, arresttime) ~ fin + age + race + wexp + mar + paro + prio + employed, data=Rossi2)
summary(modallison2)
## Call:
## coxph(formula = Surv(start, stop, arresttime) ~ fin + age + race +
## wexp + mar + paro + prio + employed, data = Rossi2)
##
## n= 19809, number of events= 114
##
## coef exp(coef) se(coef) z Pr(>|z|)
## fin -0.3567 0.7000 0.1911 -1.87 0.0620 .
## age -0.0463 0.9547 0.0217 -2.13 0.0330 *
## race 0.3387 1.4031 0.3096 1.09 0.2740
## wexp -0.0256 0.9748 0.2114 -0.12 0.9038
## mar -0.2937 0.7455 0.3830 -0.77 0.4431
## paro -0.0642 0.9378 0.1947 -0.33 0.7416
## prio 0.0851 1.0889 0.0290 2.94 0.0033 **
## employed -1.3283 0.2649 0.2507 -5.30 1.2e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## fin 0.700 1.429 0.481 1.018
## age 0.955 1.047 0.915 0.996
## race 1.403 0.713 0.765 2.574
## wexp 0.975 1.026 0.644 1.475
## mar 0.745 1.341 0.352 1.579
## paro 0.938 1.066 0.640 1.374
## prio 1.089 0.918 1.029 1.152
## employed 0.265 3.775 0.162 0.433
##
## Concordance= 0.708 (se = 0.027 )
## Rsquare= 0.003 (max possible= 0.066 )
## Likelihood ratio test= 68.7 on 8 df, p=9.11e-12
## Wald test = 56.1 on 8 df, p=2.63e-09
## Score (logrank) test = 64.5 on 8 df, p=6.1e-11
21.7 Model Diagnostics
- Checking Proportional Hazards
modallison3 <- coxph(Surv(week, arrest) ~ fin + age + prio, data=Rossi)
modallison3
## Call:
## coxph(formula = Surv(week, arrest) ~ fin + age + prio, data = Rossi)
##
## coef exp(coef) se(coef) z p
## fin -0.3470 0.7068 0.1902 -1.82 0.06820
## age -0.0671 0.9351 0.0209 -3.22 0.00129
## prio 0.0969 1.1017 0.0273 3.56 0.00038
##
## Likelihood ratio test=29.1 on 3 df, p=2.19e-06
## n= 432, number of events= 114
cox.zph(modallison3)
## rho chisq p
## fin -0.00657 0.00507 0.9433
## age -0.20976 6.54147 0.0105
## prio -0.08004 0.77288 0.3793
## GLOBAL NA 7.13046 0.0679
par(mfrow=c(2,2))
plot(cox.zph(modallison3))
- there appears to be a trend in the plot for age, with the age effect declining with time
modallison4 <- coxph(Surv(start,stop,arresttime)~fin+age+age:stop:stop+prio, data = Rossi2)
modallison4
## Call:
## coxph(formula = Surv(start, stop, arresttime) ~ fin + age + age:stop:stop +
## prio, data = Rossi2)
##
## coef exp(coef) se(coef) z p
## fin -0.34856 0.70570 0.19023 -1.83 0.06690
## age 0.03228 1.03280 0.03943 0.82 0.41301
## prio 0.09818 1.10316 0.02726 3.60 0.00032
## age:stop -0.00383 0.99617 0.00147 -2.61 0.00899
##
## Likelihood ratio test=36 on 4 df, p=2.85e-07
## n= 19809, number of events= 114
the coefficient for the interaction is negative and highly statistically significant: The effect of age declines with time
use residual to find influential observations