7.10 Prediction
In linear regression, we estimated the mean outcome at \(X=x\), \(E(Y|X=x)\). In logistic regression, we estimated the probability of the outcome, \(P(Y=1|X=x)\). In Cox regression, we can estimate the survival probability at a specific time, \(S\left(t|X=x\right)\), as well as the hazard ratio for an individual relative to a reference individual, \(h\left(t|X=x\right)/h\left(t|X=x_\textrm{ref}\right)\).
Use predict()
to compute these estimates. Along with specifying values for the predictors in the model at which you want to compute the estimate, also specify \(t\), the time at which you want to compute the estimate. For predicted survival, the choice of \(t\) matters. However, because of the proportional hazards assumption, a predicted HR will not depend on time, although in the syntax below you still need to specify a time for the code to work (but the HR will be the same regardless of what time is chosen).
Unfortunately, predict()
only returns the standard error of the estimate when applied to a coxph
object, not a confidence interval. For the estimated survival probability, an alternative is to use survfit()
on the fitted Cox model, which does return a confidence interval.
Example 7.6 (continued): Using the Natality teaching dataset, estimate \(S(30)\), the probability that a preterm birth has not occurred as of 30 weeks, for a married, non-Hispanic Black, 28-year-old woman with a previous preterm birth. Also, estimate the hazard of preterm birth for this individual relative to a reference individual.
# Check the spelling of each level
# (results not shown)
levels(natality$DMAR)
levels(natality$MRACEHISP)
levels(natality$RF_PPTERM)
# NOTE: For the event indicator variable (preterm), you must specify
# a value or predict() will not work, but it does not matter which
# value you specify, you will get the same answer.
NEWDAT <- data.frame(preterm01 = 1,
gestage37 = 30,
DMAR = "Married",
MRACEHISP = "NH Black",
MAGER = 28,
RF_PPTERM = "Yes")
predict(cox.ex7.6, NEWDAT, type = "survival", se.fit = T)
## $fit
## [1] 0.9628
##
## $se.fit
## [1] 0.01235
We can get the same answer using a summary()
of survfit()
with the times
option. However, summary()
will also include a confidence interval (via the conf.int
option). With this method, there is no need to specify values for the event indicator and time inside of NEWDAT
, although leaving them in NEWDAT
will not change the answer.
unlist(summary(survfit(cox.ex7.6,
NEWDAT,
se.fit =T, conf.int = 0.95),
times=30)[c("surv", "std.err", "lower", "upper")])
## surv std.err lower upper
## 0.96282 0.01235 0.93892 0.98732
We conclude that the probability of no preterm birth through 30 weeks is 0.963 (95% CI = 0.939, 0.987).
Use predict()
with type = "risk"
to estimate an AHR comparing the hazard for this individual to that of a reference individual. Different reference
specifications lead to different results.
reference = "zero"
: The reference individual has continuous predictor values of 0, and categorical predictors that are each at their reference level. This is not a good choice if \(X = 0\) is not a plausible value for a continuous predictor \(X\).reference = "sample"
: The reference individual has continuous predictors that are each at their respective sample mean, and categorical predictors that are each at their reference level. This is equivalent to centering each continuous variable at its mean.reference = "strata"
(the default): This leads to the same answer asreference = "sample"
unless the model includes astrata()
term (see Section 7.16.4), in which case continuous variables will be set to their within-stratum means.
Our model contained a single continuous predictor, mother’s age, for which a reference value of 0 does not make sense. As we see below, the AHR comparing our individual to a reference individual with age 0 is very large, since the model is extrapolating and estimating a very small hazard for a newborn woman to have a preterm birth (a nonsense statement) and that small number is the denominator of the HR.
## $fit
## 1
## 12.15
##
## $se.fit
## 1
## 1.487
If a value of 0 does not make sense for any of the continuous predictors in your model, then use reference = "strata"
instead (equivalent to "sample"
if there are no strata variables, and more sensible if there are, since it makes sense to have the hazard for the individual in NEWDAT
be compared to an individual in their own strata).
## $fit
## 1
## 4.906
##
## $se.fit
## 1
## 0.6329
Because of the proportional hazards assumption, the hazard ratio is constant over time. You will get the same answer for any gestage37
value you enter into NEWDAT
, or even if you omit gestage37
altogether (the event indicator, preterm01
, can also be omitted).
NEWDAT <- data.frame(DMAR = "Married",
MRACEHISP = "NH Black",
MAGER = 28,
RF_PPTERM = "Yes")
predict(cox.ex7.6, NEWDAT, type = "risk", reference = "strata")
## 1
## 4.906
Conclusion: This individual has 4.91 times the hazard of preterm birth of an individual at the reference level of each categorical predictor and the mean of each continuous predictor.
To verify that the reference individual has reference level values and mean values for the predictors, the code below demonstrates that the predicted HR for an individual with these characteristics is 1.
NEWDAT <- data.frame(DMAR = levels(natality.complete$DMAR)[1],
MRACEHISP = levels(natality.complete$MRACEHISP)[1],
MAGER = mean(natality.complete$MAGER),
RF_PPTERM = levels(natality.complete$RF_PPTERM)[1])
predict(cox.ex7.6, NEWDAT, type = "risk", reference = "strata")
## 1
## 1
NOTE: reference = "strata"
is the default value and, for the reasons mentioned above, is a reasonable choice. Thus, you can typically omit the option and type, for example, predict(cox.ex7.6, NEWDAT, type = "risk")
.