Chapter 7 Case Study
library(tidyverse)
library(survival)
library(survminer)
library(gtsummary)
library(scales)
The guidelines for reporting the Kaplan-Meier test are from Laerd’s Kaplan-Meier using SPSS Statistics (Laerd 2015). The data is from survival::lung
.
<- survival::lung %>%
d_lung filter(!is.na(ph.ecog)) %>%
mutate(
ph.ecog = factor(
ph.ecog,levels = c(0, 1, 2, 3, 4),
labels = c("Asymptomatic", "Ambulatory", "In bed <50%", "In bed >50%", "Bedridden")),
ph.ecog = fct_drop(ph.ecog),
ph.ecog = fct_lump_n(ph.ecog, n = 2, other_level = "Bedridden"),
patient_id = row_number()
)
A study investigated differences in all-cause mortality between men and women diagnosed with advanced lung cancer. 227 participants aged 39 to 82 were monitored up to three years until time of death. The participants were segmented into three groups according to their ECOG performance score: asymptomatic, symptomatic but completely ambulatory, and bedridden at least part of the day. Participant age was captured as a controlling covariate.
<- d_lung %>%
(t1 mutate(status = factor(status, levels = c(1, 2), labels = c("censored", "died"))) %>%
tbl_summary(by = "ph.ecog", include = c(time, status, ph.ecog, age)) %>%
add_overall())
Characteristic | Overall, N = 2271 | Asymptomatic, N = 631 | Ambulatory, N = 1131 | Bedridden, N = 511 |
---|---|---|---|---|
time | 259 (168, 399) | 303 (224, 438) | 243 (177, 426) | 180 (100, 301) |
status | ||||
censored | 63 (28%) | 26 (41%) | 31 (27%) | 6 (12%) |
died | 164 (72%) | 37 (59%) | 82 (73%) | 45 (88%) |
age | 63 (56, 69) | 61 (56, 68) | 63 (55, 68) | 68 (60, 73) |
1 Median (IQR); n (%) |
7.1 Cox
The Cox proportional hazards model describes the effect of explanatory variables on survival and can include controlling variables. Add age
as a controlling variable.
<- coxph(Surv(time, status) ~ ph.ecog + age, data = d_lung)
cox_fit <- cox_fit %>% gtsummary::tbl_regression(exponentiate = TRUE)) (cox_tbl
Characteristic | HR1 | 95% CI1 | p-value |
---|---|---|---|
ph.ecog | |||
Asymptomatic | — | — | |
Ambulatory | 1.43 | 0.97, 2.11 | 0.072 |
Bedridden | 2.39 | 1.52, 3.74 | <0.001 |
age | 1.01 | 0.99, 1.03 | 0.2 |
1 HR = Hazard Ratio, CI = Confidence Interval |
Using a Cox proportional hazards regression analysis, we find the association between ECOG and survival time statistically significant with a Bedridden estimate of 2.39 (95% CI 1.52, 3.74; p<0.001) relative to Asymptomatic. The hazard ratio for Ambulatory was not statistically significant, 1.43 (95% CI 0.97, 2.11; p=0.072).
survfit(cox_fit, newdata = list(ph.ecog = levels(d_lung$ph.ecog),
age = rep(median(d_lung$age), 3)), data = d_lung) %>%
surv_summary() %>%
ggsurvplot_df(
fun = "pct",
# linetype = "strata", # Change line type by groups
# conf.int = TRUE,
surv.median.line = "hv", # median horizontal and vertical ref lines
ggtheme = theme_light(),
palette = c("goldenrod", "sienna", "tomato"),
title = "Fitted Cox model at Median Participant Age.",
legend.title = "",
legend.labs = levels(d_lung$ph.ecog)
)
7.2 Discrete Time
To fit a discrete time model, first create a discrete time data set, on record per relevant time period. You could do this by hand with expand.grid()
<- expand.grid(
cs_glm_dat time = seq(min(d_lung$time), max(d_lung$time), 1),
patient_id = unique(d_lung$patient_id)
%>%
) inner_join(d_lung, by = "patient_id") %>%
filter(time.x <= time.y) %>%
mutate(status = if_else(time.x < time.y, 0, status - 1)) %>%
rename(time = time.x)
But discSurv::dataLong
does this for you.
<- d_lung %>%
cs_glm_dat mutate(status = status - 1) %>%
::dataLong("time", "status", timeAsFactor = FALSE) discSurv
For example, patient 1 died at time 306, so they have 306 rows in the new data set, with y = 0
for all but the last row.
%>% filter(obj == 1) %>% tail() cs_glm_dat
## obj timeInt y inst time status age sex ph.ecog ph.karno pat.karno
## 1.300 1 301 0 3 306 1 74 1 Ambulatory 90 100
## 1.301 1 302 0 3 306 1 74 1 Ambulatory 90 100
## 1.302 1 303 0 3 306 1 74 1 Ambulatory 90 100
## 1.303 1 304 0 3 306 1 74 1 Ambulatory 90 100
## 1.304 1 305 0 3 306 1 74 1 Ambulatory 90 100
## 1.305 1 306 1 3 306 1 74 1 Ambulatory 90 100
## meal.cal wt.loss patient_id
## 1.300 1175 NA 1
## 1.301 1175 NA 1
## 1.302 1175 NA 1
## 1.303 1175 NA 1
## 1.304 1175 NA 1
## 1.305 1175 NA 1
glm()
fits either the log(-log(1-hazard))
or log(hazard / (1-hazard))
as a function of time
, controlling for ph.ecog
and age
. The two links are similar here6.
%>%
cs_glm_dat group_by(ph.ecog, timeInt) %>%
summarise(.groups = "drop", died = sum(y), at_risk = n()) %>%
mutate(
hazard = died / at_risk,
logit_hazard = log(hazard / (1 - hazard)),
cloglog_hazard = log(-log(1 - hazard))
%>%
) pivot_longer(cols = c(logit_hazard, cloglog_hazard)) %>%
filter(!hazard %in% c(0, 1)) %>%
ggplot(aes(x = timeInt, y = value, color = ph.ecog)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ x") +
facet_wrap(facets = vars(name)) +
theme_light()
<- glm(
logit_fit ~ timeInt + ph.ecog + age,
y data = cs_glm_dat,
family = binomial(link = "logit")
)<- logit_fit %>% gtsummary::tbl_regression(exponentiate = TRUE)) (logit_tbl
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
timeInt | 1.00 | 1.00, 1.00 | <0.001 |
ph.ecog | |||
Asymptomatic | — | — | |
Ambulatory | 1.42 | 0.97, 2.11 | 0.081 |
Bedridden | 2.38 | 1.52, 3.76 | <0.001 |
age | 1.01 | 0.99, 1.03 | 0.3 |
1 OR = Odds Ratio, CI = Confidence Interval |
Using a discrete time logistic regression analysis with logit link function, we find the association between ECOG and survival time statistically significant with a Bedridden estimate of 2.38 (95% CI 1.52, 3.76; p<0.001) relative to Asymptomatic. The hazard odds ratio for Ambulatory was not statistically significant, 1.42 (95% CI 0.97, 2.11; p=0.081).
<- glm(
cloglog_fit ~ timeInt + ph.ecog + age,
y data = cs_glm_dat,
family = binomial(link = "cloglog")
)<- cloglog_fit %>% gtsummary::tbl_regression(exponentiate = TRUE)) (cloglog_tbl
Characteristic | HR1 | 95% CI1 | p-value |
---|---|---|---|
timeInt | 1.00 | 1.00, 1.00 | <0.001 |
ph.ecog | |||
Asymptomatic | — | — | |
Ambulatory | 1.41 | 0.97, 2.11 | 0.081 |
Bedridden | 2.38 | 1.52, 3.75 | <0.001 |
age | 1.01 | 0.99, 1.03 | 0.3 |
1 HR = Hazard Ratio, CI = Confidence Interval |
Using a discrete time logistic regression analysis with complementary log-log link function, we find the association between ECOG and survival time statistically significant with a Bedridden estimate of 2.38 (95% CI 1.52, 3.75; p<0.001) relative to Asymptomatic. The hazard ratio for Ambulatory was not statistically significant, 1.41 (95% CI 0.97, 2.11; p=0.081).
7.3 Reporting
The reporting below uses the Cox model rather than the discrete time model.
Participants were randomly assigned to three different interventions in order to quit smoking: a hypnotherapy programme (n = 50), wearing nicotine patches (n = 50) and using e-cigarettes (n = 50). Kaplan-Meier survival analysis (Kaplan & Meier, 1958) was conducted to compare the three different interventions for their effectiveness in preventing smoking resumption. A similar percentage of censored cases was present in the hypnotherapy (12.0%), nicotine patch (14.0%) and e-cigarette (14.0%) intervention groups and the pattern of censoring was similar. Participants that underwent the hypnotherapy programme had a median time to smoking resumption of 69.0 (95% CI, 45.2 to 92.8) days. This was longer than the groups receiving nicotine patches or e-cigarettes, which had identical median times to smoking resumption of 9.0 (95% CI, 6.6 to 11.4) days and 9.0 (95% CI, 7.1 to 10.9) days, respectively. A log rank test was conducted to determine if there were differences in the survival distributions for the different types of intervention. The survival distributions for the three interventions were statistically significantly different, χ2(2) = 25.818, p < .0005. Pairwise log rank comparisons were conducted to determine which intervention groups had different survival distributions. A Bonferroni correction was made with statistical significance accepted at the p < .0167 level. There was a statistically significant difference in survival distributions for the hypnotherapy vs nicotine patch intervention, χ2(1) = 11.035, p = .001, and hypnotherapy vs e-cigarette intervention, χ2(1) = 29.003, p < .0005. However, the survival distributions for the e-cigarette and nicotine patch interventions were not statistically significantly different, χ2(1) = 1.541, p = .214.
References
Full discussion on Rens van de Schoot.↩︎