7.8 Fitting the Cox regression model
7.8.1 Unadjusted
To illustrate the syntax and interpretation of the output, we will fit two unadjusted Cox regression models as examples, one with a categorical predictor and one with a continuous predictor.
Categorical predictor
Example 7.4 (continued): Earlier we plotted the KM estimates of the survival functions for preterm birth for mothers of different race/ethnicities (Figure 7.10). Visually, there were clear disparities. Use Cox regression, via the coxph() function, to numerically assess the magnitude of racial disparities by estimating the hazard ratios comparing these groups. As before, use the Surv(TIME, EVENT) ~ X syntax for the model formula.
## Call:
## coxph(formula = Surv(gestage37, preterm01) ~ MRACEHISP, data = natality)
## 
##   n= 1991, number of events= 251 
##    (9 observations deleted due to missingness)
## 
##                     coef exp(coef) se(coef)    z   Pr(>|z|)    
## MRACEHISPNH Black 0.8483    2.3357   0.1617 5.25 0.00000015 ***
## MRACEHISPNH Other 0.0964    1.1012   0.2449 0.39     0.6940    
## MRACEHISPHispanic 0.4332    1.5422   0.1565 2.77     0.0056 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## MRACEHISPNH Black      2.34      0.428     1.701      3.21
## MRACEHISPNH Other      1.10      0.908     0.681      1.78
## MRACEHISPHispanic      1.54      0.648     1.135      2.10
## 
## Concordance= 0.587  (se = 0.018 )
## Likelihood ratio test= 27.6  on 3 df,   p=0.000004
## Wald test            = 29.4  on 3 df,   p=0.000002
## Score (logrank) test = 30.7  on 3 df,   p=0.000001The output tells us that there were 1991 observations (9 were removed because they had a missing value for MRACEHISP) and among these there were 251 events (preterm births), with the remaining event times being censored. The coef column contains the estimates of the \(\beta\)s, and the exp(coef) column contains \(e^\beta\) for each term which is the HR comparing that level to the reference level (non-Hispanic White). For example, non-Hispanic Black mothers have 2.34 times the hazard of preterm birth of non-Hispanic White mothers. The exp(-coef) column displays the inverse of the HR, which reverses the direction of comparison. For example, the hazard for non-Hispanic White mothers is 0.43 times that of non-Hispanic Black mothers. The lower .95 and upper .95 columns display the 95% CI for the HR (exp(coef), not the inverse).
Use the following code to extract just the regression coefficient table.
##                      coef exp(coef) se(coef)      z     Pr(>|z|)
## MRACEHISPNH Black 0.84830     2.336   0.1617 5.2472 0.0000001544
## MRACEHISPNH Other 0.09637     1.101   0.2449 0.3934 0.6940061841
## MRACEHISPHispanic 0.43321     1.542   0.1565 2.7680 0.0056401474Use the following code to display the HRs and their 95% confidence intervals and p-values.
# HRs, 95% CIs, p-values
cbind("HR"      = exp(summary(cox.ex7.4)$coef[, "coef"]),
                  exp(confint(cox.ex7.4)),
      "p-value" = summary(cox.ex7.4)$coef[, "Pr(>|z|)"])##                      HR  2.5 % 97.5 %      p-value
## MRACEHISPNH Black 2.336 1.7014  3.206 0.0000001544
## MRACEHISPNH Other 1.101 0.6813  1.780 0.6940061841
## MRACEHISPHispanic 1.542 1.1348  2.096 0.0056401474Use car::Anova() to get multiple degree of freedom Wald tests for categorical variables with more than two levels.
## Analysis of Deviance Table (Type III tests)
## 
## Response: Surv(gestage37, preterm01)
##           Df Chisq Pr(>Chisq)    
## MRACEHISP  3  29.4  0.0000019 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Conclusion: We estimate from the Natality teaching dataset that there are significant racial disparities in the hazard of preterm birth (p <.001 from the Type III test). Non-Hispanic Black mothers have 2.34 times the hazard of preterm birth of non-Hispanic White mothers (HR = 2.34; 95% CI = 1.70, 3.21; p <.001). Hispanic mothers have 1.54 times the hazard of preterm birth of non-Hispanic White mothers (HR = 1.54; 95% CI = 1.13, 2.10; p = .006). Non-Hispanic Other race mothers had greater observed hazard than non-Hispanic White mothers, but this difference was small and not statistically significant (HR = 1.10; 95% CI = 0.68, 1.78; p = .694).
Continuous predictor
Example 7.5: Using the Framingham Heart Study teaching dataset, assess the association between time to angina pectoris (recurrent chest pain) and the age of the participant at enrollment (AGE). The time and event variables are TIMEAP and ANGINA, respectively. The syntax is the same as the categorical predictor setting, but the interpretation of the HR is slightly different. Rather than being a comparison between a level and the reference level, it is a comparison between individuals who differ by one unit (in this case, by 1 year of age).
## Call:
## coxph(formula = Surv(TIMEAP, ANGINA) ~ AGE, data = fram_time_invar)
## 
##   n= 4215, number of events= 558 
## 
##        coef exp(coef) se(coef)    z        Pr(>|z|)    
## AGE 0.03429   1.03488  0.00499 6.87 0.0000000000066 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## AGE      1.03      0.966      1.02      1.05
## 
## Concordance= 0.595  (se = 0.012 )
## Likelihood ratio test= 46.6  on 1 df,   p=0.000000000009
## Wald test            = 47.1  on 1 df,   p=0.000000000007
## Score (logrank) test = 47.8  on 1 df,   p=0.000000000005# HR, 95% CI, p-value
cbind("HR"      = exp(summary(cox.ex7.5)$coef[, "coef"]),
                  exp(confint(cox.ex7.5)),
      "p-value" = summary(cox.ex7.5)$coef[, "Pr(>|z|)"])##        HR 2.5 % 97.5 %           p-value
## AGE 1.035 1.025  1.045 0.000000000006608Conclusion: Age at enrollment is significantly associated with time to angina pectoris (p <.001, from the regression coefficient table – since this is just testing one term we do not need car::Anova(), although it would have given the same answer). Individuals who were one year older at enrollment have 3.5% greater hazard of angina pectoris (HR = 1.035; 95% CI = 1.025, 1.045; p <.001).
HR associated with other than a 1-unit difference
Similar to the issue with ORs estimated from logistic regression discussed in Section 6.6.1, if you want to know the HR associated with more than a 1-unit difference, you cannot simply multiply the HR by the new amount of units. Instead, multiply the regression coefficient (coef) (and the 95% confidence interval) by the units before exponentiating. For example, individuals who were 10 years older at enrollment have 41% greater hazard of angina pectoris (not 10 \(\times\) the % associated with a 1-unit difference which would be 35%).
# Multiply by 10 inside of exp()
# Do NOT multiply the p-value
cbind("HR"      = exp(10*summary(cox.ex7.5)$coef[, "coef"]),
                  exp(10*confint(cox.ex7.5)),
      "p-value" = summary(cox.ex7.5)$coef[, "Pr(>|z|)"])##        HR 2.5 % 97.5 %           p-value
## AGE 1.409 1.278  1.554 0.0000000000066087.8.2 Adjusted
When adjusting for other predictors, refer to hazard ratios as adjusted hazard ratios (AHR) and, when interpreting them, specify what you have adjusted for.
Example 7.6: Using the Natality teaching dataset, estimate the association between time to preterm birth and previous preterm birth (RF_PPTERM), adjusted for mother’s age (MAGER), race/ethnicity (MRACEHISP), and marital status (DMAR). For comparison, fit the unadjusted model first. To make the two models comparable (based on the same individuals), use a complete case analysis.
# Complete-case dataset
natality.complete <- natality %>% 
  select(gestage37, preterm01, RF_PPTERM, MAGER, MRACEHISP, DMAR) %>% 
  drop_na()
# Unadjusted
cox.ex7.6.unadj <- coxph(Surv(gestage37, preterm01) ~ RF_PPTERM,
                         data = natality.complete)
# AHR, 95% CI, p-value
cbind("HR"      = exp(summary(cox.ex7.6.unadj)$coef[, "coef"]),
                  exp(confint(cox.ex7.6.unadj)),
      "p-value" = summary(cox.ex7.6.unadj)$coef[, "Pr(>|z|)"])##                 HR 2.5 % 97.5 %      p-value
## RF_PPTERMYes 3.285 2.074  5.204 0.0000004018# Adjusted
cox.ex7.6 <- coxph(Surv(gestage37, preterm01) ~ RF_PPTERM + MAGER + 
                     MRACEHISP + DMAR, data = natality.complete)
# AHRs, 95% CIs, p-values
cbind("AHR"      = exp(summary(cox.ex7.6)$coef[, "coef"]),
                   exp(confint(cox.ex7.6)),
      "p-value"  = summary(cox.ex7.6)$coef[, "Pr(>|z|)"])##                      AHR  2.5 % 97.5 %     p-value
## RF_PPTERMYes      2.8919 1.8188  4.598 0.000007191
## MAGER             1.0318 1.0077  1.056 0.009317466
## MRACEHISPNH Black 1.7485 1.2346  2.476 0.001649379
## MRACEHISPNH Other 0.9292 0.5207  1.658 0.803844986
## MRACEHISPHispanic 1.2972 0.9132  1.843 0.146213569
## DMARUnmarried     1.7950 1.3291  2.424 0.000135888## Analysis of Deviance Table (Type III tests)
## 
## Response: Surv(gestage37, preterm01)
##           Df Chisq Pr(>Chisq)    
## RF_PPTERM  1 20.14  0.0000072 ***
## MAGER      1  6.76    0.00932 ** 
## MRACEHISP  3 10.79    0.01292 *  
## DMAR       1 14.56    0.00014 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Before adjusting for confounding, those with a previous preterm birth have an estimated 3.29 times the hazard of experiencing a preterm birth as those without a previous preterm birth. After adjusting for confounding, the estimated association is still large, but somewhat attenuated. After adjusting for mother’s age, race/ethnicity, marital status, and education, those with a previous preterm birth have an estimated 2.89 times the hazard of experiencing a preterm birth (AHR = 2.89; 95% CI = 1.82, 4.60; p <.001) as those without a previous preterm birth.
NOTE: Parametric survival regression models
The KM method is a non-parametric estimator of the survival function. That implies that it makes no assumption about the shape of the function. As mentioned above, the Cox model is a semi-parametric method – it uses an unspecified baseline hazard function but includes regression parameters. There are various parametric survival regression models that do assume an underlying shape of the survival function, resulting from an assumption about the distribution of the survival times. Examples of assumed distributions include exponential, Weibull, Gompertz, and gamma. When the distributional assumption is met, these models can be more powerful than Cox regression. However, Cox regression does not depend on a distributional assumption and also easily incorporates time-varying predictors (Allison 2010) (see Section 7.14). In this text, we focus on the Cox proportional hazards regression model. See the flexsurvreg() function in the flexsurv package (Jackson 2024) for information on fitting parametric survival models in R.