3.4 Hypothesis testing
In order to test the significance of a variable or a interaction term in the model we can use two procedures:
the Wald test (typically used with Maximun Likelihood estimates)
the Likelihood Ratio test (LRT) (it uses the log likelihood to compare two nested models)
The null hypothesis of the Wald test states that the coeficient \(\beta_j\) is equal to 0. The test statistics is
\[ Z = \frac{\hat \beta_j - 0}{Std. Err (\hat \beta_j)} \sim N(0,1) \]
summary(m1)$coef
## coef exp(coef) se(coef) z
## LoanOriginalAmount2 -0.1217675 0.8853542 0.06661063 -1.828049
## IsBorrowerHomeownerTrue -0.2481456 0.7802463 0.06231124 -3.982357
## IncomeVerifiableTrue 0.2926323 1.3399500 0.30286111 0.966226
## Pr(>|z|)
## LoanOriginalAmount2 6.754227e-02
## IsBorrowerHomeownerTrue 6.823526e-05
## IncomeVerifiableTrue 3.339311e-01
# by hand... for IncomeVerifiable
z <- summary(m1)$coef[3, 1]/summary(m1)$coef[3, 3]
pvalue <- 2 * pnorm(z, lower.tail = FALSE)
pvalue
## [1] 0.3339311
According to the pvalue of the test, the null hypothesis is accepted (for the IncomeVerifiable
variable). Thus, the model must not include this variable.
The other approach is to use the Likelihood Ratio test. In this case, we need to compute the difference between the log likelihood statistic of the reduced model which does not contain the variable that we want to test and the log likelihood statistic of the full model containing the variable. In general, the LRT statistic can be written in the form of
\[ LRT = -2 ln \frac{L_R}{L_F}= 2 ln(L_F) - 2 ln(L_R) \sim \chi^2_p \] where \(L_R\) denotes the log likelihood of the reduced model with \(k\) parameter and \(L_F\) is the log likelihood of the full model with \(k + p\) parameters. \(\chi^2_p\) is a Chi-square with \(p\) degrees of freedom, where \(p\) denotes the number of predictors being assessed.
m_red <- coxph(Surv(time, status) ~ LoanOriginalAmount2 + IsBorrowerHomeowner,
data = loan_filtered)
anova(m_red, m1) #fist the reduced, second the full
## Analysis of Deviance Table
## Cox model: response is Surv(time, status)
## Model 1: ~ LoanOriginalAmount2 + IsBorrowerHomeowner
## Model 2: ~ LoanOriginalAmount2 + IsBorrowerHomeowner + IncomeVerifiable
## loglik Chisq Df P(>|Chi|)
## 1 -10837
## 2 -10836 1.0297 1 0.3102
# by hand... for IncomeVerifiable variable
m1$loglik # the first is the log likelihood of a model that contains
## [1] -10848.75 -10836.52
# none of the predictors, so we need the second one
chi <- 2 * m1$loglik[2] - 2 * m_red$loglik[2]
pvalue <- 1 - pchisq(chi, df = 1) # df = 3 - 2
pvalue
## [1] 0.310227
In this case, using an \(\alpha = 0.05\) and testing the significance of the IncomeVerifiable
variable, we must remove it from the model.