3.4 Hypothesis testing

In order to test the significance of a variable or a interaction term in the model we can use two procedures:

  • the Wald test (typically used with Maximun Likelihood estimates)

  • the Likelihood Ratio test (LRT) (it uses the log likelihood to compare two nested models)

The null hypothesis of the Wald test states that the coeficient \(\beta_j\) is equal to 0. The test statistics is

\[ Z = \frac{\hat \beta_j - 0}{Std. Err (\hat \beta_j)} \sim N(0,1) \]

summary(m1)$coef
##                               coef exp(coef)   se(coef)         z
## LoanOriginalAmount2     -0.1217675 0.8853542 0.06661063 -1.828049
## IsBorrowerHomeownerTrue -0.2481456 0.7802463 0.06231124 -3.982357
## IncomeVerifiableTrue     0.2926323 1.3399500 0.30286111  0.966226
##                             Pr(>|z|)
## LoanOriginalAmount2     6.754227e-02
## IsBorrowerHomeownerTrue 6.823526e-05
## IncomeVerifiableTrue    3.339311e-01

# by hand... for IncomeVerifiable
z <- summary(m1)$coef[3, 1]/summary(m1)$coef[3, 3]  
pvalue <- 2 * pnorm(z, lower.tail = FALSE)
pvalue
## [1] 0.3339311

According to the pvalue of the test, the null hypothesis is accepted (for the IncomeVerifiable variable). Thus, the model must not include this variable.

The other approach is to use the Likelihood Ratio test. In this case, we need to compute the difference between the log likelihood statistic of the reduced model which does not contain the variable that we want to test and the log likelihood statistic of the full model containing the variable. In general, the LRT statistic can be written in the form of

\[ LRT = -2 ln \frac{L_R}{L_F}= 2 ln(L_F) - 2 ln(L_R) \sim \chi^2_p \] where \(L_R\) denotes the log likelihood of the reduced model with \(k\) parameter and \(L_F\) is the log likelihood of the full model with \(k + p\) parameters. \(\chi^2_p\) is a Chi-square with \(p\) degrees of freedom, where \(p\) denotes the number of predictors being assessed.

In general, the Likelihood Ratio test and Wald statistics may not give exactly the same answer. It has been shown that of the two test procedures, the LR statistic has better statistical properties, so when in doubt, you should use the LRT.

m_red <- coxph(Surv(time, status) ~ LoanOriginalAmount2 + IsBorrowerHomeowner,
               data = loan_filtered) 
anova(m_red, m1) #fist the reduced, second the full
## Analysis of Deviance Table
##  Cox model: response is  Surv(time, status)
##  Model 1: ~ LoanOriginalAmount2 + IsBorrowerHomeowner
##  Model 2: ~ LoanOriginalAmount2 + IsBorrowerHomeowner + IncomeVerifiable
##   loglik  Chisq Df P(>|Chi|)
## 1 -10837                    
## 2 -10836 1.0297  1    0.3102

# by hand... for IncomeVerifiable variable
m1$loglik  # the first is the log likelihood of a model that contains
## [1] -10848.75 -10836.52
           #      none of the predictors, so we need the second one

chi <- 2 * m1$loglik[2] - 2 * m_red$loglik[2]
pvalue <- 1 - pchisq(chi, df = 1) # df = 3 - 2
pvalue
## [1] 0.310227

In this case, using an \(\alpha = 0.05\) and testing the significance of the IncomeVerifiable variable, we must remove it from the model.