12  Logistic regression model building

Learning objectives

  1. Revisit the fundamental difference between two aspects of modelling: prediction vs explanation

  2. Be familiar with the concept of training and validation datasets

  3. Discover the different ways to validate a prediction model, particularly in logistic regression

  4. Learn about optimism bias and how it can be removed

  5. Discover how calibration plots are built and why they are important

  6. Revisit the ways in which multiple logistic regression models are constructed with an emphasis on interpretation

Learning activities

This week’s learning activities include:

Learning Activity Learning objectives
Lecture 1 1, 2, 3, 4
Readings 2, 3, 4, 5
Lecture 2 5, 6
Investigation 3, 4
Practice 5

We have learnt tools to build a linear or logistic regression model. In this last week of RM1, we take the perspective of considering model building strategies in general. As an introduction, we quote: Successful modeling of a complex data set is part science, part statistical methods, and part experience and common sense from Hosmer and Lemershow (2013) in their book on applied logistic regression and note that this applies to any model. This issue of model building has attracted attention and a lot of controversy over the years and continues today. It is important for you to know key ideas before you decide on your own strategy and formulate your personal view on this important topic. In this week’s materials we revisit and expand on what was discussed in the context of linear regression. The first key concept is to make a distinction between prediction and explanation/interpretation. This has important implications for aspects of the approach to modelling.

Result explaining the fundamental difference between explanatory and predictive modelling

In prediction, we are typically interested in predicting an outcome \(y\) with some function of \(x\), say, \(f(x)\) where \(x\) is a vector of covariates. To clarify what we mean by predicting, let us say that we would like \(f(x)\) to be close to \(y\) in some sense. A common way to define close is to consider the (expected) squared error loss of estimating \(y\) using \(f(x)\), i.e. \(E\big[(y-f(x))^2\big]\). It then makes sense to minimise this quantity mathematically yielding \(f=E(y|x)\), the conditional expectation of \(y\) given \(x\) (which is called a regression function). When we have data \((x_i,y_i)\), \(i=1,\dots,n\) to play with our goal is to find \(\hat f\) that is a good estimate of the regression function \(f\). For instance, \(\hat f\) can be the fitted model at a particular vector of covariates \(x\). The model is chosen to fit the data well. How good is that estimate \(\hat f\)? A good \(\hat f\) should have a low expected prediction error:

\[EPE(y,\hat f(x))=E\Big[(y-\hat f(x))^2\Big]\] This expectation is over \(x\), \(y\) and also the sample data used to build \(\hat f\). Next, the rules of statistical theory allow us to rewrite \(EPE\) as follows:

`

\[EPE=\hbox{var}(y) +\hbox{bias} ^2(\hat f(x)) + \hbox{var}(\hat f(x))\] In other words, \(EPE\) can be expressed as the sum of 3 terms

\[EPE=\hbox{irreducible error} + \hbox{bias}^2 + \hbox{variance}\]

where the first term is the irreducible error that results even is the model is correctly specified and accurarely estimated; the second term linked to bias is the result of misspecifying the statistical model \(f\); the third term called (estimation) variance is the result of using a sample to estimate \(f\). The above decomposition reveals a source of the difference between explanatory and predictive modeling: In explanatory modeling the focus is on minimising bias to obtain the most accurate representation. In contrast, predictive modeling seeks to minimise the combination of bias and estimation variance, occasionally sacrificing theoretical accuracy for improved empirical precision. This is essentially a bias-variance tradeoff. A wrong model (in the sense of bias) can sometimes predict better than a more correct one! We have used here the \(EPE\) to quantify the closeness of \(\hat f\) but we could use other measures of prediction error. In binary logistic regression, we may use tools adapted to the nature of the data but the fundamental distinction between prediction and explanation remains.

How to build a prediction model

Reading: Steyerberg and Vergouwe (2014). Towards better clinical prediction models: seven steps for development and an ABCD for validation

DOI: 10.1093/eurheartj/ehu207

This reading lists seven steps that need to be undertaken to build a (risk) prediction model that can be the objective of investigation in medical statistics. We will not rediscuss all the steps but focus on steps 5 and 6 that are essential and have not been fully discussed so far.

Lecture 1

This lecture discusses the issue of optimism bias when evaluating the AUC, the notion of training and validation datasets, crossvalidation and how to produce an AUC free of optimism bias.

Lecture 1 in R

Lecture 1 in Stata

Optimism, training and validation datasets

First we need a good measure of prediction error for a binary endpoint. The ROC curve or more exactly the AUC introduced in week 11 can be used. The closer to 1 the AUC is, the better the model discriminates between observed 0’s and 1’s. Another common measure is the Brier score that directly measures the prediction error. Since the AUC is very popular we focus on it but everything we say next will be true for the Brier score or other prediction error measures.

The first point to consider is how do we evaluate the AUC? A naive way to proceed is to build a model using steps 1-4 of the reading and then evaluate the AUC using the same data. The problem with this approach is that the data is used twice (once to build the model and a second time to evaluate the AUC). This results in an overestimation of the AUC (on average), a problem known as optimism bias as briefly stated in week 11.

Over-optimism can be severe when large models are used relative to the amount of information in the data since these tend to be overly tuned to idiosyncrasies of the observations and, as a result, will typically predict poorly with new, independent observations. This issue of overfitting is the same with machine learning using techniques like neural networks and with complex statistical models.

A way to avoid optimism bias is to randomly split the dataset in two: the first dataset is for training and is therefore called the training (or development) dataset; the second data set is for validation therefore its name: validation (or test) dataset. This is known as the split sample validation approach. This approach addresses the stability of the selection of predictors and the quality of predictions using a random sample for model development and the remaining patients for validation. The logic is rather simple: 1) generate a random variable that splits the data in two (i.e. the equivalent of a coin toss to decide to which dataset each patient will belong) - call this indicator val ; 2) fit the model on the development dataset (i.e. using only patients with val=0); 3) evaluate its performance (e.g. AUC) on the validation dataset (i.e. using only patients with val=1).

To illustrate, we again use the WGCS data and reproduce the analysis presented in Vittinghoff et al. (2012) p. 400 but delete the outlier in cholesterol (n=3141 observations). The endpoint is once again chd69 and the covariates retained for this exercise are age, chol, sbp, bmi and smoke. In Stata, the AUC seen in the validation dataset (n=1563) is 0.7195, 95% CI=(0.68 ; 0.76), a slightly lower value than what is observed in the development dataset (n=1578), AUC=0.7385, 95% CI=(0.69 ; 0.78) illustrating over-optimism in AUC in the development dataset. Slightly different values may be obtained in R, e.g. validation dataset AUC=0.713, 95%=(0.67 ; 0.76), mainly due to differences in the random number generation processes in the two programs.

R code and output

Show the code
wcgs <- read.csv("wcgs.csv")
wcgs<-data.frame(wcgs) 
wcgs1=cbind(wcgs$age,wcgs$chol,wcgs$sbp,wcgs$bmi,wcgs$smoke,wcgs$chd69)
colnames(wcgs1)=c("age", "chol", "sbp", "bmi", "smoke","chd69")
wcgs1=data.frame(wcgs1)
wcgs1=na.omit(wcgs1)
wcgs1<-wcgs1[wcgs1$chol < 645,]  
# remove outlier in cholesterol
n=dim(wcgs1)[1]
set.seed(1001)
wcgs1$val=rbinom(n,1,0.5) # val =0 for development and val=1 for validation
table(wcgs1$val)
## 
##    0    1 
## 1596 1545
# building model on development dataset
wcgs1.dev=wcgs1[wcgs1$val==0,]
fit.dev<-glm(chd69 ~ age+chol+sbp+bmi+smoke, family=binomial, data=wcgs1.dev)
summary(fit.dev)
## 
## Call:
## glm(formula = chd69 ~ age + chol + sbp + bmi + smoke, family = binomial, 
##     data = wcgs1.dev)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -13.016252   1.397805  -9.312  < 2e-16 ***
## age           0.063685   0.017046   3.736 0.000187 ***
## chol          0.013148   0.002172   6.054 1.42e-09 ***
## sbp           0.019793   0.005747   3.444 0.000573 ***
## bmi           0.060258   0.036968   1.630 0.103102    
## smoke         0.602919   0.203779   2.959 0.003090 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 881.60  on 1595  degrees of freedom
## Residual deviance: 786.35  on 1590  degrees of freedom
## AIC: 798.35
## 
## Number of Fisher Scoring iterations: 6
# evaluation on the validation dataset  
wcgs1.val=wcgs1[wcgs1$val==1,]
wcgs1.val$pred<- predict(fit.dev, wcgs1.val,type="response")  
# predicted probabilities for the validation dataset using the previous fit
require(pROC)
## Loading required package: pROC
## Warning: package 'pROC' was built under R version 4.4.1
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# ROC + AUC on the validation dataset (suffix .val)
out<- roc(wcgs1.val$chd69, wcgs1.val$pred,plot=TRUE,ci=TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Show the code
out
## 
## Call:
## roc.default(response = wcgs1.val$chd69, predictor = wcgs1.val$pred,     ci = TRUE, plot = TRUE)
## 
## Data: wcgs1.val$pred in 1415 controls (wcgs1.val$chd69 0) < 130 cases (wcgs1.val$chd69 1).
## Area under the curve: 0.7127
## 95% CI: 0.6694-0.756 (DeLong)

Stata code and output

Show the code
use wcgs.dta
drop if chol>=645 
** missing chol and chol=645 deleted
set seed 1001
gen val = runiform()<.5   
** Derive a prediction model for y-chd69 using the development dataset
logistic chd69 age chol sbp bmi smoke if val==0
**Generate a new variable containing the predicted probabilities for all observations
predict fitted, pr
** AUC on the development data (training), n=1578; to compare to.
roctab chd69 fitted if  val==0
** AUC on the validation data  - n= 1563; the one we need!
roctab chd69 fitted if  val==1
roctab chd69 fitted if  val==1,  graph title("Validation")
## (13 observations deleted)
## 
## 
## 
## 
## Logistic regression                                     Number of obs =  1,578
##                                                         LR chi2(5)    =  86.80
##                                                         Prob > chi2   = 0.0000
## Log likelihood = -395.91089                             Pseudo R2     = 0.0988
## 
## ------------------------------------------------------------------------------
##        chd69 | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##          age |   1.045713   .0178468     2.62   0.009     1.011313    1.081284
##         chol |   1.013229   .0021723     6.13   0.000      1.00898    1.017496
##          sbp |   1.020727   .0062837     3.33   0.001     1.008485    1.033117
##          bmi |   1.046858   .0403416     1.19   0.235     .9707017    1.128988
##        smoke |   2.034663   .4094141     3.53   0.000     1.371558    3.018359
##        _cons |   6.91e-06   9.78e-06    -8.39   0.000     4.30e-07    .0001109
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline odds.
## 
## 
## 
##                       ROC                     Asymptotic normal  
##            Obs       area     Std. err.      [95% conf. interval]
##      ------------------------------------------------------------
##          1,578     0.7385       0.0224        0.69452     0.78241
## 
## 
##                       ROC                     Asymptotic normal  
##            Obs       area     Std. err.      [95% conf. interval]
##      ------------------------------------------------------------
##          1,563     0.7195       0.0219        0.67657     0.76242

We let you change the seed or the proportion of patients in each dataset and check that you will get slightly different results. Although commonly used, the split sample validation approach is a suboptimal form of internal validation, partly because of this variability due to the random process of splitting the dataset and the arbitrary proportions of the original data that are assigned to development and test datasets.

Different forms of validation

A more efficient way of obtaining optimism-corrected measures of validation is to use a process called \(h\)-fold cross validation where the whole dataset is split multiple times into development and test datasets and thus all data is used for both steps. We illustrate the procedure using \(h=10\), often considered as the default but other values can be used (typically between 10 and 20).

\(h\)-fold validation:
  1. Randomly divide the data into \(h=10\) mutually exclusive subsets of approximately the same size.

  2. In turn, set aside each subset of the data (approximately 10%) and use the remaining other subsets (approximately 90%) to fit (develop) the model.

  3. On each turn, use the parameter estimates of the developed model to get predicted probabilities and AUC (or any other measure of prediction error) in the 10% subset of observations that were set aside.

  4. Repeat this for all 10 subsets and compute a summary estimate of the ten AUCs (or any other measure of prediction error)

In addition, this 4 step procedure can be repeated \(k\) times, and the average AUC computed but for simplicity we will consider the case \(k=1\).

Going back to the WCGS data and model used previously, the naive estimate for the AUC (based on 3141 observations) is 0.732. 95%CI=(0.702 ; 0.763). A 10-fold cross-validated AUC computed through steps 1) to 4) is 0.727, 95% CI=(0.697 ; 0.757) in R. Note that the 95% CI is narrower than the one obtained with the split sample approach since all the data has been used. The optimism-adjusted AUC of 0.727 via the cross-validation approach is a bit larger than the value from the split sample approach (0.713) indicating that perhaps we drew an unlucky sample in that split sample approach. Nevertheless the cross-validated AUC of 0.727 is less than the original 0.732 demonstrating again the correction for over-optimism.

R code and output

Show the code
require(cvAUC)
## Loading required package: cvAUC
## Warning: package 'cvAUC' was built under R version 4.4.1
## function required to run CV
cv_eval <- function(data, V=10){
  f.cvFolds <- function(Y, V){ #Create CV folds (stratify by outcome)
    Y0 <- split(sample(which(Y==0)), rep(1:V, length=length(which(Y==0))))
    Y1 <- split(sample(which(Y==1)), rep(1:V, length=length(which(Y==1))))
    folds <- vector("list", length=V)
    for (v in seq(V)) {folds[[v]] <- c(Y0[[v]], Y1[[v]])}
    return(folds)
  }
  f.doFit <- function(v, folds, data){ #Train/test glm for each fold
    fit <- glm(Y~., data=data[-folds[[v]],], family=binomial)
    pred <- predict(fit, newdata=data[folds[[v]],], type="response")
    return(pred)
  }
  folds <- f.cvFolds(Y=data$Y, V=V) #Create folds
  predictions <- unlist(sapply(seq(V), f.doFit, folds=folds, data=data)) #CV train/predict
  predictions[unlist(folds)] <- predictions #Re-order pred values
  ci.pooled.cvAUC
  # Get CV AUC and confidence interval
  out <- ci.cvAUC(predictions=predictions, labels=data$Y, folds=folds, confidence=0.95)
  return(out)
}

# this function requires a dataset with only the covariates 
# and the endpoint recoded Y as columns, NOTE that the outcome (last
# column) is now named Y

wcgs1=wcgs1[,1:6]
colnames(wcgs1)=c("age", "chol", "sbp", "bmi", "smoke","Y")
fit<-glm(Y ~ age+chol+sbp+bmi+smoke, family=binomial, data=wcgs1)
set.seed(1001)
out <- cv_eval(data=wcgs1, V=10)
out
## $cvAUC
## [1] 0.7269148
## 
## $se
## [1] 0.01518409
## 
## $ci
## [1] 0.6971545 0.7566751
## 
## $confidence
## [1] 0.95

As expected the naive AUC estimate is slightly bigger than its cross-validated counterpart. This illustrates a small optimism bias discussed above. The fact that the difference is small suggests that the model used to predict chd69 is not badly overfitted. Once again, the result depends on the seed and may differ (slightly) across packages. In this instance, we get a cross-validated AUC=0.727, 95% CI=(0.696 ; 0.758) in Stata with the same seed. The code below shows how to do this.

Stata code and output

Show the code
clear
use wcgs.dta
drop if chol>=645 
** missing chol and chol=645 deleted
set seed 1001
xtile group = uniform(), nq(10)
quietly gen cvfitted = .
forvalues i = 1/10 {

    ** Step 2: estimate model omitting each subset
    quietly xi: logistic chd69 age chol sbp bmi smoke if group~=`i'
    quietly predict cvfittedi, pr
    
    ** Step 3: save cross-validated statistic for each omitted subset
    quietly replace cvfitted = cvfittedi if group==`i'
    quietly drop cvfittedi
    }

** Step 4: calculate cross-validated area under ROC curve
roctab chd69 cvfitted
## (13 observations deleted)
## 
## 
## 
## 
## 
## 
##                       ROC                     Asymptotic normal  
##            Obs       area     Std. err.      [95% conf. interval]
##      ------------------------------------------------------------
##          3,141     0.7270       0.0157        0.69619     0.75779

Cross-validation is not the only technique that can be used to correct for optimism bias, the bootstrap is also a popular method. We do not describe in detail how it can be used in this context but provide an optional reading for interested students (optional)

link

So far we have only been talking about internal validation, either using the split sample approach or cross-validation. The word internal is used because we only use the dataset under investigation. Another form of validation called external validation exists and is generally considered superior as it provides a tougher test of a model’s ability to predict. For external validation, data from a different source are used to compute the AUC. This form of validation addresses transportability of a model rather than solely reproducibility and constitutes a stronger test for prediction models.

12.0.0.1 Investigation:

  1. The infarct data provides information on 200 female patients who experienced MI. The response is mi (0=no,1=yes), and the dataset also contains the variables age in years, smoke (0=no,1=yes) and oral for oral contraceptive use (0=no,1=yes). You can check that these covariates are all related to the MI outcome. We will assume that the age effect is rather linear - you can also check that it seems a reasonable assumption for these data. Fit a logistic regression for mi, give the naive AUC and its 95% CI, and draw the ROC curve. Does the model seem to predict well?

  2. We have used the same data to build and evaluate the model, so the AUC may be affected by optimism bias. Use the split sample validation approach to compute an optimism-adjusted AUC. Do you think this is a good approach to use here?

  3. Calculate a 10-fold cross-validated AUC and compare with the naive estimate. What do you conclude?

Of course, the concept of cross-validation illustrated here in logistic regression is general and can be used in linear regression as well using an appropriate measure of prediction error.

Calibration

This lecture discusses how calibration plots can be obtained and why they are important in risk prediction modelling.

Lecture 2

Lecture 2 in R

Lecture 2 in Stata

Evaluation of model performance does not stop with the computation of an external or cross-validated AUC and the evaluation of the model discriminative ability. A prediction model should also be able to provide good agreement between observed outcomes and predictions over the whole range of predicted values. In the WCGS example, it may well be that the model correctly predicts on average the outcome for patients with low risk of CHD but is not so accurate for high risk patients. This is important because poorly calibrated risk estimates can lead to false expectations with patients and healthcare professionals. The term calibration is used to describe the agreement between observed outcomes and their predicted counterparts. A calibration plot, displaying observed vs predicted probabilities with 95% CI, is often used to get a visual impression on how close they are to the 45 degrees line, seen as a demonstration of good calibration. Calibration plots can also be affected by overfitting and optimism bias so, again, plots based on external data are preferable. If no external data is available the split sample approach can be used. It may be possible to use 10-fold cross-validation but it is more complicated since there is no straightforward way to average the 10 calibration plots that would arise, one for each 10% subset of data, in order to draw a single calibration plot at the end.

To illustrate the concept of calibration, we return to the WCGS example with the split sample approach to validation. The following plot is obtained for the test dataset when data are grouped in 10 categories of risk. We could choose more categories but 10 is often the default choice. A larger number of categories may make sense for large datasets, however we need to keep in mind that the more categories we choose, the wider the confidence intervals.

R code and output

Show the code
require(rms)
# val.prob(wcgs1.val$pred, wcgs1.val$chd69, m=150, cex=.5) # without CIs

# calibration function to get the 95% CI
val.prob.ci<-
  function(p, y, logit, group, weights = rep(1, length(y)), normwt = F, pl = T, 
           smooth = T, logistic.cal = F, xlab = "Predicted probability", ylab = 
             "Observed frequency", xlim = c(-0.02, 1),ylim = c(-0.15,1), m, g, cuts, emax.lim = c(0, 1), 
           legendloc =  c(0.55 , 0.27), statloc = c(0,.85),dostats=c(12,13,2,15,3),roundstats=2,
           riskdist = "predicted", cex = 0.75, mkh = 0.02, connect.group = 
             F, connect.smooth = T, g.group = 4, evaluate = 100, nmin = 0, d0lab="0", d1lab="1", cex.d01=0.7,
           dist.label=0.04, line.bins=-.05, dist.label2=.03, cutoff, cex.lab=1, las=1, length.seg=1)  {
    if(missing(p))
      p <- 1/(1 + exp( - logit))
    else logit <- log(p/(1 - p))
    if(length(p) != length(y))
      stop("lengths of p or logit and y do not agree")
    names(p) <- names(y) <- names(logit) <- NULL
    if(!missing(group)) {
      if(length(group) == 1 && is.logical(group) && group)
        group <- rep("", length(y))
      if(!is.factor(group))
        group <- if(is.logical(group) || is.character(group)) 
          as.factor(group) else cut2(group, g = 
                                       g.group)
      names(group) <- NULL
      nma <- !(is.na(p + y + weights) | is.na(group))
      ng <- length(levels(group))
    }
    else {
      nma <- !is.na(p + y + weights)
      ng <- 0
    }
    logit <- logit[nma]
    y <- y[nma]
    p <- p[nma]
    if(ng > 0) {
      group <- group[nma]
      weights <- weights[nma]
      return(val.probg(p, y, group, evaluate, weights, normwt, nmin)
      )
    }
    if(length(unique(p)) == 1) {
      #22Sep94
      P <- mean(y)
      Intc <- log(P/(1 - P))
      n <- length(y)
      D <- -1/n
      L01 <- -2 * sum(y * logit - log(1 + exp(logit)), na.rm = T)
      L.cal <- -2 * sum(y * Intc - log(1 + exp(Intc)), na.rm = T)
      U.chisq <- L01 - L.cal
      U.p <- 1 - pchisq(U.chisq, 1)
      U <- (U.chisq - 1)/n
      Q <- D - U
      
      stats <- c(0, 0.5, 0, D, 0, 1, U, U.chisq, U.p, Q, mean((y - p[
        1])^2), Intc, 0, rep(abs(p[1] - P), 2))
      names(stats) <- c("Dxy", "C (ROC)", "R2", "D", "D:Chi-sq", 
                        "D:p", "U", "U:Chi-sq", "U:p", "Q", "Brier", 
                        "Intercept", "Slope", "Emax", "Eavg")
      return(stats)
    }
    i <- !is.infinite(logit)
    nm <- sum(!i)
    if(nm > 0)
      warning(paste(nm, 
                    "observations deleted from logistic calibration due to probs. of 0 or 1"
      ))
    f <- lrm.fit(logit[i], y[i])
    f2<-    lrm.fit(offset=logit[i], y=y[i])
    stats <- f$stats
    n <- stats["Obs"]
    predprob <- seq(emax.lim[1], emax.lim[2], by = 0.0005)
    lt <- f$coef[1] + f$coef[2] * log(predprob/(1 - predprob))
    calp <- 1/(1 + exp( - lt))
    emax <- max(abs(predprob - calp))
    if (pl) {
      plot(0.5, 0.5, xlim = xlim, ylim = ylim, type = "n", xlab = xlab, 
           ylab = ylab, las=las)
      abline(0, 1, lty = 2)
      lt <- 2
      leg <- "Ideal"
      marks <- -1
      if (logistic.cal) {
        lt <- c(lt, 1)
        leg <- c(leg, "Logistic calibration")
        marks <- c(marks, -1)
      }
      if (smooth) {
        Sm <- lowess(p, y, iter = 0)
        if (connect.smooth) {
          lines(Sm, lty = 3)
          lt <- c(lt, 3)
          marks <- c(marks, -1)
        }
        else {
          points(Sm)
          lt <- c(lt, 0)
          marks <- c(marks, 1)
        }
        leg <- c(leg, "Nonparametric")
        cal.smooth <- approx(Sm, xout = p)$y
        eavg <- mean(abs(p - cal.smooth))
      }
      if(!missing(m) | !missing(g) | !missing(cuts)) {
        if(!missing(m))
          q <- cut2(p, m = m, levels.mean = T, digits = 7)
        else if(!missing(g))
          q <- cut2(p, g = g, levels.mean = T, digits = 7)
        else if(!missing(cuts))
          q <- cut2(p, cuts = cuts, levels.mean = T, digits = 7)
        means <- as.single(levels(q))
        prop <- tapply(y, q, function(x)mean(x, na.rm = T))
        points(means, prop, pch = 2, cex=cex)
        #18.11.02: CI triangles         
        ng  <-tapply(y, q, length)
        og  <-tapply(y, q, sum)
        ob  <-og/ng
        se.ob   <-sqrt(ob*(1-ob)/ng)
        g       <- length(as.single(levels(q)))
        
        for (i in 1:g) lines(c(means[i], means[i]), c(prop[i],min(1,prop[i]+1.96*se.ob[i])), type="l")
        for (i in 1:g) lines(c(means[i], means[i]), c(prop[i],max(0,prop[i]-1.96*se.ob[i])), type="l")
        
        if(connect.group) {
          lines(means, prop)
          lt <- c(lt, 1)
        }
        else lt <- c(lt, 0)
        leg <- c(leg, "Grouped patients")
        marks <- c(marks, 2)
      }
    }
    lr <- stats["Model L.R."]
    p.lr <- stats["P"]
    D <- (lr - 1)/n
    L01 <- -2 * sum(y * logit - logb(1 + exp(logit)), na.rm = TRUE)
    U.chisq <- L01 - f$deviance[2]
    p.U <- 1 - pchisq(U.chisq, 2)
    U <- (U.chisq - 2)/n
    Q <- D - U
    Dxy <- stats["Dxy"]
    C <- stats["C"]
    R2 <- stats["R2"]
    B <- sum((p - y)^2)/n
    # ES 15dec08 add Brier scaled
    Bmax  <- mean(y) * (1-mean(y))^2 + (1-mean(y)) * mean(y)^2
    Bscaled <- 1 - B/Bmax
    stats <- c(Dxy, C, R2, D, lr, p.lr, U, U.chisq, p.U, Q, B, 
               f2$coef[1], f$coef[2], emax, Bscaled)
    names(stats) <- c("Dxy", "C (ROC)", "R2", "D", "D:Chi-sq", 
                      "D:p", "U", "U:Chi-sq", "U:p", "Q", "Brier", "Intercept", 
                      "Slope", "Emax", "Brier scaled")
    if(smooth)
      stats <- c(stats, c(Eavg = eavg))
    
    # Cut off definition    
    if(!missing(cutoff)) {
      arrows(x0=cutoff,y0=.1,x1=cutoff,y1=-0.025,length=.15)
    }   
    if(pl) {
      logit <- seq(-7, 7, length = 200)
      prob <- 1/(1 + exp( - logit))
      pred.prob <- f$coef[1] + f$coef[2] * logit
      pred.prob <- 1/(1 + exp( - pred.prob))
      if(logistic.cal) lines(prob, pred.prob, lty = 1)  
      # pc <- rep(" ", length(lt))
      # pc[lt==0] <- "."
      lp <- legendloc
      if (!is.logical(lp)) {
        if (!is.list(lp)) 
          lp <- list(x = lp[1], y = lp[2])
        legend(lp, leg, lty = lt, pch = marks, cex = cex, bty = "n")
      }
      if(!is.logical(statloc)) {
        dostats <- dostats
        leg <- format(names(stats)[dostats])    #constant length
        leg <- paste(leg, ":", format(stats[dostats], digits=roundstats), sep = 
                       "")
        if(!is.list(statloc))
          statloc <- list(x = statloc[1], y = statloc[2])
        text(statloc, paste(format(names(stats[dostats])), 
                            collapse = "\n"), adj = 0, cex = cex)
        text(statloc$x + (xlim[2]-xlim[1])/3 , statloc$y, paste(
          format(round(stats[dostats], digits=roundstats)), collapse = 
            "\n"), adj = 1, cex = cex)  
      }
      if(is.character(riskdist)) {
        if(riskdist == "calibrated") {
          x <- f$coef[1] + f$coef[2] * log(p/(1 - p))
          x <- 1/(1 + exp( - x))
          x[p == 0] <- 0
          x[p == 1] <- 1
        }
        else x <- p
        bins <- seq(0, min(1,max(xlim)), length = 101) 
        x <- x[x >= 0 & x <= 1]
        #08.04.01,yvon: distribution of predicted prob according to outcome
        f0  <-table(cut(x[y==0],bins))
        f1  <-table(cut(x[y==1],bins))
        j0  <-f0 > 0
        j1  <-f1 > 0
        bins0 <-(bins[-101])[j0]
        bins1 <-(bins[-101])[j1]
        f0  <-f0[j0]
        f1  <-f1[j1]
        maxf <-max(f0,f1)
        f0  <-(0.1*f0)/maxf
        f1  <-(0.1*f1)/maxf
        segments(bins1,line.bins,bins1,length.seg*f1+line.bins)
        segments(bins0,line.bins,bins0,length.seg*-f0+line.bins)
        lines(c(min(bins0,bins1)-0.01,max(bins0,bins1)+0.01),c(line.bins,line.bins))
        text(max(bins0,bins1)+dist.label,line.bins+dist.label2,d1lab,cex=cex.d01)
        text(max(bins0,bins1)+dist.label,line.bins-dist.label2,d0lab,cex=cex.d01)
      }
    }
    stats
  }

# calibration plot - with 95% CIs - using the validation dataset
val.prob.ci(wcgs1.val$pred,wcgs1.val$chd69, pl=T,smooth=T,logistic.cal=F,
            g=10)

          Dxy       C (ROC)            R2             D      D:Chi-sq 
 4.455546e-01  7.227773e-01  1.027665e-01  4.398439e-02  6.943971e+01 
          D:p             U      U:Chi-sq           U:p             Q 
 1.110223e-16 -7.891226e-04  7.721253e-01  6.797279e-01  4.477351e-02 
        Brier     Intercept         Slope          Emax  Brier scaled 
 6.973287e-02 -3.298974e-02  9.088922e-01  6.825304e-02  4.213247e-02 
         Eavg 
 7.421886e-03 

Stata code and output - figure will be displayed when running the code

A note for Stata users: the calibration plot can be obtained using the command pmcalplot. To be able to use this command you need to proceed as follows: 1) install it by typing ssc install pmcalplot; 2) install the required package sed9_2.pkg. You do this in two steps: type first findit sed9_2 then click on the link found by Stata. Once you have used these two steps, the command pmcalplot should work.

Show the code
use wcgs.dta
drop if chol > 645 
set seed 1001
gen val = runiform()<.5 
logistic chd69 age chol sbp bmi smoke if val==0
predict proba, pr

** Use pmcalplot to produce a calibration plot of the apparent performance of
** the model in the Training dataset

pmcalplot proba chd69 if val==0, ci

** Now produce the calibration plot for the models performance in the
** validation dataset = the one we need

pmcalplot proba chd69 if val==1, ci
** highest category poorly fitted

** Now produce the calibration plot for the models performance in the
** validation cohort using risk thresholds of <5%, 5-15%, 15-20% and >20%.
** This is useful if the model has been proposed for decision making at these
** thresholds.  Use the keep option to return the groupings to your dataset

pmcalplot proba chd69 if val==1, cut(.05 0.10 .15 .20) ci keep

## (12 observations deleted)
## 
## 
## 
## 
## Logistic regression                                     Number of obs =  1,579
##                                                         LR chi2(5)    = 100.20
##                                                         Prob > chi2   = 0.0000
## Log likelihood = -413.32347                             Pseudo R2     = 0.1081
## 
## ------------------------------------------------------------------------------
##        chd69 | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##          age |   1.065177   .0175377     3.83   0.000     1.031353    1.100111
##         chol |   1.012124   .0020439     5.97   0.000     1.008126    1.016138
##          sbp |   1.019312   .0064403     3.03   0.002     1.006767    1.032014
##          bmi |   1.047462   .0397636     1.22   0.222     .9723557     1.12837
##        smoke |   2.443136   .4891536     4.46   0.000     1.650152     3.61719
##        _cons |   4.26e-06   5.96e-06    -8.83   0.000     2.74e-07    .0000663
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline odds.
## 
## 
## Binary option selected: Calibration plot for logistic prediction model displayi
## > ng...
## 
## Binary option selected: Calibration plot for logistic prediction model displayi
## > ng...
## 
## Binary option selected: Calibration plot for logistic prediction model displayi
## > ng...
## 
## WARNING: Do not use the cut-point option unless the model has prespecified clin
## > ically relevant cut-points.

The plot is reasonably close to the 45-degree line except for the highest risk category where the CHD risk tends to be overestimated by the model predictions (but the 95% CI may just touch the 45-degree line). This plot satisfies an optimism-corrected approach to internal calibration plot using the same original dataset for development and validation of the model. We would proceed in a similar way for external calibration, were new data available, the difference being that we would use the whole WGSC dataset to fit the model and the new data to build the calibration plot. Constructing calibration plots is part of the validation process in general.

Now the next logical question is: what do we do if the calibration plot is poor? This can occur, particularly when external calibration is conducted. This can be due to the baseline risk being higher in the new sample or other differences between the two populations. This is a difficult problem to solve for which there is no unique fix. Options include: 1) recalibrate the intercept, i.e. recalibration in the large; 2) recalibrate the intercept and the beta coefficients; 3) change the model to find one that is well calibrated! More details are given in the Steyerberg and Vergouwe paper given as a reading earlier.

Optional reading: Those of you who want to know more can read the following paper which contains particularly illustrative examples and plots: Van Calster et al. (2020), Calibration: the Achilles heel of predictive analytics, BMC Medicine.

Practice

  1. We continue with the WCGS data. In week 10, we came up with a more complex model for chd69 after deletion of the outlier (chol=645) - see Table 5.18 in Vittinghoff et al. (2012). Use this model to build a calibration plot based on the split sample approach. Provide an interpretation. We use the split sample approach for simplicity and we note that we don’t have external data to undertake external validation.

  2. Has the calibration plot improved compared with the simpler model? Keep the same seed here for a fair comparison.

  3. Repeat the procedure with 20 groups in the calibration plot and also using more patients in the development/training dataset (2/3 is also used as indicated in Vittinghoff et al. (2012) p. 399).

Recap on building a model focusing on explanation/interpretation

As a final note we remind you that prediction is one of two scenarios for analysing health data. The other scenario is to explain/interpret the role of specific variables in relationship to an outcome of interest. For explanation/interpretation, we further distinguish two cases: 1) evaluating a (exposure) variable of primary interest; 2) identifying what factors relate to the outcome.

Generally speaking, recommendations developed in week 8 apply but, as we’ve previously noted, there is no absolute consensus on the best way to build a model. A few points are worth adding. You may also want to read the beginning of Chapter 10 (pp. 395-402) of Vittinghoff et al. (2012).

Regarding 1), when the study is a randomised clinical trial and therefore the focus is on assessing the effect of treatment, it’s uncommon to adjust for baseline because the binary outcome is not often measured at baseline. The issue of whether we should adjust for anything in a randomised clinical trial and if so, what to adjust for, is controversial and our view is that any adjustment should be for a small number of factors that have been pre-specified in a statistical analysis plan. The adjusted OR is technically not the same once we have adjusted for covariates. It’s more a conditional than a marginal OR but in most cases the difference is small. In the case of a non-randomised comparison between two groups, adjustment or more sophisticated methods (e.g. propensity score) are usually essential.

Regarding 2), model building is complex and we do not generally recommend automated model selection techniques including the backward selection procedure that appears to be favoured in Vittinghoff et al. (2012). We nevertheless agree that if an automated procedure is used then it is preferable to use backward rather than forward selection. In a long commentary paper published in 2020 on the State of the art in selection of variables and functional forms in multivariable analysis, the STRATOS group, which includes eminent statisticians, concluded:

Our overview revealed that there is not yet enough evidence on which to base recommendations for the selection of variables and functional forms in multivariable analysis. Such evidence may come from comparisons between alternative methods.

They also listed seven important topics for the direction of further research. Elements of this commentary could be useful to you in the future but it might be better to read it after having completed RM1 and RM2.

Summary

The following are the key takeaway messages from this week:
  1. There is a fundamental difference between models built to predict and models used to explain.

  2. Evaluation of the performance of a (risk) prediction model requires several steps including the assessment of its discriminative ability and calibration.

  3. Optimism bias exists but can be removed using appropriate techniques, e.g. crossvalidation.

  4. External validation provides a sterner test of a new prediction model than internal validation but is not always possible.

  5. Calibration plots assess whether the prediction model predicts well over the whole range of predicted values.

  6. The general ideas covered in earlier weeks of RM1 for building linear models with interpretation as the main focus extend to binary logistic regression.