Chapter 5 Extension to classification

Our high-dimensional considerations so far focused on the linear regression model. We now extend this to classification.

5.1 Classification and logistic regression

We start with standard logistic regression where response \(G\) takes values \(0\) and \(1\), and, the aim is to do prediction based based on covariates \(X=(X_1,\ldots,X_p)\). We model the probability of success \[p(x;\beta)=P(G=1|X=x;\beta)\] assuming a binomial distribution with logit link function

\[{\rm logit}(x;\beta)=\log \Big(\frac{p(x;\beta)}{1-p(x;\beta)}\Big)=X^T\beta.\]

In logistic regression we estimate the regression parameter \(\beta\) by maximizing the log-likelihood

\[\begin{align*} \ell(\beta|{\bf y,X})&=\sum_{i=1}^{n}(1-y_i)\log(1-p(x_i;\beta))+y_i\log p(x_i;\beta)\\ &=\sum_{i=1}^{n}y_i x_i^T\beta - \log(1+\exp(x_i^T\beta)). \end{align*}\]

Given an estimate \(\hat \beta\) (from training data), prediction based on new input data \(X_{\rm new}\) can be obtained via the predicted probability of success \(p(X_{\rm new};\hat \beta)\), e.g. the class labels corresponding to the maximum probability

\[\begin{align*} \hat{G}(X_{\rm new})=\left\{ \begin{array}{ll} 1, & \mbox{if $p(X_{\rm new};\hat\beta)>0.5$}.\\ 0, & \mbox{otherwise}. \end{array} \right. \end{align*}\]

There are different measures to judge the quality of the predictions. We focus on the misclassification error which is simply the fraction of misclassified test samples. Another important measure used in the context of classification is the receiver operating characteristic (ROC).

We illustrate logistic regression using an example taken from the book by Hastie, Tibshirani, and Friedman (2001). The data shown in Figure 5.1 are a subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa. The aim of the study was to establish the intensity of ischemic heart disease risk factors in that high-incidence region. The data represent white males between 15 and 64, and the response variable is the presence or absence of myocardial infarction (MI) at the time of the survey (the overall prevalence of MI was 5.1% in this region). There are 160 cases in our data set, and a sample of 302 controls. The variables are:

  • sbp: systolic blood pressure
  • tobacco: cumulative tobacco (kg)
  • ldl: low densiity lipoprotein cholesterol adiposity
  • famhist: family history of heart disease (Present, Absent)
  • obesity
  • alcohol: current alcohol consumption
  • age: age at onset
  • chd: response, coronary heart disease.
Pairs plot of variables for the South African Heart Disease Data. Red points indicate cases and blue points indicate controls.

Figure 5.1: Pairs plot of variables for the South African Heart Disease Data. Red points indicate cases and blue points indicate controls.

We first fit a logistic regression model using the function glm.

fit.logistic <- glm(chd~sbp+tobacco+ldl+famhist+obesity+alcohol+age,
                    data=dat,
                    family="binomial")
kable(broom::tidy(fit.logistic),digits=3)
term estimate std.error statistic p.value
(Intercept) -4.130 0.964 -4.283 0.000
sbp 0.006 0.006 1.023 0.306
tobacco 0.080 0.026 3.034 0.002
ldl 0.185 0.057 3.219 0.001
famhistPresent 0.939 0.225 4.177 0.000
obesity -0.035 0.029 -1.187 0.235
alcohol 0.001 0.004 0.136 0.892
age 0.043 0.010 4.181 0.000

There are some surprises in this table of coefficients, which must be interpreted with caution. Systolic blood pressure (sbp) is not significant! Nor is obesity, and its sign is negative. This confusion is a result of the correlation between the set of predictors. On their own, both sbp and obesity are significant, and with positive sign. However, in the presence of many other correlated variables, they are no longer needed (and can even get a negative sign).

How does one interpret a coefficient of \(0.080\) (Std. Error = \(0.026\)) for tobacco, for example? Tobacco is measured in total lifetime usage in kilograms, with a median of \(1.0\)kg for the controls and \(4.1\)kg for the cases. Thus an increase of 1kg in lifetime tobacco usage accounts for an increase in the odds of coronary heart disease of \(\exp(0.080)\)=1.083 or \(8.3\)%. Incorporating the standard error we get an approximate \(95\)% confidence interval of \(\exp(0.081 ± 2 × 0.026)\)=(1.035,1.133).

5.2 Subset selection and elastic net

Similar as for linear regression, in the high-dimensional setting where \(n\) is small compared to \(p\), the maximum likelihood estimator does lead to overfitting and a poor generalisation error. In the context of linear regression we considered constrained estimation, i.e. subset selection and elastic net regularization. It is easy to generalize this to logistic regression. Subset selection is implemented in stepAIC and elastic net regularization in glmnet. In the latter case the algorithm optimizes the negative log-likelihood penalized with the elastic net term:

\[\begin{align*} \hat{\beta}_{\alpha,\lambda}&=\rm{argmin}-\frac{1}{n}\ell(\beta|{\bf y},{\bf X})+\lambda\big((1-\alpha)\|\beta\|_2^2/2+\alpha\|\beta\|_1\big). \end{align*}\]

With \(\alpha=0\) and \(\alpha=1\) we obtain the Ridge and the Lasso solution, respectively.

We turn back to the heart disease example and perform backward selection.

fit.bw <- stepAIC(fit.logistic,direction = "backward",trace=FALSE)

The terms deleted in each step are provided in the next table.

kable(as.data.frame(fit.bw$anova),digits=3)
Step Df Deviance Resid. Df Resid. Dev AIC
NA NA 454 483.174 499.174
- alcohol 1 0.019 455 483.193 497.193
- sbp 1 1.104 456 484.297 496.297
- obesity 1 1.147 457 485.444 495.444

The regression coefficients of the final model are shown below.

kable(broom::tidy(fit.bw),digits=3)
term estimate std.error statistic p.value
(Intercept) -4.204 0.498 -8.437 0.000
tobacco 0.081 0.026 3.163 0.002
ldl 0.168 0.054 3.093 0.002
famhistPresent 0.924 0.223 4.141 0.000
age 0.044 0.010 4.521 0.000

We continue with the Lasso approach and show the trace plot.

x <- scale(data.matrix(dat[,-1]))
y <- dat$chd
fit.lasso <- glmnet(x=x,y=y,family="binomial")
plot(fit.lasso,xvar = "lambda",label=TRUE)

The first coefficient which is shrunk to zero is alcohol followed by obesity and sbp. This is in line with the results from backward selection.

5.3 Gene expression example

We now turn to a truly high-dimensional example. The data consists of expression levels recorded for \(3'571\) genes in \(72\) patients with leukemia. The binary outcome encodes the disease subtype: acute lymphobastic leukemia (ALL) or acute myeloid leukemia (AML). The data are represented as a 72 x 3,571 matrix \(\bf X\) of gene expression values, and a vector \(\bf y\) of 72 binary disease outcomes. We first create training and test data.

# set seed
set.seed(15)

# get leukemia data
library(varbvs) # varbvs package contains the diabetes data
data(leukemia)
x <- leukemia$x
y <- leukemia$y

# test/train 
ind_train <- sample(1:length(y),size=length(y)/2)
xtrain <- x[ind_train,]
ytrain <- y[ind_train]
xtest <- x[-ind_train,]
ytest <- y[-ind_train]

We now run the elastic net logistic regression approach.

# run glmnet
alpha  <- 0.95                # elastic net mixing parameter.
fit.glmnet <-glmnet(xtrain,ytrain,family = "binomial",alpha=alpha)

The following Figure shows the trace plot.

plot(fit.glmnet,xvar="lambda",label=TRUE)

We run 10-fold cross-validation and show the misclassification error.

# run cv.glmnet
nfolds <- 10 # number of cross-validation folds.
cv.glmnet <- cv.glmnet(xtrain,ytrain,family = "binomial",type.measure = "class",
                       alpha = alpha,nfolds = nfolds)
plot(cv.glmnet)

We take lambda.1se as the optimal tuning parameter.

## [1] 0.006680534

We extract the coefficients and plot them as a barplot.

beta.glmnet <- coef(fit.glmnet, s = cv.glmnet$lambda.1se)
barplot(as.numeric(beta.glmnet)[-1],xlab="gene",ylab="beta glment")

Finally, we predict the disease outcome of the test samples using the fitted model and compare against the observed outcomes of the test samples.

pred <- c(predict(fit.glmnet,xtest,s = lambda.opt,type = "class"))
print(table(true = factor(ytest),pred = factor(pred)))
##     pred
## true  0  1
##    0 24  0
##    1  2 10

The misclassification error on the test data can be calculated as follows.

round(mean(pred!=ytest),3)
## [1] 0.056

5.4 Boosting and data mining

Classification is a frequent task in data mining. There is a variety of methods developed for classification. Popular approaches are Support Vector Machines, Random Forest, Neural Networks and Boosting. We now briefly describe the AdaBoost method which is the most popular boosting algorithm. In the exercises we will then apply it to an example and also compare it to elastic net regression.

The key idea behind AdaBoost is to sequentially apply a weak classification algorithm (the “base learner”) to weighted data with repeatedly modified weights. In that way a sequence of weak classifiers \(\{G_m\}_{m=1}^M\) is produced and finally a powerful new classifier is obtained by combining through a “majority vote.” More specifically AdaBoost proceeds as follows:

  1. Initialize the weights \(w_i=1/n\), \(i=1,\ldots,n\)
  2. For \(m=1\) to \(M\):
    1. Fit a weak classifier \(G_m\) to the training data using weights \(w_i\).
    2. Compute the weighted error rate \[{\rm err}_m=\frac{\sum_{i=1}^n w_i I(y_i\neq G_m(x_i))}{\sum_{i=1}^n w_i}.\]
    3. Compute \(\alpha_m=\log((1-{\rm err}_m)/{\rm err}_m)\)
    4. Update \(w_i\leftarrow w_i \exp(\alpha_m I(y_i\neq G_m(x_i)))\), \(i=1,2,...,n.\)
  3. Output the combined classifier \(G(X)={\rm sign}\left[\sum_{m=1}^M \alpha_m G_m(X)\right]\).

In line 2d the weights are updated. Interestingly, the observations that where misclassified by \(G_{m}(x)\) receive a larger weight which forces the next classifier \(G_{m+1}(x)\) to focus on the misclassified observations. In line 3 the final predictor is assembled. The parameters \(\alpha_m\) balance the influence of the individual classifiers towards the more accurate classifiers in the sequence. More details are provided in the book by Hastie, Tibshirani, and Friedman (2001).

AdaBoost with classification trees as the “base learner” is one of the best “off-the-shelf” classifier in data mining. Data mining also deals with large data sets, however, typically not only the number of variables \(p\) is large but also the sample size \(n\). In the classical high-dimensional setting with \(p>>n\) the elastic net is a very powerful method.

References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.