# Chapter 5 Extension to classification

Our high-dimensional considerations so far focused on the linear regression model. We now extend this to classification.

## 5.1 Classification and logistic regression

We start with standard logistic regression where response \(G\) takes values \(0\) and \(1\), and, the aim is to do prediction based based on covariates \(X=(X_1,\ldots,X_p)\). We model the probability of success \[p(x;\beta)=P(G=1|X=x;\beta)\] assuming a binomial distribution with logit link function

\[{\rm logit}(x;\beta)=\log \Big(\frac{p(x;\beta)}{1-p(x;\beta)}\Big)=X^T\beta.\]

In logistic regression we estimate the regression parameter \(\beta\) by maximizing the log-likelihood

\[\begin{align*} \ell(\beta|{\bf y,X})&=\sum_{i=1}^{n}(1-y_i)\log(1-p(x_i;\beta))+y_i\log p(x_i;\beta)\\ &=\sum_{i=1}^{n}y_i x_i^T\beta - \log(1+\exp(x_i^T\beta)). \end{align*}\]

Given an estimate \(\hat \beta\) (from training data), prediction based on new input data \(X_{\rm new}\) can be obtained via the predicted probability of success \(p(X_{\rm new};\hat \beta)\), e.g. the class labels corresponding to the maximum probability

\[\begin{align*} \hat{G}(X_{\rm new})=\left\{ \begin{array}{ll} 1, & \mbox{if $p(X_{\rm new};\hat\beta)>0.5$}.\\ 0, & \mbox{otherwise}. \end{array} \right. \end{align*}\]

There are different measures to judge the quality of the predictions. We focus on the misclassification error which is simply the fraction of misclassified test samples. Another important measure used in the context of classification is the receiver operating characteristic (ROC).

We illustrate logistic regression using an example taken from the book by Hastie, Tibshirani, and Friedman (2001). The data shown in Figure 5.1 are a subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa. The aim of the study was to establish the intensity of ischemic heart disease risk factors in that high-incidence region. The data represent white males between 15 and 64, and the response variable is the presence or absence of myocardial infarction (MI) at the time of the survey (the overall prevalence of MI was 5.1% in this region). There are 160 cases in our data set, and a sample of 302 controls. The variables are:

- sbp: systolic blood pressure
- tobacco: cumulative tobacco (kg)
- ldl: low densiity lipoprotein cholesterol adiposity
- famhist: family history of heart disease (Present, Absent)
- obesity
- alcohol: current alcohol consumption
- age: age at onset
- chd: response, coronary heart disease.

We first fit a logistic regression model using the function `glm`

.

```
<- glm(chd~sbp+tobacco+ldl+famhist+obesity+alcohol+age,
fit.logistic data=dat,
family="binomial")
kable(broom::tidy(fit.logistic),digits=3)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -4.130 | 0.964 | -4.283 | 0.000 |

sbp | 0.006 | 0.006 | 1.023 | 0.306 |

tobacco | 0.080 | 0.026 | 3.034 | 0.002 |

ldl | 0.185 | 0.057 | 3.219 | 0.001 |

famhistPresent | 0.939 | 0.225 | 4.177 | 0.000 |

obesity | -0.035 | 0.029 | -1.187 | 0.235 |

alcohol | 0.001 | 0.004 | 0.136 | 0.892 |

age | 0.043 | 0.010 | 4.181 | 0.000 |

There are some surprises in this table of coefficients, which must be interpreted with caution. Systolic blood pressure (sbp) is not significant! Nor is obesity, and its sign is negative. This confusion is a result of the correlation between the set of predictors. On their own, both sbp and obesity are significant, and with positive sign. However, in the presence of many other correlated variables, they are no longer needed (and can even get a negative sign).

How does one interpret a coefficient of \(0.080\) (Std. Error = \(0.026\)) for tobacco, for example? Tobacco is measured in total lifetime usage in kilograms, with a median of \(1.0\)kg for the controls and \(4.1\)kg for the cases. Thus an increase of 1kg in lifetime tobacco usage accounts for an increase in the odds of coronary heart disease of \(\exp(0.080)\)=1.083 or \(8.3\)%. Incorporating the standard error we get an approximate \(95\)% confidence interval of \(\exp(0.081 ± 2 × 0.026)\)=(1.035,1.133).

## 5.2 Subset selection and elastic net

Similar as for linear regression, in the high-dimensional setting where \(n\) is small compared to \(p\), the maximum likelihood estimator does lead to overfitting and a poor generalisation error. In the context of linear regression we considered constrained estimation, i.e. subset selection and elastic net regularization. It is easy to generalize this to logistic regression. Subset selection is implemented in `stepAIC`

and elastic net regularization in `glmnet`

. In the latter case the algorithm optimizes the negative log-likelihood penalized with the elastic net term:

\[\begin{align*} \hat{\beta}_{\alpha,\lambda}&=\rm{argmin}-\frac{1}{n}\ell(\beta|{\bf y},{\bf X})+\lambda\big((1-\alpha)\|\beta\|_2^2/2+\alpha\|\beta\|_1\big). \end{align*}\]

With \(\alpha=0\) and \(\alpha=1\) we obtain the Ridge and the Lasso solution, respectively.

We turn back to the heart disease example and perform backward selection.

`<- stepAIC(fit.logistic,direction = "backward",trace=FALSE) fit.bw `

The terms deleted in each step are provided in the next table.

`kable(as.data.frame(fit.bw$anova),digits=3)`

Step | Df | Deviance | Resid. Df | Resid. Dev | AIC |
---|---|---|---|---|---|

NA | NA | 454 | 483.174 | 499.174 | |

- alcohol | 1 | 0.019 | 455 | 483.193 | 497.193 |

- sbp | 1 | 1.104 | 456 | 484.297 | 496.297 |

- obesity | 1 | 1.147 | 457 | 485.444 | 495.444 |

The regression coefficients of the final model are shown below.

`kable(broom::tidy(fit.bw),digits=3)`

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -4.204 | 0.498 | -8.437 | 0.000 |

tobacco | 0.081 | 0.026 | 3.163 | 0.002 |

ldl | 0.168 | 0.054 | 3.093 | 0.002 |

famhistPresent | 0.924 | 0.223 | 4.141 | 0.000 |

age | 0.044 | 0.010 | 4.521 | 0.000 |

We continue with the Lasso approach and show the trace plot.

```
<- scale(data.matrix(dat[,-1]))
x <- dat$chd
y <- glmnet(x=x,y=y,family="binomial")
fit.lasso plot(fit.lasso,xvar = "lambda",label=TRUE)
```

The first coefficient which is shrunk to zero is alcohol followed by obesity and sbp. This is in line with the results from backward selection.

## 5.3 Gene expression example

We now turn to a truly high-dimensional example. The data consists of expression levels recorded for \(3'571\) genes in \(72\) patients with leukemia. The binary outcome encodes the disease subtype: acute lymphobastic leukemia (ALL) or acute myeloid leukemia (AML). The data are represented as a 72 x 3,571 matrix \(\bf X\) of gene expression values, and a vector \(\bf y\) of 72 binary disease outcomes. We first create training and test data.

```
# set seed
set.seed(15)
# get leukemia data
library(varbvs) # varbvs package contains the diabetes data
data(leukemia)
<- leukemia$x
x <- leukemia$y
y
# test/train
<- sample(1:length(y),size=length(y)/2)
ind_train <- x[ind_train,]
xtrain <- y[ind_train]
ytrain <- x[-ind_train,]
xtest <- y[-ind_train] ytest
```

We now run the elastic net logistic regression approach.

```
# run glmnet
<- 0.95 # elastic net mixing parameter.
alpha <-glmnet(xtrain,ytrain,family = "binomial",alpha=alpha) fit.glmnet
```

The following Figure shows the trace plot.

`plot(fit.glmnet,xvar="lambda",label=TRUE)`

We run 10-fold cross-validation and show the misclassification error.

```
# run cv.glmnet
<- 10 # number of cross-validation folds.
nfolds <- cv.glmnet(xtrain,ytrain,family = "binomial",type.measure = "class",
cv.glmnet alpha = alpha,nfolds = nfolds)
plot(cv.glmnet)
```

We take `lambda.1se`

as the optimal tuning parameter.

`## [1] 0.006680534`

We extract the coefficients and plot them as a barplot.

```
<- coef(fit.glmnet, s = cv.glmnet$lambda.1se)
beta.glmnet barplot(as.numeric(beta.glmnet)[-1],xlab="gene",ylab="beta glment")
```

Finally, we predict the disease outcome of the test samples using the fitted model and compare against the observed outcomes of the test samples.

```
<- c(predict(fit.glmnet,xtest,s = lambda.opt,type = "class"))
pred print(table(true = factor(ytest),pred = factor(pred)))
```

```
## pred
## true 0 1
## 0 24 0
## 1 2 10
```

The misclassification error on the test data can be calculated as follows.

`round(mean(pred!=ytest),3)`

`## [1] 0.056`

## 5.4 Boosting and data mining

Classification is a frequent task in data mining. There is a variety of methods developed for classification. Popular approaches are *Support Vector Machines*, *Random Forest*, *Neural Networks* and *Boosting*. We now briefly describe the *AdaBoost* method which is the most popular boosting algorithm. In the exercises we will then apply it to an example and also compare it to elastic net regression.

The key idea behind AdaBoost is to sequentially apply a weak classification algorithm (the “base learner”) to weighted data with repeatedly modified weights. In that way a sequence of weak classifiers \(\{G_m\}_{m=1}^M\) is produced and finally a powerful new classifier is obtained by combining through a “majority vote.” More specifically AdaBoost proceeds as follows:

- Initialize the weights \(w_i=1/n\), \(i=1,\ldots,n\)
- For \(m=1\) to \(M\):
- Fit a weak classifier \(G_m\) to the training data using weights \(w_i\).
- Compute the weighted error rate \[{\rm err}_m=\frac{\sum_{i=1}^n w_i I(y_i\neq G_m(x_i))}{\sum_{i=1}^n w_i}.\]
- Compute \(\alpha_m=\log((1-{\rm err}_m)/{\rm err}_m)\)
- Update \(w_i\leftarrow w_i \exp(\alpha_m I(y_i\neq G_m(x_i)))\), \(i=1,2,...,n.\)

- Output the combined classifier \(G(X)={\rm sign}\left[\sum_{m=1}^M \alpha_m G_m(X)\right]\).

In line 2d the weights are updated. Interestingly, the observations that where misclassified by \(G_{m}(x)\) receive a larger weight which forces the next classifier \(G_{m+1}(x)\) to focus on the misclassified observations. In line 3 the final predictor is assembled. The parameters \(\alpha_m\) balance the influence of the individual classifiers towards the more accurate classifiers in the sequence. More details are provided in the book by Hastie, Tibshirani, and Friedman (2001).

AdaBoost with classification trees as the “base learner” is one of the best “off-the-shelf” classifier in data mining. Data mining also deals with large data sets, however, typically not only the number of variables \(p\) is large but also the sample size \(n\). In the classical high-dimensional setting with \(p>>n\) the elastic net is a very powerful method.

### References

*The Elements of Statistical Learning*. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.