# Chapter 7 Appendix 1: Exercises

## 7.1 Closed form solution for Ridge regression

Show that the Ridge optimization problem has the closed form solution

\begin{align*} \hat{\beta}^{\rm Ridge}_{\lambda}&=(\bf X^T \bf X+\lambda \bf I)^{-1}\bf X^T \bf y. \end{align*}

Hint: calculate the gradient of the loss function $$\ell_{\rm Ridge}(\beta|\bf y,\bf X)=\rm{RSS}(\beta)+\lambda\|\beta\|_2^2$$, set equal to zero and solve for $$\beta$$.

## 7.2 Ridge and Lasso for the orthonormal design

Calculate the Ridge and the Lasso solution for the special case of an orthonormal design matrix.

## 7.3 Bayesian interpretation of Ridge regression

1. Write down the log-likelihood of the linear regression model. Note: $$Y_i=X_{i}^T\beta +\epsilon_i,$$ where $$\epsilon_1,\ldots,\epsilon_n$$ iid $$N(0,\sigma^2)$$ and $$\bf{X}$$ is a fixed $$n\times p$$ design matrix.
2. Find the expression for the maximum likelihood estimator.
3. Assuming a prior distribution $$\beta_1,\ldots,\beta_p$$ iid $$\sim N(0,\tau^2)$$, derive the posterior distribution of $$\beta$$ and show that the maximum a posteriori estimator (MAP) coincides with the Ridge estimator.

## 7.4 Prostate Cancer data

Explore the prostate cancer data set described in the book .

1. Read the prostate cancer data set.
2. Run OLS, Best Subset selection (use the leaps package), Ridge Regression and Lasso Regression.
3. Print the coefficients as a table.
4. Calculate for each method the Test error and compare the results with Table 3.3. in .

The solution to this exercise. Read the data set and create training and test data.

dat <- read.csv("data/prostate.csv")
dat <- cbind(scale(dat[,-c(9:10)]),dat[,c(9,10)])
dtrain <- dat[dat$train,-10] xtrain <- data.matrix(dtrain[,-9]) ytrain <- dtrain$lpsa
dtest <- dat[!dat$train,-10] xtest <- data.matrix(dtest[,-9]) ytest <- dtest$lpsa

Perform OLS regression, Best Subset selection, Ridge regression and Lasso regression.

# OLS regression
fit.lm <- lm(lpsa~.,data=dtrain)
coef.lm <- coef(fit.lm)
terr.lm <- mean((predict(fit.lm,newdata = dtest)-dtest$lpsa)^2) # best subset selection with leaps library(leaps) subset.leaps<-regsubsets(lpsa~.,data=dtrain,method="forward") ss.summary <- summary(subset.leaps) coef.ss.bic <- coef(subset.leaps, which.min(ss.summary$bic))
fit.ss.bic <- lm(lpsa~.,data=dtrain[,c("lpsa",names(coef.ss.bic)[-1])])
terr.ss.bic <- mean((predict(fit.ss.bic,newdata = dtest)-dtest$lpsa)^2) # best subset selection with caret package (10-fold cv) library(caret) set.seed(1) # set seed for reproducibility train.control <- trainControl(method = "cv", number = 10)# set up repeated k-fold subset.caret <- train(lpsa ~., data = dtrain, method = "leapBackward", tuneGrid = data.frame(nvmax = 1:8), trControl = train.control ) #subset.caret$results
#summary(subset.caret$finalModel) coef.ss.cv <- coef(subset.caret$finalModel, subset.caret$bestTune$nvmax)
fit.ss.cv <- lm(lpsa~.,data=dtrain[,c("lpsa",names(coef.ss.cv)[-1])])
terr.ss.cv <- mean((predict(fit.ss.cv,newdata = dtest)-dtest$lpsa)^2) # Ridge regression set.seed(1) library(glmnet) fit.ridge <- glmnet(xtrain,ytrain,alpha=0) fit.ridge.cv <- cv.glmnet(xtrain,ytrain,alpha=0) coef.ridge <- coef(fit.ridge,s=fit.ridge.cv$lambda.min)
terr.ridge <- mean((as.vector(predict(fit.ridge,newx=xtest,s=fit.ridge.cv$lambda.1se))-ytest)^2) # Lasso regression set.seed(1) fit.lasso <- glmnet(xtrain,ytrain,alpha=1) fit.lasso.cv <- cv.glmnet(xtrain,ytrain,alpha=1) coef.lasso <- coef(fit.lasso,s=fit.ridge.cv$lambda.min)
terr.lasso <- mean((as.vector(predict(fit.lasso,newx=xtest,s=fit.lasso.cv$lambda.1se))-ytest)^2) The next two figures show BIC and CV Error for the Best Subset selection approach (tuning parameter are the different subset sizes). plot(ss.summary$bic)

plot(subset.caret)

The next two figures show the CV Error for Ridge and Lasso regression.

par(mfrow=c(1,2))
plot(fit.ridge.cv)
plot(fit.lasso.cv)

Next we show the table of regression coefficients obtained using the different methods.

res <- data.frame(OLS=coef.lm,SSBIC=0,SSCV=0,RIDGE=0,LASSO=0)
res[names(coef.ss.bic),"SSBIC"] <- coef.ss.bic
res[names(coef.ss.cv),"SSCV"] <- coef.ss.cv
res[,"RIDGE"] <- as.numeric(coef.ridge)
res[,"LASSO"] <- as.numeric(coef.lasso)
kable(res,digits = 3)
OLS SSBIC SSCV RIDGE LASSO
(Intercept) 2.465 2.477 2.467 2.467 2.465
lcavol 0.680 0.740 0.676 0.581 0.547
lweight 0.263 0.316 0.265 0.258 0.211
age -0.141 0.000 -0.145 -0.110 0.000
lbph 0.210 0.000 0.210 0.200 0.116
svi 0.305 0.000 0.307 0.281 0.178
lcp -0.288 0.000 -0.287 -0.163 0.000
gleason -0.021 0.000 0.000 0.012 0.000
pgg45 0.267 0.000 0.252 0.200 0.070

Finally we show the test errors of the different methods.

res <- data.frame(OLS=terr.lm,
SSBIC=terr.ss.bic,SSCV=terr.ss.cv,
RIDGE=terr.ridge,LASSO=terr.lasso)
kable(res,digits = 3)
OLS SSBIC SSCV RIDGE LASSO
0.521 0.492 0.517 0.508 0.454

## 7.5 Elastic net mixing parameter and cross-validation

1. Obtain the Riboflavin data set from the hdipackage, read the help documentation and check data structure.
2. Run the Lasso and generate the trace plot.
3. Run the Elastic net with mixing parameters $$\alpha=0.25, 0.5, 0.75$$ and $$1$$ and compare the cross-validation curves. Hint: use the foldid argument in glmnet.
4. Show the selected genes for the best performing model.

The Solution to this exercise. We first load the data and check the data structure.

library(hdi)
## Loading required package: scalreg
library(glmnet)
str(riboflavin)
## 'data.frame':    71 obs. of  2 variables:
##  $y: num -6.64 -6.95 -7.93 -8.29 -7.31 ... ##$ x: 'AsIs' num [1:71, 1:4088] 8.49 7.64 8.09 7.89 6.81 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$: chr [1:71] "b_Fbat107PT24.CEL" "b_Fbat107PT30.CEL" "b_Fbat107PT48.CEL" "b_Fbat107PT52.CEL" ... ## .. ..$ : chr [1:4088] "AADK_at" "AAPA_at" "ABFA_at" "ABH_at" ...

Next we setup the design matrix and the response variable and we run the Lasso.

x <- riboflavin[,-1]
y <- riboflavin[,1]
fit <- glmnet(x = x, y = y)
plot(fit)

We run 10-fold cross-validation for the different mixing parameters and plot the error curves.

set.seed(1)
n.fold <- 10
foldid <- sample(1:n.fold, size = length(y), replace = TRUE)
cv1 <- cv.glmnet(x, y, foldid = foldid, alpha = 1)
cv2  <- cv.glmnet(x, y, foldid = foldid, alpha = 0.75)
cv3  <- cv.glmnet(x, y, foldid = foldid, alpha = 0.5)
cv4  <- cv.glmnet(x, y, foldid = foldid, alpha = 0.25)

t.lambdarange <- range(log(c(cv1$lambda, cv2$lambda,
cv3$lambda, cv4$lambda)))
t.crange <- range(c(cv1$cvm,cv2$cvm,cv3$cvm,cv4$cvm))
plot(log(cv1$lambda), cv1$cvm ,
pch = 19, col = "red",
xlab = "log(Lambda)",
ylab=cv1$name, type="l", xlim=t.lambdarange, ylim=t.crange) lines(log(cv2$lambda), cv2$cvm, pch = 19, col = "grey") lines(log(cv3$lambda) , cv3$cvm , pch = 19, col = "blue") lines(log(cv4$lambda) , cv4$cvm , pch = 19, col = "green") Finally, we print the gene names of the non-zero coefficients. ## Get selected genes b <- as.matrix(coef(cv1)) rownames(b)[b!=0] ## [1] "(Intercept)" "ARGF_at" "DNAJ_at" "GAPB_at" "LYSC_at" ## [6] "PKSA_at" "SPOIISA_at" "SPOVAA_at" "XHLB_at" "XKDS_at" ## [11] "XTRA_at" "YBFI_at" "YCDH_at" "YCGO_at" "YCKE_at" ## [16] "YCLB_at" "YCLF_at" "YDDH_at" "YDDK_at" "YEBC_at" ## [21] "YEZB_at" "YFHE_r_at" "YFIR_at" "YHDS_r_at" "YKBA_at" ## [26] "YOAB_at" "YQJU_at" "YRVJ_at" "YTGB_at" "YURQ_at" ## [31] "YXLD_at" "YXLE_at" "YYDA_at" ## By default, the selected variables are based on the largest value of ## lambda such that the cv-error is within 1 standard error of the minimum ## 7.6 Logistic regression and splines We explore the logistic regression based on the South African heart disease data. Proceed as follows: 1. Fit a univariate logistic regression model with age as the covariate. Calculate the odds-ratio and elaborate on the interpretation. 2. Predict the probability of heart disease at age 65. 3. Fit a logistic regression model including all covariates. Run stepwise backward selection. Which variables are excluded? What is the AIC value of the final model? 4. Fit a logistic regression model using four natural cubic spline bases for each term in the model. Run backward selection and summarise the final model. Plot the natural cubic spline functions for the age term (use termplot). What does the plot tell you? The solution to this exercise. We load the data and fit a logistic regression with age as the covariate. sahd <- readRDS(file="data/sahd.rds") fit <- glm(chd~age, data=sahd, family=binomial ) summary(fit) ## ## Call: ## glm(formula = chd ~ age, family = binomial, data = sahd) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.4321 -0.9215 -0.5392 1.0952 2.2433 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.521710 0.416031 -8.465 < 2e-16 *** ## age 0.064108 0.008532 7.513 5.76e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 596.11 on 461 degrees of freedom ## Residual deviance: 525.56 on 460 degrees of freedom ## AIC: 529.56 ## ## Number of Fisher Scoring iterations: 4 The odds ratio for age is given next. exp(coef(fit)[2]) ## age ## 1.066208 This means that an increase of 1 year in age leads to a $$6.6\%$$ increase in the odds of having a heart disease. The estimated probability of having a heart disease at age 65 can be calculated using the predict function. predict(fit,newdata=data.frame(age=65),type="response") ## 1 ## 0.6559532 Alternatively, we can use the inverse logit formula. lp <- coef(fit)[1]+coef(fit)[2]*65 exp(lp)/(exp(lp)+1) ## (Intercept) ## 0.6559532 We fit a logistic regression model including all covariates. Then we perform stepwise backward selection using stepAIC. fit.full <- glm(chd~sbp+tobacco+ldl+famhist+obesity+alcohol+age, data=sahd, family="binomial") fit.bw <- stepAIC(fit.full,direction = "backward",trace=FALSE) The terms removed in each step are provided in the next table. kable(as.data.frame(fit.bw$anova),digits=3)
Step Df Deviance Resid. Df Resid. Dev AIC
NA NA 454 483.174 499.174
- alcohol 1 0.019 455 483.193 497.193
- sbp 1 1.104 456 484.297 496.297
- obesity 1 1.147 457 485.444 495.444

The variables alcohol, sbp and obesity are excluded from the model. The AIC values are provided in the table above. We can also re-calculate the AIC for the final model.

AIC(fit.bw)
## [1] 495.4439

We fit a logistic regression model using natural splines.

# Computes the logistic regression model using natural splines (note famhist is included as a factor):
form <-  "chd ~ ns(sbp,df=4) + ns(tobacco,df=4) + ns(ldl,df=4) + famhist + ns(obesity,df=4)+ ns(alcohol,df=4)  + ns(age,df=4)"
form <-  formula(form)
fit <-  glm( form, data=sahd, family=binomial )

# stepwise backward selection
fit.bw <- stepAIC(fit,direction="backward",trace = 0)
kable(as.data.frame(fit.bw$anova),digits=3) Step Df Deviance Resid. Df Resid. Dev AIC NA NA 436 457.632 509.632 - ns(alcohol, df = 4) 4 0.456 440 458.088 502.088 The summary of the final model is provided next. kable(as.data.frame(drop1(fit.bw, test="Chisq" )),digits=2) Df Deviance AIC LRT Pr(>Chi) NA 458.09 502.09 NA NA ns(sbp, df = 4) 4 467.16 503.16 9.08 0.06 ns(tobacco, df = 4) 4 470.48 506.48 12.39 0.01 ns(ldl, df = 4) 4 472.39 508.39 14.31 0.01 famhist 1 479.44 521.44 21.36 0.00 ns(obesity, df = 4) 4 466.24 502.24 8.15 0.09 ns(age, df = 4) 4 481.86 517.86 23.77 0.00 We can plot the natural spline function for the first term as follows. termplot(fit.bw,se=TRUE,rug=TRUE,term=6) The plot shows how the log-odds change with age (keeping the other variables fixed). We observe a slight deviation from linearity, i.e. the log-odds increase more strongly for age <35 than for age >35. ## 7.7 Decision trees, Random Forest and AdaBoost In this exercise we explore decision trees based on the South African heart disease data. 1. Load the South African heart disease data and grow a decision tree using rpart. Visualize the tree using rpart.plot. How many leave nodes has the tree? 2. Re-grow the tree but now relax the “default” control parameters (choose rpart.control(cp=0, minsplit=50)). How many leaves has the tree now? 3. Plot the cross-validation error against the complexity parameter $$\alpha$$. What is the tree size of the optimal model? 4. Prune the tree using prune and by choosing cp (=$$\alpha$$) to achieve minimal cross-validation error. 5. Calculate the confusion matrix and the misclassification error. 6. Generate a bootstrap sample. Grow a tree and calculate the out-of-bag error. 7. Fit a random forest using randomForest. Plot the fitted object. What is this plot telling us? Calculate the variable importance. Which are the most important variables? 8. Run AdaBoost using gbm. What is the prediction for a patient with covariates sbp=100, tobacco=0, ldl=5, famhist=“Present,” obesity=25, alcohol=10 and age=50. Compute the variable importance. The solution to this exercise. First we read the data and grow the tree. # load packages library(rpart) library(rpart.plot) # grow a classification tree fit.tree <- rpart(chd~.,data=sahd,method="class") rpart.plot(fit.tree,extra=1,under=TRUE,tweak = 1.2,faclen=3) We re-grow the tree using different control parameters. # controlling the growth of the tree with rpart.control # cp: improvement in each split needs to be > cp # minsplit: minimal number of samples in a node inorder to do a split fit.tree2 <- rpart(chd~.,data=sahd,method="class", control = rpart.control(cp = 0,minsplit=10) ) rpart.plot(fit.tree2,extra=1,under=TRUE,tweak = 1.2,faclen=3) We can get the tree size from the cptable (tree size=number of leaves=number splits+1). fit.tree2$cptable[fit.tree2$cptable[,"CP"]==0,"nsplit"]+1 # number of leaves ## [1] 43 Next, we plot the cross-validation error against the complexity parameter $$\alpha$$. plotcp(fit.tree2,cex.lab=1.5,cex.axis=1.2,cex=1.5) We prune the tree and visualize the result. # prune the tree fit.prune<- prune(fit.tree2, cp=fit.tree2$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]) rpart.plot(fit.prune,extra=1) Finally, we compute the confusion matrix and the misclassification error. # confusion matrix of actual and fitted class labels table(Actual=sahd$chd,Fitted=predict(fit.prune,type="class"))
##       Fitted
## Actual   0   1
##      0 268  34
##      1  80  80
# missclassification error
mean(sahd$chd!=predict(fit.prune,type="class")) ## [1] 0.2467532 We sample with replacement. set.seed(1) inthebag <- sample(1:nrow(sahd),size=nrow(sahd),replace=TRUE) outofbag <- setdiff(1:nrow(sahd),inthebag) We fit a tree on the in-the-bag samples and calculate the misclassification error on the out-of-bag samples. fit.in <- rpart(chd~.,data=sahd[inthebag,],method="class") pred.oob <- predict(fit.in,newdata=sahd[outofbag,],type="class") mean(sahd$chd[outofbag]!=pred.oob)
## [1] 0.3559322

We fit the random forest, plot the error as a function of the number of trees and plot the variable importance.

library(randomForest)
sahd$chd <- factor(sahd$chd)
fit.rf <- randomForest(chd~.,data=sahd,importance=TRUE)
plot(fit.rf)

varImpPlot(fit.rf,type=1)

We run AdaBoost using gbm and specifying distribution = "adaboost". The summary provides a measure of variable importance. Prediction can be made using predict.

fit.boost <-gbm(chd~.,data=sahd,distribution = "adaboost") # note: for adaboost the outcome must be numeric
summary(fit.boost)

##             var    rel.inf
## tobacco tobacco 96.4821033
## age         age  2.9567803
## ldl         ldl  0.5611164
## sbp         sbp  0.0000000
## famhist famhist  0.0000000
## obesity obesity  0.0000000
## alcohol alcohol  0.0000000
newd <- data.frame(sbp=100,tobacco=0,ldl=5,famhist=factor("Present"),obesity=25,alcohol=10,age=50)
predict(fit.boost,
newdata=newd,
type="response" )
## Using 100 trees...
## [1] 1

## 7.8 Phoneme Recognition

In this exercise we investigate prediction of phonemes based on digitized speech data.

1. Read the data set, subset the phonemes “aa” and “ao” and create training and test data.
dat <- read.csv("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/phoneme.data")
1. Plot the log-periodogram as a function of frequency for 5 examples each of the phonemes “aa” and “ao.”

2. Fit a logistic regression model and evaluate the training and test misclassification errors.

3. Run Lasso regression and evaluate the training and test misclassification errors.

4. In the previous approaches we assumed logit-link ${\rm logit}(x;\beta)=\sum_{j=1}^{256} X_j\beta_j.$ Next we assume that the coefficients are a smooth function of the frequency $$\beta(f)$$, i.e. $\beta(f)=\sum_{m=1}^{\nu} h_m(f)\theta_m,$ where $$h_m$$ are B-spline basis functions for a natural cubic spline with $$\nu=12$$ degrees of freedom (defined on the set of frequencies). Consider filtered inputs $$x^*={\bf H}^T x$$ and fit $$\theta$$ by logistic regression on the $$x^*$$. Evaluate the training and test misclassification errors.

5. Plot the coefficients of the different models.

The solution to this exercise. We prepare the data set.

library(splines)
dat2 <- dat[dat$g%in%c("aa","ao"),] dtrain <- dat2[grepl("^train",dat2$speaker),-c(1,259)]
xtrain <- as.matrix(dtrain[,-257])
ytrain <- ifelse(dtrain$g=="ao",1,0) dtest <- dat2[grepl("^test",dat2$speaker),-c(1,259)]
xtest <- as.matrix(dtest[,-257])
ytest <- ifelse(dtest$g=="ao",1,0) dtrain$y <- ytrain
dtest$y <- ytest dtrain <- dtrain[,-257] dtest <- dtest[,-257] We plot the log-periodograms. id.ao <- sample(which(ytrain==1),5) id.aa <- sample(which(ytrain==0),5) plot(xtrain[id.ao[1],],type="l", xlab="Frequency",ylab="Log-periodogram") for(i in 2:5){ lines(xtrain[id.ao[i],]) } for(i in 1:5){ lines(xtrain[id.aa[i],],col="red") } We run logistic regression and calculate the train and test errors. # logistic regression fit <- glm(y~.,data=dtrain,family=binomial) coef.glm <- coefficients(fit) pred_train <- as.numeric((predict(fit,type="response")>0.5)) pred_test <- as.numeric((predict(fit,type="response",newdata=dtest)>0.5)) mean(pred_train!=ytrain) ## [1] 0.09311424 mean(pred_test!=ytest) ## [1] 0.2437358 We run Lasso regression and calculate the train and test errors. # lasso regression fit.glmnet <-glmnet(xtrain,ytrain,family = "binomial",alpha=1) cv.glmnet <- cv.glmnet(xtrain,ytrain,family = "binomial",type.measure = "class", alpha = 1,nfolds = 10) coef.lasso <- as.numeric(coefficients(fit.glmnet,s = cv.glmnet$lambda.1se))[-1]
plot(cv.glmnet)

pred_train <- c(predict(fit.glmnet,xtrain,s = cv.glmnet$lambda.1se,type = "class")) pred_test <- c(predict(fit.glmnet,xtest,s = cv.glmnet$lambda.1se,type = "class"))
mean(pred_train!=ytrain)
## [1] 0.170579
mean(pred_test!=ytest)
## [1] 0.2072893

We use the natural cubic spline basis with $$\nu=12$$ to express the coefficients as a smooth function of the frequencies. We calculate the train and test errors.

# coefficient smoothing
hmat <- ns(x=1:256,df=12)
xstar <- xtrain%*%hmat
fit.smooth <- glm(dtrain$y~.,data=data.frame(xstar),family="binomial") coef.smooth <- as.numeric(hmat%*%coef(fit.smooth)[-1]) pred_train <- as.numeric((predict(fit.smooth,type="response")>0.5)) pred_test <- as.numeric((predict(fit.smooth,type="response",newdata=data.frame(xtest%*%hmat))>0.5)) mean(pred_train!=ytrain) ## [1] 0.1690141 mean(pred_test!=ytest) ## [1] 0.1867882 We plot the regression coefficients. plot( coef.glm[-1], ylim=c(-0.4,+0.4), type="l", xlab="Frequency", ylab="Logistic Regression Coefficients" ) lines(coef.lasso,col="green",lwd=2) lines(coef.smooth,col="red",lwd=2) abline(h=0) ## 7.9 Classification and the Sonar data set In this exercise we explore the Sonar data set from the mlbench package. We use the caret package for classification. 1. Load the Sonar data set and read the help page. 2. Split the data into a training and test set. 3. Use logistic regression and apply forward selection. Show the estimated coefficients of the final model and calculate the prediction error. 4. Apply elastic-net regularized logistic regression. Show the trace plot and the cross-validation curve. Calculate the prediction error. 5. Run the AdaBoost method. Obtain the variable importance scores and calculate the prediction error. (Hint: use the function gbm with argument distribution set to “adaboost”). The solution to this exercise. We explore the Sonar data set from the mlbench package. The goal is to train a model to discriminate between rock “R” and metal “M” based on sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The $$p=60$$ covariates represent energies within particular frequency bands. We first load the data and split into training and test data. sonar <- readRDS(file="data/sonar.rds") set.seed(107) inTrain <- createDataPartition( y = sonar$Class,
## the outcome data are needed
p = .5,
## The percentage of data in the
## training set
list = FALSE
)
dtrain <- sonar[ inTrain,]
xtrain <- as.matrix(dtrain[,-61])
ytrain <- dtrain$Class dtest <- sonar[-inTrain,] xtest <- as.matrix(dtest[,-61]) ytest <- dtest$Class

In order to compare the performance of the different methods we setup trainControl. We use 10-fold cross-validation.

tc <- trainControl(method = "cv", number = 10) # 10-fold CV, performance=accuracy

We first run logistic regression, once with only an intercept and once including all covariates. We note that glm reports convergence problems when using all covariates. This indicates that the model is too complex to fit.

fit.lo <-  glm(Class~1,family=binomial,data=dtrain)
fit.full <-  glm(Class~.,family=binomial,data=dtrain)
fit.full.caret <- train(Class ~.,
data = dtrain,
method = "glm",
family = "binomial",
trControl = tc)

Next we use Forward selection using stepAIC from the MASS package.

up <- paste("~", paste(colnames(dtrain), collapse=" + "))
fit.fw <- stepAIC(fit.lo,
scope=up,
direction = "forward",
steps=10,
trace = FALSE)
#summary(fit.fw)
kable(broom::tidy(fit.fw),digits=2)
term estimate std.error statistic p.value
(Intercept) 4.84 1.74 2.78 0.01
V11 -13.08 5.32 -2.46 0.01
V51 -106.83 46.93 -2.28 0.02
V36 12.56 3.55 3.54 0.00
V43 -15.67 4.67 -3.35 0.00
V4 -54.34 18.41 -2.95 0.00
V15 4.74 2.47 1.92 0.05
V3 30.03 14.72 2.04 0.04
V22 -3.45 1.60 -2.16 0.03
V9 -14.82 6.05 -2.45 0.01
V7 23.17 9.66 2.40 0.02

We can repeat the same analysis using the train function from the caret package. (Note that there is no tuning parameter involved for glmStepAIC.)

fit.fw.caret <- train(x=xtrain,
y=ytrain,
method = "glmStepAIC",
family = "binomial",
direction ="forward",
steps=10,
trControl = tc,
trace = FALSE
)
# accuracy
kable(fit.fw.caret$results,digits=2) parameter Accuracy Kappa AccuracySD KappaSD none 0.68 0.35 0.16 0.32 # summary of the model kable(broom::tidy(fit.fw.caret$finalModel),digits=2)
term estimate std.error statistic p.value
(Intercept) 4.84 1.74 2.78 0.01
V11 -13.08 5.32 -2.46 0.01
V51 -106.83 46.93 -2.28 0.02
V36 12.56 3.55 3.54 0.00
V43 -15.67 4.67 -3.35 0.00
V4 -54.34 18.41 -2.95 0.00
V15 4.74 2.47 1.92 0.05
V3 30.03 14.72 2.04 0.04
V22 -3.45 1.60 -2.16 0.03
V9 -14.82 6.05 -2.45 0.01
V7 23.17 9.66 2.40 0.02

Next we employ L1-penalized logistic regression.

fit.l1 <-glmnet(x=xtrain,y=ytrain,alpha=1,family="binomial")
plot(fit.l1,xvar="lambda",label=TRUE)

cvfit <- cv.glmnet(x=xtrain, y=ytrain,
family = "binomial",
type.measure = "class",alpha=1)
#cvfit$lambda.min plot(cvfit) We repeat the same analysis using the caret package. lambda.grid <- fit.l1$lambda
fit.l1.caret<-train(x=xtrain,
y=ytrain,
method = "glmnet",
family="binomial",
preProc = c("center","scale"),
tuneGrid = expand.grid(alpha = 1,lambda=lambda.grid),
trControl = tc
)

ggplot(fit.l1.caret)

fit.l1.caret
# best lambda
fit.l1.caret$bestTune$lambda
# model coefficients
coef(fit.l1.caret$finalModel,fit.l1.caret$bestTune$lambda) Next we investigate AdaBoost. We can use the gbm function dtrain$Class <-  factor(ifelse(dtrain$Class=="M",0,1)) fit.boost <-gbm(Class~.,data=dtrain)  or the caret package. We first specify the tuning parameter grid. gbmGrid <- expand.grid( n.trees = seq(1,1000,by=20), interaction.depth = c(1,3,6), shrinkage = 0.1, n.minobsinnode = 10 ) Next we train the model. fit.boost.caret<-train( x=xtrain, y=ytrain, method = "gbm", trControl = tc, verbose = FALSE, tuneGrid = gbmGrid )  and plot the cross-validation accuracy. ggplot(fit.boost.caret) Finally we can obtain the variable importance. gbmImp <- varImp(fit.boost.caret, scale = FALSE) plot(gbmImp,top=20) Next we make predictions and compare the misclassification errors on the test data. # Make predictions confusionMatrix(data =fit.boost.caret %>% predict(xtest), ytest) confusionMatrix(data =fit.l1.caret %>% predict(xtest), ytest) confusionMatrix(data =fit.fw.caret %>% predict(xtest), ytest) confusionMatrix(data =fit.full.caret %>% predict(xtest), ytest) With the caret package we can also compare the 3 models using the following commands. models <- list(full=fit.full.caret, fw=fit.fw.caret, l1=fit.l1.caret, boost=fit.boost.caret) summary(resamples(models),metric = "Accuracy") ## 7.10 Survival analysis based on simulated data In this exercise we simulated high-dimensional survival data and explore regularized cox regression. 1. Simulate high-dimensional survival data from the cox proportional hazards model with $$n=50$$ and $$p=100$$ covariates. Assume that only the first two covariates are active, i.e. have non-zero regression coefficient. Create training and test data. 2. Based on training data fit a cox regression model including the first 20 covariates. Visualize the hazard-ratios using a forest plot. 3. Based on training data run l1-penalized cox regression using glmnet. Show the trace plot and the cross-validated C-index. 4. Based on training data fit a cox regression model including only the active covariates obtained from the previous glmnet fit. 5. Use the pec package to obtain the Brier score on training and test data for the two cox regression models. Solution to the exercise. We start by simulating training and test data. ## load packages library(survival) library(survminer) ## dimensions p <- 100 n <- 50 ## training data # design matrix t.dat <- data.frame(matrix(rnorm(n * p), nrow = n)) fm <- as.formula(paste0("~0+",paste(paste0("X",1:p),collapse="+"))) x_design <- model.matrix(fm,data=t.dat) beta <- c(2,0,0,0,2,rep(0,p-5)) lh <- x_design%*%beta # survival and censoring time rate_base <- 1/5 # baseline hazard rate_cens <- 1/6 # censoring hazard hazard <- rate_base*exp(lh) survtime <- rexp(n,rate=hazard) censtime <- rexp(n,rate=rate_cens) status <- as.numeric(survtime<=censtime) time <- survtime*status+censtime*(1-status) dtrain <- cbind( data.frame(survtime,time,status)%>% dplyr::mutate(status.cens=1-status), t.dat) ## test data # design matrix t.dat <- data.frame(matrix(rnorm(n * p), nrow = n)) fm <- as.formula(paste0("~0+",paste(paste0("X",1:p),collapse="+"))) x_design <- model.matrix(fm,data=t.dat) beta <- c(2,0,0,0,2,rep(0,p-5)) lh <- x_design%*%beta # survival and censoring time rate_base <- 1/5 # baseline hazard rate_cens <- 1/6 # censoring hazard hazard <- rate_base*exp(lh) survtime <- rexp(n,rate=hazard) censtime <- rexp(n,rate=rate_cens) status <- as.numeric(survtime<=censtime) time <- survtime*status+censtime*(1-status) dtest <- cbind( data.frame(survtime,time,status)%>% dplyr::mutate(status.cens=1-status), t.dat) We perform cox regression and create a forest plot. fm1 <- as.formula(paste("Surv(time,status)~",paste(paste0("X",1:20),collapse="+"))) fit1 <- coxph(fm1,data=dtrain,x=TRUE) survminer::ggforest(fit1) We perform l1-regularized cox regression and generate the trace plot. xtrain <- as.matrix(dtrain[,-c(1:4)]) ytrain <- with(dtrain,Surv(time,status)) fit.coxnet <- glmnet(xtrain, ytrain, family = "cox",alpha=0.95) plot(fit.coxnet,xvar="lambda",label=TRUE) We perform cross-validation and plot the C-index. cv.coxnet <- cv.glmnet(xtrain, ytrain, family="cox", type.measure="C", alpha=0.95) plot(cv.coxnet) We perform cox regression based on only the active covariates. act <- which(as.numeric(coef(fit.coxnet,s=cv.coxnet$lambda.min))!=0)
fm2 <- as.formula(paste("Surv(time,status)~",paste(paste0("X",act),collapse="+")))
fit2 <- coxph(fm2,data=dtrain,x=TRUE)

Finally, we calculate the Brier score on training and test data.

library(pec)
fit.pec.train <- pec::pec(
object=list("mod1"=fit1,"mod2"=fit2),
data = dtrain,
formula = Surv(time, status) ~ 1,
splitMethod = "none")

fit.pec.test <- pec::pec(
object=list("mod1"=fit1,"mod2"=fit2),
data = dtest,
formula = Surv(time, status) ~ 1,
splitMethod = "none")

par(mfrow=c(1,2))
plot(fit.pec.train,main="train")
plot(fit.pec.test,main="test")
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

### References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.