Chapter 7 Exercises

7.1 Closed form solution for Ridge regression

Show that the Ridge optimization problem has the closed form solution

\[\begin{align*} \hat{\beta}^{\rm Ridge}_{\lambda}&=(\bf X^T \bf X+\lambda \bf I)^{-1}\bf X^T \bf y. \end{align*}\]

Hint: calculate the gradient of the loss function \(\ell_{\rm Ridge}(\beta|\bf y,\bf X)=\rm{RSS}(\beta)+\lambda\|\beta\|_2^2\), set equal to zero and solve for \(\beta\).

7.2 Ridge and Lasso for the orthonormal design

Calculate the Ridge and the Lasso solution for the special case of an orthonormal design matrix.

7.3 Bayesian interpretation of Ridge regression

  1. Write down the log-likelihood of the linear regression model. Note: \(Y_i=X_{i}^T\beta +\epsilon_i,\) where \(\epsilon_1,\ldots,\epsilon_n\) iid \(N(0,\sigma^2)\) and \(\bf{X}\) is a fixed \(n\times p\) design matrix.
  2. Find the expression for the maximum likelihood estimator.
  3. Assuming a prior distribution \(\beta_1,\ldots,\beta_p\) iid \(\sim N(0,\tau^2)\), derive the posterior distribution of \(\beta\) and show that the maximum a posteriori estimator (MAP) coincides with the Ridge estimator.

7.4 Prostate Cancer data

Explore the prostate cancer data set described in the book Hastie, Tibshirani, and Friedman (2001).

  1. Download the prostate cancer data set from the webpage https://web.stanford.edu/~hastie/ElemStatLearn/datasets.
  2. Run OLS, Best Subset selection (use the leaps package), Ridge Regression and Lasso Regression.
  3. Print the coefficients as a table.
  4. Calculate for each method the Test error and compare the results with Table 3.3. in Hastie, Tibshirani, and Friedman (2001).

The solution to this exercise. Read the data set and create training and test data.

dat <- read.csv("data/prostate.csv")
dat <- cbind(scale(dat[,-c(9:10)]),dat[,c(9,10)])
dtrain <- dat[dat$train,-10]
xtrain <- data.matrix(dtrain[,-9])
ytrain <- dtrain$lpsa
dtest <- dat[!dat$train,-10]
xtest <- data.matrix(dtest[,-9])
ytest <- dtest$lpsa

Perform OLS regression, Best Subset selection, Ridge regression and Lasso regression.

# OLS regression
fit.lm <- lm(lpsa~.,data=dtrain)
coef.lm <- coef(fit.lm)
terr.lm <- mean((predict(fit.lm,newdata = dtest)-dtest$lpsa)^2)

# best subset selection with leaps
library(leaps)
subset.leaps<-regsubsets(lpsa~.,data=dtrain,method="forward")
ss.summary <- summary(subset.leaps)
coef.ss.bic <- coef(subset.leaps, which.min(ss.summary$bic))
fit.ss.bic <- lm(lpsa~.,data=dtrain[,c("lpsa",names(coef.ss.bic)[-1])])
terr.ss.bic <- mean((predict(fit.ss.bic,newdata = dtest)-dtest$lpsa)^2)

# best subset selection with caret package (10-fold cv)
library(caret)
set.seed(1) # set seed for reproducibility
train.control <- trainControl(method = "cv", number = 10)# set up repeated k-fold 
subset.caret <- train(lpsa ~., data = dtrain,
                      method = "leapBackward", 
                      tuneGrid = data.frame(nvmax = 1:8),
                      trControl = train.control
)
#subset.caret$results
#summary(subset.caret$finalModel)
coef.ss.cv <- coef(subset.caret$finalModel, subset.caret$bestTune$nvmax)
fit.ss.cv <- lm(lpsa~.,data=dtrain[,c("lpsa",names(coef.ss.cv)[-1])])
terr.ss.cv <- mean((predict(fit.ss.cv,newdata = dtest)-dtest$lpsa)^2)

# Ridge regression
set.seed(1)
library(glmnet)
fit.ridge <- glmnet(xtrain,ytrain,alpha=0)
fit.ridge.cv <- cv.glmnet(xtrain,ytrain,alpha=0)
coef.ridge <- coef(fit.ridge,s=fit.ridge.cv$lambda.min)
terr.ridge <- mean((as.vector(predict(fit.ridge,newx=xtest,s=fit.ridge.cv$lambda.1se))-ytest)^2)

# Lasso regression
set.seed(1)
fit.lasso <- glmnet(xtrain,ytrain,alpha=1)
fit.lasso.cv <- cv.glmnet(xtrain,ytrain,alpha=1)
coef.lasso <- coef(fit.lasso,s=fit.ridge.cv$lambda.min)
terr.lasso <- mean((as.vector(predict(fit.lasso,newx=xtest,s=fit.lasso.cv$lambda.1se))-ytest)^2)

The next two figures show BIC and CV Error for the Best Subset selection approach (tuning parameter are the different subset sizes).

plot(ss.summary$bic)

plot(subset.caret)

The next two figures show the CV Error for Ridge and Lasso regression.

par(mfrow=c(1,2))
plot(fit.ridge.cv)
plot(fit.lasso.cv)

Next we show the table of regression coefficients obtained using the different methods.

res <- data.frame(OLS=coef.lm,SSBIC=0,SSCV=0,RIDGE=0,LASSO=0)
res[names(coef.ss.bic),"SSBIC"] <- coef.ss.bic
res[names(coef.ss.cv),"SSCV"] <- coef.ss.cv
res[,"RIDGE"] <- as.numeric(coef.ridge)
res[,"LASSO"] <- as.numeric(coef.lasso)
kable(res,digits = 3)
OLS SSBIC SSCV RIDGE LASSO
(Intercept) 2.465 2.477 2.467 2.467 2.465
lcavol 0.680 0.740 0.676 0.581 0.547
lweight 0.263 0.316 0.265 0.258 0.211
age -0.141 0.000 -0.145 -0.110 0.000
lbph 0.210 0.000 0.210 0.200 0.116
svi 0.305 0.000 0.307 0.281 0.178
lcp -0.288 0.000 -0.287 -0.163 0.000
gleason -0.021 0.000 0.000 0.012 0.000
pgg45 0.267 0.000 0.252 0.200 0.070

Finally we show the test errors of the different methods.

res <- data.frame(OLS=terr.lm,
                  SSBIC=terr.ss.bic,SSCV=terr.ss.cv,
                  RIDGE=terr.ridge,LASSO=terr.lasso)
kable(res,digits = 3)
OLS SSBIC SSCV RIDGE LASSO
0.521 0.492 0.517 0.508 0.454

7.5 Elastic net mixing parameter and cross-validation

  1. Obtain the Riboflavin data set from the hdipackage, read the help documentation and check data structure.
  2. Run the Lasso and generate the trace plot.
  3. Run the Elastic net with mixing parameters \(\alpha=0.25, 0.5, 0.75\) and \(1\) and compare the cross-validation curves. Hint: use the foldid argument in glmnet.
  4. Show the selected genes for the best performing model.

The Solution to this exercise. We first load the data and check the data structure.

library(hdi)
## Loading required package: scalreg
library(glmnet)
data(riboflavin)
str(riboflavin)
## 'data.frame':    71 obs. of  2 variables:
##  $ y: num  -6.64 -6.95 -7.93 -8.29 -7.31 ...
##  $ x: 'AsIs' num [1:71, 1:4088] 8.49 7.64 8.09 7.89 6.81 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:71] "b_Fbat107PT24.CEL" "b_Fbat107PT30.CEL" "b_Fbat107PT48.CEL" "b_Fbat107PT52.CEL" ...
##   .. ..$ : chr [1:4088] "AADK_at" "AAPA_at" "ABFA_at" "ABH_at" ...

Next we setup the design matrix and the response variable and we run the Lasso.

x <- riboflavin[,-1]
y <- riboflavin[,1]
fit <- glmnet(x = x, y = y)
plot(fit)

We run 10-fold cross-validation for the different mixing parameters and plot the error curves.

set.seed(1)
n.fold <- 10
foldid <- sample(1:n.fold, size = length(y), replace = TRUE)
cv1 <- cv.glmnet(x, y, foldid = foldid, alpha = 1) 
cv2  <- cv.glmnet(x, y, foldid = foldid, alpha = 0.75)
cv3  <- cv.glmnet(x, y, foldid = foldid, alpha = 0.5)
cv4  <- cv.glmnet(x, y, foldid = foldid, alpha = 0.25)

t.lambdarange <- range(log(c(cv1$lambda,
                             cv2$lambda,
                             cv3$lambda,
                             cv4$lambda)))
t.crange <- range(c(cv1$cvm,cv2$cvm,cv3$cvm,cv4$cvm))
plot(log(cv1$lambda), cv1$cvm , 
     pch = 19, col = "red",
     xlab = "log(Lambda)",
     ylab=cv1$name,
     type="l",
     xlim=t.lambdarange,
     ylim=t.crange)
lines(log(cv2$lambda), cv2$cvm, pch = 19, col = "grey")
lines(log(cv3$lambda) , cv3$cvm , pch = 19, col = "blue")
lines(log(cv4$lambda) , cv4$cvm , pch = 19, col = "green")

Finally, we print the gene names of the non-zero coefficients.

## Get selected genes
b <- as.matrix(coef(cv1))
rownames(b)[b!=0]
##  [1] "(Intercept)" "ARGF_at"     "DNAJ_at"     "GAPB_at"     "LYSC_at"    
##  [6] "PKSA_at"     "SPOIISA_at"  "SPOVAA_at"   "XHLB_at"     "XKDS_at"    
## [11] "XTRA_at"     "YBFI_at"     "YCDH_at"     "YCGO_at"     "YCKE_at"    
## [16] "YCLB_at"     "YCLF_at"     "YDDH_at"     "YDDK_at"     "YEBC_at"    
## [21] "YEZB_at"     "YFHE_r_at"   "YFIR_at"     "YHDS_r_at"   "YKBA_at"    
## [26] "YOAB_at"     "YQJU_at"     "YRVJ_at"     "YTGB_at"     "YURQ_at"    
## [31] "YXLD_at"     "YXLE_at"     "YYDA_at"
## By default, the selected variables are based on the largest value of
## lambda such that the cv-error is within 1 standard error of the minimum

7.6 Splines and South African heart disease data

In the main text we fitted a linear logistic regression models to the South African heart disease data. Here we explore nonlinearities in the predictors using natural cubic splines. Proceed as follows:

  1. Fit the model using four natural cubic spline bases for each term in the model.
  2. Run backward selection on the model
  3. Summarise the final model
  4. Plot the natural cubic spline functions for the different terms

The solution to this exercise. First we read the data and fit the model.

library(splines)
dat <-  read.table("http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data",sep=",",head=T,row.names=1)

# Computes the logistic regression model using natural splines (note famhist is included as a factor): 
form <-  "chd ~ ns(sbp,df=4) + ns(tobacco,df=4) + ns(ldl,df=4) + famhist + ns(obesity,df=4) + ns(alcohol,df=4) + ns(age,df=4)"
form <-  formula(form)

fit <-  glm( form, data=dat, family=binomial )

Next we run Backward selection.

fit.bw <- stepAIC(fit,direction="backward",trace = 0)

The outcome of Backward selection is shown next.

kable(as.data.frame(fit.bw$anova),digits=3)
Step Df Deviance Resid. Df Resid. Dev AIC
NA NA 436 457.632 509.632
- ns(alcohol, df = 4) 4 0.456 440 458.088 502.088

The summary of the final model is provided next.

kable(as.data.frame(drop1(fit.bw, test="Chisq" )),digits=2)
Df Deviance AIC LRT Pr(>Chi)
NA 458.09 502.09 NA NA
ns(sbp, df = 4) 4 467.16 503.16 9.08 0.06
ns(tobacco, df = 4) 4 470.48 506.48 12.39 0.01
ns(ldl, df = 4) 4 472.39 508.39 14.31 0.01
famhist 1 479.44 521.44 21.36 0.00
ns(obesity, df = 4) 4 466.24 502.24 8.15 0.09
ns(age, df = 4) 4 481.86 517.86 23.77 0.00

We can plot the natural spline function for the first term as follows.

termplot(fit.bw,se=TRUE,rug=TRUE,term=1)

7.7 Phoneme Recognition

In this exercise we investigate prediction of phonemes based on digitized speech data.

  1. Read the data set, subset the phonemes “aa” and “ao” and create training and test data.
dat <- read.csv("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/phoneme.data")
  1. Plot the log-periodogram as a function of frequency for 5 examples each of the phonemes “aa” and “ao.”

  2. Fit a logistic regression model and evaluate the training and test misclassification errors.

  3. Run Lasso regression and evaluate the training and test misclassification errors.

  4. In the previous approaches we assumed logit-link \[{\rm logit}(x;\beta)=\sum_{j=1}^{256} X_j\beta_j.\] Next we assume that the coefficients are a smooth function of the frequency \(\beta(f)\), i.e. \[\beta(f)=\sum_{m=1}^{\nu} h_m(f)\theta_m,\] where \(h_m\) are B-spline basis functions for a natural cubic spline with \(\nu=12\) degrees of freedom (defined on the set of frequencies). Consider filtered inputs \(x^*={\bf H}^T x\) and fit \(\theta\) by logistic regression on the \(x^*\). Evaluate the training and test misclassification errors.

  5. Plot the coefficients of the different models.

The solution to this exercise. We prepare the data set.

library(splines)
load(file="data/phoneme.rda")
dat2 <- dat[dat$g%in%c("aa","ao"),]

dtrain <- dat2[grepl("^train",dat2$speaker),-c(1,259)]
xtrain <- as.matrix(dtrain[,-257])
ytrain <- ifelse(dtrain$g=="ao",1,0)
dtest <- dat2[grepl("^test",dat2$speaker),-c(1,259)]
xtest <- as.matrix(dtest[,-257])
ytest <- ifelse(dtest$g=="ao",1,0)

dtrain$y <- ytrain
dtest$y <- ytest
dtrain <- dtrain[,-257]
dtest <- dtest[,-257]

We plot the log-periodograms.

id.ao <- sample(which(ytrain==1),5)
id.aa <- sample(which(ytrain==0),5)
plot(xtrain[id.ao[1],],type="l",
     xlab="Frequency",ylab="Log-periodogram")
for(i in 2:5){
  lines(xtrain[id.ao[i],])
}
for(i in 1:5){
  lines(xtrain[id.aa[i],],col="red")
}

We run logistic regression and calculate the train and test errors.

# logistic regression
fit <- glm(y~.,data=dtrain,family=binomial)
coef.glm <-  coefficients(fit)
pred_train <- as.numeric((predict(fit,type="response")>0.5))
pred_test <- as.numeric((predict(fit,type="response",newdata=dtest)>0.5))
mean(pred_train!=ytrain)
## [1] 0.09311424
mean(pred_test!=ytest)
## [1] 0.2437358

We run Lasso regression and calculate the train and test errors.

# lasso regression
fit.glmnet <-glmnet(xtrain,ytrain,family = "binomial",alpha=1)
cv.glmnet <- cv.glmnet(xtrain,ytrain,family = "binomial",type.measure = "class",
                       alpha = 1,nfolds = 10)
coef.lasso <- as.numeric(coefficients(fit.glmnet,s = cv.glmnet$lambda.1se))[-1]
plot(cv.glmnet)

pred_train <- c(predict(fit.glmnet,xtrain,s = cv.glmnet$lambda.1se,type = "class"))
pred_test <- c(predict(fit.glmnet,xtest,s = cv.glmnet$lambda.1se,type = "class"))
mean(pred_train!=ytrain)
## [1] 0.1690141
mean(pred_test!=ytest)
## [1] 0.2050114

We use the natural cubic spline basis with \(\nu=12\) to express the coefficients as a smooth function of the frequencies. We calculate the train and test errors.

# coefficient smoothing
hmat <- ns(x=1:256,df=12)
xstar <- xtrain%*%hmat
fit.smooth <- glm(dtrain$y~.,data=data.frame(xstar),family="binomial")
coef.smooth <- as.numeric(hmat%*%coef(fit.smooth)[-1])
pred_train <- as.numeric((predict(fit.smooth,type="response")>0.5))
pred_test <- as.numeric((predict(fit.smooth,type="response",newdata=data.frame(xtest%*%hmat))>0.5))

mean(pred_train!=ytrain)
## [1] 0.1690141
mean(pred_test!=ytest)
## [1] 0.1867882

We plot the regression coefficients.

plot( coef.glm[-1], 
      ylim=c(-0.4,+0.4), 
      type="l", 
      xlab="Frequency", 
      ylab="Logistic Regression Coefficients" )
lines(coef.lasso,col="green",lwd=2)
lines(coef.smooth,col="red",lwd=2)
abline(h=0)

7.8 Classification and the Sonar data set

In this exercise we explore the Sonar data set from the mlbench package. We use the caret package for classification.

  1. Load the Sonar data set and read the help page.

  2. Split the data into a training and test set.

  3. Use logistic regression and apply forward selection. Show the estimated coefficients of the final model and calculate the prediction error.

  4. Apply elastic-net regularized logistic regression. Show the trace plot and the cross-validation curve. Calculate the prediction error.

  5. Run the AdaBoost method. Obtain the variable importance scores and calculate the prediction error. (Hint: use the function gbm with argument distribution set to “adaboost”).

The solution to this exercise. We explore the Sonar data set from the mlbench package. The goal is to train a model to discriminate between rock “R” and metal “M” based on sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The \(p=60\) covariates represent energies within particular frequency bands. We first load the data and split into training and test data.

library(mlbench)
data(Sonar)

set.seed(107)
inTrain <- createDataPartition(
  y = Sonar$Class,
  ## the outcome data are needed
  p = .5,
  ## The percentage of data in the
  ## training set
  list = FALSE
)
dtrain <- Sonar[ inTrain,]
xtrain <- as.matrix(dtrain[,-61])
ytrain <- dtrain$Class
dtest  <- Sonar[-inTrain,]
xtest <- as.matrix(dtest[,-61])
ytest <- dtest$Class

In order to compare the performance of the different methods we setup trainControl. We use 10-fold cross-validation.

tc <- trainControl(method = "cv", number = 10) # 10-fold CV, performance=accuracy

We first run logistic regression, once with only an intercept and once including all covariates. We note that glm reports convergence problems when using all covariates. This indicates that the model is too complex to fit.

fit.lo <-  glm(Class~1,family=binomial,data=dtrain)
fit.full <-  glm(Class~.,family=binomial,data=dtrain)
fit.full.caret <- train(Class ~., 
                        data = dtrain, 
                        method = "glm", 
                        family = "binomial",
                        trControl = tc)

Next we use Forward selection using stepAIC from the MASS package.

up <- paste("~", paste(colnames(dtrain), collapse=" + ")) 
fit.fw <- stepAIC(fit.lo,
                  scope=up,
                  direction = "forward", 
                  steps=10,
                  trace = FALSE)
#summary(fit.fw)
kable(broom::tidy(fit.fw),digits=2)
term estimate std.error statistic p.value
(Intercept) 4.84 1.74 2.78 0.01
V11 -13.08 5.32 -2.46 0.01
V51 -106.83 46.93 -2.28 0.02
V36 12.56 3.55 3.54 0.00
V43 -15.67 4.67 -3.35 0.00
V4 -54.34 18.41 -2.95 0.00
V15 4.74 2.47 1.92 0.05
V3 30.03 14.72 2.04 0.04
V22 -3.45 1.60 -2.16 0.03
V9 -14.82 6.05 -2.45 0.01
V7 23.17 9.66 2.40 0.02

We can repeat the same analysis using the train function from the caret package. (Note that there is no tuning parameter involved for glmStepAIC.)

fit.fw.caret <- train(x=xtrain,
                      y=ytrain, 
                      method = "glmStepAIC", 
                      family = "binomial",
                      direction ="forward",
                      steps=10,
                      trControl = tc,
                      trace = FALSE
)
# accuracy
kable(fit.fw.caret$results,digits=2)
parameter Accuracy Kappa AccuracySD KappaSD
none 0.68 0.35 0.16 0.32
# summary of the model
kable(broom::tidy(fit.fw.caret$finalModel),digits=2)
term estimate std.error statistic p.value
(Intercept) 4.84 1.74 2.78 0.01
V11 -13.08 5.32 -2.46 0.01
V51 -106.83 46.93 -2.28 0.02
V36 12.56 3.55 3.54 0.00
V43 -15.67 4.67 -3.35 0.00
V4 -54.34 18.41 -2.95 0.00
V15 4.74 2.47 1.92 0.05
V3 30.03 14.72 2.04 0.04
V22 -3.45 1.60 -2.16 0.03
V9 -14.82 6.05 -2.45 0.01
V7 23.17 9.66 2.40 0.02

Next we employ L1-penalized logistic regression.

fit.l1 <-glmnet(x=xtrain,y=ytrain,alpha=1,family="binomial") 
plot(fit.l1,xvar="lambda",label=TRUE)

cvfit <- cv.glmnet(x=xtrain, y=ytrain, 
                   family = "binomial", 
                   type.measure = "class",alpha=1)
#cvfit$lambda.min
plot(cvfit)

We repeat the same analysis using the caret package.

lambda.grid <- fit.l1$lambda
fit.l1.caret<-train(x=xtrain,
                    y=ytrain,
                    method = "glmnet",
                    family="binomial",
                    preProc = c("center","scale"),
                    tuneGrid = expand.grid(alpha = 1,lambda=lambda.grid),
                    trControl = tc
) 

ggplot(fit.l1.caret)

fit.l1.caret
# best lambda
fit.l1.caret$bestTune$lambda
# model coefficients
coef(fit.l1.caret$finalModel,fit.l1.caret$bestTune$lambda)

Next we investigate AdaBoost. We can use the gbm function

dtrain$Class <-  factor(ifelse(dtrain$Class=="M",0,1))
fit.boost <-gbm(Class~.,data=dtrain) 

or the caret package. We first specify the tuning parameter grid.

gbmGrid <-  expand.grid(
  n.trees = seq(1,1000,by=20),
  interaction.depth = c(1,3,6), 
  shrinkage = 0.1,
  n.minobsinnode = 10
)

Next we train the model.

fit.boost.caret<-train(
  x=xtrain,
  y=ytrain,
  method = "gbm",
  trControl = tc,
  verbose = FALSE,
  tuneGrid = gbmGrid
) 

and plot the cross-validation accuracy.

ggplot(fit.boost.caret)

Finally we can obtain the variable importance.

gbmImp <- varImp(fit.boost.caret, scale = FALSE)
plot(gbmImp,top=20)

Next we make predictions and compare the misclassification errors on the test data.

# Make predictions
confusionMatrix(data =fit.boost.caret %>% predict(xtest), ytest)
confusionMatrix(data =fit.l1.caret %>% predict(xtest), ytest)
confusionMatrix(data =fit.fw.caret %>% predict(xtest), ytest)
confusionMatrix(data =fit.full.caret %>% predict(xtest), ytest)

With the caret package we can also compare the 3 models using the following commands.

models <- list(full=fit.full.caret,
               fw=fit.fw.caret,
               l1=fit.l1.caret,
               boost=fit.boost.caret)
summary(resamples(models),metric = "Accuracy")

7.9 Survival analysis based on simulated data

In this exercise we simulated high-dimensional survival data and explore regularized cox regression.

  1. Simulate high-dimensional survival data from the cox proportional hazards model with \(n=50\) and \(p=100\) covariates. Assume that only the first two covariates are active, i.e. have non-zero regression coefficient. Create training and test data.

  2. Based on training data fit a cox regression model including the first 20 covariates. Visualize the hazard-ratios using a forest plot.

  3. Based on training data run l1-penalized cox regression using glmnet. Show the trace plot and the cross-validated C-index.

  4. Based on training data fit a cox regression model including only the active covariates obtained from the previous glmnet fit.

  5. Use the pec package to obtain the Brier score on training and test data for the two cox regression models.

Solution to the exercise. We start by simulating training and test data.

## load packages
library(survival)
library(survminer)

## dimensions
p <- 100
n <- 50

## training data

# design matrix
t.dat <- data.frame(matrix(rnorm(n * p), nrow = n))
fm <- as.formula(paste0("~0+",paste(paste0("X",1:p),collapse="+")))
x_design <- model.matrix(fm,data=t.dat)
beta <- c(2,0,0,0,2,rep(0,p-5))
lh <- x_design%*%beta

# survival and censoring time
rate_base <- 1/5 # baseline hazard
rate_cens <- 1/6 # censoring hazard
hazard <- rate_base*exp(lh)
survtime <- rexp(n,rate=hazard)
censtime <- rexp(n,rate=rate_cens)
status <- as.numeric(survtime<=censtime)
time <- survtime*status+censtime*(1-status)
dtrain <- cbind(
  data.frame(survtime,time,status)%>%
    dplyr::mutate(status.cens=1-status),
  t.dat)

## test data

# design matrix
t.dat <- data.frame(matrix(rnorm(n * p), nrow = n))
fm <- as.formula(paste0("~0+",paste(paste0("X",1:p),collapse="+")))
x_design <- model.matrix(fm,data=t.dat)
beta <- c(2,0,0,0,2,rep(0,p-5))
lh <- x_design%*%beta

# survival and censoring time
rate_base <- 1/5 # baseline hazard
rate_cens <- 1/6 # censoring hazard
hazard <- rate_base*exp(lh)
survtime <- rexp(n,rate=hazard)
censtime <- rexp(n,rate=rate_cens)
status <- as.numeric(survtime<=censtime)
time <- survtime*status+censtime*(1-status)
dtest <- cbind(
  data.frame(survtime,time,status)%>%
    dplyr::mutate(status.cens=1-status),
  t.dat)

We perform cox regression and create a forest plot.

fm1 <- as.formula(paste("Surv(time,status)~",paste(paste0("X",1:20),collapse="+")))
fit1 <- coxph(fm1,data=dtrain,x=TRUE)
survminer::ggforest(fit1)

We perform l1-regularized cox regression and generate the trace plot.

xtrain <- as.matrix(dtrain[,-c(1:4)])
ytrain <- with(dtrain,Surv(time,status))
fit.coxnet <- glmnet(xtrain, ytrain, family = "cox",alpha=0.95)
plot(fit.coxnet,xvar="lambda",label=TRUE)

We perform cross-validation and plot the C-index.

cv.coxnet <- cv.glmnet(xtrain, ytrain,
                       family="cox",
                       type.measure="C",
                       alpha=0.95)
plot(cv.coxnet)

We perform cox regression based on only the active covariates.

act <- which(as.numeric(coef(fit.coxnet,s=cv.coxnet$lambda.min))!=0)
fm2 <- as.formula(paste("Surv(time,status)~",paste(paste0("X",act),collapse="+")))
fit2 <- coxph(fm2,data=dtrain,x=TRUE)

Finally, we calculate the Brier score on training and test data.

library(pec)
fit.pec.train <- pec::pec(
  object=list("mod1"=fit1,"mod2"=fit2), 
  data = dtrain, 
  formula = Surv(time, status) ~ 1, 
  splitMethod = "none")


fit.pec.test <- pec::pec(
  object=list("mod1"=fit1,"mod2"=fit2), 
  data = dtest, 
  formula = Surv(time, status) ~ 1, 
  splitMethod = "none")

par(mfrow=c(1,2))
plot(fit.pec.train,main="train")
plot(fit.pec.test,main="test")

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.