37 Day 36 (July 26)

37.1 Announcements

Today is our last day :(

37.2 Regression trees

Example: Change point analysis

Motivation: Annual mean temperature in Ann Arbor Michigan

library(faraway)
plot(aatemp$year,aatemp$temp,xlab="Year",ylab="Annual mean temperature",pch=20,col=rgb(0.5,0.5,0.5,0.7))

Basis function $f_1(x)=\begin{cases} 1 & \text{if}\;x<c\\ 0 & \text{otherwise} \end{cases}$ and $f_2(x)=\begin{cases} 1 & \text{if}\;x\geq c\\ 0 & \text{otherwise} \end{cases}$
In R

# Change point basis function
f1 <- function(x,c){ifelse(x<c,1,0)}
f2 <- function(x,c){ifelse(x>=c,1,0)}

# Fit linear model
m1 <- lm(temp ~ -1 + f1(year,1930) + f2(year,1930),data=aatemp)
summary(m1)

## 
## Call:
## lm(formula = temp ~ -1 + f1(year, 1930) + f2(year, 1930), data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5467 -0.8962 -0.1157  1.1388  3.5843 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## f1(year, 1930)  46.9567     0.1990   236.0   <2e-16 ***
## f2(year, 1930)  48.3057     0.1684   286.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.378 on 113 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9992 
## F-statistic: 6.899e+04 on 2 and 113 DF,  p-value: < 2.2e-16

# Plot predictions
plot(aatemp$year,aatemp$temp,xlab="Year",ylab="Annual mean temperature",pch=20,col=rgb(0.5,0.5,0.5,0.7))
points(aatemp$year,predict(m1),type="l")

Connection to regression trees
- For a single predictor, a regression tree can be though of as a linear model that uses the “change point” basis function, but tries to estimate the location (or time) and number of change points.
- Example in R

library(rpart)

# Fit regression tree model
m2 <- rpart(temp ~ year,data=aatemp)

# Plot predictions
E.y <- predict(m2,data.frame(year=seq(1850,2023,by=1)))
plot(aatemp$year,aatemp$temp,xlab="Year",ylab="Annual mean temperature",pch=20,col=rgb(0.5,0.5,0.5,0.7),xlim=c(1850,2023))
points(1850:2023,E.y,type="l")

37.3 Random forest

Breiman 2001 link

Example

n.boot <- 1000 # Number of bootstrap samples
x.new=seq(1850,2023,by=1) # New values of x that I want predicitons at
E.y <- matrix(,length(x.new),n.boot) # Matrix to save predictions
x <- aatemp$year
y <- aatemp$temp

for(i in 1:n.boot){

  # Resample data
  df.temp <- data.frame(y=y,x=x)[sample(1:length(y),replace=TRUE,size=51),]

  # Fit tree
  m1 <- rpart(y~x,data=df.temp)

  # Get and save predictions
  E.y[,i] <- predict(m1,data.frame(x=x.new))
}

plot(aatemp$year,aatemp$temp,xlab="Year",ylab="Annual mean temperature",pch=20,col=rgb(0.5,0.5,0.5,0.7),xlim=c(1850,2023))



# Aggregate predictions
E.E.y <- rowMeans(E.y)
points(1850:2023,E.E.y,type="l")

Summary
- Bootstrap aggregating (or bagging) can be used with any statistical model
- Usually increases predictive accuracy but reduces ability to understand (and write-out) the model(s) and make inference

37.4 Generalized additive model

Write out on white board
Difference between interpretable vs. non-interpretable model/machine learning approaches
My personal use of gams
- The functions lm() and glm() and packages lme4, nlme, etc
- Why I use the mgcv package
- Wood (2017)

Example

library(faraway)
library(mgcv)

m1 <- gam(temp ~ s(year,bs="tp"),family=Gamma(link="log"),data=aatemp)

# Plot predictions
E.y <- predict(m1,data.frame(year=seq(1850,2023,by=1)),type="response")
ui <- E.y + 1.96*predict(m1,data.frame(year=seq(1850,2023,by=1)),type="response",se=TRUE)$se.fit
li <- E.y - 1.96*predict(m1,data.frame(year=seq(1850,2023,by=1)),type="response",se=TRUE)$se.fit
plot(aatemp$year,aatemp$temp,xlab="Year",ylab="Annual mean temperature",pch=20,col=rgb(0.5,0.5,0.5,0.7),xlim=c(1850,2023))
points(1850:2023,E.y,type="l",lwd=3)
points(1850:2023,ui,type="l",lwd=2,lty=2)
points(1850:2023,li,type="l",lwd=2,lty=2)

37.5 Shrinkage methods

Ridge regression (Hoerl and Kennard 1970)

See pgs. 174-177 of Chapter 11 in (Faraway 2014)
Minimize the loss function $\underset{\boldsymbol{\beta}}{\operatorname{argmin}} (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^{'}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\lambda\sum_{j=1}^{p}\beta_{j}^{2}$
Properties of ridge regression
- Connection to the linear model with random effects
- $\hat{\boldsymbol{\beta}}=\left(\mathbf{X}^{'}\mathbf{X}+\lambda\mathbf{I}\right)^{-1}\mathbf{X}^{'}\mathbf{y}$
- $\text{rank}(\mathbf{X}^{'}\mathbf{X}+\lambda\mathbf{I})$
- $\hat{\boldsymbol{\beta}}$ when $\lambda=0$
- $\hat{\boldsymbol{\beta}}$ when $\lambda=\infty$
- Recall $\begin{equation} \begin{split} \hat{\text{E}(\mathbf{y})} & = \mathbf{X}\hat{\boldsymbol{\beta}} \\ & = \mathbf{X}\left(\mathbf{X}^{\prime}\mathbf{X}+\lambda\mathbf{I}\right)^{-1}\mathbf{X}^{\prime}\mathbf{y}\\ & = \mathbf{H}\mathbf{y}\\ \end{split} \end{equation}$
- Effective number of coefficients $=\text{trace}(\mathbf{H})$
- If $\lambda=0$ , then $\text{trace}(\mathbf{H}) = \text{?}$
- If $\lambda=\infty$ , then $\text{trace}(\mathbf{H}) = \text{?}$
Estimation requires selecting a value for $\lambda$
Example: Variable selection and dealing with collinearity
- Linear model with $\lambda=0$

library(httr)
library(readxl)

url <- "https://doi.org/10.1371/journal.pone.0148743.s005"
GET(url, write_disk(path <- tempfile(fileext = ".xls")))

df1 <- read_excel(path = path, sheet = "Data", col_names = TRUE, col_types = "numeric")
df1 <- df1[complete.cases(df1),]

m1 <- lm(CDEP ~ NCAT + WPOP + WKIL, data = df1)
summary(m1)

## 
## Call:
## lm(formula = CDEP ~ NCAT + WPOP + WKIL, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.483  -9.151  -1.477   3.985  78.775 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.5176798  6.6712280   2.176   0.0339 *  
## NCAT        -0.0014333  0.0008412  -1.704   0.0941 .  
## WPOP         0.0108041  0.0159751   0.676   0.5017    
## WKIL         0.6864227  0.0985674   6.964 4.71e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.95 on 54 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7661 
## F-statistic: 63.23 on 3 and 54 DF,  p-value: < 2.2e-16

cor(df1[,c(7,5,4)])

##             NCAT       WPOP        WKIL
## NCAT  1.00000000 -0.3547717 -0.05005347
## WPOP -0.35477170  1.0000000  0.79070496
## WKIL -0.05005347  0.7907050  1.00000000

Linear model with $\lambda=10$

X <- scale(df1[,c(7,5,4)])
y <- df1$CDEP

m2 <- lm(y ~ X)
summary(m2)

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.483  -9.151  -1.477   3.985  78.775 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   28.776      1.964  14.655  < 2e-16 ***
## XNCAT         -3.944      2.314  -1.704   0.0941 .  
## XWPOP          2.554      3.776   0.676   0.5017    
## XWKIL         24.615      3.535   6.964 4.71e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.95 on 54 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7661 
## F-statistic: 63.23 on 3 and 54 DF,  p-value: < 2.2e-16

library(MASS)
m3 <- lm.ridge(y ~ X,lambda = 10)
coef(m3)

##               XNCAT     XWPOP     XWKIL 
## 28.775862 -2.243853  7.196794 17.936405

Ridge regression using glmnet() and multiple values of $\lambda$

library(glmnet)
lambda <- seq(0,400,by=0.1)
m4 <- glmnet(X,y,family=c("gaussian"),alpha=0,standardize=FALSE,lambda=lambda)
plot(m4, xvar = "lambda", label = TRUE)

plot(m4, xvar = "dev", label = TRUE)

Ridge regression using 10 fold cross-validation

m5 <- cv.glmnet(X,y,family=c("gaussian"),alpha=0,standardize=FALSE,lambda=lambda,nfolds = 10)
plot(m5)

m5$lambda.min

## [1] 1.9

m6 <- glmnet(X,y,family=c("gaussian"),alpha=0,standardize=FALSE,lambda=m5$lambda.min)
coef(m6)

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept) 28.775862
## NCAT        -2.950257
## WPOP         5.314891
## WKIL        21.147750

LASSO (Tibshirani 1996)

Minimize the loss function $\underset{\boldsymbol{\beta}}{\operatorname{argmin}} (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^{'}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\lambda\sum_{j=1}^{p}|\beta_{j}|$
Properties of LASSO
- Connection to the linear model with random effects
- No closed-form solution for $\hat{\boldsymbol{\beta}}$
- Shrinks and selects (Least Absolute Shrinkage and Selection Operator)
- Good to use when collinearity is not a huge problem and when some of the $\boldsymbol{\beta}$ are thought to be zero.
Example: LASSO using glmnet() and multiple values of $\lambda$

X <- scale(df1[,c(7,5,4)])
y <- df1$CDEP

library(glmnet)
lambda <- seq(0,400,by=0.1)
m7 <- glmnet(X,y,family=c("gaussian"),alpha=1,standardize=FALSE,lambda=lambda)
plot(m7, xvar = "lambda", label = TRUE)

plot(m7, xvar = "dev", label = TRUE)

- LASSO using 10 fold cross-validation

m8 <- cv.glmnet(X,y,family=c("gaussian"),alpha=1,standardize=FALSE,lambda=lambda,nfolds = 10)
plot(m8)

m8$lambda.min

## [1] 0.2

m9 <- glmnet(X,y,family=c("gaussian"),alpha=1,standardize=FALSE,lambda=m8$lambda.min)
coef(m9)

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept) 28.775862
## NCAT        -3.722689
## WPOP         2.634395
## WKIL        24.358870

m10 <- glmnet(X,y,family=c("gaussian"),alpha=1,standardize=FALSE,lambda=m8$lambda.1se)
coef(m10)

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                     s0
## (Intercept) 28.7758621
## NCAT         .        
## WPOP         0.3655531
## WKIL        16.6725216

Elastic net (Zou and Hastie 2005)

Mix of ridge and lasso penalty
Example: Elastic net using glmnet() and multiple values of $\lambda$

lambda <- seq(0,400,by=0.1)
m11 <- glmnet(X,y,family=c("gaussian"),alpha=0.5,standardize=FALSE,lambda=lambda)
plot(m11, xvar = "lambda", label = TRUE)

plot(m11, xvar = "dev", label = TRUE)

- Elastic net using 10 fold cross-validation

m12 <- cv.glmnet(X,y,family=c("gaussian"),alpha=1,standardize=FALSE,lambda=lambda,nfolds = 10)
plot(m11)

m12$lambda.min

## [1] 1.4

m13 <- glmnet(X,y,family=c("gaussian"),alpha=1,standardize=FALSE,lambda=m12$lambda.min)
coef(m13)

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept) 28.775862
## NCAT        -2.431400
## WPOP         3.039959
## WKIL        22.881770

Summary
- Non-Bayesian inference (e.g., confidence intervals) is an open question
- Methods that shrink coefficients can be used to introduce bias which reduces variance
  - Methods that do continuous shrinkage are preferend to binary shrinkage (e.g., variable selection with AIC).
  - Shrinking coefficients is one way to deal with collinearity.
- Most continuous shrinkage methods are some gernalization of ridge or LASSO penalty.
- Good references
  - Statistical Learning with Sparsity: the Lasso and Generalizations (link)
  - An Introduction to Statistical Learning: with Applications in R (link)
    - Ch. 6 (Linear Model Selection and Regularization)

Literature cited

Faraway, J. J. 2014. Linear Models with r. CRC Press.

Hoerl, Arthur E, and Robert W Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 267–88.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20.