29 Day 29 (July 17)

29.1 Announcements

Work days on tuesday and thursday

29.2 Collinearity

When two or more of the predictor variables (i.e., columns of $\mathbf{X}$ ) are correlated, the predictors are said to be collinear. Also called
- Collinearity
- Multicollinearity
- Correlation among predictors/covariates
- Partial parameter identifiability
Collinearity cause coefficient estimates (i.e., $\hat{\boldsymbol{\beta}}$ to change when a collinear predictor is removed (or added) to the model.
- Extremely common problem with observational data and data collected from poorly designed experiments.

Example

The data

library(httr)
library(readxl)

url <- "https://doi.org/10.1371/journal.pone.0148743.s005"    
GET(url, write_disk(path <- tempfile(fileext = ".xls")))

df1 <- read_excel(path=path,sheet = "Data", col_names=TRUE,col_types="numeric")
notes <- read_excel(path=path,sheet = "Notes", col_names=FALSE)

head(df1)

## # A tibble: 6 × 9
##   States  Year  CDEP  WKIL  WPOP  BRPA  NCAT  SDEP  NSHP
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1      1  1987     6     4    10     2  9736    10  1478
## 2      1  1988     0     0    14     1  9324     0  1554
## 3      1  1989     3     1    12     1  9190     0  1572
## 4      1  1990     5     1    33     3  8345     0  1666
## 5      1  1991     2     0    29     2  8733     2  1639
## 6      1  1992     1     0    41     4  9428     0  1608

notes

## # A tibble: 9 × 2
##   ...1               ...2                       
##   <chr>              <chr>                      
## 1 Year               <NA>                       
## 2 State              1=Montana,2=Wyoming,3=Idaho
## 3 Cattle depredated  CDEP                       
## 4 Sheep depredated   SDEP                       
## 5 Wolf population    WPOP                       
## 6 Wolves killed      WKIL                       
## 7 Breeding pairs     BRPA                       
## 8 Number of cattleA* NCAT                       
## 9 Number of SheepB*  NSHP

Fit a linear model

m1 <- lm(CDEP ~ NCAT + WPOP,data=df1)
summary(m1)

## 
## Call:
## lm(formula = CDEP ~ NCAT + WPOP, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.289  -8.147  -3.475   2.678  92.257 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.8455597  8.1697216  -0.471     0.64    
## NCAT         0.0009372  0.0010363   0.904     0.37    
## WPOP         0.1031119  0.0119644   8.618 7.51e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.23 on 56 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.5855, Adjusted R-squared:  0.5707 
## F-statistic: 39.55 on 2 and 56 DF,  p-value: 1.951e-11

Killing wolves should reduce the number of cattle depredated, right? Let’s add the number of wolves killed to our model.

m2 <- lm(CDEP ~ NCAT + WPOP + WKIL,data=df1)
summary(m2)

## 
## Call:
## lm(formula = CDEP ~ NCAT + WPOP + WKIL, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.519  -8.549  -1.588   4.049  79.041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.901373   6.511267   2.135   0.0372 *  
## NCAT        -0.001385   0.000830  -1.669   0.1008    
## WPOP         0.011868   0.015724   0.755   0.4536    
## WKIL         0.683946   0.097769   6.996 3.84e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.85 on 55 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.7807, Adjusted R-squared:  0.7687 
## F-statistic: 65.25 on 3 and 55 DF,  p-value: < 2.2e-16

What happened?!

cor(df1[,c(4,5,7)],use="complete.obs")

##             WKIL       WPOP        NCAT
## WKIL  1.00000000  0.7933354 -0.04460099
## WPOP  0.79333543  1.0000000 -0.34440797
## NCAT -0.04460099 -0.3444080  1.00000000

Some analytical results
- Assume the linear model $\mathbf{y}=\beta_{1}\mathbf{x}+\beta_{2}\mathbf{z}+\boldsymbol{\varepsilon}$ where $\boldsymbol{\varepsilon}\sim\text{N}(\mathbf{0},\sigma^{2}\mathbf{I})$
- Assume that the predictors $\mathbf{x}$ and $\mathbf{z}$ are correlated (i.e., are not orthogonal $\mathbf{x}^{\prime}\mathbf{z}\neq 0$ and $\mathbf{z}^{\prime}\mathbf{x}\neq 0$ ).
- Then $\hat{\beta}_1 = (\mathbf{x}^{\prime}\mathbf{x}\times\mathbf{z}^{\prime}\mathbf{z} - \mathbf{x}^{\prime}\mathbf{z}\times\mathbf{z}^{\prime}\mathbf{x})^{-1}((\mathbf{z}^{\prime}\mathbf{z})\mathbf{x}^{\prime} - (\mathbf{x}^{\prime}\mathbf{z})\mathbf{z}^{\prime})\mathbf{y}$ and $\hat{\beta}_2 = (\mathbf{z}^{\prime}\mathbf{z}\times\mathbf{x}^{\prime}\mathbf{x} - \mathbf{z}^{\prime}\mathbf{x}\times\mathbf{x}^{\prime}\mathbf{z})^{-1}((\mathbf{x}^{\prime}\mathbf{x})\mathbf{z}^{\prime} - (\mathbf{z}^{\prime}\mathbf{x})\mathbf{x}^{\prime})\mathbf{y}.$ The variance is $\text{Var}(\hat{\beta}_{1})=\sigma^{2}\bigg{(}\frac{1}{1-R^{2}_{xz}}\bigg{)}\frac{1}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}$ $\text{Var}(\hat{\beta}_{2})=\sigma^{2}\bigg{(}\frac{1}{1-R^{2}_{xz}}\bigg{)}\frac{1}{\sum_{i=1}^{n}(z_{i}-\bar{z})^{2}}$ where $R^{2}_{xz}$ is the correlation between $\mathbf{x}$ and $\mathbf{z}$ (see pg. 107 in Faraway (2014)).
Take away message
- $\hat{\beta}_i$ is the unbiased and minimum variance estimator assuming all predictors for $\beta_i \neq 0$ are in the model.
- $\hat{\beta}_i$ depends on the other predictors in the model.
- Correlation among predictors increases the variance of $\hat{\beta}_i$ .
  - Known as the variance inflation factor
- Collinearity is ubiquitous in any model with correlated predictors
```
library(faraway)
vif(m2)    
```
```
##     NCAT     WPOP     WKIL 
## 1.350723 3.637257 3.212207
```

29.3 Model selection

Recall that all inference is conditional on the statistical model that we have chosen to use.
- Components of a statistical model
- Are the assumptions met?
- Are there better models we could have used?
- What should you do if you have two or more competing models?
Information criterion
- Motivation: Adjusted $R^2$
  - Recall $\hat{R}^{2}=1-\frac{(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta})}}{(\mathbf{y}-\mathbf{1}\hat{\beta}_{0})^{\prime}(\mathbf{y}-\mathbf{1}\hat{\beta}_{0})}$
  - As we add more predictors to $\mathbf{X}$ , $R^{2}$ monotonically increases when calculated using in-sample data.
  - Calculating $R^{2}$ using in-sample data results in a biased estimator of the $R^{2}$ calculated using “out-of-sample” (i.e., $\text{E}(\hat{R}_{\text{in-sample}}^{2})\neq R_{\text{out-of-sample}}^{2}$ )
  - Adjusted $R^{2}$ is a biased corrected version of $R^{2}$ . $\hat{R}^{2}_{\text{adjusted}}=1-\bigg{(}\frac{n-1}{n-p}\bigg{)}\frac{(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta})}}{(\mathbf{y}-\mathbf{1}\hat{\beta}_{0})^{\prime}(\mathbf{y}-\mathbf{1}\hat{\beta}_{0})}$
  - Many different estimators exist for adjusted $R^{2}$
    - Different assumptions lead to different estimators
- Gelman, Hwang, and Vehtari (2014) on estimating out-of-sample predictive accuracy using in-sample data
  - “We know of no approximation that works in general….Each of these methods has flaws, which tells us that any predictive accuracy measure that we compute will be only approximate”
  - “One difficulty is that all the proposed measures are attempting to perform what is, in general, an impossible task: to obtain an unbiased (or approximately unbiased) and accurate measure of out-of-sample prediction error that will be valid over a general class of models…”
Akaike’s Information Criterion
- The log-likelihood as a scoring rule $\mathcal{l}(\boldsymbol{\theta}|\mathbf{y})$
- The log-likelihood for the linear model $\mathcal{l}(\boldsymbol{\beta},\sigma^2|\mathbf{y})=-\frac{n}{2}\text{ log}(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_{i} - \mathbf{x}_{i}^{\prime}\boldsymbol{\beta})^2.$
  - Like $R^2$ , $\mathcal{l}(\boldsymbol{\beta},\sigma^2|\mathbf{y})$ can be used to measure predictive accuracy on in-sample or out of sample data.
  - Unlike $R^2$ , the log-likelihood is a general scoring rule. For example, let $\mathbf{y} \sim \text{Bernoulli}(p)$ then $\sum_{i=1}^{n}y_{i}\text{log}(p)+(1-y_{i})\text{log}({1-p}).$
  - The log-likelihood is a relative score and can be difficult to interpret.
- AIC uses the in-sample log-likelihood, but achieves an (asymptotically) unbiased estimate the log-likelihood on the out-of-sample data by adding a bias correction. The statistic is then multiplied by negative two $AIC=-2\mathcal{l}(\boldsymbol{\theta}|\mathbf{y}) + 2q$
  - What is a meaninful difference in AIC?
  - AIC is a derived quanitity. Just like $R^2$ we can and should get measures of uncertainty (e.g., 95% CI).

Example

Motivation: Annual mean temperature in Ann Arbor Michigan

library(faraway)
plot(aatemp$year,aatemp$temp,xlab="Year",ylab="Annual mean temperature",pch=20,col=rgb(0.5,0.5,0.5,0.7))

df.in_sample <- aatemp # Use all years for in-sample data

# Intercept only model
m1 <- lm(temp ~ 1,data=df.in_sample)

# Simple regression model
m2 <- lm(temp ~ year,data=df.in_sample)

par(mfrow=c(1,1))
# Plot predictions (in-sample)
plot(df.in_sample$year,df.in_sample$temp,main="In-sample",xlab="Year",ylab="Annual mean temperature",ylim=c(44,52),pch=20,col=rgb(0.5,0.5,0.5,0.7))
lines(df.in_sample$year,predict(m2))

# Broken stick basis function
f1 <- function(x,c){ifelse(x<c,c-x,0)}
f2 <- function(x,c){ifelse(x>=c,x-c,0)}
m3 <- lm(temp ~ f1(year,1930) + f2(year,1930),data=df.in_sample)

par(mfrow=c(1,1))
# Plot predictions (in-sample)
plot(df.in_sample$year,df.in_sample$temp,main="In-sample",xlab="Year",ylab="Annual mean temperature",ylim=c(44,52),pch=20,col=rgb(0.5,0.5,0.5,0.7))
lines(df.in_sample$year,predict(m3))

library(rpart)
# Fit regression tree model
m4 <- rpart(temp ~ year,data=df.in_sample)

par(mfrow=c(1,1))
# Plot predictions (in-sample)
plot(df.in_sample$year,df.in_sample$temp,main="In-sample",xlab="Year",ylab="Annual mean temperature",ylim=c(44,52),pch=20,col=rgb(0.5,0.5,0.5,0.7))
lines(df.in_sample$year,predict(m4))

#Adjusted R-squared
summary(m1)$adj.r.square

## [1] 0

summary(m2)$adj.r.square

## [1] 0.0772653

summary(m3)$adj.r.square

## [1] 0.07692598

# Not clear how to calculate adjusted R^2 for m4 (regression tree)

# AIC 
y <- df.in_sample$temp
y.hat_m1 <- predict(m1,newdata=df.in_sample)
y.hat_m2 <- predict(m2,newdata=df.in_sample)
y.hat_m3 <- predict(m3,newdata=df.in_sample)
y.hat_m4 <- predict(m4,newdata=df.in_sample)

sigma2.hat_m1 <- summary(m1)$sigma^2
sigma2.hat_m2 <- summary(m2)$sigma^2
sigma2.hat_m3 <- summary(m3)$sigma^2

# m1
#log-likelihood
ll <- sum(dnorm(y,y.hat_m1,sigma2.hat_m1^0.5,log=TRUE)) # same as logLik(m1)
q <- 1 + 1 # number of estimated betas + number estimated sigma2s
#AIC
-2*ll+2*q #Same as AIC(m1)

## [1] 426.6276

# m2
#log-likelihood
ll <- sum(dnorm(y,y.hat_m2,sigma2.hat_m2^0.5,log=TRUE)) # same as logLik(m2)
q <-  2 + 1 # number of estimated betas + number estimated sigma2s
#AIC
-2*ll+2*q #Same as AIC(m2)

## [1] 418.38

# m3
#log-likelihood
ll <- sum(dnorm(y,y.hat_m3,sigma2.hat_m3^0.5,log=TRUE)) # same as logLik(m3)
q <- 3 + 1 # number of estimated betas + number estimated sigma2s
#AIC
-2*ll+2*q #Same as AIC(m3)

## [1] 419.4223

# m4
n <- length(y)
q <- 7 + 6 + 1 # number of estimated "intercept" terms in each termimal node + number of estimated "breaks" + number estimated sigma2s
sigma2.hat_m4 <- t(y - y.hat_m4)%*%(y - y.hat_m4)/(n - q - 1)
# log-likelihood
ll <- sum(dnorm(y,y.hat_m4,sigma2.hat_m4^0.5,log=TRUE)) 
#AIC
-2*ll+2*q

## [1] 406.5715

Example: assessing uncertainity in model selection statistics

library(mosaic)

boot <- function(){
m1 <- lm(temp ~ 1,data=resample(df.in_sample))
m2 <- lm(temp ~ year,data=resample(df.in_sample))
m3 <- lm(temp ~ f1(year,1930) + f2(year,1930),data=resample(df.in_sample))
cbind(AIC(m1),AIC(m2),AIC(m3))}

bs.aic.samplles <- do(1000)*boot()

par(mfrow=c(3,1))
hist(bs.aic.samplles[,1],col="grey",main="m1",freq=FALSE,xlab="AIC score",xlim=range(bs.aic.samplles))
abline(v=AIC(m1),col="gold",lwd=3)
hist(bs.aic.samplles[,2],col="grey",main="m2",freq=FALSE,xlab="AIC score",xlim=range(bs.aic.samplles))
abline(v=AIC(m2),col="gold",lwd=2)
hist(bs.aic.samplles[,3],col="grey",main="m3",freq=FALSE,xlab="AIC score",xlim=range(bs.aic.samplles))
abline(v=AIC(m3),col="gold",lwd=3)

Summary
- Making statistical inference after model selection?
- Discipline specific views (e.g., AIC in wildlife ecology).
- There are many choices for model selection using in-sample data. Knowing the “best” model selection approach would require you to know a lot about all other procedures.
  - AIC (Akaike 1973)
  - BIC (Swartz 1978)
  - CAIC (Vaida & Blanchard 2005)
  - DIC (Spiegelhalter et al. 2002)
  - ERIC (Hui et al. 2015)
  - And the list goes on
- I would recommend using methods that you understand.
- In my experience it takes a larger data set to do estimation and model selection
  - If you have a large data set, split it and use out-of-sample data for estimation
  - If you don’t have a large data set, make sure you assess the uncertainty. You might not have enough data to do both estimation and model selection.

References

Faraway, J. J. 2014. Linear Models with r. CRC Press.

Gelman, Andrew, Jessica Hwang, and Aki Vehtari. 2014. “Understanding Predictive Information Criteria for Bayesian Models.” Statistics and Computing 24 (6): 997–1016.