32 Day 31 (July 19)

32.1 Announcements

Please email me and request a 20 min time period during the final week of class (July 22 - 26) to give your final presentation. When you email me please suggest three times/dates that work for you.
Read Ch. 10 in linear models with R
Up next
- Random effects/mixed model
- Generalized linear models
- Machine learning (e.g., regression trees, random forest)

32.2 Model selection

Recall that all inference is conditional on the statistical model that we have chosen to use.
- Components of a statistical model
- Are the assumptions met?
- Are there better models we could have used?
- What should you do if you have two or more competing models?
Information criterion
- Motivation: Adjusted $R^2$
  - Recall $\hat{R}^{2}=1-\frac{(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta})}}{(\mathbf{y}-\mathbf{1}\hat{\beta}_{0})^{\prime}(\mathbf{y}-\mathbf{1}\hat{\beta}_{0})}$
  - As we add more predictors to $\mathbf{X}$ , $R^{2}$ monotonically increases when calculated using in-sample data.
  - Calculating $R^{2}$ using in-sample data results in a biased estimator of the $R^{2}$ calculated using “out-of-sample” (i.e., $\text{E}(\hat{R}_{\text{in-sample}}^{2})\neq R_{\text{out-of-sample}}^{2}$ )
  - Adjusted $R^{2}$ is a biased corrected version of $R^{2}$ . $\hat{R}^{2}_{\text{adjusted}}=1-\bigg{(}\frac{n-1}{n-p}\bigg{)}\frac{(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta})}}{(\mathbf{y}-\mathbf{1}\hat{\beta}_{0})^{\prime}(\mathbf{y}-\mathbf{1}\hat{\beta}_{0})}$
  - Many different estimators exist for adjusted $R^{2}$
    - Different assumptions lead to different estimators
- Gelman, Hwang, and Vehtari (2014) on estimating out-of-sample predictive accuracy using in-sample data
  - “We know of no approximation that works in general….Each of these methods has flaws, which tells us that any predictive accuracy measure that we compute will be only approximate”
  - “One difficulty is that all the proposed measures are attempting to perform what is, in general, an impossible task: to obtain an unbiased (or approximately unbiased) and accurate measure of out-of-sample prediction error that will be valid over a general class of models…”
Akaike’s Information Criterion
- The log-likelihood as a scoring rule $\mathcal{l}(\boldsymbol{\theta}|\mathbf{y})$
- The log-likelihood for the linear model $\mathcal{l}(\boldsymbol{\beta},\sigma^2|\mathbf{y})=-\frac{n}{2}\text{ log}(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_{i} - \mathbf{x}_{i}^{\prime}\boldsymbol{\beta})^2.$
  - Like $R^2$ , $\mathcal{l}(\boldsymbol{\beta},\sigma^2|\mathbf{y})$ can be used to measure predictive accuracy on in-sample or out of sample data.
  - Unlike $R^2$ , the log-likelihood is a general scoring rule. For example, let $\mathbf{y} \sim \text{Bernoulli}(p)$ then $\sum_{i=1}^{n}y_{i}\text{log}(p)+(1-y_{i})\text{log}({1-p}).$
  - The log-likelihood is a relative score and can be difficult to interpret.
- AIC uses the in-sample log-likelihood, but achieves an (asymptotically) unbiased estimate the log-likelihood on the out-of-sample data by adding a bias correction. The statistic is then multiplied by negative two $AIC=-2\mathcal{l}(\boldsymbol{\theta}|\mathbf{y}) + 2q$
  - What is a meaninful difference in AIC?
  - AIC is a derived quanitity. Just like $R^2$ we can and should get measures of uncertainty (e.g., 95% CI).
Example
- Motivation: Annual mean temperature in Ann Arbor Michigan

library(faraway)
plot(aatemp$year,aatemp$temp,xlab="Year",ylab="Annual mean temperature",pch=20,col=rgb(0.5,0.5,0.5,0.7))

df.in_sample <- aatemp # Use all years for in-sample data

# Intercept only model
m1 <- lm(temp ~ 1,data=df.in_sample)

# Simple regression model
m2 <- lm(temp ~ year,data=df.in_sample)

par(mfrow=c(1,1))
# Plot predictions (in-sample)
plot(df.in_sample$year,df.in_sample$temp,main="In-sample",xlab="Year",ylab="Annual mean temperature",ylim=c(44,52),pch=20,col=rgb(0.5,0.5,0.5,0.7))
lines(df.in_sample$year,predict(m2))

# Broken stick basis function
f1 <- function(x,c){ifelse(x<c,c-x,0)}
f2 <- function(x,c){ifelse(x>=c,x-c,0)}
m3 <- lm(temp ~ f1(year,1930) + f2(year,1930),data=df.in_sample)

par(mfrow=c(1,1))
# Plot predictions (in-sample)
plot(df.in_sample$year,df.in_sample$temp,main="In-sample",xlab="Year",ylab="Annual mean temperature",ylim=c(44,52),pch=20,col=rgb(0.5,0.5,0.5,0.7))
lines(df.in_sample$year,predict(m3))

library(rpart)
# Fit regression tree model
m4 <- rpart(temp ~ year,data=df.in_sample)

par(mfrow=c(1,1))
# Plot predictions (in-sample)
plot(df.in_sample$year,df.in_sample$temp,main="In-sample",xlab="Year",ylab="Annual mean temperature",ylim=c(44,52),pch=20,col=rgb(0.5,0.5,0.5,0.7))
lines(df.in_sample$year,predict(m4))

#Adjusted R-squared
summary(m1)$adj.r.square

## [1] 0

summary(m2)$adj.r.square

## [1] 0.0772653

summary(m3)$adj.r.square

## [1] 0.07692598

# Not clear how to calculate adjusted R^2 for m4 (regression tree)

# AIC 
y <- df.in_sample$temp
y.hat_m1 <- predict(m1,newdata=df.in_sample)
y.hat_m2 <- predict(m2,newdata=df.in_sample)
y.hat_m3 <- predict(m3,newdata=df.in_sample)
y.hat_m4 <- predict(m4,newdata=df.in_sample)

sigma2.hat_m1 <- summary(m1)$sigma^2
sigma2.hat_m2 <- summary(m2)$sigma^2
sigma2.hat_m3 <- summary(m3)$sigma^2

# m1
#log-likelihood
ll <- sum(dnorm(y,y.hat_m1,sigma2.hat_m1^0.5,log=TRUE)) # same as logLik(m1)
q <- 1 + 1 # number of estimated betas + number estimated sigma2s
#AIC
-2*ll+2*q #Same as AIC(m1)

## [1] 426.6276

# m2
#log-likelihood
ll <- sum(dnorm(y,y.hat_m2,sigma2.hat_m2^0.5,log=TRUE)) # same as logLik(m2)
q <-  2 + 1 # number of estimated betas + number estimated sigma2s
#AIC
-2*ll+2*q #Same as AIC(m2)

## [1] 418.38

# m3
#log-likelihood
ll <- sum(dnorm(y,y.hat_m3,sigma2.hat_m3^0.5,log=TRUE)) # same as logLik(m3)
q <- 3 + 1 # number of estimated betas + number estimated sigma2s
#AIC
-2*ll+2*q #Same as AIC(m3)

## [1] 419.4223

# m4
n <- length(y)
q <- 7 + 6 + 1 # number of estimated "intercept" terms in each termimal node + number of estimated "breaks" + number estimated sigma2s
sigma2.hat_m4 <- t(y - y.hat_m4)%*%(y - y.hat_m4)/(n - q - 1)
# log-likelihood
ll <- sum(dnorm(y,y.hat_m4,sigma2.hat_m4^0.5,log=TRUE)) 
#AIC
-2*ll+2*q

## [1] 406.5715

Example: assessing uncertainity in model selection statistics

library(mosaic)

boot <- function(){
m1 <- lm(temp ~ 1,data=resample(df.in_sample))
m2 <- lm(temp ~ year,data=resample(df.in_sample))
m3 <- lm(temp ~ f1(year,1930) + f2(year,1930),data=resample(df.in_sample))
cbind(AIC(m1),AIC(m2),AIC(m3))}

bs.aic.samplles <- do(1000)*boot()

par(mfrow=c(3,1))
hist(bs.aic.samplles[,1],col="grey",main="m1",freq=FALSE,xlab="AIC score",xlim=range(bs.aic.samplles))
abline(v=AIC(m1),col="gold",lwd=3)
hist(bs.aic.samplles[,2],col="grey",main="m2",freq=FALSE,xlab="AIC score",xlim=range(bs.aic.samplles))
abline(v=AIC(m2),col="gold",lwd=2)
hist(bs.aic.samplles[,3],col="grey",main="m3",freq=FALSE,xlab="AIC score",xlim=range(bs.aic.samplles))
abline(v=AIC(m3),col="gold",lwd=3)

Summary
- Making statistical inference after model selection?
- Discipline specific views (e.g., AIC in wildlife ecology).
- There are many choices for model selection using in-sample data. Knowing the “best” model selection approach would require you to know a lot about all other procedures.
  - AIC (Akaike 1973)
  - BIC (Swartz 1978)
  - CAIC (Vaida & Blanchard 2005)
  - DIC (Spiegelhalter et al. 2002)
  - ERIC (Hui et al. 2015)
  - And the list goes on
- I would recommend using methods that you understand.
- In my experience it takes a larger data set to do estimation and model selection
  - If you have a large data set, split it and use out-of-sample data for estimation
  - If you don’t have a large data set, make sure you assess the uncertainty. You might not have enough data to do both estimation and model selection.

32.3 Random effects

Conditional, marginal, and joint distributions
- Conditional distribution: “ $\mathbf{y}$ given $\boldsymbol{\beta}$ ” $\mathbf{y}|\boldsymbol{\beta}\sim\text{N}(\mathbf{X}\boldsymbol{\beta},\sigma_{\varepsilon}^{2}\mathbf{I})$ This is also sometimes called the data model.
- Marginal distribution of $\boldsymbol{\beta}$ $\beta\sim\text{N}(\mathbf{0},\sigma_{\beta}^{2}\mathbf{I})$ This is also called the parameter model.
- Joint distribution $[\mathbf{y}|\boldsymbol{\beta}][\boldsymbol{\beta}]$
Estimation for the linear model with random effects
- By maximizing the joint likelihood
  - Write out the likelihood function $\mathcal{L}(\boldsymbol{\beta},\sigma_{\varepsilon},\sigma_{\beta}^{2})=\frac{1}{(2\pi)^{-\frac{n}{2}}}|\sigma^{2}_{\varepsilon}\mathbf{I}|^{-\frac{1}{2}}\text{exp}\bigg{(}-\frac{1}{2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\prime}(\sigma^{2}_{\varepsilon}\mathbf{I})^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\bigg{)}\times\\\frac{1}{(2\pi)^{-\frac{p}{2}}}|\sigma^{2}_{\beta}\mathbf{I}|^{-\frac{1}{2}}\text{exp}\bigg{(}-\frac{1}{2}\boldsymbol{\beta}^{\prime}(\sigma^{2}_{\beta}\mathbf{I})^{-1}\boldsymbol{\beta}\bigg{)}$
  - Maximizing the log-likelihood function w.r.t. gives $\boldsymbol{\beta}$ gives $\hat{\boldsymbol{\beta}}=\big{(}\mathbf{X}^{'}\mathbf{X}+\frac{\sigma^{2}_{\varepsilon}}{\sigma^{2}_{\beta}}\mathbf{I}\big{)}^{-1}\mathbf{X}^{'}\mathbf{y}$

Example eye temperature data set for Eurasian blue tit

df.et <- read.table("https://www.dropbox.com/s/bo11n66cezdnbia/eyetemp.txt?dl=1")
names(df.et) <- c("id","sex","date","time","event","airtemp","relhum","sun","bodycond","eyetemp")
df.et$id <- as.factor(df.et$id)
df.et$id.num <- as.numeric(df.et$id)

ind.means <- aggregate(eyetemp ~ id.num,df.et, FUN=base::mean)
plot(df.et$id.num,df.et$eyetemp,xlab="Individual",ylab="Eye temperature")
points(ind.means,pch="x",col="gold",cex=1.5)

# Standard "fixed effects" linear regression model
m1 <- lm(eyetemp~as.factor(id.num)-1,data=df.et)
summary(m1)

## 
## Call:
## lm(formula = eyetemp ~ as.factor(id.num) - 1, data = df.et)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7378 -0.5750  0.1507  0.6622  3.0286 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## as.factor(id.num)1   29.3423     0.1972  148.77   <2e-16 ***
## as.factor(id.num)2   28.8735     0.1725  167.41   <2e-16 ***
## as.factor(id.num)3   27.5458     0.2053  134.19   <2e-16 ***
## as.factor(id.num)4   28.0156     0.1778  157.59   <2e-16 ***
## as.factor(id.num)5   30.8188     0.1211  254.56   <2e-16 ***
## as.factor(id.num)6   32.6486     0.1700  192.06   <2e-16 ***
## as.factor(id.num)7   29.6800     0.3180   93.33   <2e-16 ***
## as.factor(id.num)8   29.0000     0.2789  103.97   <2e-16 ***
## as.factor(id.num)9   29.4378     0.1499  196.36   <2e-16 ***
## as.factor(id.num)10  29.7667     0.1836  162.12   <2e-16 ***
## as.factor(id.num)11  28.4714     0.2195  129.74   <2e-16 ***
## as.factor(id.num)12  28.8793     0.1867  154.64   <2e-16 ***
## as.factor(id.num)13  30.1500     0.7111   42.40   <2e-16 ***
## as.factor(id.num)14  30.6500     0.7111   43.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.006 on 358 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
## F-statistic: 2.31e+04 on 14 and 358 DF,  p-value: < 2.2e-16

plot(df.et$id.num,df.et$eyetemp,xlab="Individual",ylab="Expected eye temperature",col="white")
points(1:14,coef(m1),pch=20,col="black",cex=1)
segments (1:14,confint(m1)[,1],1:14,confint(m1)[,2] ,lwd=2,lty=1)

# Random effects regression model (using nlme package; difficult to get confidence interval)
library(nlme)
m2 <- lme(fixed = eyetemp ~ 1,random= list(id = pdIdent(~id-1)),data=df.et,method="ML")#summary(m2)
#predict(m2,level=1,newdata=data.frame(id=unique(df.et$id)))

# Random effects regression model (mgcv package)
library(mgcv)
m2 <- gam(eyetemp ~ 1+s(id,bs="re"),data=df.et,method="ML")

plot(df.et$id.num,df.et$eyetemp,xlab="Individual",ylab="Expected eye temperature",col="white")
points(1:14-0.15,coef(m1),pch=20,col="black",cex=1)
segments (1:14-0.15,confint(m1)[,1],1:14-0.15,confint(m1)[,2] ,lwd=2,lty=1)
pred <- predict(m2,newdata=data.frame(id=unique(df.et$id)),se.fit=TRUE)
pred$lci <- pred$fit - 1.96*pred$se.fit
pred$uci <- pred$fit + 1.96*pred$se.fit
points(1:14+0.15,pred$fit,pch=20,col="green",cex=1)
segments (1:14+0.15,pred$lci,1:14+0.15,pred$uci,lwd=2,lty=1,col="green")
abline(a=coef(m2)[1],b=0,col="grey")

Summary
- Take home message
  - Treating $\boldsymbol{\beta}$ as a random effect “shrinks” the estimates of $\boldsymbol{\beta}$ closer to zero.
- Requires an extra assumption $\boldsymbol{\beta}\sim\text{N}(\mathbf{0},\sigma_{\beta}^{2}\mathbf{I})$
  - Penalty
  - Random effect
  - This called a prior distribution in Bayesian statistics

32.4 Generalized linear models

The generalized linear model framework $\mathbf{y} \sim[\mathbf{\mathbf{y}|\boldsymbol{\mu},\psi]}$ $g(\boldsymbol{\mu})=\mathbf{X}\boldsymbol{\beta}$
- $[\mathbf{y}|\boldsymbol{\mu},\psi]$ is a distribution from the exponential family (e.g., normal, Poisson, binomial) with expected value $\boldsymbol{\mu}$ and dispersion parameter $\psi$ .
  - Example $[\mathbf{y}|\boldsymbol{\mu},\psi]=\text{N}\left(\boldsymbol{\mu},\psi\mathbf{I}\right)$
  - Example $[\mathbf{\mathbf{y}|\boldsymbol{\mu},\psi]}=\text{Poisson}\left(\boldsymbol{\mu}\right)$
  - Example $[\mathbf{\mathbf{y}|\boldsymbol{\mu},\psi]}=\text{gamma}\left(\boldsymbol{\mu},\psi\right)$

Estimation

Use numerical maximization to estimate
$\hat{\boldsymbol{\beta}}=(\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{y}$ is NOT the MLE for the glm
Inference is based on asymptotic properties of MLE.
Example: Challenger data
- The statistical model: $\mathbf{y}\sim\text{binomial}(6,\mathbf{p})$ $\text{logit}(\mathbf{p})=\mathbf{X}\boldsymbol{\beta}.$
- The PMF for the binomial distribution $\binom{N}{y}p^{y}(1-p)^{N-y}$
- Write out the likelihood function $\mathcal{L}(\boldsymbol{\beta})=\prod_{i=1}^n\binom{6}{y_i}p_i^{y_i}(1-p_i)^{6-y_i}$
Maximize the log-likelihood function

challenger <- read.csv("https://www.dropbox.com/s/ezxj8d48uh7lzhr/challenger.csv?dl=1")

plot(challenger$Temp,challenger$O.ring,xlab="Temperature",ylab="Number of incidents",xlim=c(10,80),ylim=c(0,6))

y <- challenger$O.ring
X <- model.matrix(~Temp,data= challenger)

nll <- function(beta){
  p <- 1/(1+exp(-X%*%beta)) # Inverse of logit link function
  -sum(dbinom(y,6,p,log=TRUE)) # Sum of the log likelihood for each observation 
  }
est <- optim(par=c(0,0),fn=nll,hessian=TRUE)

# MLE of beta
est$par

## [1]  6.751192 -0.139703

# Var(beta.hat)
diag(solve(est$hessian))

## [1] 8.830866682 0.002145791

# Std.Error
diag(solve(est$hessian))^0.5

## [1] 2.97167742 0.04632268

Using the glm() function

m1 <- glm(cbind(challenger$O.ring,6-challenger$O.ring) ~ Temp, family = binomial(link = "logit"),data=challenger)
summary(m1)

## 
## Call:
## glm(formula = cbind(challenger$O.ring, 6 - challenger$O.ring) ~ 
##     Temp, family = binomial(link = "logit"), data = challenger)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  6.75183    2.97989   2.266  0.02346 * 
## Temp        -0.13971    0.04647  -3.007  0.00264 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.761  on 22  degrees of freedom
## Residual deviance: 19.093  on 21  degrees of freedom
## AIC: 36.757
## 
## Number of Fisher Scoring iterations: 5

Prediction
- We can get prediction using $\mathbf{X}\hat{\boldsymbol{\beta}}$ or the inverse of the logit link function $\hat{\boldsymbol{\mu}}=\frac{1}{1+e^{-\mathbf{X}\hat{\boldsymbol{\beta}}}}$
- Confidence intervals for $\mathbf{X}\hat{\boldsymbol{\beta}}$ are straightforward
- Confidence intervals for $\hat{\boldsymbol{\mu}}=\frac{1}{1+e^{-\mathbf{X}\hat{\boldsymbol{\beta}}}}$ are not straightforward because of the nonlinear transformation of $\hat{\boldsymbol{\beta}}$

beta.hat <- coef(m1)

new.temp <- seq(0,80,by=0.01)
X.new <- model.matrix(~new.temp)

mu.hat <- (1/(1+exp(-X.new%*%beta.hat)))

plot(challenger$Temp,challenger$O.ring,xlab="Temperature",ylab="Number of incidents",xlim=c(10,80),ylim=c(0,6))
points(new.temp,6*mu.hat,typ="l",col="red")

Predictive distribution

library(mosaic)

predict.bs <- function(){
  m1 <- glm(cbind(O.ring,6-O.ring) ~ Temp, family = binomial(link = "logit"),data=resample(challenger))
  p <- predict(m1,newdata=data.frame(Temp=31),type="response")
  y <- rbinom(1,6,p)
  y
}

bootstrap <- do(1000)*predict.bs()

par(mar = c(5, 4, 7, 2))
hist(bootstrap[,1], col = "grey", xlab = "Year", main = "Empirical distribution of O-ring failures at 31 degrees", freq = FALSE,right=FALSE,breaks=c(-0.5:6.5))

# 95% equal-tailed CI based on percentiles of the empirical distribution
quantile(bootstrap[,1], prob = c(0.025, 0.975))

##  2.5% 97.5% 
##     0     6

# Proability of 6 O-rings failing
length(which(bootstrap[,1]==6))/dim(bootstrap)[1]

## [1] 0.573

Summary
- The generalized linear model useful when the distributional assumptions of linear model are untenable.
  - Small counts
  - Binary data
- Two cultures
  - All distributions are approximations of a unknown data generating process so why does it matter? Use the linear model because it is easier to understand.
  - Personal philosophy
    - Picking a distribution whose support matches that of the data gets closer to the etiology of the process that generated the data.
  - Easier for me to think scientifically about the process

32.5 Transformations of the response

Big picture idea
- Select a function that transforms $\mathbf{y}$ such that the assumptions of normality are approximately correct.
Two cultures
- This is with respect to $\mathbf{y}$ and not the predictors. Transforming the predictors is not at all controversial.
- Trade offs
- Difficult to apply sometimes (e.g., $\text{log(}\mathbf{y}+c)$ for count data)
- Difficult to interpret on the transformed scale
- If normality is an appropriate assumption, we can use the general linear model $\textbf{y}\sim\text{N}(\mathbf{X}\boldsymbol{\beta},\mathbf{V})$

Example

Number of whooping cranes each year

url <- "https://www.dropbox.com/s/ihs3as87oaxvmhq/Butler%20et%20al.%20Table%201.csv?dl=1"
df1 <- read.csv(url)
plot(df1$Winter,df1$N,typ="b",xlab="Year",ylab="Population Size",ylim=c(0,300),lwd=1.5,pch="*")

Fit the model $\mathbf{y}=\beta_{0}+\beta_{1}\mathbf{x}+\boldsymbol{\varepsilon}$ where $\boldsymbol{\varepsilon}\sim\text{N}(\mathbf{0},\sigma^{2}\mathbf{I})$ and $\mathbf{x}$ is the year.

m1 <- lm(N~Winter,data=df1)
plot(df1$Winter,df1$N,typ="b",xlab="Year",ylab="Population Size",ylim=c(0,300),lwd=1.5,pch="*")
abline(m1,col="gold")

Check model assumptions

e.hat <- residuals(m1)

#Check normality
hist(e.hat,col="grey",breaks=10,main="",xlab=expression(hat(epsilon)))

shapiro.test(e.hat)

## 
##  Shapiro-Wilk normality test
## 
## data:  e.hat
## W = 0.96653, p-value = 0.09347

#Check constant variance
plot(df1$Winter,e.hat,xlab="Year",ylab=expression(hat(epsilon)))

#Check for autocorrelation among residuals
plot(e.hat[1:60],e.hat[2:61],xlab=expression(hat(epsilon)[t]),ylab=expression(hat(epsilon)[t+1]))

cor(e.hat[1:60],e.hat[2:61])

## [1] 0.9273559

library(lmtest)
dwtest(m1)

## 
##  Durbin-Watson test
## 
## data:  m1
## DW = 0.13355, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

Log transformation (i.e., $\text{log}(\mathbf{y})$ ).
- $\text{log}(\mathbf{y})=\beta_{0}+\beta_{1}\mathbf{x}+\boldsymbol{\varepsilon}$ where $\boldsymbol{\varepsilon}\sim\text{N}(\mathbf{0},\sigma^{2}\mathbf{I})$ and $\mathbf{x}$ is the year.
- What model is implied by this transformation?
  - If $z\sim\text{N}(\mu,\sigma^{2})$ , then $\text{E}(e^{z})=e^{\mu+frac{\sigma^{2}}{2}}$ and $\text{Var}(e^{z})=(e^{\sigma^{2}}-1)e^{2\mu+\sigma^{2}}$
- Fit the model

m2 <- lm(log(N)~Winter,data=df1)
plot(df1$Winter,df1$N,typ="b",xlab="Year",ylab="Population Size",ylim=c(0,300),lwd=1.5,pch="*")
abline(m1,col="gold")

ev.y.hat <- exp(predict(m2)) # Possible because of invariance property of MLEs
points(df1$Winter,ev.y.hat,typ="l",col="deepskyblue")

- Check model assumptions

e.hat <- residuals(m2)

#Check normality
hist(e.hat,col="grey",breaks=10,main="",xlab=expression(hat(epsilon)))

shapiro.test(e.hat)

## 
##  Shapiro-Wilk normality test
## 
## data:  e.hat
## W = 0.97059, p-value = 0.1489

#Check constant variance
plot(df1$Winter,e.hat,xlab="Year",ylab=expression(hat(epsilon)))

- Confidence intervals and prediction intervals

plot(df1$Winter,df1$N,typ="b",xlab="Year",ylab="Population Size",ylim=c(0,300),lwd=1.5,pch="*")
ev.y.hat <- exp(predict(m2)) # Possible because of invariance property of MLEs
points(df1$Winter,ev.y.hat,typ="l",col="deepskyblue")

# CI and PI for log(y)
predict(m2,newdata=df1[1,],interval="confidence")

##        fit    lwr      upr
## 1 3.080718 3.0223 3.139136

predict(m2,newdata=df1[1,],interval="prediction")

##        fit      lwr      upr
## 1 3.080718 2.842504 3.318933

Literature cited

Gelman, Andrew, Jessica Hwang, and Aki Vehtari. 2014. “Understanding Predictive Information Criteria for Bayesian Models.” Statistics and Computing 24 (6): 997–1016.