35 Day 34 (July 24)

35.1 Announcements

Teval
Final grades
More about challenger example here

35.2 Transformations of the response

Big picture idea
- Select a function that transforms $\mathbf{y}$ such that the assumptions of normality are approximately correct.
Two cultures
- This is with respect to $\mathbf{y}$ and not the predictors. Transforming the predictors is not at all controversial.
- Trade offs
- Difficult to apply sometimes (e.g., $\text{log(}\mathbf{y}+c)$ for count data)
- Difficult to interpret on the transformed scale
- If normality is an appropriate assumption, we can use the general linear model $\textbf{y}\sim\text{N}(\mathbf{X}\boldsymbol{\beta},\mathbf{V})$

Example

Number of whooping cranes each year

url <- "https://www.dropbox.com/s/ihs3as87oaxvmhq/Butler%20et%20al.%20Table%201.csv?dl=1"
df1 <- read.csv(url)
plot(df1$Winter,df1$N,typ="b",xlab="Year",ylab="Population Size",ylim=c(0,300),lwd=1.5,pch="*")

Fit the model $\mathbf{y}=\beta_{0}+\beta_{1}\mathbf{x}+\boldsymbol{\varepsilon}$ where $\boldsymbol{\varepsilon}\sim\text{N}(\mathbf{0},\sigma^{2}\mathbf{I})$ and $\mathbf{x}$ is the year.

m1 <- lm(N~Winter,data=df1)
plot(df1$Winter,df1$N,typ="b",xlab="Year",ylab="Population Size",ylim=c(0,300),lwd=1.5,pch="*")
abline(m1,col="gold")

Check model assumptions

e.hat <- residuals(m1)

#Check normality
hist(e.hat,col="grey",breaks=10,main="",xlab=expression(hat(epsilon)))

shapiro.test(e.hat)

## 
##  Shapiro-Wilk normality test
## 
## data:  e.hat
## W = 0.96653, p-value = 0.09347

#Check constant variance
plot(df1$Winter,e.hat,xlab="Year",ylab=expression(hat(epsilon)))

#Check for autocorrelation among residuals
plot(e.hat[1:60],e.hat[2:61],xlab=expression(hat(epsilon)[t]),ylab=expression(hat(epsilon)[t+1]))

cor(e.hat[1:60],e.hat[2:61])

## [1] 0.9273559

library(lmtest)
dwtest(m1)

## 
##  Durbin-Watson test
## 
## data:  m1
## DW = 0.13355, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

Log transformation (i.e., $\text{log}(\mathbf{y})$ ).
- $\text{log}(\mathbf{y})=\beta_{0}+\beta_{1}\mathbf{x}+\boldsymbol{\varepsilon}$ where $\boldsymbol{\varepsilon}\sim\text{N}(\mathbf{0},\sigma^{2}\mathbf{I})$ and $\mathbf{x}$ is the year.
- What model is implied by this transformation?
  - If $z\sim\text{N}(\mu,\sigma^{2})$ , then $\text{E}(e^{z})=e^{\mu+\frac{\sigma^{2}}{2}}$ and $\text{Var}(e^{z})=(e^{\sigma^{2}}-1)e^{2\mu+\sigma^{2}}$
- Fit the model

m2 <- lm(log(N)~Winter,data=df1)
plot(df1$Winter,df1$N,typ="b",xlab="Year",ylab="Population Size",ylim=c(0,300),lwd=1.5,pch="*")
abline(m1,col="gold")

ev.y.hat <- exp(predict(m2)) # Possible because of invariance property of MLEs
points(df1$Winter,ev.y.hat,typ="l",col="deepskyblue")

- Check model assumptions

e.hat <- residuals(m2)

#Check normality
hist(e.hat,col="grey",breaks=10,main="",xlab=expression(hat(epsilon)))

shapiro.test(e.hat)

## 
##  Shapiro-Wilk normality test
## 
## data:  e.hat
## W = 0.97059, p-value = 0.1489

#Check constant variance
plot(df1$Winter,e.hat,xlab="Year",ylab=expression(hat(epsilon)))

- Confidence intervals and prediction intervals

plot(df1$Winter,df1$N,typ="b",xlab="Year",ylab="Population Size",ylim=c(0,300),lwd=1.5,pch="*")
ev.y.hat <- exp(predict(m2)) # Possible because of invariance property of MLEs
points(df1$Winter,ev.y.hat,typ="l",col="deepskyblue")

# CI and PI for log(y)
predict(m2,newdata=df1[1,],interval="confidence")

##        fit    lwr      upr
## 1 3.080718 3.0223 3.139136

predict(m2,newdata=df1[1,],interval="prediction")

##        fit      lwr      upr
## 1 3.080718 2.842504 3.318933