Chapter 4 Prediction, R-squared, and Modeling

rm(list=ls()) # Caution: this clears the Environment

A prediction is an estimate of the value of $$y$$ for a given value of $$x$$, based on a regression model of the form shown in Equation \ref{eq:regmod4}. Goodness-of-fit is a measure of how well an estimated regression line approximates the data in a given sample. One such measure is the correlation coefficient between the predicted values of $$y$$ for all $$x$$-s in the data file and the actual $$y$$-s. Goodness-of-fit, along with other diagnostic tests help determining the most suitable functional form of our regression equation, i.e., the most suitable mathematical relationship between $$y$$ and $$x$$.

$$$y_{i}=\beta_{1}+\beta_{2}x_{i}+e_{i} \label{eq:regmod4}$$$

4.1 Forecasting (Predicting a Particular Value)

Assuming that the expected values of the error term in Equation \ref{eq:regmod4} is zero, Equation \ref{eq:pred4} gives $$\hat{y}_{i}$$, the predicted value of the expectation of $$y_{i}$$ given $$x_{i}$$, where $$b_{1}$$ and $$b_{2}$$ are the (least squares) estimates of the regression parameters $$\beta_{1}$$ and $$\beta_{2}$$.

$$$\hat{y}_{i}=b_{1}+b_{2}x_{i} \label{eq:pred4}$$$

The predicted value $$\hat{y}_{i}$$ is a random variable, since it depends on the sample; therefore, we can calculate a confidence interval and test hypothesis about it, provided we can determine its distribution and variance. The prediction has a normal distribution, being a linear combination of two normally distributed random variables $$b_{1}$$ and $$b_{2}$$, and its variance is given by Equation \ref{eq:varf4}. Please note that the variance in Equation \ref{eq:varf4} is not the same as the one in Equation \ref{eq:varfood}; the former is the variance of the estimated expectation of $$y$$, while the latter is the variance of a particular occurrence of $$y$$. Let us call the latter the variance of the forecast error. Not surprisingly, the variance of the forecast error is greater than the variance of the predicted $$E(y|x)$$.

As before, since we need to use an estimated variance, we use a $$t$$-distribution instead of a normal one. Equation \ref{eq:varf4} applies to any given $$x$$, say $$x_{0}$$, not only to those $$x$$-s in the dataset.

$$$\widehat{var(f_{i})}=\hat{\sigma}\,^2\left[ 1+\frac{1}{N}+\frac{(x_{i}-\bar{x})^2}{\sum_{j=1}^{N}{(x_{j}-\bar{x})^2}} \right], \label{eq:varf4}$$$

which can be reduced to

$$$\widehat{var(f_{i})}=\hat{\sigma}\,^2+\frac{\hat{\sigma}\,^2}{N}+(x_{i}-\bar{x})^2\widehat{var(b_{2})} \label{eq:varf41}$$$

Let’s determine a standard error for the food equation for a household earning $2000 a week, i.e., at $$x=x_{0}=20$$, using Equation \ref{eq:varf41}; to do so, we need to retrieve $$var(b_{2})$$ and $$\hat{\sigma}$$, the standard error of regression from the regression output. library(PoEdata) data("food") alpha <- 0.05 x <- 20 xbar <- mean(food$income)
m1 <- lm(food_exp~income, data=food)
b1 <- coef(m1)[[1]]
b2 <- coef(m1)[[2]]
yhatx <- b1+b2*x
sm1 <- summary(m1)
df <- df.residual(m1)
tcr <- qt(1-alpha/2, df)
N <- nobs(m1)   #number of observations, N
N <- NROW(food) #just another way of finding N
varb2 <- vcov(m1)[2, 2]
sighat2 <- sm1$sigma^2 # estimated variance varf <- sighat2+sighat2/N+(x-xbar)^2*varb2 #forecast variance sef <- sqrt(varf) #standard error of forecast lb <- yhatx-tcr*sef ub <- yhatx+tcr*sef The result is the confidence interval for the forecast $$(104.13, 471.09)$$, which is, as expected, larger than the confidence interval of the estimated expected value of $$y$$ based on Equation \ref{eq:varfood}. Let us calculate confidence intervals of the forecast for all the observations in the sample and draw the upper and lower limits together with the regression line. Figure 4.1 shows the confidence interval band about the regression line. sef <- sqrt(sighat2+sighat2/N+(food$income-xbar)^2*varb2)
yhatv <- fitted.values(m1)
lbv <- yhatv-tcr*sef
ubv <- yhatv+tcr*sef
xincome <- food$income dplot <- data.frame(xincome, yhatv, lbv, ubv) dplotord <- dplot[order(xincome), ] xmax <- max(dplotord$xincome)
xmin <- min(dplotord$xincome) ymax <- max(dplotord$ubv)
ymin <- min(dplotord$lbv) plot(dplotord$xincome, dplotord$yhatv, xlim=c(xmin, xmax), ylim=c(ymin, ymax), xlab="income", ylab="food expenditure", type="l") lines(dplotord$ubv~dplotord$xincome, lty=2) lines(dplotord$lbv~dplotord$xincome, lty=2) A different way of finding point and interval estimates for the predicted $$E(y|x)$$ and forecasted $$y$$ (please see the distinction I mentioned above) is to use the predict() function in R. This function requires that the values of the independent variable where the prediction (or forecast) is intended have a data frame structure. The next example shows in parallel point and interval estimates of predicted and forecasted food expenditures for income is$2000. As I have pointed out before, the point estimate is the same for both prediction and forecast, but the interval estimates are very different.

incomex=data.frame(income=20)
predict(m1, newdata=incomex, interval="confidence",level=0.95)
fit lwr upr
287.609 258.907 316.311
predict(m1, newdata=incomex, interval="prediction",level=0.95)
fit lwr upr
287.609 104.132 471.085

Let us now use the predict() function to replicate Figure 4.1. The result is Figure 4.2, which shows, besides the interval estimation band, the points in the dataset. (I will create new values for income just for the purpose of plotting.)

xmin <- min(food$income) xmax <- max(food$income)
income <- seq(from=xmin, to=xmax)
ypredict <- predict(m1, newdata=data.frame(income),
interval="confidence")
yforecast <- predict(m1, newdata=data.frame(income),
interval="predict")
matplot(income, cbind(ypredict[,1], ypredict[,2], ypredict[,3],
yforecast[,2], yforecast[,3]),
type ="l", lty=c(1, 2, 2, 3, 3),
col=c("black", "red", "red", "blue", "blue"),
ylab="food expenditure", xlab="income")
points(food$income, food$food_exp)
legend("topleft",
legend=c("E[y|x]", "lwr_pred", "upr_pred",
"lwr_forcst","upr_forcst"),
lty=c(1, 2, 2, 3, 3),
col=c("black", "red", "red", "blue", "blue")
)

Figure 4.2 presents the predicted and forecasted bands on the same graph, to show that they have the same point estimates (the black, solid line) and that the forecasted band is much larger than the predicted one. Put another way, you may think about the distinction between the two types of intervals that we called prediction and forecast as follows: the prediction interval is not supposed to include, say, 95 percent of the points, but to include the regression line, $$E(y|x)$$, with a probability of 95 percent; the forecasted interval, on the other hand, should include any true point with a 95 percent probability.

4.2 Goodness-of-Fit

The total variation of $$y$$ about its sample mean, $$SST$$, can be decomposed in variation about the regression line, $$SSE$$, and variation of the regression line about the mean of $$y$$, $$SSR$$, as Equation \ref{eq:sst4} shows.

$$$SST=SSR+SSE \label{eq:sst4}$$$

The coefficient of determination, $$R^2$$, is defined as the proportion of the variance in $$y$$ that is explained by the regression, $$SSR$$, in the total variation in $$y$$, $$SST$$. Dividing both sides of the Equation \ref{eq:sst4} by $$SST$$ and re-arranging terms gives a formula to calculate $$R^2$$, as shown in Equation \ref{eq:rsqformula4}.

$$$R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST} \label{eq:rsqformula4}$$$

$$R^2$$ takes values between $$0$$ and $$1$$, with higher values showing a closer fit of the regression line to the data. In $$R$$, the value of $$R^2$$ can be retrieved from the summary of the regression model under the name r.squared; for instance, in our food example, $$R^2=0.385$$. $$R^2$$ is also printed as part of the summary of a regression model, as the following code sequence shows. (The parentheses around a command tells $$R$$ to print the result.)

4.4 Residuals and Diagnostics

Regression results are reliable only to the extent to which the underlying assumptions are met. Plotting the residuals and calculating certain test statistics help deciding whether assumptions such as homoskedasticity, serial correlation, and normality of the errors are not violated. In R, the residuals are stored in the vector residuals of the regression output.

ehat <- mod2$residuals plot(food$income, ehat, xlab="income", ylab="residuals")

Figure 4.4 shows the residuals of the of the linear-log equation of the food expenditure example. One can notice that the spread of the residuals seems to be higher at higher incomes, which may indicate that the heteroskedasticity assumption is violated.

Let us draw a residual plot generated with a simulated model that satisfies the regression assumptions. The data generating process is given by Equation \ref{eq:residsim4}, where $$x$$ is a number between $$0$$ and $$10$$, randomly drawn from a uniform distribution, and the error term is randomly drawn from a standard normal distribution. Figure 4.5 illustrates this simulated example.

$$$y_{i}=1+x_{i}+e_{i}, \;\;\; i=1,...,N \label{eq:residsim4}$$$
set.seed(12345)   #sets the seed for the random number generator
x <- runif(300, 0, 10)
e <- rnorm(300, 0, 1)
y <- 1+x+e
mod3 <- lm(y~x)
ehat <- resid(mod3)
plot(x,ehat, xlab="x", ylab="residuals")

The next example illustrates how the residuals look like when a linear functional form is used when the true relationship is, in fact, quadratic. The data generating equation is given in Equation \ref{eq:quadsim4}, where $$x$$ is the same uniformly distributed between $$-2.5$$ and $$2.5$$), and $$e \sim N(0,4)$$. Figure 4.6 shows the residuals from estimating an incorrectly specified, linear econometric model when the correct specification should be quadratic.

$$$y_{i}=15-4x_{i}^2+e_{i}, \;\;\; i=1,...,N \label{eq:quadsim4}$$$
set.seed(12345)
x <- runif(1000, -2.5, 2.5)
e <- rnorm(1000, 0, 4)
y <- 15-4*x^2+e
mod3 <- lm(y~x)
ehat <- resid(mod3)
ymi <- min(ehat)
yma <- max(ehat)
plot(x, ehat, ylim=c(ymi, yma),
xlab="x", ylab="residuals",col="grey")

Another assumption that we would like to test is the normality of the residuals, which assures reliable hypothesis testing and confidence intervals even in small samples. This assumption can be assessed by inspecting a histogram of the residuals, as well as performing a Jarque-Bera test, for which the null hypothesis is “Series is normally distributed”. Thus, a small $$p$$-value rejects the null hypothesis, which means the series fails the normality test. The Jarque-Bera test requires installing and loading the package tseries in R. Figure 4.7 shows a histogram and a superimposed normal distribution for the linear food expenditure model.

library(tseries)
mod1 <- lm(food_exp~income, data=food)
ehat <- resid(m1)
ebar <- mean(ehat)
sde <- sd(ehat)
hist(ehat, col="grey", freq=FALSE, main="",
ylab="density", xlab="ehat")
ylab="density", xlab="ehat")
jarque.bera.test(ehat) #(in package 'tseries')
##
##  Jarque Bera Test
##
## data:  ehat
## X-squared = 0.06334, df = 2, p-value = 0.969

While the histogram in Figure 4.7 may not strongly support one conclusion or another about the normlity of ehat, the Jarque-Bera test is unambiguous: there is no evidence against the normality hypothesis.

4.5 Polynomial Models

Regression models may include quadratic or cubic terms to better describe the nature of the dadta. The following is an example of quadratic and cubic model for the wa_wheat dataset, which gives annual wheat yield in tonnes per hectare in Greenough Shire in Western Australia over a period of 48 years. The linear model is given in Equation \ref{eq:wheateq4}, where the subscript $$t$$ indicates the observation period. $$$yield_{t}=\beta_{1}+\beta_{2}time_{t}+e_{t} \label{eq:wheateq4}$$$
library(PoEdata)
data("wa_wheat")
mod1 <- lm(greenough~time, data=wa_wheat)
ehat <- resid(mod1)
plot(wa_wheat$time, ehat, xlab="time", ylab="residuals") Figure 4.8 shows a pattern in the residuals generated by the linear model, which may inspire us to think of a more appropriate functional form, such as the one in Equation \ref{eq:cubewheat4}. $$$yield_{t}=\beta_{1}+\beta_{2}time_{t}^3+e_{t} \label{eq:cubewheat4}$$$ Please note in the following code sequence the use of the function I(), which is needed in R when an independent variable is transformed by mathematical operators. You do not need the operator I() when an independent variable is transformed through a function such as $$log(x)$$. In our example, the transformation requiring the use of I() is raising time to the power of $$3$$. Of course, you can create a new variable, x3=x^3 if you wish to avoid the use of I() in a regression equation. mod2 <- lm(wa_wheat$greenough~I(time^3), data=wa_wheat)
ehat <- resid(mod2)
plot(wa_wheat$time, ehat, xlab="time", ylab="residuals") Figure 4.9 displays a much better image of the residuals than Figure 4.8, since the residuals are more evenly spread about the zero line. 4.6 Log-Linear Models Transforming the dependent variable with the $$log()$$ function is useful when the variable has a skewed distribution, which is in general the case with amounts that cannot be negative. The $$log()$$ transformation often makes the distribution closer to normal. The general log-linear model is given in Equation \ref{eq:loglin4}. $$$log(y_{i}) = \beta_{1}+ \beta_{2}x_{i}+e_{i} \label{eq:loglin4}$$$ The following formulas are easily derived from the log-linear Equation \ref{eq:loglin4}. The semi-elasticity has here a different interpretation than the one in the linear-log model: here, an increase in $$x$$ by one unit (of $$x$$) produces a change of $$100b_{2}$$ percent in $$y$$. For small changes in $$x$$, the amount $$100b_{2}$$ in the log-linear model can also be interpreted as the growth rate in $$y$$ (corresponding to a unit increase in $$x$$). For instance, if $$x$$ is time, then $$100b_{2}$$ is the growth rate in $$y$$ per unit of time. • Prediction: $$\hat{y}_{n}=exp(b_{1}+b_{2}x)$$, or $$\hat{y}_{c}=exp(b_{1}+b_{2}x+\frac{\hat{\sigma}^2}{2})$$, with the “natural” predictor $$\hat{y}_{n}$$ to be used in small samples and the “corrected” predictor, $$\hat{y}_{c}$$, in large samples • Marginal effect (slope): $$\frac{dy}{dx}=b_{2}y$$ • Semi-elasticity: $$\%\Delta y = 100b_{2}\Delta x$$ Let us do these calculations first for the yield equation using the wa_wheat dataset. mod4 <- lm(log(greenough)~time, data=wa_wheat) smod4 <- summary(mod4) tbl <- data.frame(xtable(smod4)) kable(tbl, caption="Log-linear model for the *yield* equation") Table 4.3: Log-linear model for the yield equation Estimate Std..Error t.value Pr…t.. (Intercept) -0.343366 0.058404 -5.87914 0 time 0.017844 0.002075 8.59911 0 Table 4.3 gives $$b_{2}=0.017844$$, which indicates that the rate of growth in wheat production has increased at an average rate of approximately $$1.78$$ percent per year. The wage log-linear equation provides another example of calculating a growth rate, but this time the independent variable is not $$time$$, but $$education$$. The predictions and the slope are calculated for $$educ=12$$ years. data("cps4_small", package="PoEdata") xeduc <- 12 mod5 <- lm(log(wage)~educ, data=cps4_small) smod5 <- summary(mod5) tabl <- data.frame(xtable(smod5)) kable(tabl, caption="Log-linear 'wage' regression output") Table 4.4: Log-linear ‘wage’ regression output Estimate Std..Error t.value Pr…t.. (Intercept) 1.609444 0.086423 18.6229 0 educ 0.090408 0.006146 14.7110 0 b1 <- coef(smod5)[[1]] b2 <- coef(smod5)[[2]] sighat2 <- smod5$sigma^2
g <- 100*b2               #growth rate
yhatn <- exp(b1+b2*xeduc) #"natural" predictiction
yhatc <- exp(b1+b2*xeduc+sighat2/2) #corrected prediction
DyDx <- b2*yhatn          #marginal effect

Here are the results of these calculations: “natural” prediction $$\hat{y}_{n}=14.796$$; corrected prediction, $$\hat{y}_{c}=16.996$$; growth rate $$g=9.041$$; and marginal effect $$\frac{dy}{dx}=1.34$$. The growth rate indicates that an increase in education by one unit (see the data description using ?cps4_small) increases hourly wage by $$9.041$$ percent.

Figure 4.10 presents the “natural” and the “corrected” regression lines for the wage equation, together with the actual data points.

education=seq(0,22,2)
yn <- exp(b1+b2*education)
yc <- exp(b1+b2*education+sighat2/2)
plot(cps4_small$educ, cps4_small$wage,
xlab="education", ylab="wage", col="grey")
lines(yn~education, lty=2, col="black")
lines(yc~education, lty=1, col="blue")
legend("topleft", legend=c("yc","yn"),
lty=c(1,2), col=c("blue","black"))

The regular $$R^2$$ cannot be used to compare two regression models having different dependent variables such as a linear-log and a log-linear models; when such a comparison is needed, one can use the general $$R^2$$, which is $$R_{g}^2=\lbrack corr(y, \hat{y} \rbrack ^2$$. Let us calculate the generalized $$R^2$$ for the quadratic and the log-linear wage models.

mod4 <- lm(wage~I(educ^2), data=cps4_small)
yhat4 <- predict(mod4)
mod5 <- lm(log(wage)~educ, data=cps4_small)
smod5 <- summary(mod5)
b1 <- coef(smod5)[[1]]
b2 <- coef(smod5)[[2]]
sighat2 <- smod5$sigma^2 yhat5 <- exp(b1+b2*cps4_small$educ+sighat2/2)
rg4 <- cor(cps4_small$wage, yhat4)^2 rg5 <- cor(cps4_small$wage,yhat5)^2

The quadratic model yields $$R_{g}^2=0.188$$, and the log-linear model yields $$R_{g}^2=0.186$$; since the former is higher, we conclude that the quadratic model is a better fit to the data than the log-linear one. (However, other tests of how the two models meet the assumptions of linear refgression may reach a different conclusion; $$R^2$$ is only one of the model selection criteria.)

To determne a forecast interval estimate in the log-linear model, we first construct the interval in logs using the natural predictor $$\hat{y}_{n}$$, then take antilogs of the interval limits. The forecasting error is the same as before, given in Equation \ref{eq:varf41}. The following calculations use an education level equal to 12 and $$\alpha=0.05$$.

# The *wage* log-linear model
# Prediction interval for educ = 12
alpha <- 0.05
xeduc <- 12
xedbar <- mean(cps4_small$educ) mod5 <- lm(log(wage)~educ, data=cps4_small) b1 <- coef(mod5)[[1]] b2 <- coef(mod5)[[2]] df5 <- mod5$df.residual
N <- nobs(mod5)
tcr <- qt(1-alpha/2, df=df5)
smod5 <- summary(mod5)
varb2 <- vcov(mod5)[2,2]
sighat2 <- smod5$sigma^2 varf <- sighat2+sighat2/N+(xeduc-xedbar)^2*varb2 sef <- sqrt(varf) lnyhat <- b1+b2*xeduc lowb <- exp(lnyhat-tcr*sef) upb <- exp(lnyhat+tcr*sef) The result is the confidence interval $$(5.26, \; 41.62)$$. Figure 4.11 shows a $$95\%$$ confidence band for the log-linear wage model. # Drawing a confidence band for the log-linear # *wage* equation xmin <- min(cps4_small$educ)
xmax <- max(cps4_small$educ)+2 education <- seq(xmin, xmax, 2) lnyhat <- b1+b2*education yhat <- exp(lnyhat) varf <- sighat2+sighat2/N+(education-xedbar)^2*varb2 sef <- sqrt(varf) lowb <- exp(lnyhat-tcr*sef) upb <- exp(lnyhat+tcr*sef) plot(cps4_small$educ, cps4_small$wage, col="grey", xlab="education", ylab="wage", ylim=c(0,100)) lines(yhat~education, lty=1, col="black") lines(lowb~education, lty=2, col="blue") lines(upb~education, lty=2, col="blue") legend("topleft", legend=c("yhat", "lowb", "upb"), lty=c(1, 2, 2), col=c("black", "blue", "blue")) 4.7 The Log-Log Model The log-log model has the desirable property that the coefficient of the independent variable is equal to the (constant) elasticity of $$y$$ with respect to $$x$$. Therefore, this model is often used to estimate supply and demand equations. Its standard form is given in Equation \ref{eq:logloggen4}, where $$y$$, $$x$$, and $$e$$ are $$N \times 1$$ vectors. $$$log(y)=\beta_{1}+\beta_{2}log(x)+e \label{eq:logloggen4}$$$ # Calculating log-log demand for chicken data("newbroiler", package="PoEdata") mod6 <- lm(log(q)~log(p), data=newbroiler) b1 <- coef(mod6)[[1]] b2 <- coef(mod6)[[2]] smod6 <- summary(mod6) tbl <- data.frame(xtable(smod6)) kable(tbl, caption="The log-log poultry regression equation") Table 4.5: The log-log poultry regression equation Estimate Std..Error t.value Pr…t.. (Intercept) 3.71694 0.022359 166.2362 0 log(p) -1.12136 0.048756 -22.9992 0 Table 4.5 gives the log-log regression output. The coefficient on $$p$$ indicates that an increase in price by $$1\%$$ changes the quantity demanded by $$-1.121\%$$. # Drawing the fitted values of the log-log equation ngrid <- 20 # number of drawing points xmin <- min(newbroiler$p)
xmax <- max(newbroiler$p) step <- (xmax-xmin)/ngrid # grid dimension xp <- seq(xmin, xmax, step) sighat2 <- smod6$sigma^2
yhatc <- exp(b1+b2*log(newbroiler$p)+sighat2/2) yc <- exp(b1+b2*log(xp)+sighat2/2) #corrected q plot(newbroiler$p, newbroiler$q, ylim=c(10,60), xlab="price", ylab="quantity") lines(yc~xp, lty=1, col="black") # The generalized R-squared: rgsq <- cor(newbroiler$q, yhatc)^2

The generalized $$R^2$$, wich uses the corrected fitted values, is equal to $$0.8818$$.