13 Regression Prediction
Let’s do one more example of an interaction using the ACS data:
load("Data/ACSCountyData.Rdata")
How does the effect of the percent who commute by car on commute time change as population density increases?
What do we mean by this question. First, we think there is some sort of unconditional relationship between the percent of people who commute by transit and commuting time. Given that, in the US at least, transit commuting is a real second-class option, my expectation is that the more people who commute by transit in a location the longer average commute times will be.
We can confirm that using a bivariate regression:
library(stargazer)
#>
#> Please cite as:
#> Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
#> R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
m <- lm(average.commute.time ~ percent.transit.commute, data=acs)
stargazer(m, type="text")
#>
#> ===================================================
#> Dependent variable:
#> ---------------------------
#> average.commute.time
#> ---------------------------------------------------
#> percent.transit.commute 0.353***
#> (0.032)
#>
#> Constant 23.225***
#> (0.105)
#>
#> ---------------------------------------------------
#> Observations 3,140
#> R2 0.037
#> Adjusted R2 0.037
#> Residual Std. Error 5.630 (df = 3138)
#> F Statistic 121.671*** (df = 1; 3138)
#> ===================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Indeed, for every additional percent of people commuting by transit average commute times increase by .353.
An objection to that regression might be that rural places have both low transit commuting and low commute times (if people work on their farms, for example). So we may want to take into account the indpendent effect of population density on both the percent who commute by transit and average commute times:
m2 <- lm(average.commute.time ~ percent.transit.commute + population.density, data=acs)
stargazer(m,m2, type="text")
#>
#> ==========================================================================
#> Dependent variable:
#> --------------------------------------------------
#> average.commute.time
#> (1) (2)
#> --------------------------------------------------------------------------
#> percent.transit.commute 0.353*** 0.391***
#> (0.032) (0.056)
#>
#> population.density -0.0001
#> (0.0001)
#>
#> Constant 23.225*** 23.209***
#> (0.105) (0.107)
#>
#> --------------------------------------------------------------------------
#> Observations 3,140 3,140
#> R2 0.037 0.038
#> Adjusted R2 0.037 0.037
#> Residual Std. Error 5.630 (df = 3138) 5.631 (df = 3137)
#> F Statistic 121.671*** (df = 1; 3138) 61.166*** (df = 2; 3137)
#> ==========================================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Actually this relationship gets stronger once we take into account the independent effect of population density.
But that’s not the question we asked, we want to know if the effect of the percent transit commuters changes as population density increases. To do that we have to multiply the two variables:
m3 <- lm(average.commute.time ~ percent.transit.commute*population.density, data=acs)
stargazer(m,m2,m3, type="text")
#>
#> ======================================================================================================================
#> Dependent variable:
#> ---------------------------------------------------------------------------
#> average.commute.time
#> (1) (2) (3)
#> ----------------------------------------------------------------------------------------------------------------------
#> percent.transit.commute 0.353*** 0.391*** 0.272***
#> (0.032) (0.056) (0.061)
#>
#> population.density -0.0001 0.001***
#> (0.0001) (0.0002)
#>
#> percent.transit.commute:population.density -0.00002***
#> (0.00000)
#>
#> Constant 23.225*** 23.209*** 23.110***
#> (0.105) (0.107) (0.109)
#>
#> ----------------------------------------------------------------------------------------------------------------------
#> Observations 3,140 3,140 3,140
#> R2 0.037 0.038 0.045
#> Adjusted R2 0.037 0.037 0.044
#> Residual Std. Error 5.630 (df = 3138) 5.631 (df = 3137) 5.610 (df = 3136)
#> F Statistic 121.671*** (df = 1; 3138) 61.166*** (df = 2; 3137) 49.045*** (df = 3; 3136)
#> ======================================================================================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Ok we have lots of start here, so let’s break down what it is that we are seeing here.
The intercept here is 23.110. I have said many times that the intercept is the average value of the outcome when all variables in the regression are equal to 0, so this is the average commute time in a place with 0 transit commuters and 0 population density. So, as often is this case, this is a nonsense number.
To get the effect of the percent commuting by transit we can deploy a derivative:
\hat{commute} = \alpha + \beta_1 Transit + \beta_@ Dens + \beta_3 Transity*Dens\\ \frac{\partial\hat{commute} }{\partial Transit} = \beta_1 + \beta_3*Dens\\
As we wanted, the effect of the percent who commute by transit changes across levels of population density. If this equation describes the effect of % transit, under what conditions is the effect \beta_1? This is only the case when population density is equal to 0. So in a county with zero people the effect of an additional transit commuter is to increase average commute times by .272. Is this an important number? No! You have to know to ignore this number.
To determine what the effect of percent transit commute is, we have to evaluate at many values of of population density. Again, we have to be smart here. Try to think: what values does density take on?
summary(acs$population.density)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.04 16.69 44.72 270.68 116.71 72052.96
OK! Let’s just do the math:
population.density <- seq(0,73000,1)
marginal.effect <- coef(m3)["percent.transit.commute"] + population.density*coef(m3)["percent.transit.commute:population.density"]
plot(population.density, marginal.effect, type="l", xlab="Population Density", ylab="Effect of % Transit Commute on Commute Time")
abline(h=0, lty=2)
rug(acs$population.density)

We see that this line is negative. This is saying that the effect of percent commuting by transit decreases as population density increases. In other words, when a place gets more dense having more people commute by transit moves from a net time suck on commute times to a net positive.
I have added the rug plot here to show where counties population densities actually are, and that rug plot somewhat tempers this finding. The truth is that the majority of counties have very small population densities and so this negative relationship is not true for many places.
This graph is not particularly helpful without some sense of the error. Can we conclude with any certainty that the effect of the percent commuting by transit on commute times is truly negative at any point?
What is the variance of this marginal effect? In other words what is:
V(\beta_1 + \beta_3*Dens)=?
Is it just:
V(\beta_1 + \beta_3*Dens) \neq V(\beta_1) + V(\beta_3*Dens)
No! When we are adding variance we critically need to know if there is any covariance between the two things. That is because:
V(X+Y) = V(X) + V(Y) + 2*COV(X,Y)
So we could ignore the covariance if there was 0 covariance between these two features. Do the coefficients co-vary? Let’s simulate via bootstrap to see if beta 1 and beta 3 covary:
small <- acs[c("percent.transit.commute","population.density","average.commute.time")]
b1 <- rep(NA, 1000)
b3 <- rep(NA, 1000)
for(i in 1:1000){
bs.data <- small[sample(1:nrow(small), nrow(small),replace=T),]
m.bs <- lm(average.commute.time ~ percent.transit.commute*population.density, data=bs.data)
b1[i] <- coef(m.bs)["percent.transit.commute"]
b3[i] <- coef(m.bs)["percent.transit.commute:population.density"]
}
cor(b1,b3)
#> [1] 0.1535127
Yes! There is a non-zero correlation between these two things.
As such the variance of the marginal effect is:
V(\beta_1 + \beta_3*Dens) = V(\beta_1) + V(\beta_3*Dens) + 2*Cov(\beta_1, \beta_3*Dens)\\ = V(\beta_1) + Dens^2*V(\beta_3) + 2*Dens*Cov(\beta_1, \beta_3)\\
Notice how population density is a variable in this equation. The standard error of the marginal effect changes across levels of density. We will see this visually in a second.
How can we get these variances and co-variances?
vcov(m3)
#> (Intercept)
#> (Intercept) 1.177826e-02
#> percent.transit.commute -1.378290e-03
#> population.density -2.825708e-06
#> percent.transit.commute:population.density 7.390022e-08
#> percent.transit.commute
#> (Intercept) -1.378290e-03
#> percent.transit.commute 3.752489e-03
#> population.density -1.011826e-05
#> percent.transit.commute:population.density 8.850673e-08
#> population.density
#> (Intercept) -2.825708e-06
#> percent.transit.commute -1.011826e-05
#> population.density 6.223717e-08
#> percent.transit.commute:population.density -8.328641e-10
#> percent.transit.commute:population.density
#> (Intercept) 7.390022e-08
#> percent.transit.commute 8.850673e-08
#> population.density -8.328641e-10
#> percent.transit.commute:population.density 1.316543e-11
The “variance-covariance” matrix is generated by the regression and tells us the variances (the diagonals) and the covariances (the off diaganoals) of all of the coefficients including the intercept.
So to get the standard error of the marginal effect at all the levels we calculated above:
var.margin <- vcov(m3)["percent.transit.commute","percent.transit.commute"] +
population.density^2*vcov(m3)["percent.transit.commute:population.density","percent.transit.commute:population.density"] +
population.density*2*vcov(m3)["percent.transit.commute:population.density","percent.transit.commute"]
This is variance, so to get the standard errors we take the square root:
se.margin <- sqrt(var.margin)
Each entry here is the standard deviation of the sampling distribution at that point. If each of those sampling distributions is normally distributed, then the 95% confidence interval will run 1.96 standard errors above and below our estimate:
upper.bounds <- marginal.effect +se.margin*1.96
lower.bounds <- marginal.effect - se.margin*1.96
Now for this graph we
population.density <- seq(0,73000,1)
marginal.effect <- coef(m3)["percent.transit.commute"] + population.density*coef(m3)["percent.transit.commute:population.density"]
plot(population.density, marginal.effect, type="l", xlab="Population Density", ylab="Effect of % Transit Commute on Commute Time")
abline(h=0, lty=2)
points(population.density, upper.bounds, type="l", lty=2)
points(population.density, lower.bounds, type="l", lty=2)
rug(acs$population.density)

This should be the same as what we get in the margins package:
library(margins)
#> Warning: package 'margins' was built under R version 4.1.3
cplot(m3, x="population.density",dx="percent.transit.commute", what="effect")
abline(h=0)

13.1 Predicting with regression
The same ability to draw a line on a graph of data also allows us to use regression to make predictions about out of sample information.
library(lubridate)
#> Warning: package 'lubridate' was built under R version
#> 4.1.3
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
mwr <- read.csv("Data/MarathonWR.csv")
#Some cleaning because I copied data from wikipedia
names(mwr) <- c("time","date")
mwr <- mwr[-5,]
mwr$seconds <- seconds(hms(mwr$time))
mwr$date <- dmy(mwr$date)
mwr$date[1:30] <- mwr$date[1:30]-years(100)
mwr$year <- year(mwr$date)
mwr$seconds <- as.numeric(mwr$seconds)
mwr$year <- as.numeric(mwr$year)
#How many seconds is 2 hours?
2*3600
#> [1] 7200
#Plotting this relationship:
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.1.3
p <- ggplot(mwr, aes(x=year, y=seconds)) + geom_point() +
labs(x="Date", y = "Time") +
geom_smooth(method = lm, formula = y ~ poly(x, 1), se = FALSE)
p

m <- lm(seconds ~ year, data=mwr)
summary(m)
#>
#> Call:
#> lm(formula = seconds ~ year, data = mwr)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -410.6 -188.4 -100.2 166.1 944.4
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 54898.384 2406.743 22.81 <2e-16 ***
#> year -23.755 1.227 -19.36 <2e-16 ***
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 291.5 on 48 degrees of freedom
#> Multiple R-squared: 0.8865, Adjusted R-squared: 0.8842
#> F-statistic: 375 on 1 and 48 DF, p-value: < 2.2e-16
Now, ignoring that this is not a very good fit, we can use the output of this regression to predict future years marathon times.
future.years <- seq(2023,2050,1)
prediction <- coef(m)["(Intercept)"] + coef(m)["year"]*future.years
cbind(future.years, prediction/3600)
#> future.years
#> [1,] 2023 1.900505
#> [2,] 2024 1.893906
#> [3,] 2025 1.887308
#> [4,] 2026 1.880709
#> [5,] 2027 1.874110
#> [6,] 2028 1.867512
#> [7,] 2029 1.860913
#> [8,] 2030 1.854314
#> [9,] 2031 1.847716
#> [10,] 2032 1.841117
#> [11,] 2033 1.834518
#> [12,] 2034 1.827920
#> [13,] 2035 1.821321
#> [14,] 2036 1.814722
#> [15,] 2037 1.808124
#> [16,] 2038 1.801525
#> [17,] 2039 1.794927
#> [18,] 2040 1.788328
#> [19,] 2041 1.781729
#> [20,] 2042 1.775131
#> [21,] 2043 1.768532
#> [22,] 2044 1.761933
#> [23,] 2045 1.755335
#> [24,] 2046 1.748736
#> [25,] 2047 1.742137
#> [26,] 2048 1.735539
#> [27,] 2049 1.728940
#> [28,] 2050 1.722342
What is the confidence interval around each of these predictions?
We can use the same insite above:
V(\alpha + \beta*years)=V(\alpha) + years^2 *V(\beta) + 2*years*Cov(\alpha,\beta)
var <- vcov(m)["(Intercept)","(Intercept)"] + future.years^2 * vcov(m)["year","year"] + 2*future.years * vcov(m)["year","(Intercept)"]
se.pred <- sqrt(var)
upper <- (prediction + se.pred*1.96)/3600
lower <- (prediction - se.pred*1.96)/3600
plot(future.years, prediction/3600, type="l", ylim=c())
points(future.years, lower, type="l", lty=2)
points(future.years, upper, type="l", lty=2)

Thankfully we can simplify this procedure greatly using the predict()
command.
In it’s basic form predict will spit out the value of the line for all of your data points
mwr$prediction <- predict(m)
head(mwr)
#> time date seconds year prediction
#> 1 2:55:18 1908-07-24 10518 1908 9573.654
#> 2 2:52:45 1909-01-01 10365 1909 9549.899
#> 3 2:46:53 1909-02-12 10013 1909 9549.899
#> 4 2:46:05 1909-05-08 9965 1909 9549.899
#> 6 2:40:34 1909-08-31 9634 1909 9549.899
#> 7 2:38:16 1913-05-12 9496 1913 9454.878
But helpfully we can also feed new data into the predict function with new values for the x variable
prediction2 <- predict(m, newdata=data.frame(year=future.years))
cbind(prediction, prediction2)
#> prediction prediction2
#> 1 6841.817 6841.817
#> 2 6818.062 6818.062
#> 3 6794.307 6794.307
#> 4 6770.552 6770.552
#> 5 6746.797 6746.797
#> 6 6723.042 6723.042
#> 7 6699.287 6699.287
#> 8 6675.532 6675.532
#> 9 6651.776 6651.776
#> 10 6628.021 6628.021
#> 11 6604.266 6604.266
#> 12 6580.511 6580.511
#> 13 6556.756 6556.756
#> 14 6533.001 6533.001
#> 15 6509.246 6509.246
#> 16 6485.491 6485.491
#> 17 6461.736 6461.736
#> 18 6437.981 6437.981
#> 19 6414.225 6414.225
#> 20 6390.470 6390.470
#> 21 6366.715 6366.715
#> 22 6342.960 6342.960
#> 23 6319.205 6319.205
#> 24 6295.450 6295.450
#> 25 6271.695 6271.695
#> 26 6247.940 6247.940
#> 27 6224.185 6224.185
#> 28 6200.430 6200.430
What’s more, the predict function will also determine the confidence interval for you:
prediction2 <- predict(m, newdata=data.frame(year=future.years), interval="confidence")
All of this is particularly helpful when calculating the predicted values are particularly difficult.
Above we can see that the linear regression is not a very good fit for these data. Instead, i’m going to propose that a 3rd order polynomial fit is what is best to fit to these data. That would look like this:
#Plotting this relationship:
library(ggplot2)
p <- ggplot(mwr, aes(x=year, y=seconds)) + geom_point() +
labs(x="Date", y = "Time") +
geom_smooth(method = lm, formula = y ~ poly(x, 3), se = FALSE)
p

m2 <- lm(seconds ~ poly(year, 3), data=mwr)
summary(m2)
#>
#> Call:
#> lm(formula = seconds ~ poly(year, 3), data = mwr)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -357.07 -41.24 18.96 48.94 526.26
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8298.48 22.67 366.133 < 2e-16 ***
#> poly(year, 3)1 -5644.47 160.27 -35.219 < 2e-16 ***
#> poly(year, 3)2 1678.50 160.27 10.473 9.16e-14 ***
#> poly(year, 3)3 -281.26 160.27 -1.755 0.0859 .
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 160.3 on 46 degrees of freedom
#> Multiple R-squared: 0.9671, Adjusted R-squared: 0.965
#> F-statistic: 451.1 on 3 and 46 DF, p-value: < 2.2e-16
It’s not too bad to use the math to predict the fitted values for all of the future years, but calculating the standard error for the confidence band at each of those fitted values would be extremely difficult.
Thankfully, we can just pipe in the model to the predict function and generate predictions and a confidence interval around those predictions:
prediction.poly <- predict(m2, newdata=data.frame(year=future.years), interval="confidence")
plot(future.years, prediction.poly[,1], type="l", ylim=c(6000,8000))
points(future.years, prediction.poly[,2], type="l")
points(future.years, prediction.poly[,3], type="l")
abline(h=7200, lty=3)

Displayed with all the data:
prediction.poly <- predict(m2, newdata=data.frame(year=future.years), interval="confidence")
plot(future.years, prediction.poly[,1], type="l", ylim=c(6000,11000), xlim=c(1900,2050))
points(future.years, prediction.poly[,2], type="l")
points(future.years, prediction.poly[,3], type="l")
points(mwr$year, mwr$seconds, pch=16)
abline(h=7200, lty=3)

13.2 Out of Sample Prediction
Let’s go through another example of making out of sample predictions.
In order to conduct a pre-election survey a necessary step is to predict who is likely to vote and who is not. This is critical because a pre-election poll is not meant to predict the opinions of all Americans, but rather the opinions of voters specifically. Making a judgement on who that is a critical step.
We are going to look at building a likely voter model using CCES data. One of the nice things about the CCES is that, post election, they use the voter file to determine who actually voted out of the people that they surveyed. This is great information because people lie like crazy when we ask them if they voted or not.
load("Data/CCES.Rdata")
cces <- cces.class
head(cces)
#> # A tibble: 6 x 9
#> year vv.turnout educ white female agegrp vote.~1 pid3
#> <int> <int> <chr> <chr> <dbl> <fct> <chr> <fct>
#> 1 2016 1 HS or ~ white 1 40-49 Maybe 3
#> 2 2016 1 HS or ~ white 1 18-29 Yes 3
#> 3 2016 0 HS or ~ nonw~ 1 50-64 Yes 1
#> 4 2016 1 HS or ~ nonw~ 1 18-29 Yes 3
#> 5 2016 1 college white 1 30-39 Yes 1
#> 6 2016 0 HS or ~ nonw~ 1 50-64 Maybe 1
#> # ... with 1 more variable: prim.turnout <dbl>, and
#> # abbreviated variable name 1: vote.intent
table(cces$year)
#>
#> 2016 2018 2020
#> 64600 60000 61000
The models we want to build here are to predict vv.turnout, whether someone voted (1) or not (0). So for example, taking the 2016 voters, we can determine how an individual’s vote intention predicts whether they will vote, or not.
table(cces$vote.intent)
#>
#> Already voted Maybe No Yes
#> 16563 23491 15139 130199
cces$vote.intent <- factor(cces$vote.intent)
cces$vote.intent <- relevel(cces$vote.intent, ref="No")
cces.16 <- cces[cces$year==2016,]
m <- lm(vv.turnout ~ vote.intent, data=cces.16)
summary(m)
#>
#> Call:
#> lm(formula = vv.turnout ~ vote.intent, data = cces.16)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.6969 -0.6499 0.3502 0.3502 0.9261
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 0.073882 0.006337 11.66
#> vote.intentAlready voted 0.623028 0.013332 46.73
#> vote.intentMaybe 0.174263 0.008153 21.37
#> vote.intentYes 0.575972 0.006660 86.48
#> Pr(>|t|)
#> (Intercept) <2e-16 ***
#> vote.intentAlready voted <2e-16 ***
#> vote.intentMaybe <2e-16 ***
#> vote.intentYes <2e-16 ***
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4575 on 64484 degrees of freedom
#> (112 observations deleted due to missingness)
#> Multiple R-squared: 0.1528, Adjusted R-squared: 0.1528
#> F-statistic: 3877 on 3 and 64484 DF, p-value: < 2.2e-16
Because the intercept is the probability of voting when all other variables are 0, this is the probability of voting for someone who said they would not vote in the upcoming election (about 7.3%). Each of these coefficients indicates the change of probability of moving from being an expressed non-voter to each of those categories. We can do the math ourselves or use predict:
predict(m, newdata= data.frame(vote.intent=c("No", "Maybe", "Yes", "Already voted")))
#> 1 2 3 4
#> 0.07388217 0.24814489 0.64985443 0.69690993
So for the 2016 data only 70% of those who said they already voted actually voted. The lies!
Now pretend that it’s fall of 2020 and we have collected the data but the election hasn’t happened yet. How can we use the other data to predict, for the 2020 respondents, whether they are going to vote or not. Now we actually have this information and can verify how we do with our predictions, but that’s not something that would actually be available to us at the time.
Let’s use the 2016 and 2018 data to build a couple of models that predict whether someone voted or not. I’m going to use 4 models from simple to complex:
cces.training <- cces[cces$year %in% c(2016,2018),]
cces.test <- cces[cces$year==2020,]
m1 <- lm(vv.turnout ~ vote.intent, data=cces.training)
m2 <- lm(vv.turnout ~ vote.intent +prim.turnout + educ:white, data=cces.training)
m3 <- lm(vv.turnout ~ female + vote.intent +prim.turnout + educ:white:agegrp + pid3, data=cces.training)
These are 3 models with increasingly complicated equations. We haven’t seen regressions with this many variables but, rest assured, everything we’ve talked about so far applies. These are (mostly) just a series of indicator variables and interactions.
The important thing here is that because the 2020 data has the exact same variables in it, we can use the coefficients in these models to form predictions in the 2020 data.
Consider the first model:
summary(m1)
#>
#> Call:
#> lm(formula = vv.turnout ~ vote.intent, data = cces.training)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.7710 -0.6675 0.3325 0.3325 0.9336
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 0.066432 0.004318 15.39
#> vote.intentAlready voted 0.704534 0.007914 89.02
#> vote.intentMaybe 0.175119 0.005477 31.97
#> vote.intentYes 0.601085 0.004565 131.67
#> Pr(>|t|)
#> (Intercept) <2e-16 ***
#> vote.intentAlready voted <2e-16 ***
#> vote.intentMaybe <2e-16 ***
#> vote.intentYes <2e-16 ***
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4482 on 124444 degrees of freedom
#> (152 observations deleted due to missingness)
#> Multiple R-squared: 0.1851, Adjusted R-squared: 0.1851
#> F-statistic: 9423 on 3 and 124444 DF, p-value: < 2.2e-16
predict(m1, newdata= data.frame(vote.intent=c("No", "Maybe", "Yes", "Already voted")))
#> 1 2 3 4
#> 0.06643162 0.24155081 0.66751633 0.77096562
The people in 2020 also have a variable named vote.intent with the same response labels. As such we can do:
cces.test$predict1 <- predict(m1, newdata=cces.test)
head(cces.test)
#> # A tibble: 6 x 10
#> year vv.turnout educ white female agegrp vote.~1 pid3
#> <int> <int> <chr> <chr> <dbl> <fct> <fct> <fct>
#> 1 2020 1 some c~ white 0 50-64 Alread~ 2
#> 2 2020 0 postgr~ white 1 65-74 No 1
#> 3 2020 0 college white 1 65-74 Yes 3
#> 4 2020 1 college white 1 50-64 Yes 1
#> 5 2020 0 college white 0 50-64 Yes 3
#> 6 2020 1 some c~ white 0 50-64 Yes 2
#> # ... with 2 more variables: prim.turnout <dbl>,
#> # predict1 <dbl>, and abbreviated variable name
#> # 1: vote.intent
table(cces.test$predict1)
#>
#> 0.066431619966184 0.241550808184305 0.667516327713994
#> 4361 5797 38790
#> 0.770965622946551
#> 11996
The predict1 column takes on 4 possible values based on the individuals vote intention.
We can classify each person as a likely voter or not if there probability of voting is above or below 50%. From that we can determine how good of a job this model does in predicting 2020 voters. We can look to see the proportion of voters who are incorrectly classified. IE, how many are predicted to vote but don’t, and how many are predicted not to vote but do?
#Classification Error
mean((cces.test$predict1>.5 & cces.test$vv.turnout==0) | (cces.test$predict1<.5 & cces.test$vv.turnout==1),na.rm=T)
#> [1] 0.2363153
The two other models both have a substantially higher R^2, meaning that the models fit the 2016/2018 data much better. The question is whether this better fitting data also reduces the classification error. Notice that even though these models have lots of complicated interactions and many variables the code to actually form predictions is really easy! Wee! Regression is fun!
cces.test$predict2 <- predict(m2, newdata=cces.test)
#> Warning in predict.lm(m2, newdata = cces.test): prediction
#> from a rank-deficient fit may be misleading
cces.test$predict3 <- predict(m3, newdata=cces.test)
#> Warning in predict.lm(m3, newdata = cces.test): prediction
#> from a rank-deficient fit may be misleading
#Classification Error Model 2
mean((cces.test$predict2>.5 & cces.test$vv.turnout==0) | (cces.test$predict2<.5 & cces.test$vv.turnout==1),na.rm=T)
#> [1] 0.2115221
#Classification Error Model 3
mean((cces.test$predict3>.5 & cces.test$vv.turnout==0) | (cces.test$predict3<.5 & cces.test$vv.turnout==1),na.rm=T)
#> [1] 0.1940142
Yes, in this case we do make a few fewer classification errors when we better fit the data. But let’s go crazy with this interacting everything we everything:
#Stupid model
m4 <- lm(vv.turnout ~ pid3:vote.intent:prim.turnout + female:educ:white:agegrp, data=cces.training)
cces.test$predict4<- predict(m4, newdata=cces.test)
#Classification Error Model 4
mean((cces.test$predict4>.5 & cces.test$vv.turnout==0) | (cces.test$predict4<.5 & cces.test$vv.turnout==1),na.rm=T)
#> [1] 0.2268804
It went back up!
Sum this up.
13.3 Summary
Where we started:
At some point in learning about statistics or data science you may have heard instructors talk about samples and the population. The purpose of this class is to think a lot between the relationships between these two things and to focus our minds on the idea that this interplay is actually what the whole enterprise is all about.
What defines statistics is making conclusions about populations from a sample of data. (We infer things about populations from samples, hence the title of the class.)
“Population” is a pretty vague term however, and it encompasses a really wide range of things.
Sometimes what we mean is very obvious: In politics we often take random samples of a few thousand people to help us to understand the opinions and attitudes of the entire United States. In this case it’s quite clear that the sample is the random sample of people while the population is the adult public of the United States.
But lots of other things in our day-to-day lives are “samples” from some population, we often just don’t consider them as such.
For example: we might be interested in learning about the revenue potential of a small business, and have a few months of their finances as data. The goal is to use this sample of data (which may or may not be perfectly representative!) to better understand what the true earning potential of that business is. In this case the “population” is simply the true earning potential of the business.
If you are sports inclined we can think of the performance of a baseball hitter over a certain stretch of time as a “Sample” from the population that is their true hitting potential.
If it’s helpful (it is for me sometimes) we can re-frame the “population” as the “data generating process”. In the baseball example there is an underlying truth (the hitter’s true skill) which generates data (if the player hits the ball or not) when they step up to the plate.
In statistics, we look to make probabilistic inferences from samples about unknowable truths about the world. We will never precisely know the earning potential of a business, or how many Americans support a presidential candidate, or what level of hitting skill a baseball player has. We can, however, form estimates from a sample of data and estimate our level of uncertainty about the degree to which our answer represents the truth.
Where do you go from here. There are, broadly, three extensions from this material that you can go to from here.
Pure statistics: there is a lot more in terms of probability, probability distributions, getting standard errors right etc. We have scratched the surface.
Causal inference: very exciting new field.
Prediction: Machine learning baby!