15 Final Review

15.1 The Big Picture

The fundamental task of statistics is not to describe the dataset in front of us but instead to make inferences about the population from which the sample is drawn.

The parts of this course where we estimated something from our dataset were relatively straightforward. How do you get a mean? How do you measure covariance? How do you turn that covariance into a correlation or a regression coefficient? The math for all of those things are pretty straightforward, and when it’s not (we didn’t do the matrix algebra for multiple regression) the interpretation was straightforward.

The more conceptually difficult part of this course (and the part that I want you to think about when you think about this course) was conceptualizing each and every thing that we estimate in a data set as an infinite number of possible estimates we could get from an infinite number of data sets that could exist. Critically (and magically!) the distribution of all of those possible estimates from all those possible datasets is predictable.

Everything we estimate in one data set has a sampling distribution and a standard error which allow us to make inferences about the population. These inferences are what statistics is about

Consider an example I used in the first class. Here is the distribution of the feeling thermometer question on Donald Trump from the 2020 ANES

library(rio)
anes <- import(file="https://raw.githubusercontent.com/marctrussler/IIS-Data/main/ANES2020Clean.csv")

plot(density(anes$therm.trump,na.rm=T), xlim=c(0,100), main="Distribution of Trump Themometer Ratings")
abline(v=mean(anes$therm.trump,na.rm=T), lty=2)
legend("topright", c(paste("Mean = ",round(mean(anes$therm.trump,na.rm=T),2 )),
                     paste("SD = ",round(sd(anes$therm.trump,na.rm=T),2 )) ), lty=2, col=c("Black", "White"))

Here’s what I said at the time:

We can also summarize the distribution of this variable in the sample with summary statistics. The mean – or simple average – of this variable, for example, is 40.44. The standard deviation – which can be thought of as how much each individual deviates from the mean, on average – is 40.31.

The problem is, that’s not at all what we care about. What we care about, in the example of a Trump feeling thermometer, is how all voters in the feel about Trump, not just the ones in this sample. What is the true average feeling towards Trump amongst all voters? How spread out is that opinion? Do the opinions of Democrats and Republicans differ? All of these questions are not immediately or obviously answered by the data in front of us.

It is tempting to say: well if this is a sample of data from the population isn’t the answer the same in the population? Isn’t it good enough to get a sample, report what’s in that sample, and assume it’s the same as the population?

To some degree our answer to this is actually, yes! But let’s think for a little bit: if we took another sample of 8000 Americans would this plot look exactly the same? How much could we expect it to vary? It obviously will not be exactly the same…. so how do we know that this distribution is the one that is representative of the population?

We now have a lot of insight into all of these questions. After studying for the final you should have a good sense of why it is that we have good insights into those questions!

15.2 Standard Error of the Mean

We care about three distributions: a population, a sample, and the sampling distribution of an estimate we form using out sample.

This is true no matter what we are estimating but it is simplest to think about in terms of the mean.

First is the population that we are sampling from. Somthing like this.

eval <- seq(-5,60,.001)
plot(eval,dchisq(eval,10,ncp=12), type="l", main="Population",
     xlab="Values",ylab="Density", ylim=c(0,0.06), xlim=c(0,65))

We can think of the population as the “Data Generating Process”. It can be anything.

The second distribution we care about is the sample we get from this population:

set.seed(19104)
x <- rchisq(50,10,ncp=12)
head(x)
#> [1] 17.17206 15.50880 29.22219 18.31710 31.47061 20.98213
plot(density(x), main="Sample", xlab="Value", ylim=c(0,0.06), xlim=c(0,65))

We can think of each sample as being a rough approximation of the population it was generated from. The larger the sample size, the more the sample will look like the population. (However, critically, this is NOT what the Central Limit Theorem is.)

Here are the two distributions side-by-side:

plot(density(x), main="Sample", xlab="Value", ylim=c(0,0.06), xlim=c(0,65))
points(eval,dchisq(eval,10,ncp=12),col="firebrick", type="l")

Every time we take a sample we will get a distribution of data that will be a rough approximation of the population.

The key thing we learned before the midterm is, while it’s somewhat futile to ask “How much will my sample distribution look like the population for a sample size of \(n\)?”, we can predict how much the sample mean will look like the true population expected value.

Because of the central limit theorem We know that the sample mean is itself a random variable, with this distribution \(\bar{X_n} \sim N(\mu=E[X], \sigma^2 = V[X]/n)\).

Specifically, each sample mean for an n of 50 will be drawn from this normal distribution, based on my knowledge that the \(E[X]=22\) and \(V[X]=68\).

mean.eval <- seq(15,29,.001)
plot(mean.eval, dnorm(mean.eval, mean=22, sd=sqrt(68/50)), type="l",
     main="Sampling Distribution of the Sample Mean", 
     xlab="Value of Sample Mean",
     ylab="Density")

We can prove this to ourselves by repeatedly sampling and taking a bunch of sample means and plotting their density compared to this sampling distribution.

sample.means <- rep(NA, 10000)

for(i in 1:10000){
  sample.means[i] <- mean(rchisq(50,10,ncp=12))
}

plot(mean.eval, dnorm(mean.eval, mean=22, sd=sqrt(68/50)), type="l",
     main="Sampling Distribution of the Sample Mean", 
     xlab="Value of Sample Mean",
     ylab="Density")
points(density(sample.means), type="l", col="darkblue")

They are the same distribution. Because this sampling distribution is normally distributed it is easy to determine the probability falling in certain ranges.

15.2.1 But we don’t know the population variance

Our definition of the sampling distribution made use of \(V[X]\) the population level variance. But we don’t have that!

Instead, we make use of the sample standard deviation as a stand in for \(\sqrt{V[X]}\).

When we are using the sample standard deviation as a stand in for the population variance in the calculation of the standard error, however, the sampling distribution of the sample mean is t distributed with \(n-1\) degrees of freedom. This really only affects sample sizes under 30, but still, it is theoretically important that the sample mean is \(t\) distributed not distributed standard normal.

\[ \frac{\bar{X_n}-E[X]}{s/\sqrt{n}} \sim t_{n-1} \]

15.2.2 Confidence Intervals

We define a confidence level as \((1-\alpha)*100\), such that a 70% confidence interval implies an \(\alpha\) of .3.

For a given \(\alpha\) a confidence interval around any given sample mean is:

\[CI(\alpha) = [\bar{X_n}-t_{\alpha/2}*SE,\bar{X_n}+t_{\alpha/2}*SE]\]

If we had \(\bar{x}=3\), \(s=2\), \(n=100\) and want to find a 95% confidence interval the steps are:

Derive the standard error:

se <- 2/sqrt(100)

Determine how many standard errors puts \(\alpha/2=.05/2=.025\) probability mass in the left tail in the \(t\) distribution with 99 degrees of freedom:

t <- qt(.975, df=99)

Now plug these values into the above equation:

c(3-t*se, 3+t*se)
#> [1] 2.603157 3.396843

The interpretation of a confidence interval comes from the fact that if we repeatedly re-sample and form confidence intervals, 95% of them will contain the true population paramater.

The fully correct interpretation is: 95% of similarly constructed confidence intervals will contain the true population paramter.

The technically incorrect, but good enough for me, interpretation is: there is a 95% chance this confidence interval contains the true population paramter.

15.3 Hypothesis Tests

The logic of hypothesis testing comes from the fact that we don’t know where the sampling distribution of an estimate is centered, but we feel that we have a good sense of it’s shape (via the CLT).

So we have this normal distribution and need to put it somewhere. All we are doing with a hypothesis testing is positing that the sampling distribution is centered on the null hypothesis and determining how likely it is to get the result we got in our sample under that assumption.

The five steps to a hypothesis test are:

  1. Specify a null hypothesis and an alternative hypothesis.
  2. Choose a test statistic and the level of test (\(\alpha\)).
  3. Derive the sampling distribution and center on the null hypothesis.
  4. Compute the p-value (one sided or two sided).
  5. Reject the null hypothesis if the p-value is less than or equal to \(\alpha\). Otherwise retain the null hypothesis.

If we had this sample:

#We don't care about the population it comes from
set.seed(19103)
x <- rnorm(1000, 0.2, 3)

mean(x)
#> [1] 0.1536556
sd(x)
#> [1] 2.968243

And wanted to test the null hypothesis that \(\mu=0\) and the alternative hypothesis that \(\mu \neq 0\) we could approximate the sampling distribution under the null hypothesis as:

se <- sd(x)/sqrt(1000)
eval <- seq(-.5,.5, .001)

plot(eval, dnorm(eval, mean=0, sd=se), type="l", main="Sampling Distribution Under the Null")
abline(v=mean(x))

“Approximate” because I am using a normal distribution to visualize this as I calculated the standard error using the sample standard deviation, so this sampling distribution should be t-distributed.

We are going to set an \(\alpha=.05\) which means we want there to be a 5% or lower probability of false positive. In terms of our figure that means we want to put 5% of the probability mass in the tails and 95% of the probability mass in the center.

plot(eval, dnorm(eval, mean=0, sd=se), type="l", main="Sampling Distribution Under the Null")
abline(v=qnorm(c(.025,.975), mean=0, sd=se), lty=2, col="firebrick")
abline(v=mean(x))

If this is the true sampling distribution, there is a 5% of producing means to the left and right of those two red lines (in the tails). If our mean is in that range, there is less than 5% chance of getting a result that extreme simply by chance. This is why we reject the null hypothesis if our mean is in that range. In this case our mean is not in that range. There is a higher than 5% of producing a mean at least that extreme under this distribution. So we fail to reject the null hypothesis.

To calculate a p-value for a two-tailed test we want to add up the probability mass in the tail that is to the right of our sample mean as well as the probability mass to the left of our mean once it is flipped to the other side of the null. Because the normal distribution is symmetrical we can just find one of these values and multiply by 2.

pnorm(mean(x), 0, se, lower.tail=F)*2
#> [1] 0.1016303

The p-value is around 10%. Greater than our 5% threshold, as expected.

Now we know that this is technically t-distributed. So to skip all the above visualizations I can just evaluate: how many standard errors away from the null is my mean, and what does that mean in the t distribution?

t <- (0-mean(x))/se
pt(t, df=999)*2
#> [1] 0.1019453

Those last two steps are all we really need to do for a hypothesis test. The rest of the above is just to help us visualize. You should really understand those last two steps.

The other part of hypothesis testing that is critical to understand is: all of the things we have estimated in this class have a normal or t distribution. Once you have a standard error (and a degrees of freedom if t-distributed) you can do the hypothesis test. It doesn’t matter if that’s a mean, or a difference in means, or a regression coefficient, or a marginal effect….

15.3.1 Power

The level of our statistical test, \(\alpha\), is the probability of a false positive. If we think about a drug trial, this is the accepted risk of saying a drug works when it does not work. We want this to be very low because science is inherently conservative. Helpfully, we get to just set this level as part of our hypothesis test.

More tricky is \(\beta\) the probability of a false negative. Again thinking about a drug trial, this is the the probability of saying a drug doesn’t work when it actually does. We don’t get to set \(\beta\). The things that determine it are (1) the width of our sampling distribution and the difference between the truth and the null hypothesis.

The two things you have to think about when calculating power are:

  1. Given my standard error and null hypothesis, what is the range of values where I will fail to reject the null hypothesis?

  2. Given my standard error and assumption (or knowledge) of the truth of the population parameter, what range of estimates will be produced?

Then you need to determine how many of the estimates produced by (2) will end up in the range defined by (1).

So, taking the variable x above, we had:

set.seed(19103)
x <- rnorm(1000, 0.2, 3)

mean(x)
#> [1] 0.1536556
sd(x)
#> [1] 2.968243

In this case I know that the true population mean is 0.2 and the true population standard deviation is 3 (or \(V[X]=9\)). That means the sampling distribution of the sample mean will be \(\sqrt{9/1000} = .095\).

So we have a standard error, we have a truth (0.2), and let’s assume that our null hypotehsis test is still \(\mu=0\).

We don’t have to visualize this, though I have been to help you see what i’m talking about. But to just quickly do the math:

Here is the range of values where we will fail to reject the null hypothesis:

null.low<- qnorm(.025, mean=0, sd=.095)
null.high<- qnorm(.975, mean=0, sd=.095)

Now determine the probability of drawing means in that zone, given the truth:

pnorm(null.high, mean=0.2, sd=.095) - pnorm(null.low, mean=0.2, sd=.095)
#> [1] 0.4422133

There is about a 44% chance of drawing a mean that will fall into the zone where we will mistakenly fail to reject the null hypothesis. This is a high probabiliy of a false negative and therefore a low amount of power.

In the case where we know the data generating process (i.e. I can see the rnorm command that is generating data, we can simulate to get the same answer):

fail.t.test <- NA

for(i in 1:1000){
  #Draw a sample from the same population
  samp <- rnorm(1000, 0.2, 3)
  #Run a t-test with the stated null hypothesis
  test <- t.test(samp, mu=0)
  #Record whether we fail the t.test
  fail.t.test[i] <- test$p.value>.05
}
mean(fail.t.test)
#> [1] 0.45
#Same answer as math

15.4 Bootstrap

There are four ways to calculate a sampling distribution (and therefore, a standard error):

  1. You have information about the population that allows you to calculate a standard error.

For example, in situations where I told you that the population level \(V[X]\).

  1. You know the population/DGP and repeatedly sample and generate estimates. The result will be a sampling distribution.

This is the literal “sampling distribution” as it is a distribution of a bunch of estimates from a bunch of samples. The standard deviation of this distribution is the standard error.

  1. You have information about the sample and an equation that produces the standard error.

This is the “standard” situation that allows us to calculate the standard error of the mean for a single sample, but is also how we get the standard error for the difference in means, for a regression coefficient, for a marginal effect. We know how to figure these things out using math.

  1. Bootstrap

The fourth method is a rough approximation of the second method. Instead of repeatedly re-sampling from a population, we use the fact that our sample is iid to produce the insight that “random samples of our sample are new samples”.

We repeatedly re-sample rows in our data-set, calculate the thing we are estimating in each new “sample”, and produce a sampling distribution. The standard deviation of this sampling distribution will be our estimate of the standard error.

This method is a “last resort”. We would never use this to estimate the standard error of the mean (we have an equation for that!). But we will use this for things that don’t have a standard error like a correlation coefficient, the difference between two correlation coefficients, or two regression coefficients that come from two different models.

15.5 Covariance and Correlation

15.5.1 Difference in means

To extend our knowledge of sampling distributions and hypothesis testing we started to consider the relationships between variables, starting with a situation where we calculated the mean of a variable for two different groups. (We can think about this as the relationship between the variable \(x\) and a group indicator with exactly two values).

We wanted to think about hypotheses like these:

\[ H_o:\mu_c - \mu_t = 0\\ H_a: \mu_c - \mu_t \neq 0 \]

So if we have two means the hypothesis we often want to answer is whether they are equal, or not.

What is critical (in line with our discussion above of sampling distributions and hypothesis tests) is once you have the standard error everything is exactly the same.

Here the standard error is:

\[ se = \sqrt{\frac{\sigma_0^2}{n_0} + \frac{\sigma_1^2}{n_1} } \]

If we have that, an estimate (say a difference in means of….3… or 0.2… it doesn’t really matter) than our hypothesis test (or a power test) continues exactly as it did for a single mean.

I’m not going to make you calculate the degrees of freedom for a difference in means on the exam :)

We can easily perform a difference in means test using the t.test function:

set.seed(19104)
y0 <- rnorm(50,23, 13)
y1 <-  rnorm(50, 28, 6)
y <- c(y0,y1)
x <- c(rep(0,50), rep(1,50))
dat <- cbind.data.frame(x,y)
head(dat)
#>   x        y
#> 1 0 17.85445
#> 2 0 25.55715
#> 3 0 29.99189
#> 4 0 28.97014
#> 5 0 22.22095
#> 6 0 10.75801

#T.test
t.test(dat$y ~ dat$x)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  dat$y by dat$x
#> t = -3.1014, df = 77.848, p-value = 0.002683
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#>  -9.012638 -1.965384
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        22.44682        27.93583

Note that we get the same result via regression:

summary(lm(dat$y ~ dat$x))
#> 
#> Call:
#> lm(formula = dat$y ~ dat$x)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -27.3657  -5.6623   0.3217   5.3519  25.0263 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   22.447      1.251  17.936  < 2e-16 ***
#> dat$x          5.489      1.770   3.101  0.00252 ** 
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.849 on 98 degrees of freedom
#> Multiple R-squared:  0.08938,    Adjusted R-squared:  0.08008 
#> F-statistic: 9.619 on 1 and 98 DF,  p-value: 0.002516

15.5.2 Covariation and Correlation

As a bridge to get to regression we thought about the covariation between two variables and the (related) concept of correlation. We need these two things to describe the inter-relationship between two continous variables.

Covariance is defined as:

\[ \hat{\sigma_{x,y}} = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})(y_i - \bar{y}) \] We add up the deviations of x from it’s mean and multiply by the deviations from y from it’s mean.

Covariation is a critical building block, but it’s major problem is that it doesn’t have clear units. If you multiply x and y by 100 covariation will get much higher, even though it is describing the same relationship.

As such we quickly pivot to correlation, which is just covariation standardized by the variation in the two variables:

The idea behind correlation is simply to standardize a covariance by dividing by the standard deviations of the variables that we are measuring. Specifically, the correlation equals:

\[ \hat{\rho}_{x,y} = \frac{\hat{\sigma}_{x,y}}{\sqrt{var(x)*var(y)}} \]

Correlation ranges from -1 (perfect negative correlation) to 1 (perfect correlation).

15.6 Bivariate Regression

The need to expand on correlation came from two drawbacks. First, the “unitless” character of correlation makes it hard to form real, human, sentences about the size of effects. Second, our ability to consider third (or fourth, or fifth) variables is extremely blunt (i.e. we can maybe look at the correlation in a few subgroups of a third variable, but not much beyond that).

Regression summarizes the relationship between two variables by drawing a line in a two-dimdensional space. Specifically we estimate the following equation:

\[ \hat{y} = \hat{\alpha} + \hat{\beta}x_i \]

Where \(\hat{\alpha}\) is the y-intercept, \(\hat{\beta}\) is the slope, and \(x_i\) is each data-points x value.

For any given value of x, bivariate regression returns the predicted value of y.

The two coefficients, \(\hat{\alpha}\) and \(\hat{\beta}\) are chosen such that they minimize the sum of the squared residuals. Where the residuals are:

\[ \hat{u_i} = y_i - \hat{y} \]

The regression line is the line that minimizes the vertical distance to each data point.

Going through the math to determine what minimizes the sum of the squared residuals, we found that:

\[ \hat{\alpha} = \bar{Y} - \hat{\beta}\bar{x} \]

And

\[ \hat{\beta} = \frac{\sum_{i=1}^n (y_i - \bar{y})(x_i - \bar{x})}{\sum_{i==1}^n(x_i -\bar{x})^2} \\ \]

For this second equation, we also showed that it simplifies to

$$ = \

$$ The covariance of x and y divided by the variance of x. Correlation and Regression are nothing more than the re-scaling of covariation.

We further discussed how to interepret a regression. Consider this regression of age on the blm feeling thermoemter

library(rio)
anes <- import("https://github.com/marctrussler/IIS-Data/raw/main/ANESFinalProjectData.csv")

anes$age <- anes$V201507x
anes$age[anes$age<0] <- NA

anes$blm.therm <- anes$V202174
anes$blm.therm[anes$blm.therm<0 | anes$blm.therm>100] <- NA

m <- lm(blm.therm ~ age, data=anes)
summary(m)
#> 
#> Call:
#> lm(formula = blm.therm ~ age, data = anes)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -62.690 -33.381   4.049  32.202  54.049 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 67.54966    1.32870   50.84   <2e-16 ***
#> age         -0.26998    0.02437  -11.08   <2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 35.12 on 7062 degrees of freedom
#>   (1216 observations deleted due to missingness)
#> Multiple R-squared:  0.01709,    Adjusted R-squared:  0.01695 
#> F-statistic: 122.8 on 1 and 7062 DF,  p-value: < 2.2e-16

The interpretation of the intercept is always the average value of y when all variables are equal to zero. We can see that this is just mathematically true from the regression equation:

\[ \hat{y} = \hat{\alpha} + \hat{\beta}x_i \\ \hat{y} = \hat{\alpha} + \hat{\beta}*x_i*0 \\ \hat{y} = \hat{\alpha}\\ \]

That may, or may not, be a helpful value. In this case, this is the predicted average blm.therm score for when age is 0. That’s clearly a nonsense number.

The interpretation of \(\beta\) is always how a 1-unit change in the independent variable leads to a \(\beta\) change in the dependent variable. By 1-unit we mean 1-unit in our number system. This is true in EVERY REGRESSION. In this case, 1-unit is a year of aging. So for every additional year a person get’s older we expect their blm.therm to drop by around .27.

The interpretation of regression coefficients therefore completely depends on the scale of the variable. If we instead had age in months, the underlying relationship would be no different, but the coefficient would change it’s value:

anes$age.months <- anes$age*12
m2 <- lm(blm.therm ~ age.months, data=anes)
summary(m2)
#> 
#> Call:
#> lm(formula = blm.therm ~ age.months, data = anes)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -62.690 -33.381   4.049  32.202  54.049 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 67.549664   1.328699   50.84   <2e-16 ***
#> age.months  -0.022498   0.002031  -11.08   <2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 35.12 on 7062 degrees of freedom
#>   (1216 observations deleted due to missingness)
#> Multiple R-squared:  0.01709,    Adjusted R-squared:  0.01695 
#> F-statistic: 122.8 on 1 and 7062 DF,  p-value: < 2.2e-16

“One unit” is no longer one year, but is now one month. As such the coefficient is much smaller. An individual getting one month older leads to a .022 decline in their blm thermometer score.

If one unit is one month then 12 units is one year. As such we can see:

coef(m2)["age.months"]*12
#> age.months 
#> -0.2699798

Multiplying this coefficient by 12 returns the original variable.

15.6.1 The standard error of a regression coefficient

The standard error of a regression coefficient is given by:

\[ SE_{\beta} = \sqrt{\frac{ \frac{1}{n} \sum_{i=1}^n u_i^2 }{\sum_{i=1}^n (x_i-\bar{x})^2} } \]

The average squared residual divided by the sum of the deviations of the x’s from the mean. The counter-intuitive part of this is that the standard error of a regression coefficient gets smaller when x has more variation, which is backwards from a lot of other things.

As discussed throughout this review, once you have a regression coefficient (an estimate) and a standard error, everything about a hypothesis test is the same.

15.7 Multivariate Regression

When we just look at a bivariate regression we might not be uncovering the true effect of x on y.

In particular, we are concerned with omitted variables that are a cause of both x and y.

We can “control” for these additional variables, z, by putting them in our regression.

When z has one or two categories, one way that we can conceptualize of regression is that we are adding seperate intercepts for each category.

For example we looked at how the relationship between the student to teacher ratio in schools and test scores in California schools might be affected by whether a school is a high.esl or low.esl school


## ---------------------------------------------------------
library(AER)
#> Warning: package 'AER' was built under R version 4.1.2
#> Loading required package: car
#> Warning: package 'car' was built under R version 4.1.2
#> Loading required package: carData
#> Warning: package 'carData' was built under R version 4.1.2
#> Loading required package: lmtest
#> Warning: package 'lmtest' was built under R version 4.1.2
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
#> Loading required package: sandwich
#> Loading required package: survival
data("CASchools")
CASchools$STR <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2
dat <- CASchools

## ---------------------------------------------------------
dat$high.esl <-NA 
dat$high.esl[dat$english>=median(dat$english)]<- 1
dat$high.esl[dat$english<median(dat$english)]<- 0

## ---------------------------------------------------------
plot(dat$STR, dat$score, xlab="Student:Teacher Ratio", ylab="Test Score", pch=16, type="n")
points(dat$STR[dat$high.esl==0], dat$score[dat$high.esl==0], pch=16, col="darkblue")
points(dat$STR[dat$high.esl==1], dat$score[dat$high.esl==1], pch=16, col="firebrick")
abline(a=691.32, b=-1.3963, col="darkblue", lwd=3)
abline(a=691.32-19.49, b=-1.3963, col="firebrick", lwd=3)
legend("topright", c("Low ESL", "High ESL"), pch=c(16,16), col=c("darkblue", "firebrick"))

And in particular that some of the negative relationship between classroom size and test scores could be explained by this third variable.

m <- lm(score ~ STR + high.esl, data=dat)
summary(m)
#> 
#> Call:
#> lm(formula = score ~ STR + high.esl, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -37.857 -10.853  -0.626  10.198  43.834 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 691.3267     8.1315  85.019  < 2e-16 ***
#> STR          -1.3963     0.4171  -3.348 0.000889 ***
#> high.esl    -19.4911     1.5763 -12.365  < 2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 15.91 on 417 degrees of freedom
#> Multiple R-squared:  0.3058, Adjusted R-squared:  0.3025 
#> F-statistic: 91.84 on 2 and 417 DF,  p-value: < 2.2e-16

Thinking about how to interpret regression with more than one variable, everything stays exactly the same. The intercept is still the average value of y when all variables are equal to 0. Each coefficient tells us how a one-unit shift in that variable will affect the dependent variable. The difference now is that impact is holding the other variable constant.

When ESL is split into a binary variable it is easy to visualize what “holding constant” means: we are looking at the relationship of clasroom size and test scores within categories of ESL. It’s harder to visualize this when we let ESL be a continuous variable with many levels, but the same thing is happening. We are considering the relationship between classroom size and test scores while holding constant the percentage of kids in the classroom who speak english as a second language:

## ---------------------------------------------------------
m <- lm(score ~ STR + english, data=dat)
summary(m)
#> 
#> Call:
#> lm(formula = score ~ STR + english, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -48.845 -10.240  -0.308   9.815  43.461 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 686.03224    7.41131  92.566  < 2e-16 ***
#> STR          -1.10130    0.38028  -2.896  0.00398 ** 
#> english      -0.64978    0.03934 -16.516  < 2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 14.46 on 417 degrees of freedom
#> Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237 
#> F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16

And we can interpret the variables in the same way. (With the caveat that we have to think about what scale english is on).

Remember that, as with all addition, the order of the variables doesn’t matter:

m <- lm(score ~ english + STR, data=dat)
summary(m)
#> 
#> Call:
#> lm(formula = score ~ english + STR, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -48.845 -10.240  -0.308   9.815  43.461 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 686.03224    7.41131  92.566  < 2e-16 ***
#> english      -0.64978    0.03934 -16.516  < 2e-16 ***
#> STR          -1.10130    0.38028  -2.896  0.00398 ** 
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 14.46 on 417 degrees of freedom
#> Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237 
#> F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16

Along the same lines, while we set this up as “Lets look at the relationship between STR and test scores while holding constant ESL”, the regression is treating STR and ESL in the same way here. We are equally “looking at the relationship between ESL and test scores while holding constant STR”.

The logic of regression scales up to many variables and the interpretation remains the same.

## ---------------------------------------------------------
m <- lm(score ~ STR + english + lunch + income, data=dat)
summary(m)
#> 
#> Call:
#> lm(formula = score ~ STR + english + lunch + income, data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -30.7085  -4.9977  -0.2014   4.8411  28.7000 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 675.60821    5.30886 127.261  < 2e-16 ***
#> STR          -0.56039    0.22861  -2.451   0.0146 *  
#> english      -0.19433    0.03138  -6.193 1.42e-09 ***
#> lunch        -0.39637    0.02741 -14.461  < 2e-16 ***
#> income        0.67498    0.08333   8.100 6.19e-15 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.448 on 415 degrees of freedom
#> Multiple R-squared:  0.8053, Adjusted R-squared:  0.8034 
#> F-statistic: 429.1 on 4 and 415 DF,  p-value: < 2.2e-16

The intercept is the average value of y when all variables are 0. Each coefficient is the effect of a 1-unit shift of that variable on the outcome.

15.8 Regression Assumptions

There are 5 regression assumptions needed for OLS to return the true effect of \(X\) on \(Y\), and for the hypothesis tests to be correct.

15.8.1 1. X has full rank

This is a matrix algebra thing which says that no variable can be a linear combination of other variables. In practice this means that you cannot have an \(x\) variable with no variation, and you can’t include an exhaustive set of indicator variables.

This second piece is more practically important. We saw that if we want to run a regression on a categorical variable, we make indicators for each category and leave one of them out as the reference category.

For example if I want to find the effect of census region on median income I would make indicator variables for the four regions:

acs <- import("https://github.com/marctrussler/IIS-Data/raw/main/ACSCountyData.csv")

acs$midwest[acs$census.region=="midwest"]<- 1
acs$midwest[acs$census.region!="midwest"]<- 0
acs$northeast[acs$census.region=="northeast"]<- 1
acs$northeast[acs$census.region!="northeast"]<- 0
acs$south[acs$census.region=="south"]<- 1
acs$south[acs$census.region!="south"]<- 0
acs$west[acs$census.region=="west"]<- 1
acs$west[acs$census.region!="west"]<- 0

If we try to include all four indicator variables in the regression we will get one NA:

m <- lm(median.income ~ midwest + northeast + south + west, data=acs)
summary(m)
#> 
#> Call:
#> lm(formula = median.income ~ midwest + northeast + south + west, 
#>     data = acs)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -31506  -8314  -2055   5236  88945 
#> 
#> Coefficients: (1 not defined because of singularities)
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  55590.7      614.4  90.481  < 2e-16 ***
#> midwest      -2104.2      733.1  -2.870  0.00413 ** 
#> northeast     6403.3     1074.7   5.958 2.84e-09 ***
#> south        -8268.1      704.4 -11.738  < 2e-16 ***
#> west              NA         NA      NA       NA    
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 12990 on 3137 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared:  0.1023, Adjusted R-squared:  0.1015 
#> F-statistic: 119.2 on 3 and 3137 DF,  p-value: < 2.2e-16

R has made west NA here. To think about why this is happening (in a way that avoids matrix algebra) we can think about our definition of the intercept: the average value of the outcome variable when all variables are equal to zero. In the case of having exhaustive categories there are no cases where all the variables equal 0. Counties have to be 1 on at least one of the categories.

The way around this is to simply omit one of the coefficients, which we call the “reference” category. The choice of what to leave out doesn’t matter for the math, but different choices will add in interpretability.

For the above regression the intercept is the average of value for median income for counties in the west (when midwest, northeast, and south are all 0). Each coefficient is the change in the average value of median income moving from the west to counties in that region. So median income is 2104 dollars lower in the midwest compared to the west.

Note that the shortcut way to including indicator variables is to make out variable into a factor variable

acs$census.region <- as.factor(acs$census.region)
acs$census.region <- relevel(acs$census.region, ref="midwest")

m <- lm(median.income ~ census.region, data=acs)
summary(m)
#> 
#> Call:
#> lm(formula = median.income ~ census.region, data = acs)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -31506  -8314  -2055   5236  88945 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)
#> (Intercept)             53486.5      399.9 133.743  < 2e-16
#> census.regionnortheast   8507.4      968.2   8.786  < 2e-16
#> census.regionsouth      -6163.9      527.8 -11.678  < 2e-16
#> census.regionwest        2104.2      733.1   2.870  0.00413
#>                           
#> (Intercept)            ***
#> census.regionnortheast ***
#> census.regionsouth     ***
#> census.regionwest      ** 
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 12990 on 3137 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared:  0.1023, Adjusted R-squared:  0.1015 
#> F-statistic: 119.2 on 3 and 3137 DF,  p-value: < 2.2e-16

15.8.2 (2) Correct Functional Form

This assumption is that the model we specify for our data is the correct one. If we think about the population as a “Data Generating Process” what we want to do is to model the data in a way that matches the data generating process.

The three things we should worry about here are: polynomials, clusters, and interactions.

For polynomials, we want to consider whether the effect of a variable is linear or not. In many cases the effect of a variable depends on it’s own level. We saw the example of looking at household income on test scores. Here is the linear relationship:


#Creating Variables
CASchools$STR <- CASchools$students/CASchools$teachers
summary(CASchools$STR)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   14.00   18.58   19.72   19.64   20.87   25.80

CASchools$score <- (CASchools$read + CASchools$math)/2
summary(CASchools$score)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   605.5   640.0   654.5   654.2   666.7   706.8

#Changing names
dat <- CASchools
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.1.3
p <- ggplot(dat, aes(x=income, y=score)) + geom_point() +
  labs(x="Avereage Household Income", y = "School Test Scores") +
  geom_smooth(method = lm, formula = y ~ poly(x, 1), se = FALSE)
p
summary(lm(score ~ income, data=dat))
#> 
#> Call:
#> lm(formula = score ~ income, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -39.574  -8.803   0.603   9.032  32.530 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 625.3836     1.5324  408.11   <2e-16 ***
#> income        1.8785     0.0905   20.76   <2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13.39 on 418 degrees of freedom
#> Multiple R-squared:  0.5076, Adjusted R-squared:  0.5064 
#> F-statistic: 430.8 on 1 and 418 DF,  p-value: < 2.2e-16

Just looking at this visually, does this line well represent this relationship? Not really, no.

The way we deal with this is adding polynomial terms to our regression. The easy way to think about polynomials is: it let’s the line curve.

To do this we modify the simple bivariate regression by adding an additional variable income squared:

\[ y = a + bx + e\\ y = a + b_1 x + b_2 x^2 + e \]

To implement this we can use the poly() function:

#YOU MUST USE THE RAW=T OPTION WHEN YOU USE THE POLY FUNCTION
#OR IT WILL NOT WORK

m <- lm(score ~ poly(income, 2, raw=T), data=dat )
summary(m)
#> 
#> Call:
#> lm(formula = score ~ poly(income, 2, raw = T), data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -44.416  -9.048   0.440   8.347  31.639 
#> 
#> Coefficients:
#>                            Estimate Std. Error t value
#> (Intercept)               607.30174    3.04622 199.362
#> poly(income, 2, raw = T)1   3.85099    0.30426  12.657
#> poly(income, 2, raw = T)2  -0.04231    0.00626  -6.758
#>                           Pr(>|t|)    
#> (Intercept)                < 2e-16 ***
#> poly(income, 2, raw = T)1  < 2e-16 ***
#> poly(income, 2, raw = T)2 4.71e-11 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 12.72 on 417 degrees of freedom
#> Multiple R-squared:  0.5562, Adjusted R-squared:  0.554 
#> F-statistic: 261.3 on 2 and 417 DF,  p-value: < 2.2e-16

#This is equivalent to: 

dat$income2  <- dat$income^2

m <- lm(score ~ income + income2, data=dat )
summary(m)
#> 
#> Call:
#> lm(formula = score ~ income + income2, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -44.416  -9.048   0.440   8.347  31.639 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 607.30174    3.04622 199.362  < 2e-16 ***
#> income        3.85099    0.30426  12.657  < 2e-16 ***
#> income2      -0.04231    0.00626  -6.758 4.71e-11 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 12.72 on 417 degrees of freedom
#> Multiple R-squared:  0.5562, Adjusted R-squared:  0.554 
#> F-statistic: 261.3 on 2 and 417 DF,  p-value: < 2.2e-16

#Plotting this relationship:
library(ggplot2)
p <- ggplot(dat, aes(x=income, y=score)) + geom_point() +
  labs(x="Avereage Household Income", y = "School Test Scores") +
  geom_smooth(method = lm, formula = y ~ poly(x, 2), se = FALSE)
p

To get the effect of this variable we can take the first derivative using the power rule. A refresher on how the power rule works:

\[ \frac{\partial y}{\partial x}(X^k) =k*X^{k-1} \\ \frac{\partial y}{\partial x}(aX^k) =k*aX^{k-1} \\ \frac{\partial y}{\partial x}(a) =0 \\ \frac{\partial y}{\partial x}(Z) =0 \\ \]

Applying this to when we have polynomial term:

\[ \hat{y} = \hat{\alpha} + \hat{\beta_1}*x + \hat{\beta_2}*x^2 \\ \frac{\partial y}{\partial x} = \hat{\beta}_1 + 2*\hat{\beta_2}*x \]

So here, if we wanted to know the effect of income when income is at 20,30,50:

coef(m)
#>  (Intercept)       income      income2 
#> 607.30174351   3.85099392  -0.04230845

coef(m)[2] + 2*coef(m)[3]*20
#>   income 
#> 2.158656

#2.15
#and at 30
coef(m)[2] + 2*coef(m)[3]*30
#>   income 
#> 1.312487

#and at 50
coef(m)[2] + 2*coef(m)[3]*50
#>     income 
#> -0.3798508

The second violation of the functional form assumption is when our data comes from clusters but we are modeling it as if it is just a big random sample. In that case we want to explicitly model the clustered structure of the data. The canonical example of that is if you have students in a school: the relationship between the students home income and test scores will be derived, partly, from which classroom they are in. So to not violate this assumption we want to include a method that takes into account these clusters.

The example we saw in class was using the ACS data. Here the data comes from states. As such, the relationship between (for example) percent less than high school and the unemployment is determined, partly, by the state which a county is in. Another way to think about this is that the relationship within states is what we want to estimate because looking at this relationship between states introduces lots of bias. If we estimate between states part of our coefficient will be determined by state-level factors which lead to a higher percent of people with less than a high school education and higher unemployment.

To deal with this we can run a fixed-effects regression which gives an intercept to each state. See below for how we additionally have to adjust our standard errors for this clustered data.

names(acs)
#>  [1] "V1"                      "county.fips"            
#>  [3] "county.name"             "state.full"             
#>  [5] "state.abbr"              "state.alpha"            
#>  [7] "state.icpsr"             "census.region"          
#>  [9] "population"              "population.density"     
#> [11] "percent.senior"          "percent.white"          
#> [13] "percent.black"           "percent.asian"          
#> [15] "percent.amerindian"      "percent.less.hs"        
#> [17] "percent.college"         "unemployment.rate"      
#> [19] "median.income"           "gini"                   
#> [21] "median.rent"             "percent.child.poverty"  
#> [23] "percent.adult.poverty"   "percent.car.commute"    
#> [25] "percent.transit.commute" "percent.bicycle.commute"
#> [27] "percent.walk.commute"    "average.commute.time"   
#> [29] "percent.no.insurance"    "midwest"                
#> [31] "northeast"               "south"                  
#> [33] "west"

#Standard
m <- lm(unemployment.rate ~ percent.less.hs, data=acs)
summary(m)
#> 
#> Call:
#> lm(formula = unemployment.rate ~ percent.less.hs, data = acs)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -16.3352  -1.5111  -0.1961   1.2056  21.3602 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     3.098618   0.106664   29.05   <2e-16 ***
#> percent.less.hs 0.199515   0.007192   27.74   <2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.555 on 3139 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared:  0.1969, Adjusted R-squared:  0.1966 
#> F-statistic: 769.6 on 1 and 3139 DF,  p-value: < 2.2e-16

#Fixed Effects
library(lfe)
#> Loading required package: Matrix
#> Warning: package 'Matrix' was built under R version 4.1.3
#> 
#> Attaching package: 'lfe'
#> The following object is masked from 'package:lmtest':
#> 
#>     waldtest
m.fe <- felm(unemployment.rate ~ percent.less.hs | state.abbr, data=acs)
summary(m.fe)
#> 
#> Call:
#>    felm(formula = unemployment.rate ~ percent.less.hs | state.abbr,      data = acs) 
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -13.1414  -1.3097  -0.1830   0.9995  20.7760 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> percent.less.hs 0.160505   0.008225   19.52   <2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.301 on 3089 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared(full model): 0.3591   Adjusted R-squared: 0.3486 
#> Multiple R-squared(proj model): 0.1098   Adjusted R-squared: 0.09506 
#> F-statistic(full model):33.94 on 51 and 3089 DF, p-value: < 2.2e-16 
#> F-statistic(proj model): 380.8 on 1 and 3089 DF, p-value: < 2.2e-16

#Relationship somewhat attenuated.

The third violation of the functional form assumption is when we have to include an interaction term, which is dealt with in it’s own section, below.

15.8.3 3. No heteroskedasticity

Assumptions 3 & 4 deal with having incorrect standard errrors.

Assumption 3 says that, conditional on x, the squared error term must be equal to \(\sigma^2\). Less important here is what sigma squared is, and more important is that this is saying that the error must be constant across levels of X.

In short this assumption is violated if there is more variation in the dependent variable in some part of our independent variable. A pattern that may look like this:

#Heteroskedastic Errors
x <- runif(100, min=0, max=100)
y <- 2 + 3*x + rnorm(100, mean=0, sd=x*2)

plot(x,y)
abline(lm(y~x), col="firebrick" ,lwd=2)

Here was a real-world example of heteroskedasticity:

dat$esl <- dat$english
plot( dat$esl, dat$math)
abline(lm(dat$math ~ dat$esl), col="firebrick", lwd=2)

In general there seems to be a lot more variation in math scores when esl is lower, and less when esl is higher.

There is a specific test for heteroskedasticity beyond what we can see with our eyes:

library(lmtest)
m <- lm(math ~ esl, data=dat)
bptest(m)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  m
#> BP = 12.084, df = 1, p-value = 0.0005085

This test will be statistically significant if heteroskedasticity is present. Here we reject the null hypothesis of homoskedasticity.

How can we fix this?

Well, one way to fix heteroskedasticity is to better model the reasons for higher variance in some cases.

We think that there is more unmodeled variance at the low end of esl. In other words schools without a lot of english as a second language children come from all different types of places. If we add variables that explain math scores among that group better heteroskedasticity will be reduced.

So, for example, if we add to our regression other variables which help us determine what sort of schools these are…

m <- lm(math ~ esl + lunch + calworks + income, data=dat)
summary(m)
#> 
#> Call:
#> lm(formula = math ~ esl + lunch + calworks + income, data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -30.8808  -6.6403  -0.0207   5.8995  30.0311 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 660.25387    2.47171 267.124  < 2e-16 ***
#> esl          -0.14893    0.03831  -3.887 0.000118 ***
#> lunch        -0.32979    0.04215  -7.825 4.26e-14 ***
#> calworks     -0.11171    0.06657  -1.678 0.094090 .  
#> income        0.76129    0.09542   7.979 1.46e-14 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9.928 on 415 degrees of freedom
#> Multiple R-squared:  0.7224, Adjusted R-squared:  0.7198 
#> F-statistic:   270 on 4 and 415 DF,  p-value: < 2.2e-16
bptest(m)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  m
#> BP = 7.4826, df = 4, p-value = 0.1125

We no longer have a heteroskedastic regression.

The other way to deal with heteroskedasticity is to employ Hubert-White (or simply “robust”) standard errors.

library(sandwich)
library(lmtest)
m <- lm(math ~ esl, data=dat)
robust.vcov <- vcovHC(m, type="HC1")
coeftest(m, vcov=robust.vcov)
#> 
#> t test of coefficients:
#> 
#>               Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept) 662.539315   1.046782 632.929 < 2.2e-16 ***
#> esl          -0.583245   0.033671 -17.322 < 2.2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

15.8.4 3. No autocorrelation

Assumption 4 talks about “auto-correlation”. This has the same problem of breaking the assumption that the errors of one data point are independent from others, but here it is because that points are related to one another in some un-modeled way. This is almost always due to time or to clusters.

Here is quarterly American GDP since the year 2000:

gdp <- import("https://github.com/marctrussler/IIS-Data/raw/main/GDP.csv")
gdp <- gdp[213:307,]

plot(gdp$DATE, gdp$GDP)
abline(lm(gdp$GDP ~ gdp$DATE), col="firebrick", lwd=2)

Now, ignoring the fact there that this red line doesn’t do a great job of fitting the data, the problem here is that the errors being made from point to point are not independent from one another.

Let’s look at the relationships between the date and the residuals from that regression:

gdp$residuals <- residuals(lm(gdp$GDP ~ gdp$DATE))
plot(gdp$DATE, gdp$residuals)
abline(h=0)

For comparison, here is what this same plot looks like when things are fine (from the data we created abo:

x <- runif(100, min=0, max=100)
y <- 2 + 3*x + rnorm(100, mean=0, sd=15)
resid <- residuals(lm(y~x))
plot(x,resid)
abline(h=0)

Here the error on one point doesn’t really tell you anything about its neighbors.

How do we deal with time-based autocorrelation? Run! It’s very hard to deal with and is the focus of full courses. For right now it’s enough to know that this sort of pattern runs counter to the assumption of OLS.

For cluster based-autocorrelation we want to use cluster robust standard errors.

Above, we used the sandwich package to generate heteroskedastic-robust standard errors. We can use the same package to generate clustered standard errors. In the vcovCL() function we can specify what variable our data are clustered on. Here it is state, so I’ve put that variable into the command. I’ve also included type="HC1" into this. This is actually the default so I didn’t have to put it in, but I wanted to point out that cluster standard errors are also heteroskedastic robust as well! A bonus!

m <- lm(median.income ~ percent.college, data=acs)
summary(m)
#> 
#> Call:
#> lm(formula = median.income ~ percent.college, data = acs)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -37849  -6044   -623   5696  52223 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     29651.55     435.98   68.01   <2e-16 ***
#> percent.college  1016.55      18.52   54.90   <2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9789 on 3139 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared:  0.4899, Adjusted R-squared:  0.4897 
#> F-statistic:  3014 on 1 and 3139 DF,  p-value: < 2.2e-16
library(sandwich)
cluster.vcov <- vcovCL(m, cluster= ~ state.abbr, type="HC1")
cluster.vcov2 <- vcovCL(m, cluster= ~ state.abbr)

#Compare to original var-covar matrix
vcov(m)
#>                 (Intercept) percent.college
#> (Intercept)      190078.663      -7396.2115
#> percent.college   -7396.212        342.8198
#New hypothesis test
coeftest(m, vcov = cluster.vcov )
#> 
#> t test of coefficients:
#> 
#>                  Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)     29651.551   1337.982  22.161 < 2.2e-16 ***
#> percent.college  1016.547     55.331  18.372 < 2.2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally, tieing this in with GM2, if we choose to explicitly model the clusters in our dataset, we also have to use cluster-robust standard errors. The felm command will do this for you:

#Normal SE
m.fe <- felm(unemployment.rate ~ percent.less.hs | state.abbr, data=acs)
summary(m.fe)
#> 
#> Call:
#>    felm(formula = unemployment.rate ~ percent.less.hs | state.abbr,      data = acs) 
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -13.1414  -1.3097  -0.1830   0.9995  20.7760 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> percent.less.hs 0.160505   0.008225   19.52   <2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.301 on 3089 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared(full model): 0.3591   Adjusted R-squared: 0.3486 
#> Multiple R-squared(proj model): 0.1098   Adjusted R-squared: 0.09506 
#> F-statistic(full model):33.94 on 51 and 3089 DF, p-value: < 2.2e-16 
#> F-statistic(proj model): 380.8 on 1 and 3089 DF, p-value: < 2.2e-16
#Cluster-Robust SE
m.fe.clu <- felm(unemployment.rate ~ percent.less.hs | state.abbr| 0 | state.abbr, data=acs)
summary(m.fe.clu)
#> 
#> Call:
#>    felm(formula = unemployment.rate ~ percent.less.hs | state.abbr |      0 | state.abbr, data = acs) 
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -13.1414  -1.3097  -0.1830   0.9995  20.7760 
#> 
#> Coefficients:
#>                 Estimate Cluster s.e. t value Pr(>|t|)    
#> percent.less.hs   0.1605       0.0289   5.553 1.07e-06 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.301 on 3089 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared(full model): 0.3591   Adjusted R-squared: 0.3486 
#> Multiple R-squared(proj model): 0.1098   Adjusted R-squared: 0.09506 
#> F-statistic(full model, *iid*):33.94 on 51 and 3089 DF, p-value: < 2.2e-16 
#> F-statistic(proj model): 30.84 on 1 and 50 DF, p-value: 1.073e-06

15.8.5 5. Causality

The last Gauss-Markov assumption is that there are no omitted variables which impact both x and y. This is often thought of as the “bonus” assumption as it is not strictly necessary to have an unbiased estimator. We didn’t talk much about causality in this class beyond omitted variables. It’s a really big subject!

15.9 Regression Interaction and Prediction

15.9.1 Interaction

We can think of interaction as another potential violation to GM2. In this case the true data generating process is one where the effect of one variable depends on another. Indeed, if it’s helpful we can think of polynomial regression as a subset of interactive regression where the effect of a variablt depends on itself.

The most helpful way to think about interaction is via the regression equation. An interactive regression looks like this:

\[ \hat{Y} = \hat{\alpha} + \hat{\beta_1}X + \hat{\beta_2}Z + \hat{\beta_3}X*Z \]

Once we use the power rule to take the first derivative of this equation with respect to both X and Z:

\[ \frac{\partial \hat{Y}}{\partial X} = \hat{\beta_1} + \hat{\beta_3}Z \]

This equation gives us what the effect of X on Y is. The things that are true from this are: (1) \(\beta_1\) represents the effect of X on Y when Z is zero; (2) \(\beta_3\) represents how a 1-unit change in Z alters the relationship between X and Y.

Critically, interaction is completely symmetrical such that:

\[ \frac{\partial \hat{y}}{\partial Z} = \hat{\beta_1} + \hat{\beta_3}X \]

To give an example, we can think about how the the relationship between median income and income inequality is altered by the percent of a county who is white:

library(rio)
library(stargazer)
#> 
#> Please cite as:
#>  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
#>  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
acs <- import("https://github.com/marctrussler/IIS-Data/raw/main/ACSCountyData.csv")

acs$median.income <- acs$median.income/1000
acs$gini <- acs$gini*100

m4 <- lm(gini ~ median.income*percent.white, data=acs)
summary(m4)
#> 
#> Call:
#> lm(formula = gini ~ median.income * percent.white, data = acs)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -14.7470  -1.9791  -0.2637   1.8651  17.0992 
#> 
#> Coefficients:
#>                               Estimate Std. Error t value
#> (Intercept)                 52.5755392  0.8567078  61.369
#> median.income               -0.0365980  0.0178957  -2.045
#> percent.white               -0.0386569  0.0109061  -3.545
#> median.income:percent.white -0.0006808  0.0002248  -3.029
#>                             Pr(>|t|)    
#> (Intercept)                  < 2e-16 ***
#> median.income               0.040932 *  
#> percent.white               0.000399 ***
#> median.income:percent.white 0.002476 ** 
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.166 on 3137 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared:  0.249,  Adjusted R-squared:  0.2483 
#> F-statistic: 346.7 on 3 and 3137 DF,  p-value: < 2.2e-16

We can use the full regression equation and put in numbers for median income and percent white to look at how the relationship looks at different levels:

median.income <- seq(0,150,.01)

pred0 <- coef(m4)["(Intercept)"] +
  coef(m4)["median.income"]*median.income +
  coef(m4)["percent.white"]*0 +
  coef(m4)["median.income:percent.white"]*0*median.income
pred25 <- coef(m4)["(Intercept)"] +
  coef(m4)["median.income"]*median.income +
  coef(m4)["percent.white"]*25 +
  coef(m4)["median.income:percent.white"]*25*median.income
pred50 <- coef(m4)["(Intercept)"] +
  coef(m4)["median.income"]*median.income +
  coef(m4)["percent.white"]*50 +
  coef(m4)["median.income:percent.white"]*50*median.income
pred75 <- coef(m4)["(Intercept)"] +
  coef(m4)["median.income"]*median.income +
  coef(m4)["percent.white"]*75 +
  coef(m4)["median.income:percent.white"]*75*median.income
pred100 <- coef(m4)["(Intercept)"] +
  coef(m4)["median.income"]*median.income +
  coef(m4)["percent.white"]*100 +
  coef(m4)["median.income:percent.white"]*100*median.income

library(RColorBrewer)
#> Warning: package 'RColorBrewer' was built under R version
#> 4.1.3

cols <- brewer.pal(7, "Set2")

plot(acs$median.income, acs$gini, col="gray80")
points(median.income, pred0,lwd=2, type="l", col=cols[1])
points(median.income, pred25,lwd=2, type="l", col=cols[2])
points(median.income, pred50,lwd=2, type="l",  col=cols[3])
points(median.income, pred75,lwd=2, type="l",  col=cols[4])
points(median.income, pred100,lwd=2, type="l",  col=cols[5])
legend("topleft", c("0% White",
                    "25% White",
                    "50% White",
                    "75% White",
                    "100% White"),
       lty=rep(1,6), col=cols)

Alternatively we can just look at the marginal effect of one of the variables across the levels of the other:

#Manually
percent.white <- seq(0,100,.01)

margin <- coef(m4)["median.income"] + coef(m4)["median.income:percent.white"]*percent.white

plot(percent.white, margin, type="l", ylim=c(-0.15,0))
abline(h=0, lty=2)
rug(acs$percent.white)

#Via the Margins Package
library(margins)
#> Warning: package 'margins' was built under R version 4.1.3
cplot(m4, x="median.income",dx="percent.white", what="effect")
cplot(m4, dx="median.income",x="percent.white", what="effect")

15.9.2 Prediction

Regression with prediction is relatively straightforward. If we have a regression equation where we have estimated the betas and alpha, we can put whatever we want in for the variables and the equation will spit out a predicted value of y. It’s therefore quite easy to form predictions.

We looked at this through the lens of making likely voter predictions.

We have previous year data where we know who voted and who didn’t, so we can build a model of voting:

library(rio)

cces <- import("https://github.com/marctrussler/IIS-Data/raw/main/CCES.csv")

table(cces$vote.intent)
#> 
#> Already voted         Maybe            No           Yes 
#>         16563         23491         15139        130199
cces$vote.intent <- factor(cces$vote.intent)
cces$vote.intent <- relevel(cces$vote.intent, ref="No")

cces.training <- cces[cces$year %in% c(2016,2018),]
cces.test <- cces[cces$year==2020,]

m2 <- lm(vv.turnout ~  vote.intent +prim.turnout + educ:white, data=cces.training)
summary(m2)
#> 
#> Call:
#> lm(formula = vv.turnout ~ vote.intent + prim.turnout + educ:white, 
#>     data = cces.training)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.04559 -0.46192  0.00272  0.43795  1.01201 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                                 Estimate Std. Error t value
#> (Intercept)                     0.110132   0.004586  24.014
#> vote.intentAlready voted        0.500227   0.007369  67.882
#> vote.intentMaybe                0.163265   0.004999  32.657
#> vote.intentYes                  0.473932   0.004296 110.316
#> prim.turnout                    0.435233   0.002883 150.948
#> educcollege:whitenonwhite      -0.116630   0.005081 -22.953
#> educHS or less:whitenonwhite   -0.122144   0.005088 -24.009
#> educpostgrad:whitenonwhite     -0.087024   0.007057 -12.332
#> educsome college:whitenonwhite -0.080891   0.004382 -18.459
#> educcollege:whitewhite         -0.022178   0.003726  -5.953
#> educHS or less:whitewhite      -0.033830   0.003439  -9.838
#> educpostgrad:whitewhite        -0.022014   0.004334  -5.079
#> educsome college:whitewhite           NA         NA      NA
#>                                Pr(>|t|)    
#> (Intercept)                     < 2e-16 ***
#> vote.intentAlready voted        < 2e-16 ***
#> vote.intentMaybe                < 2e-16 ***
#> vote.intentYes                  < 2e-16 ***
#> prim.turnout                    < 2e-16 ***
#> educcollege:whitenonwhite       < 2e-16 ***
#> educHS or less:whitenonwhite    < 2e-16 ***
#> educpostgrad:whitenonwhite      < 2e-16 ***
#> educsome college:whitenonwhite  < 2e-16 ***
#> educcollege:whitewhite         2.64e-09 ***
#> educHS or less:whitewhite       < 2e-16 ***
#> educpostgrad:whitewhite        3.79e-07 ***
#> educsome college:whitewhite          NA    
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4087 on 124436 degrees of freedom
#>   (152 observations deleted due to missingness)
#> Multiple R-squared:  0.3225, Adjusted R-squared:  0.3225 
#> F-statistic:  5386 on 11 and 124436 DF,  p-value: < 2.2e-16

If we have a dataset (here data from the same survey from a more recent year) that has the same variables in it, we can use the estiamted coefficients from our “training” model to predict among the “test” data the probaility of voting, or not:

cces.test$predict2 <- predict(m2, newdata=cces.test)
#> Warning in predict.lm(m2, newdata = cces.test): prediction
#> from a rank-deficient fit may be misleading
head(cces.test)
#>            V1 year vv.turnout         educ white female
#> 124601 124601 2020          1 some college white      0
#> 124602 124602 2020          0     postgrad white      1
#> 124603 124603 2020          0      college white      1
#> 124604 124604 2020          1      college white      1
#> 124605 124605 2020          0      college white      0
#> 124606 124606 2020          1 some college white      0
#>        agegrp   vote.intent pid3 prim.turnout   predict2
#> 124601  50-64 Already voted    2            1 1.04559165
#> 124602  65-74            No    1            0 0.08811756
#> 124603  65-74           Yes    3            0 0.56188595
#> 124604  50-64           Yes    1            1 0.99711855
#> 124605  50-64           Yes    3            0 0.56188595
#> 124606  50-64           Yes    2            0 0.58406435
table(cces.test$predict2)
#> 
#>  -0.0120123927780443 -0.00649795413115195 
#>                  762                  195 
#>   0.0231077649957248   0.0292410289131809 
#>                  128                  457 
#>   0.0763023968785974   0.0879535214186681 
#>                 1825                  174 
#>   0.0881175633502864    0.110131921968922 
#>                   70                  660 
#>    0.151252369038693    0.156766807685586 
#>                 1034                  257 
#>    0.186372526812462    0.192505790729918 
#>                  106                  693 
#>    0.239567158695335    0.251218283235406 
#>                 1897                  363 
#>    0.251382325167024     0.27339668378566 
#>                  129                  991 
#>    0.423220206162353    0.428734644809246 
#>                   17                    1 
#>    0.461920033301611    0.464473627853578 
#>                 1777                   10 
#>    0.467434471948504     0.48821473327331 
#>                 1455                  386 
#>    0.493729171920202     0.49704019107538 
#>                  335                  593 
#>    0.503173454992836    0.511534995818995 
#>                 2362                   32 
#>    0.523186120359066    0.523334891047079 
#>                   10                  183 
#>    0.523350162290684    0.529468154964535 
#>                    2                  599 
#>     0.54536452090932    0.550234822958253 
#>                   18                 4642 
#>    0.561885947498324    0.562049989429942 
#>                 3316                 1635 
#>    0.576529522929951    0.584064348048578 
#>                  996                 4639 
#>    0.586484967979091    0.588180647470022 
#>                   27                  769 
#>     0.58834468940164    0.591999406625983 
#>                  480                   20 
#>    0.610359048020276     0.62160512575286 
#>                 1033                    5 
#>    0.627738389670316    0.674799757635732 
#>                   48                   86 
#>    0.686450882175803    0.686614924107421 
#>                   59                   17 
#>    0.708629282726057    0.897152632242009 
#>                   65                  554 
#>    0.902667070888901    0.923447332213707 
#>                 1121                  239 
#>      0.9289617708606    0.932272790015778 
#>                  430                  674 
#>    0.938406053933234    0.958567489987476 
#>                 1487                  329 
#>    0.964700753904932     0.98546742189865 
#>                  578                 3216 
#>    0.997118546438721    0.997282588370339 
#>                 4019                 2739 
#>     1.01176212187035     1.01929694698898 
#>                 1087                 4561 
#>     1.02341324641042     1.02357728834204 
#>                 1617                 1283 
#>     1.04559164696067 
#>                 1652

We can classify each person as a likely voter or not if there probability of voting is above or below 50%. From that we can determine how good of a job this model does in predicting 2020 voters. We can look to see the proportion of voters who are incorrectly classified. IE, how many are predicted to vote but don’t, and how many are predicted not to vote but do?

#Classification Error
mean((cces.test$predict2>.5 & cces.test$vv.turnout==0) | (cces.test$predict2<.5 & cces.test$vv.turnout==1),na.rm=T)
#> [1] 0.2115221