Chapter 9 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference–in–Difference Models

In this chapter we will learn to deal with panel data in R. Panel data are data that include observations in and through time. Panel data combine aspects of cross–sectional data with time–series data. The libraries necessary for this chapter are:

library(tidyverse)
library(magrittr)
library(broom)
library(estimatr)
library(carData)

9.1 Simpson’s Paradox

Simpson’s paradox - Simpson (1951) is phenomenon where an apparent relationship between two variables reverses itself when the data are dis-aggregated. For example, let’s look at the admissions rate for men and women in the University of California at Berkeley admissions data.

UCBAdmissions is a cross-tabulation of 4526 applicants by 3 variables: Admit, Gender, and Dept, the number of observations for each is n stored as 3-dimensional array.

UCBAdmissions
, , Dept = A

          Gender
Admit      Male Female
  Admitted  512     89
  Rejected  313     19

, , Dept = B

          Gender
Admit      Male Female
  Admitted  353     17
  Rejected  207      8

, , Dept = C

          Gender
Admit      Male Female
  Admitted  120    202
  Rejected  205    391

, , Dept = D

          Gender
Admit      Male Female
  Admitted  138    131
  Rejected  279    244

, , Dept = E

          Gender
Admit      Male Female
  Admitted   53     94
  Rejected  138    299

, , Dept = F

          Gender
Admit      Male Female
  Admitted   22     24
  Rejected  351    317

To calculate admission rates, we need to create a new variable, apps, that is the sum of admitted and rejected apps for both men and women.

UCBAdmissions %>% 
  as_tibble() %>% # convert the table to a data frame
  group_by(Dept, Gender) %>%  # allows us to sum admitted and rejected by department
  mutate(apps = sum(n)) %>% # create number of applicants by department
  ungroup() %>% # return the full data frame
  filter(Admit == "Admitted") %>% # select only those applicants admitted
  group_by(Gender) %>% # allows us to calculate acceptance rates by gender
  summarize(rate = sum(n)/sum(apps))
# A tibble: 2 x 2
  Gender  rate
  <chr>  <dbl>
1 Female 0.304
2 Male   0.445

Males are accepted at rate of 44.5% while females are accepted at lower rate of 30.4%.

UCBAdmissions %>% 
  as_tibble() %>% 
  group_by(Dept, Gender) %>%  
  mutate(apps = sum(n)) %>% 
  ungroup() %>% 
  filter(Admit == "Admitted") %>% 
  group_by(Dept, Gender) %>% 
  summarize(n/apps)
# A tibble: 12 x 3
# Groups:   Dept [6]
   Dept  Gender `n/apps`
   <chr> <chr>     <dbl>
 1 A     Female   0.824 
 2 A     Male     0.621 
 3 B     Female   0.68  
 4 B     Male     0.630 
 5 C     Female   0.341 
 6 C     Male     0.369 
 7 D     Female   0.349 
 8 D     Male     0.331 
 9 E     Female   0.239 
10 E     Male     0.277 
11 F     Female   0.0704
12 F     Male     0.0590

We now see that females are admitted at higher rates to four of the six departments.

9.2 Figures 8.1-8.3

We see a similar effect in Figures 8.1-8.3 in the text. We can reproduce those graphs with the code below. The crime data set contains observations on 19 variables from 58 cities over the period 1972 to 1993. First choose observations for only the California cities of Fresno, Los Angeles, Oakland, Sacramento, and San Francisco. Next convert the robbery and police to numbers per 1000 persons. The data frame crime contains the data.# the %in% operator means match the elements in one vector with elements in another.

crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% # choose relevant variables
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% # choose relevant cities
  mutate(robbery=robbery/popcity*1000, policesworn = policesworn/popcity*1000) %>% # convert to per 1000
  ggplot(aes(x = policesworn, y = robbery)) +
  geom_point(na.rm = T) + 
  geom_smooth(method = lm, na.rm = T, se = F) + 
  xlab("Police per 1000 People") + 
  ylab("Robberies per 1000 People") +
  labs(caption = "Figure 8.1: Robberies and Police for Large Cities in California") + # create caption
  theme(plot.caption = element_text(hjust = 0)) # left justify the caption

crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, policesworn = policesworn/popcity*1000) %>% 
  ggplot(aes(x = policesworn, y = robbery, color = cityname)) +
  geom_point(na.rm = T) + 
  xlab("Police per 1000 People") + 
  ylab("Robberies per 1000 People") +
  labs(caption = "Figure 8.2: Robberies and Police for Specified Cities in California") + 
  theme(plot.caption = element_text(hjust = 0), legend.position = "none") + # remove legend
  # place city names with corresponding colors.
  annotate(geom = "text", x = 1.6, y = 10, label = "Oakland", col = "#00BF7D") + 
  annotate(geom = "text", x = 2, y = 5, label = "Sacramento", col = "#00B0F6") + 
  annotate(geom = "text", x = 2.58, y = 4.5, label = "Los Angeles", col = "#A3A500") + 
  annotate(geom = "text", x = 2.7, y = 7.8, label = "San Francisco", col = "#E76BF3") + 
  annotate(geom = "text", x = 1.25, y = 3.5, label = "Fresno", col = "#F8766D")

crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, policesworn = policesworn/popcity*1000) %>% 
  ggplot(aes(x = policesworn, y = robbery, color = cityname)) +
  geom_point(na.rm = T) + 
  xlab("Police per 1000 People") + 
  ylab("Robberies per 1000 People") +
  labs(caption = "Figure 8.3: Robberies and Police for Specified Cities in California with City-Specific Regression Lines") + 
  theme(plot.caption = element_text(hjust = 0), legend.position = "none") + 
  annotate(geom = "text", x = 1.6, y = 10, label = "Oakland", col = "#00BF7D") + 
  annotate(geom = "text", x = 2, y = 5, label = "Sacramento", col = "#00B0F6") + 
  annotate(geom = "text", x = 2.58, y = 4.5, label = "Los Angeles", col = "#A3A500") + 
  annotate(geom = "text", x = 2.7, y = 7.8, label = "San Francisco", col = "#E76BF3") + 
  annotate(geom = "text", x = 1.25, y = 3.5, label = "Fresno", col = "#F8766D") +
  geom_smooth(method = "lm", se = F) # add regression lines. the addition of the color aesthetic will cause geom_smooth to add regression lines for each "color"

9.3 One-Way Fixed Effects Models

9.3.1 LSDV Approach

The least squares dummy variable approach allows us to account for the fixed effects by including a dummy variable for each unit. First, let’s calculate the pooled model.

crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, policesworn = policesworn/popcity*1000) %$%
  lm(robbery ~ policesworn) %>% 
  tidy()
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     3.48     1.05       3.31 0.00129 
2 policesworn     1.76     0.509      3.46 0.000771

We can see that the coefficient on the police variable is positive and significantly different than zero.

To apply LSDV approach in R, we add cityname as an explanatory variable. Since cityname is a character vector, R will treat it as a factor.

crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, policesworn = policesworn/popcity*1000) %$%
  lm(robbery ~ policesworn + cityname) 

Call:
lm(formula = robbery ~ policesworn + cityname)

Coefficients:
     (Intercept)       policesworn  citynamelosangel   citynameoakland  
           10.93             -4.16              6.60              5.96  
citynamesacramen   citynamesanfran  
            1.63              8.32  

We can confirm that below.

crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, 
         policesworn = policesworn/popcity*1000,
         cityname = as_factor(cityname)) %$% # coerce cityname to a factor
  lm(robbery ~ policesworn + cityname) 

Call:
lm(formula = robbery ~ policesworn + cityname)

Coefficients:
     (Intercept)       policesworn  citynamelosangel   citynameoakland  
           10.93             -4.16              6.60              5.96  
citynamesacramen   citynamesanfran  
            1.63              8.32  

The equation for each city is: \[\text{Fresno: }Robbery = 8.79-2.75Police\] \[\text{Los Angeles: }Robbery = 17.53-2.75Police\] \[\text{Oakland: }Robbery = 16.89-2.75Police\] \[\text{Sacramento: }Robbery = 12.56-2.75Police\] \[\text{San Francisco: }Robbery = 19.25-2.75Police\] We see the effect of Simpson’s Paradox in the slope variable here. The slope variable is now negative and significant. It should be noted that these results are not consistent with Figure 8.3. Here we have only one slope coefficient with five different intercepts; Figure 8.3 shows five different slope coefficients along with five different intercepts. We can return results consistent with Figure 8.3 as below. We can show the equation for each of the five cities by adding the coefficient on the dummy variable to the intercept with the base case being Fresno23

crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, 
         policesworn = policesworn/popcity*1000,
         cityname = as_factor(cityname)) %$% # coerce cityname to a factor
  lm(robbery ~ policesworn * cityname) 

Call:
lm(formula = robbery ~ policesworn * cityname)

Coefficients:
                 (Intercept)                   policesworn  
                       8.786                        -2.754  
            citynamelosangel               citynameoakland  
                      19.679                         3.720  
            citynamesacramen               citynamesanfran  
                       6.522                         4.442  
policesworn:citynamelosangel   policesworn:citynameoakland  
                      -6.038                         0.992  
policesworn:citynamesacramen   policesworn:citynamesanfran  
                      -2.940                         0.923  

Now the equation for each city requires that we add the slope dummy coefficient to the intercept coefficient and the interaction coefficient to the coefficient on policesworn. So the equation for each city is:24

\[\text{Fresno: }Robbery = 8.79-2.75Police\] \[\text{Los Angeles: }Robbery=28.46-8.79Police\] \[\text{Oakland: }Robbery=12.51-1.76Police\] \[\text{Sacramento: }Robbery=15.31-5.69Police\] \[\text{San Francisco: }Robbery=13.23-1.83Police\]

The above equation are consistent with the regression lines in Figure 8.3.

9.3.2 F-test for significance of fixed effects.

The unrestricted model is given by:\[Y_{it}=\beta_0+\beta_1X_{1it}+\beta_2D_{1i}+\beta_3D_{2i}+\cdots+\beta_PD_{P-1,i}+\nu_{it}\] To test for the significance of fixed effects we test the following hypothesis: \[H_0:\beta_2=\beta_3=\cdots=\beta_P\] \[H_1: \text{@ least one }\beta\ne0\] As in Chapter 5, we will make use of linearHypothesis from the car package.

library(car)
crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, 
         policesworn = policesworn/popcity*1000,
         cityname = as_factor(cityname)) %$% # coerce cityname to a factor
  lm(robbery ~ policesworn + cityname) %>% 
  linearHypothesis(c("citynamelosangel = 0", # use the variable names from the lm object
                     "citynameoakland = 0", 
                     "citynamesacramen = 0", 
                     "citynamesanfran = 0" ))
Linear hypothesis test

Hypothesis:
citynamelosangel = 0
citynameoakland = 0
citynamesacramen = 0
citynamesanfran = 0

Model 1: restricted model
Model 2: robbery ~ policesworn + cityname

  Res.Df RSS Df Sum of Sq    F              Pr(>F)    
1    108 571                                          
2    104 194  4       377 50.6 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the reported F-stat is 50.626 with a p-value of 0, we will reject the null hypothesis of no fixed effects in favor of the alternative suggesting that fixed effects exist.

9.4 De-Meaned approach

9.4.1 Manually De-Mean

We can estimate the fixed-model with a de-meaned approach with the model: \[Y_{it}-\bar Y_{i.}=\beta_1(X_{it}-\bar X_{i.})\]

scale will de-mean the data with the argument scale = F. Learn more about scale by calling ?scale. Do de-mean the data by city will use group_by in our pipeline to group the data by city, then we will mutate the crime and police variables with scale to de-mean them. We should end up the same estimate of the slope coefficient from the LSDV approach.

crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, 
         policesworn = policesworn/popcity*1000,
         cityname = as_factor(cityname)) %>%
  group_by(cityname) %>% 
  mutate(robbery = scale(robbery, scale = F),
         policesworn = scale(policesworn, scale = F)) %$%
  lm(robbery ~ policesworn) 

Call:
lm(formula = robbery ~ policesworn)

Coefficients:
          (Intercept)            policesworn  
 0.000000000000000609  -4.160101927269632682  

The slope coefficient is the same as the slope coefficient estimated by LSDV.

9.4.2 Using the plm package

We can estimate the fixed effects model with plm from the plm package. The plm package was created to make the estimation of linear panel models straightforward. To learn more read the vignette(plmPackage). To estimate the one-way fixed effects model with plm, we need four argumentsformula, data, index, and model. The formula and data arguments are the same as those in the lm call. index is a vector of the units and the type of variation is invoked with model. We estimate the model below:

library(plm)  
crime %>% 
  select(cityname, policesworn, robbery, popcity) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, 
         policesworn = policesworn/popcity*1000,
         cityname = as_factor(cityname)) -> # the %$% pipe does not function with plm
  cali # the modified data are assigned to the object cali
  plm(robbery ~ policesworn, data = cali, index = "cityname", model = "within")

Model Formula: robbery ~ policesworn

Coefficients:
policesworn 
      -4.16 

Again, we get the same estimate of the slope coefficient.

9.5 Two-Way Fixed Effects Models

The two-way fixed effects model is given by:\[Y_{it}=\beta_0+\beta_1X_{1it}+\alpha_i+\tau_t+\nu_{it}\]

So we need to incorporate time into the one-way fixed effects model. This can be accomplished in one of two ways. Time can be treated as a factor (dummy variable) or set the effect in plm to "twoways". The results will be the same.

9.5.1 Time as a factor

crime %>% 
  select(cityname, policesworn, robbery, popcity, year) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, 
         policesworn = policesworn/popcity*1000,
         cityname = as_factor(cityname)) -> # the %$% pipe does not function with plm
  cali # the modified data are assigned to the object cali
  plm(robbery ~ policesworn + factor(year), data = cali, index = "cityname", model = "within")

Model Formula: robbery ~ policesworn + factor(year)

Coefficients:
   policesworn factor(year)72 factor(year)73 factor(year)74 factor(year)75 
        -1.945         -0.493         -0.220         -0.158          0.647 
factor(year)76 factor(year)77 factor(year)78 factor(year)79 factor(year)80 
         0.872          0.789          1.476          1.545          2.831 
factor(year)81 factor(year)82 factor(year)83 factor(year)84 factor(year)85 
         2.509          1.940          1.151          0.782          1.005 
factor(year)86 factor(year)87 factor(year)88 factor(year)89 factor(year)90 
         1.361          0.109          0.140          0.574          1.432 
factor(year)91 factor(year)92 
         2.394          3.373 

9.5.2 effect = “twoways”

crime %>% 
  select(cityname, policesworn, robbery, popcity, year) %>% 
  filter(cityname %in% c("fresno", "losangel", "oakland", "sacramen", "sanfran")) %>% 
  mutate(robbery=robbery/popcity*1000, 
         policesworn = policesworn/popcity*1000,
         cityname = as_factor(cityname)) -> # the %$% pipe does not function with plm
  cali # the modified data are assigned to the object cali
  plm(robbery ~ policesworn, data = cali, index = "cityname", model = "within", effect = "twoways")

Model Formula: robbery ~ policesworn

Coefficients:
policesworn 
      -1.94 

As expected, the coefficient on the police variable is the same in each case.

9.6 Difference-in-Difference Models

In 1992 New Jersey raised it’s minimum wage from $4.25 to $5.05 while neighboring Pennsylvania did not. We can use a difference-in-difference model to investigate the effect of the treatment (increase in minimum wage) on the effect full time employment. The PoEdata25 package contains a data set named njmin3 that has 820 observations on 14 variables, call ?njmin for more information.

Estimate the basic model \[fte_{it}=\beta_0+\beta_1nj_i+\beta_2d_i+\beta_3(nj_i\times d_i) + \epsilon_{it}\] where \(fte_i\) is full-time equivalent employees, \(nj_i\) is the treatment26, and \(d_i\) is the after dummy27. Since \(\beta_3\) is the difference in differences of treated and control states, test the hypothesis: \[H_0:\beta_3=0\]\[H_1:\beta_3\ne0\]

# Call the following only once.
# install.packages("devtools") # required to insall GitHub packages do this only once
# devtools::install_git("https://github.com/ccolonescu/PoEdata") # install the package from GitHub 
library(PoEdata)
data(njmin3)
njmin3 %$%
  lm(fte ~ nj*d) %>% 
  summary()

Call:
lm(formula = fte ~ nj * d)

Residuals:
   Min     1Q Median     3Q    Max 
-21.17  -6.44  -1.03   4.47  64.56 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)    23.33       1.07   21.77 <0.0000000000000002 ***
nj             -2.89       1.19   -2.42               0.016 *  
d              -2.17       1.52   -1.43               0.154    
nj:d            2.75       1.69    1.63               0.103    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.41 on 790 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.0074,	Adjusted R-squared:  0.00363 
F-statistic: 1.96 on 3 and 790 DF,  p-value: 0.118

At the \(\alpha=.05\) level of significance the t-statistic with 790 degrees of freedom is \(\pm\) \(1.963\). The calculated t-statistic is 1.631 so we fail to reject the null hypothesis and conclude that there is no evidence to suggest that the change in the minimum wage changed full-time employment.

We control for other variables below

njmin3 %$%
  lm(fte ~ nj*d + co_owned) %>% 
  summary()

Call:
lm(formula = fte ~ nj * d + co_owned)

Residuals:
   Min     1Q Median     3Q    Max 
-22.06  -6.34  -1.06   4.56  66.29 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   24.293      1.093   22.24 < 0.0000000000000002 ***
nj            -2.939      1.184   -2.48              0.01323 *  
d             -2.234      1.503   -1.49              0.13759    
co_owned      -2.645      0.696   -3.80              0.00015 ***
nj:d           2.820      1.674    1.68              0.09255 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.33 on 789 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.0253,	Adjusted R-squared:  0.0203 
F-statistic: 5.11 on 4 and 789 DF,  p-value: 0.000454

  1. The base case can be changed from the default with appropriate arguments see the forcats package for more.↩︎

  2. Please note that not all of the coefficients are significant at the 5% level. This is ignored in the equations derived for expository purposes. In fact, we can see that the slope coefficients for Oakland, Sacramento, and San Francisco are not significantly different from the slope coefficient for Fresno, since each of those interaction effects are not significant.↩︎

  3. The PoEdata package is not housed at CRAN, instead it is house at GitHub, so installing it requires an extra step.↩︎

  4. \(nj_i\) takes the value 1 for New Jersey where the minimum wage was increased and the value 0 for Pennsylvania where the minimum wage was not changed↩︎

  5. \(d_1\) takes the value 1 after the minimum wage is changed and the value 0 before the change.↩︎