6 Regression

This section covers regression analysis in the fixest package, written by Laurent Berge. While R has built-in regression functions (lm()), it also has many packages for regression analysis. Fixest is my favorite of these package for three main reasons.

Fixest is flexible. With a single function (feols), you can implement multivariate regression, fixed effects regressions, and even instrumental variables. It also has functionality for all of the common standard errors. You’ll rarely need any other regression package
Fixest is fast. This mainly matters when you’re amalyzing large datasets or implementing fixed effects regressions.¹⁹
Fixest is pretty. The etable() function is quick and easy way to make attractive, well-formatted output tables that summarize your results.

This section gives only a sparse introduction to fixest. For details, see the excellent documentation.

6.1 Fixest syntax

For our guiding examples, let’s use the bechdel dataset from fivethirtyeight. This dataset contains several variables about movie revenues and whether the movies pass the Bechdel test.²⁰

library(fixest)
library(dplyr)
library(fivethirtyeight)

data('bechdel')

### Data processing
bechdel = bechdel %>% 
  select(year, title, budget_2013, domgross_2013, clean_test) %>%
  mutate(clean_test = as.character(clean_test)) # get rid of factor labels

bechdel$random_number = runif(nrow(bechdel)) # random number

bechdel = na.omit(bechdel) #(drops all observations with NA values)

print(head(bechdel))

## # A tibble: 6 × 6
##    year title            budget_2013 domgross_2013 clean_test random_number
##   <int> <chr>                  <int>         <dbl> <chr>              <dbl>
## 1  2013 21 & Over           13000000      25682380 notalk            0.0616
## 2  2012 Dredd 3D            45658735      13611086 ok                0.170 
## 3  2013 12 Years a Slave    20000000      53107035 notalk            0.322 
## 4  2013 2 Guns              61000000      75612460 notalk            0.327 
## 5  2013 42                  40000000      95020213 men               0.600 
## 6  2013 47 Ronin           225000000      38362475 men               0.841

Let’s implement a simple regression with domgross as the dependent variable and budget as the independent variable.

The main function in fixest is feols(). feols() accepts two key arguments and several options.

The first key argument is data, which is the dataset of interest. The second key argument is fml, which is the regression formula. When you write the fml argument, you use the syntax y ~ x. The dependent variable is to the left of the ~ symbol, the independent variables are to the right of the ~ symbol. All of the variables in your formula must be variables in the main dataset.

results = feols(
  data = bechdel,
  fml = domgross_2013 ~ budget_2013
)
summary(results)

## OLS estimation, Dep. Var.: domgross_2013
## Observations: 1,776 
## Standard-errors: IID 
##                 Estimate   Std. Error  t value  Pr(>|t|)    
## (Intercept) 3.615293e+07 3.781955e+06  9.55932 < 2.2e-16 ***
## budget_2013 1.056040e+00 4.822900e-02 21.89629 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 111,733,102.8   Adj. R2: 0.212318

I use the summary() function to take an initial glance at the results. You can see the main regression statistics: coefficient estimates, standard errors, p-values, and an R-squred.

To implement a regression with multiple independent variables, simply add them (with +) to the right hand side of the ~ symbol in your fml argument. It is just like writing an equation!

results = feols(
  data = bechdel,
  fml = domgross_2013 ~ budget_2013 + random_number
)
summary(results)

## OLS estimation, Dep. Var.: domgross_2013
## Observations: 1,776 
## Standard-errors: IID 
##                   Estimate   Std. Error   t value   Pr(>|t|)    
## (Intercept)   3.505952e+07 5.987777e+06  5.855181 5.6653e-09 ***
## budget_2013   1.056380e+00 4.826400e-02 21.887515  < 2.2e-16 ***
## random_number 2.122016e+06 9.007658e+06  0.235579 8.1379e-01    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 111,731,354.2   Adj. R2: 0.211898

Often, you want to understand how a categorical variable is related to your dependent variable. In this case, you might want to understand how revenues correlate with the code for whether the film passed the bechdel test.

print(unique(bechdel$clean_test)) # prints unique values of clean_test variable

## [1] "notalk"  "ok"      "men"     "nowomen" "dubious"

In your econometrics class, you hopefully learned that you should create a set of dummy variables, one for each value of the categorical except for a left-out group, to conduct this analysis. Luckily for us, fixest automaticaly implements this sort of analysis if you include a character variable or a factor on the right hand side of a regression.

results = feols(
  data = bechdel,
  fml = domgross_2013 ~ budget_2013 + clean_test
)
summary(results)

## OLS estimation, Dep. Var.: domgross_2013
## Observations: 1,776 
## Standard-errors: IID 
##                        Estimate   Std. Error   t value   Pr(>|t|)    
## (Intercept)        4.234370e+07 9.862962e+06  4.293203 1.8565e-05 ***
## budget_2013        1.039180e+00 4.887000e-02 21.264264  < 2.2e-16 ***
## clean_testmen     -3.143348e+06 1.238325e+07 -0.253839 7.9965e-01    
## clean_testnotalk   2.801307e+06 1.063853e+07  0.263317 7.9234e-01    
## clean_testnowomen -8.766933e+06 1.338242e+07 -0.655108 5.1248e-01    
## clean_testok      -1.125089e+07 1.023483e+07 -1.099275 2.7180e-01    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 111,565,230.5   Adj. R2: 0.212908

clean_testmen is a dummy variable taking value 1 if clean_test == 'men' and 0 otherwise. The other clean_test variables are defined accordingly. For interpretation, always make sure you know the identity of the leave-out group. In this case, it is the clean_test == 'dubious' group.

To alter the leave-out group, we can make clean_test into a factor, and list the desired leave-out group first. In this case, we specify 'nowomen' as the leave-out group.

bechdel = bechdel %>% 
  mutate(clean_test =
           factor(clean_test, 
                  labels = c('nowomen', 'notalk', 'men', 'dubious', 'ok'))
         )

results = feols(
  data = bechdel,
  fml = domgross_2013 ~ budget_2013 + clean_test
)
summary(results)

## OLS estimation, Dep. Var.: domgross_2013
## Observations: 1,776 
## Standard-errors: IID 
##                        Estimate   Std. Error   t value   Pr(>|t|)    
## (Intercept)        4.234370e+07 9.862962e+06  4.293203 1.8565e-05 ***
## budget_2013        1.039180e+00 4.887000e-02 21.264264  < 2.2e-16 ***
## clean_testnotalk  -3.143348e+06 1.238325e+07 -0.253839 7.9965e-01    
## clean_testmen      2.801307e+06 1.063853e+07  0.263317 7.9234e-01    
## clean_testdubious -8.766933e+06 1.338242e+07 -0.655108 5.1248e-01    
## clean_testok      -1.125089e+07 1.023483e+07 -1.099275 2.7180e-01    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 111,565,230.5   Adj. R2: 0.212908

6.1.1 Accessing estimates

Sometimes, we want to do further computations with our model estimates. For example, to access coefficients from feols() result, use the following syntax: result$coefficients['variable_name']. Here is an example.

key_estimate = results$coefficients['budget_2013']
print(key_estimate)

## budget_2013 
##    1.039177

We created a variable key_estimate that contains a single coefficient estimate. You can modify it or use it however you like.

6.1.2 Fixed effects

For this section, I follow the example in Econometrics in R chapter 10. We use the dataset Fatalities from the AER package. That dataset contains traffic fatalities per year for each state, as well as several state-year level covariates. We will study the association between beer taxes and traffic fatalities.

library(fixest)
library(AER)
data(Fatalities)

Fatalities = Fatalities %>%
  mutate(
    fatal_rate_per_10K = fatal/pop*10000,
    year = as.numeric(year)
  ) %>%
  select(state, year, fatal_rate_per_10K, mormon, baptist, 
         youngdrivers, miles, beertax, pop)

head(Fatalities)

##   state year fatal_rate_per_10K  mormon baptist youngdrivers    miles  beertax     pop
## 1    al    1            2.12836 0.32829 30.3557     0.211572 7233.887 1.539379 3942002
## 2    al    2            2.34848 0.34341 30.3336     0.210768 7836.348 1.788991 3960008
## 3    al    3            2.33643 0.35924 30.3115     0.211484 8262.990 1.714286 3988992
## 4    al    4            2.19348 0.37579 30.2895     0.211140 8726.917 1.652542 4021008
## 5    al    5            2.66914 0.39311 30.2674     0.213400 8952.854 1.609907 4049994
## 6    al    6            2.71859 0.41123 30.2453     0.215527 9166.302 1.560000 4082999

To include fixed effects, we only need to alter the fml argument to feols(). Any variables that we include after the | “vertical pipe” character are interpreted by R as fixed effects. See examples below.

out_year_as_variable = feols(
  data = Fatalities, 
  fml = fatal_rate_per_10K ~ beertax + year
  )

out_fe = feols(
  data = Fatalities, 
  fml = fatal_rate_per_10K ~ beertax | year
  )

print(out_year_as_variable)

## OLS estimation, Dep. Var.: fatal_rate_per_10K
## Observations: 336 
## Standard-errors: IID 
##             Estimate Std. Error   t value   Pr(>|t|)    
## (Intercept) 1.826302   0.074682 24.454353  < 2.2e-16 ***
## beertax     0.365631   0.062287  5.870088 1.0532e-08 ***
## year        0.006620   0.014860  0.445503 6.5625e-01    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.541954   Adj. R2: 0.088461

print(out_fe)

## OLS estimation, Dep. Var.: fatal_rate_per_10K
## Observations: 336 
## Fixed-effects: year: 7
## Standard-errors: Clustered (year) 
##         Estimate Std. Error t value   Pr(>|t|)    
## beertax 0.366336   0.048247 7.59295 0.00027154 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.540533     Adj. R2: 0.079412
##                  Within R2: 0.094538

Note the difference between the first table and the second. In the first regression, we are including year as a continuous numeric variable. R is trying to find the coefficients that best fit a model where the fatality rate depends linearly on both the beer tax and the year variable. In the second regression, we are including year as a fixed effect. R is making a set of dummy variables, one for each year except the leave-out group, and regressing the dependent variable on those. These are substantively different analyses and they result in different coefficient estimates on beertax.

As a rule of thumb, when you are working with panel data (i.e, when there are multpile units that appear in every year), you typically want to be implementing the second, fixed effects regression, rather than the first regression that only adjusts for a linear time trend.

You can also include multiple fixed effects. Add them to the formula on the right side of the |, just like you would for another independent variable.

out_twofe = feols(
  data = Fatalities, 
  fml = fatal_rate_per_10K ~ beertax | year + state
  )
print(out_twofe)

## OLS estimation, Dep. Var.: fatal_rate_per_10K
## Observations: 336 
## Fixed-effects: year: 7,  state: 48
## Standard-errors: Clustered (year) 
##         Estimate Std. Error  t value  Pr(>|t|)    
## beertax -0.63998   0.163126 -3.92322 0.0077731 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.171819     Adj. R2: 0.891425
##                  Within R2: 0.036065

Sometimes, you will want to use more exotic fixed effects, such as state times year effects.²¹ In these cases, visit the documentation, and control-F “interaction” for a comprehensive guide.

6.1.3 Alternative standard errors

For robust standard errors (as required throughout E3), include the option vcov = 'hetero'.

out_univariate_robust = feols(
  data = Fatalities, 
  fml = fatal_rate_per_10K ~ beertax,
  vcov = 'hetero'
  )

To specify cluster-robust standard errors, use the cluster = option.

out_univariate_cluster = feols(
  data = Fatalities, 
  fml = fatal_rate_per_10K ~ beertax,
  cluster = ~state
  )

For AR(1) standard errors (as required for the Gas Tax problem set in E3), you need to first tell R which dimension of your data is the “time” variable. You can do this with the panel() function, as below. Then, include the option vcov = NW(1) in your call to feols().²²

Fatalities = panel(Fatalities, panel.id = ~state + year)

out_univariate_AR = feols(
  data = Fatalities, 
  fml = fatal_rate_per_10K ~ beertax,
  vcov = NW(1)
  )

6.1.4 Weights

For weighted regression, include the option weights. Note that you again use ~variable syntax for this input.

out_univariate_robust_weighted = feols(
  data = Fatalities, 
  fml = fatal_rate_per_10K ~ beertax,
  vcov = 'hetero',
  weights = ~pop
  )

6.2 Making tables with `etable`

The summary() outputs are quick to glance at, but for writeups it’s good to have tables that are easier for others to read. Enter etable(), which is the simplest flexible table formatting function I’ve found. It only works with fixest regressions.

result_list = list(out_univariate_robust, out_univariate_robust_weighted, 
                   out_univariate_AR, out_fe)

setwd('./_book') # special line for image to render on website... will not matter for you.
etable(
  result_list,
  style.tex = style.tex('aer'), 
  markdown = TRUE # special line for image to render on website
  )

The first argument to etable is always a fixest result or a list of fixest results. This first argument is the only necessary input – all the other arguments specify table style.

6.2.1 Formatting

There are lots of ways to customize your tables – for example, you can see above that I’m asking the etable to follow the style guide of the American Economic Review. The fixest documentation contains a whole page about etable.

I will flag one important option: specifying your row names. You can see above that raw variable names appear in the table. That makes sense – R doesn’t know what else to call the variables! But when you present tables, you should make them easier to read. etable lets you change labels by setting a named character vector as a dictionary. You can make a dictionary like this.

dict_for_table = c(
  'beertax' = 'Beer Tax ($)',
  'fatal_rate_per_10K' = 'Fatality rate (deaths per 10K)'
)

The variable names are on the left side, and your labels are on the right side.

Then, when you run etable, you can pass your dictionary into the dict argument.

setwd('./_book') # special line for image to render on website... will not matter for you.
etable(
  result_list,
  style.tex = style.tex('aer'), 
  dict = dict_for_table,
  markdown = TRUE # special line for image to render on website
  )

6.3 Instrumental variables

To estimate an IV regression, you again just have to change the fml input. fixest uses the following syntax: y ~ x_exog | FEs | x_endog ~ z, where…

y is the dependent variable
x_exog are the exogenous variables (i.e, the variables you’re NOT instrumenting for)
FEs are the fixed effects
x_endog are the endogenous variables (i.e, the variables you ARE instrumenting for)
z are your instruments.

Essentially, everything is the same as before, except to do an IV you add another | and specify the first stage relationship to the right of that |. I give an example below, creating a random number to be the (fake) instrument.

setwd('./_book') # special line for image to render on website

Fatalities$random_number = runif(nrow(Fatalities))

res_IV = feols(
  data = Fatalities, 
  fml = fatal_rate_per_10K ~ 1|state+year|beertax~ random_number,
  )

etable(
  res_IV,
  style.tex = style.tex('aer'), 
  markdown = TRUE # special option for image to render on website
  )

Note that because I did not have any exogenous RHS variables, I simply included the constant “1” as x_exog. This is a common trick – “1” denotes, “There are no variables in this part of the formula.” Similarly, if you want to do an IV analysis but you do not have any fixed effects, you can include a “1” after the first | divider.

Here is a shortcut to have etable show both the first and second stage results simultaneously.

setwd('./_book') # special option for image to render on website

etable(
  res_IV, stage = 1:2,
  style.tex = style.tex('aer'), 
  markdown = TRUE # special line for image to render on website
  )

There are several other shortcuts in fixest that you will discover as you become more familiar with it. You will rarely need another regression package.

6.4 Event studies

Event study plots are an extremely common visualization of results difference-in-difference style analyses. For details on event studies, see the relevant section of the Cunningham book, the discussion of dynamic difference in differences here, or this NBER working paper.

feols includes built-in tools that allow you to make attractive event study plots very quickly. For demonstration, feols includes a synthetic dataset called base_did with the following variables.

y an outcome variable (like “log house price” for an analysis of the treatment effects of Superfund cleanups on local house prices as in Greenstone and Gallagher (2008))
id is a unit id (a census tract id in the example)
period is a time id (a year in the example)
treat is an indicator for whether the unit got treated (an indicator for whether the tract got a cleaned in the example). In the example, the treatment takes place after period 5.

Many event study analyses include two plots. First, there’s a plot of the raw means over time for both groups.

library(ggplot2)
base_did = base_did %>% 
  mutate(f.treat = factor(treat, labels = c('Treated', 'Control')))
ggplot(data = base_did, 
       aes(x = period, y = y,  group = f.treat, color = f.treat)) + 
  stat_summary_bin() +
  labs(x = 'Period', color = 'Group') +
  geom_vline(xintercept = 5.5, color = 'black', linetype = 'dashed')

## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

Next, the analyst plots the difference between the treated and control units over time, after controlling for unit fixed effects, period fixed effects, and potentially other within-unit variables that change over time. Typically these differences are comptuted with a regression:

est_estud = feols(y ~ i(period, treat, ref = 5) | id+period, base_did)

This line says: “Regress y on unit fixed effects (id), time fixed effects (period), and then a set of period $\times$ treatment indicators.” The function i(), which is part of the fixest package, creates these period $\times$ treatment indicators. Specifically, it creates variables that are equal to 1 if and only if the observation is in a given period and the treat variable is 1. You must always specify one period to be the left-out or “reference” group. Traditionally, this is the period immediately before treatment, so here I specify it with ref = 5.

Here’s what the regression output looks like in a table.

setwd('./_book') # special line for image to render on website... will not matter for you.
etable(
  est_estud,
  style.tex = style.tex('aer'), 
  dict = dict_for_table,
  markdown = TRUE # special line for image to render on website
  )

Note that there is no “treat $\times$ period 5” variable, because we specified it as the left-out group.

To visualize the results, we use the wonderful ggiplot package.

require(ggiplot)
ggiplot(
  est_estud, main = 'Event Study Example') +
  labs(y = 'Effect (point estimate and 95% confidence interval', x = 'Period')+
  theme_bw()

You can customize the look of your ggiplot just like it is a ggplot (by adding more layers). For more options, check out the vignettes on the website.

For E3 students, fixest will deliver time savings on the climate mortality problem set.↩︎
The relevant fivethirtyeight article explains more.↩︎
This comes up in the Climate Mortality PSet in E3, for example.↩︎
The NW stands for “Newey-West,” which is a way to compute AR standard errors.↩︎