Chapter 11 Panel Regression

Panel Regression is a technique used for data that has both a cross-sectional and a time series component. In other words, data where each individual/country/company/etc. in the data set is observed at multiple points in time. While there are panel versions of most of the models we have encountered thus far (e.g. logit/probit, Tobit, negbin, …), we will focus only on the linear model.

As usual, we begin with loading in some data to analyze. Additionally, this chapter will make use of the plm package, so install it if you have not done so already.

data(USSeatBelts)
data(Fatalities)
data(Municipalities)
data(USAirlines)
data(driving)
data(airfare)
data(murder)
data(NaturalGas)
data(HealthInsurance)
data(wagepan)
data(CigarettesSW)

11.1 Panel Data

With the exception of the chapter on Time Series, Chapter 10, nearly every data set we have dealt with has cross-sectional; each subject is observed once, and typically all subjects are observed at the same time. By contrast, panel data (sometimes referred to as longitudinal data) includes multiple observations of the same set of subjects, made at different points in time. A panel is said to be be balanced if you have an observation for every cross-sectional group for every time period in the data. Otherwise, the data is said to be unbalanced.

To see what a panel data set looks like, let’s look at an example of a panel data set, the airfare data from the wooldridge package.

This is a pretty huge data set, with 4596 observations over 14 variables. If we examine the description of the data using ?airfare, we can see the features of this data set that make it panel data. First, the variable $year$ indicates that the data is collected in each of 4 years; 1997, 1998, 1999, and 2000. Second, the data has an individual identifier, in this case $id$ . If we take a deeper look at the data and see that we have an observation for each $year \times id$ combination, then we have a panel data set. To that end, let’s take a quick look at the variables we will be looking at from the first 10 lines of data:

airfare %>% 
    dplyr::select(year, id, fare, dist, passen, concen) %>% 
    slice(1:10)

##    year id fare dist passen concen
## 1  1997  1  106  528    152 0.8386
## 2  1998  1  106  528    265 0.8133
## 3  1999  1  113  528    336 0.8262
## 4  2000  1  123  528    298 0.8612
## 5  1997  2  104  861    282 0.5798
## 6  1998  2  105  861    178 0.5817
## 7  1999  2  115  861    204 0.7319
## 8  2000  2  129  861    190 0.5386
## 9  1997  3  207  852    241 0.8180
## 10 1998  3  188  852    253 0.8172

To see that this is panel data, look at the first two columns. Observations 1-4 are for $id = 1$ for each of the 4 possible values of $year$ , observations 5-8 are the same for $id = 2$ , and so forth. We also have 4 other variables we will look at:

$fare$ - the average ticket price
$dist$ - the distance of each air route
$passen$ - the average daily passengers on each route
$concen$ - a measure of the market share of the biggest carrier

To begin using this data, we need to first transform it into a panel data set. This means explicitly telling R that $id$ and $year$ are the variables that define the cross-sectional and time series dimensions of our data, respectively. As we can see, R currently views them as numeric.

summary(airfare)

##       year            id            dist            passen      
##  Min.   :1997   Min.   :   1   Min.   :  95.0   Min.   :   2.0  
##  1st Qu.:1998   1st Qu.: 288   1st Qu.: 505.0   1st Qu.: 215.0  
##  Median :1998   Median : 575   Median : 861.0   Median : 357.0  
##  Mean   :1998   Mean   : 575   Mean   : 989.7   Mean   : 636.8  
##  3rd Qu.:1999   3rd Qu.: 862   3rd Qu.:1304.0   3rd Qu.: 717.0  
##  Max.   :2000   Max.   :1149   Max.   :2724.0   Max.   :8497.0  
##       fare          bmktshr           ldist            y98      
##  Min.   : 37.0   Min.   :0.1605   Min.   :4.554   Min.   :0.00  
##  1st Qu.:123.0   1st Qu.:0.4650   1st Qu.:6.225   1st Qu.:0.00  
##  Median :168.0   Median :0.6039   Median :6.758   Median :0.00  
##  Mean   :178.8   Mean   :0.6101   Mean   :6.696   Mean   :0.25  
##  3rd Qu.:225.0   3rd Qu.:0.7531   3rd Qu.:7.173   3rd Qu.:0.25  
##  Max.   :522.0   Max.   :1.0000   Max.   :7.910   Max.   :1.00  
##       y99            y00           lfare          ldistsq     
##  Min.   :0.00   Min.   :0.00   Min.   :3.611   Min.   :20.74  
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:4.812   1st Qu.:38.75  
##  Median :0.00   Median :0.00   Median :5.124   Median :45.67  
##  Mean   :0.25   Mean   :0.25   Mean   :5.096   Mean   :45.28  
##  3rd Qu.:0.25   3rd Qu.:0.25   3rd Qu.:5.416   3rd Qu.:51.45  
##  Max.   :1.00   Max.   :1.00   Max.   :6.258   Max.   :62.57  
##      concen          lpassen      
##  Min.   :0.1605   Min.   :0.6931  
##  1st Qu.:0.4650   1st Qu.:5.3706  
##  Median :0.6039   Median :5.8777  
##  Mean   :0.6101   Mean   :6.0170  
##  3rd Qu.:0.7531   3rd Qu.:6.5751  
##  Max.   :1.0000   Max.   :9.0475

We can use the pdata.frame() command in the plm package to set $id$ and $year$ as our index variables.

airfarepanel <- pdata.frame(airfare, index = c("id","year"))

Now, when we look at how the $id$ and $year$ variables are stored in the airfare and airfarepanel objects, we see that R views them quite differently; now, it is treating them as factors in the airfarepanel object.

airfare %>% dplyr::select(id, year) %>% 
    summary(.)

##        id            year     
##  Min.   :   1   Min.   :1997  
##  1st Qu.: 288   1st Qu.:1998  
##  Median : 575   Median :1998  
##  Mean   : 575   Mean   :1998  
##  3rd Qu.: 862   3rd Qu.:1999  
##  Max.   :1149   Max.   :2000

airfarepanel %>% dplyr::select(id, year) %>% 
    summary(.)

##        id         year     
##  1      :   4   1997:1149  
##  2      :   4   1998:1149  
##  3      :   4   1999:1149  
##  4      :   4   2000:1149  
##  5      :   4              
##  6      :   4              
##  (Other):4572

Before we estimate any models with our panel data, let us first estimate an OLS model with lm() on the airfare data. The model we are estimating is:

$\begin{equation} fare_{i} = \alpha +\beta_1 distance_{i} + \beta_2 concentration_i +\beta_3 passengers_i \end{equation}$

reg1a <- lm(fare ~ dist + concen + passen, data = airfare)
stargazer(reg1a, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                fare            
## -----------------------------------------------
## dist                         0.087***          
##                               (0.002)          
##                                                
## concen                       69.431***         
##                               (5.123)          
##                                                
## passen                       -0.005***         
##                               (0.001)          
##                                                
## Constant                     52.995***         
##                               (4.488)          
##                                                
## -----------------------------------------------
## Observations                   4,596           
## R2                             0.418           
## Adjusted R2                    0.418           
## Residual Std. Error     57.130 (df = 4592)     
## F Statistic         1,100.722*** (df = 3; 4592)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Interpreting this regression, we see that distance and market share are positively correlated with fare and the number of passengers is negatively correlated with fare.

11.2 Pooled OLS

We can estimate the same regression using our panel data set using the plm() (Panel Linear Model ) command. Because we already specified that the airfarepanel object is panel data, the plm() command only requires one more argument than the lm() command. This is an argument you should be used to by now, the model = argument. To estimate the same regression as a basic OLS model, we use the model = "pooling" option to estimate a pooled regression.

reg1b <- plm(fare ~ dist + concen + passen , data = airfarepanel, model = "pooling")
stargazer(reg1a, reg1b, column.labels = c("OLS", "Pooled"), type = "text")

## 
## ==========================================================
##                                  Dependent variable:      
##                            -------------------------------
##                                         fare              
##                                   OLS            panel    
##                                                  linear   
##                                   OLS            Pooled   
##                                   (1)             (2)     
## ----------------------------------------------------------
## dist                            0.087***        0.087***  
##                                 (0.002)         (0.002)   
##                                                           
## concen                         69.431***       69.431***  
##                                 (5.123)         (5.123)   
##                                                           
## passen                         -0.005***       -0.005***  
##                                 (0.001)         (0.001)   
##                                                           
## Constant                       52.995***       52.995***  
##                                 (4.488)         (4.488)   
##                                                           
## ----------------------------------------------------------
## Observations                     4,596           4,596    
## R2                               0.418           0.418    
## Adjusted R2                      0.418           0.418    
## Residual Std. Error        57.130 (df = 4592)             
## F Statistic (df = 3; 4592)    1,100.722***    1,100.722***
## ==========================================================
## Note:                          *p<0.1; **p<0.05; ***p<0.01

The results are identical! A pooled regression, therefore is just a fancy name for a regression that uses panel data, but doesn’t actually take into consideration the fact that we have the same subjects being observed over multiple time periods.

11.3 Between estimator

One way to take into consideration the fact that we have multiple observations of each subject is to “collapse” the data into group averages and estimate a linear model using that data.

In other words, consider the four observations associated with route 1:

airfare %>% 
  filter(id == 1)

##   year id dist passen fare bmktshr    ldist y98 y99 y00    lfare  ldistsq
## 1 1997  1  528    152  106  0.8386 6.269096   0   0   0 4.663439 39.30157
## 2 1998  1  528    265  106  0.8133 6.269096   1   0   0 4.663439 39.30157
## 3 1999  1  528    336  113  0.8262 6.269096   0   1   0 4.727388 39.30157
## 4 2000  1  528    298  123  0.8612 6.269096   0   0   1 4.812184 39.30157
##   concen  lpassen
## 1 0.8386 5.023880
## 2 0.8133 5.579730
## 3 0.8262 5.817111
## 4 0.8612 5.697093

What if we replace this with a single observation that is the average for this route over the 4 years:

airfare %>% 
  filter(id == 1) %>% 
  summarize_all(mean)

##     year id dist passen fare  bmktshr    ldist  y98  y99  y00    lfare
## 1 1998.5  1  528 262.75  112 0.834825 6.269096 0.25 0.25 0.25 4.716613
##    ldistsq   concen  lpassen
## 1 39.30157 0.834825 5.529454

If we perform this task for every value of $id$ and run a regression on this new dataset, we would wind up with what is called the between estimator. The between estimator is not commonly used in economics, but can be estimated using the plm() command as below:

reg1c <- plm(fare ~ dist + concen + passen , data = airfarepanel, model = "between") 
stargazer(reg1b, reg1c, column.labels = c("Pooled", "Between"), type = "text")

## 
## ==================================================================
##                               Dependent variable:                 
##              -----------------------------------------------------
##                                      fare                         
##                        Pooled                     Between         
##                          (1)                        (2)           
## ------------------------------------------------------------------
## dist                  0.087***                   0.089***         
##                        (0.002)                    (0.003)         
##                                                                   
## concen                69.431***                  77.244***        
##                        (5.123)                   (10.305)         
##                                                                   
## passen                -0.005***                   -0.004*         
##                        (0.001)                    (0.002)         
##                                                                   
## Constant              52.995***                  46.183***        
##                        (4.488)                    (8.950)         
##                                                                   
## ------------------------------------------------------------------
## Observations            4,596                      1,149          
## R2                      0.418                      0.444          
## Adjusted R2             0.418                      0.443          
## F Statistic  1,100.722*** (df = 3; 4592) 305.152*** (df = 3; 1145)
## ==================================================================
## Note:                                  *p<0.1; **p<0.05; ***p<0.01

To verify that the results are the same as collapsing the data manually, we can use group_by() and summarize_all() in dplyr to get to the same place:

reg1ca <- airfare %>% 
  group_by(id) %>% 
  summarize_all(mean) %>% 
  lm(fare ~ dist + concen + passen, data = .)

stargazer(reg1c, reg1ca, column.labels = c("Between (plm)", "Between (manual)"), type = "text")

## 
## ===========================================================
##                                  Dependent variable:       
##                            --------------------------------
##                                          fare              
##                                panel            OLS        
##                               linear                       
##                            Between (plm)  Between (manual) 
##                                 (1)             (2)        
## -----------------------------------------------------------
## dist                         0.089***         0.089***     
##                               (0.003)         (0.003)      
##                                                            
## concen                       77.244***       77.244***     
##                              (10.305)         (10.305)     
##                                                            
## passen                        -0.004*         -0.004*      
##                               (0.002)         (0.002)      
##                                                            
## Constant                     46.183***       46.183***     
##                               (8.950)         (8.950)      
##                                                            
## -----------------------------------------------------------
## Observations                   1,149           1,149       
## R2                             0.444           0.444       
## Adjusted R2                    0.443           0.443       
## Residual Std. Error                      54.328 (df = 1145)
## F Statistic (df = 3; 1145)  305.152***       305.152***    
## ===========================================================
## Note:                           *p<0.1; **p<0.05; ***p<0.01

11.4 Fixed effects

The most “popular” panel model in economics is the Fixed Effects, or Within, estimator. The equation for this model looks like:

$\begin{equation} fare_{i} = \alpha + \upsilon_j+\beta_1 distance_{i} + \beta_2 concentration_i +\beta_3 passengers_i \end{equation}$

In this equation, $\upsilon_j$ is a separate constant term for each of the $j$ cross-sectional groups in your data; here, therefore, it says that we should estimate a different constant term for each air route.

There are two ways to conceptualize the Fixed Effects model:

The fixed effects model simply adds a whole lot of dummy variables, one for each of your cross-sectional groups. For this data, as there 1149 different routes, there would be 1149 dummies added to the regression.
The Fixed effects model calculates the within-group mean for each variable, calculates the difference between each observation and its within-group mean, and runs the regression using these differences. This is where it gets its alternate name of the Within model from.

It turns out that both of these ways to conceptualize the fixed effects model are the same, even though they may not seem like it. The fixed effects estimator can be obtained using the model = "within" option in the plm() command:

reg1d <- plm(fare ~ dist + concen + passen , data = airfare, index = c("id","year"), model = "within") 
stargazer(reg1b, reg1d, column.labels = c("Pooled", "Fixed"), type = "text")

## 
## ==================================================================
##                               Dependent variable:                 
##              -----------------------------------------------------
##                                      fare                         
##                        Pooled                      Fixed          
##                          (1)                        (2)           
## ------------------------------------------------------------------
## dist                  0.087***                                    
##                        (0.002)                                    
##                                                                   
## concen                69.431***                   -2.752          
##                        (5.123)                    (5.375)         
##                                                                   
## passen                -0.005***                  -0.051***        
##                        (0.001)                    (0.003)         
##                                                                   
## Constant              52.995***                                   
##                        (4.488)                                    
##                                                                   
## ------------------------------------------------------------------
## Observations            4,596                      4,596          
## R2                      0.418                      0.092          
## Adjusted R2             0.418                     -0.212          
## F Statistic  1,100.722*** (df = 3; 4592) 173.880*** (df = 2; 3445)
## ==================================================================
## Note:                                  *p<0.1; **p<0.05; ***p<0.01

While the between and pooled estimators looked very similar, the fixed effect model looks very different. The first thing we might note is that there is no estimate for the $dist$ variable. Recall the second conceptualization of the fixed effects method above, that it is calculating within-group variation and using that to estimate the model. However, $dist$ is time invariant within each route – the distances of the routes do not change from year to year. Since they do not change, the $dist$ variable is basically a column full of zeroes, and that can’t be included in a regression. One of the drawbacks of a fixed effect regression is that it is not very useful if one of your variables of interest is time invariant.

The other thing we might note is that there is constant term, why not? Recall the first conceptualization of the fixed effect model above, that it is the same thing as running a regression with a whole slew of dummy variables. Typically, in such cases, we would omit one dummy variable, because having a full set of dummies would be collinear with the constant term. Instead, we could simply omit the constant term, which is what is going on here.

We can also demonstrate the equivalence of the fixed effects model with a dummy variable model pretty easily. Let’s estimate the same model using lm() and adding factor(id) as one of our variables – factor(id) tells R to treat $id$ as a factor variable so it will make 1149 dummies and estimate that model. This is far more computationally intensive than the plm() command, and displaying result with stargazer or summary() will probably be thousands of lines long, so let’s just look at the coefficients from the OLS regression and compare it to the fixed effects model above.

reg1e <- lm(fare ~ dist + concen + passen +  factor(id), data = airfare)
reg1e$coefficients[1:4]

## (Intercept)        dist      concen      passen 
## 55.35835180  0.13707225 -2.75163793 -0.05113363

When compared to the fixed effects estimates, we can see that the estimated coefficients for $concen$ and $passen$ are identical between the two models.

11.5 Random effects

Next, we will look at the Random Effects estimator. The model being estimated looks like:

$\begin{equation} fare_{i} = \alpha +\beta_1 distance_{i} + \beta_2 concentration_i +\beta_3 passengers_i +\epsilon_i + \omega_j \end{equation}$

In this model, $\omega_j$ is a separate error term for each of the $j$ cross-sectional groups in your data; here, therefore, it is a different error term for each air route. If you are thinking this looks very similar to the fixed effects model, you are right. What differentiates this from the fixed effects model is that $\omega_j$ is assumed to be normally distributed; we can test this assumption using the Hausman test, which we will see a bit later.

reg1f <- plm(fare ~ dist + concen + passen, data = airfarepanel, model = c("random"))
stargazer(reg1b, reg1d, reg1f, column.labels = c("Pooled", "Fixed", "Random"), type = "text")

## 
## ===============================================================================
##                                     Dependent variable:                        
##              ------------------------------------------------------------------
##                                             fare                               
##                        Pooled                      Fixed              Random   
##                          (1)                        (2)                (3)     
## -------------------------------------------------------------------------------
## dist                  0.087***                                       0.077***  
##                        (0.002)                                       (0.003)   
##                                                                                
## concen                69.431***                   -2.752            17.642***  
##                        (5.123)                    (5.375)            (4.868)   
##                                                                                
## passen                -0.005***                  -0.051***          -0.021***  
##                        (0.001)                    (0.003)            (0.002)   
##                                                                                
## Constant              52.995***                                     105.629*** 
##                        (4.488)                                       (5.193)   
##                                                                                
## -------------------------------------------------------------------------------
## Observations            4,596                      4,596              4,596    
## R2                      0.418                      0.092              0.179    
## Adjusted R2             0.418                     -0.212              0.179    
## F Statistic  1,100.722*** (df = 3; 4592) 173.880*** (df = 2; 3445) 1,002.162***
## ===============================================================================
## Note:                                               *p<0.1; **p<0.05; ***p<0.01

11.6 Model testing

The typical “workflow” in econometrics is to estimate the pooled, random effects, and fixed Effects models and then do a couple of tests to determine which model is the best.

First, we test whether or not the random effects model is better than the pooled model using the Lagrange Multiplier Test and the plmtest() command.

plmtest(reg1b)

## 
##  Lagrange Multiplier Test - (Honda) for balanced panels
## 
## data:  fare ~ dist + concen + passen
## normal = 72.238, p-value < 2.2e-16
## alternative hypothesis: significant effects

Recall that reg1b was the regression object from the pooled model. The null hypothesis is that the pooled model is the best model; because the p-value is less than .05, we conclude that the random effects model is preferred to the pooled model.

Next, we test whether or not the fixed effects model is preferred to the random effects model using the Hausman Test via the phtest() command:

phtest(reg1d, reg1f)

## 
##  Hausman Test
## 
## data:  fare ~ dist + concen + passen
## chisq = 227.11, df = 2, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent

In this test, the null hypothesis is that the random effects model is preferred to the fixed effects model. We see that our p-value is less than .05, so we conclude that the fixed effects model is actually preferred to the random effects model.

11.7 First difference modeling

One final type of panel model to look at is the First Difference model. The concept is related to the idea of differencing that we discussed in Chapter (timeseries). This model first-differences the variables within each group and estimates the regression using the resulting values. Because we have 4 time periods, we can calculate 3 first differences for each group:

1997 to 1998
1998 to 1999
1999 to 2000

Thus, the model we are estimating is:

$\begin{equation} \Delta fare_{i} = \alpha +\beta_1 \Delta distance_{i} + \beta_2 \Delta concentration_i +\beta_3 \Delta passengers_i +\epsilon_i \end{equation}$

Where we take the Greek letter Delta ( $\Delta$ ) to mean “change in”. So we are asking whether or not we can explain the year-to-year change in fare by the year-to-year change in distance (which will of course zero out, as it did with the fixed effects estimator), the year-to-year change in market concentration, and the year-to-year change in passengers. This model can be estimated using the model = "fd" option in the plm() command:

reg1g <- plm(fare ~ dist + concen + passen, data = airfarepanel, model = "fd")
stargazer(reg1b, reg1d, reg1f, reg1g, column.labels = c("Pooled", "Fixed", "Random", "First Diff."), type = "text")

## 
## =========================================================================================================
##                                                  Dependent variable:                                     
##              --------------------------------------------------------------------------------------------
##                                                          fare                                            
##                        Pooled                      Fixed              Random           First Diff.       
##                          (1)                        (2)                (3)                 (4)           
## ---------------------------------------------------------------------------------------------------------
## dist                  0.087***                                       0.077***                            
##                        (0.002)                                       (0.003)                             
##                                                                                                          
## concen                69.431***                   -2.752            17.642***             7.492          
##                        (5.123)                    (5.375)            (4.868)             (4.743)         
##                                                                                                          
## passen                -0.005***                  -0.051***          -0.021***           -0.086***        
##                        (0.001)                    (0.003)            (0.002)             (0.003)         
##                                                                                                          
## Constant              52.995***                                     105.629***          6.783***         
##                        (4.488)                                       (5.193)             (0.362)         
##                                                                                                          
## ---------------------------------------------------------------------------------------------------------
## Observations            4,596                      4,596              4,596               3,447          
## R2                      0.418                      0.092              0.179               0.201          
## Adjusted R2             0.418                     -0.212              0.179               0.200          
## F Statistic  1,100.722*** (df = 3; 4592) 173.880*** (df = 2; 3445) 1,002.162*** 432.210*** (df = 2; 3444)
## =========================================================================================================
## Note:                                                                         *p<0.1; **p<0.05; ***p<0.01

As expected, the $dist$ variable had to be dropped due to the fact that it is time invariant within group. It is also noteworthy that the number of observations fell from 4596 to 3447 – this is a natural consequence of the process of first-differencing the data; if there are 4 time periods, there can only be 3 first-differences. Because of the way the data ha been transformed, the coefficients need to be interpreted with respect to the fact that the regression was estimated on first-differenced data.

11.8 Further Examples

Let’s analyze another panel data set. If you are pretty sure you aren’t interested in a first-differenced model, the basic workflow is fairly simple:

Identify cross-sectional and time series variables in the data.
Estimate pooled, random, and fixed effects, and interpret the results.
Perform tests to see which model is preferred model

Let’s begin with the AER:USSeatBelts data. First, we will create the new object called seatbeltpanel that specifies our index() for the data.

seatbeltpanel <- pdata.frame(USSeatBelts, index = c("state","year"))

Next, we estimate the 3 basic panel models:

reg2a <- plm(fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol + income + age + enforce, data = seatbeltpanel, model = "pooling")
reg2b <- plm(fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol + income + age + enforce, data = seatbeltpanel, model = "random")
reg2c <- plm(fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol + income + age + enforce, data = seatbeltpanel, model = "within")
stargazer(reg2a, reg2b, reg2c, type = "text", column.labels = c("Pooled","Random Eff.", "Fixed Eff"))

## 
## ==============================================================================
##                                       Dependent variable:                     
##                  -------------------------------------------------------------
##                                           fatalities                          
##                          Pooled          Random Eff.         Fixed Eff        
##                            (1)               (2)                (3)           
## ------------------------------------------------------------------------------
## seatbelt                 -0.001           -0.007***          -0.008***        
##                          (0.002)           (0.001)            (0.001)         
##                                                                               
## speed65yes               0.0002             -0.001            -0.001*         
##                         (0.0005)           (0.0003)           (0.0004)        
##                                                                               
## speed70yes              0.002***           0.001***           0.001***        
##                          (0.001)           (0.0003)           (0.0003)        
##                                                                               
## drinkageyes              -0.001            0.00003             0.0001         
##                          (0.001)           (0.001)            (0.001)         
##                                                                               
## alcoholyes              -0.002***         -0.001***          -0.001***        
##                         (0.0005)           (0.0004)           (0.0004)        
##                                                                               
## income                 -0.00000***       -0.00000***        -0.00000***       
##                         (0.00000)         (0.00000)          (0.00000)        
##                                                                               
## age                      -0.0001           -0.00004            0.0003         
##                         (0.0001)           (0.0002)           (0.0004)        
##                                                                               
## enforceprimary           0.002**           0.002**            0.002**         
##                          (0.001)           (0.001)            (0.001)         
##                                                                               
## enforcesecondary         0.001*             0.0003             0.0002         
##                          (0.001)           (0.0004)           (0.0004)        
##                                                                               
## Constant                0.038***           0.035***                           
##                          (0.004)           (0.008)                            
##                                                                               
## ------------------------------------------------------------------------------
## Observations               556               556                556           
## R2                        0.515             0.656              0.672          
## Adjusted R2               0.507             0.650              0.633          
## F Statistic      64.530*** (df = 9; 546) 1,020.572*** 113.010*** (df = 9; 496)
## ==============================================================================
## Note:                                              *p<0.1; **p<0.05; ***p<0.01

All told, these are extremely consistent results. They suggest that traffic fatalities are higher with higher speed limits, but are lower in states with more strict alcohol laws, higher rates of seat belt usage, and higher incomes.

Finally, we estimate the Lagarange Multiplier and Hausman tests.

plmtest(reg2a)

## 
##  Lagrange Multiplier Test - (Honda) for unbalanced panels
## 
## data:  fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol +  ...
## normal = 33.362, p-value < 2.2e-16
## alternative hypothesis: significant effects

The Lagrange Multiplier test indicates that the random effects model is preferred to the pooled model.

phtest(reg2b, reg2c)

## 
##  Hausman Test
## 
## data:  fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol +  ...
## chisq = 15.719, df = 9, p-value = 0.07298
## alternative hypothesis: one model is inconsistent

The Hausman test has a p-value of 0.07, so it’s right on the edge of being significant. If I were writing a paper and saw this, I’d just report both and argue that the result is not sensitive to the choice of random vs fixed effects!

Finally, let’s take a look at the CigarettesSW data from the AER package. As usual, we start with creating a panel data set; the cross sectional and time series variables are $state$ and $year$ , respectively. Additionally, we can do a bit of data manipulation to convert the $income$ variable into per capita terms and convert the variables measured in dollar terms from nominal to real by dividing them by $cpi$ :

cigspanel <- pdata.frame(CigarettesSW, index = c("state","year"))
cigspanel$income <-cigspanel$income/cigspanel$population
cigspanel$income <- cigspanel$income/cigspanel$cpi
cigspanel$tax <- cigspanel$tax/cigspanel$cpi
cigspanel$price <- cigspanel$price/cigspanel$cpi

Since there are two years, we may be interested to look at regressions for each year individually; let’s use dplyr and the filter() command to run OLS regressions on the two subsets.

regcig1 <-cigspanel %>% 
  filter(year == 1985) %>% 
  lm(packs ~ income + price + tax, data = .)
regcig2 <-cigspanel %>% 
  filter(year == 1995) %>% 
  lm(packs ~ income + price + tax, data = .)
stargazer(regcig1, regcig2, type = "text", column.labels = c("1985","1995"))

## 
## ==========================================================
##                                   Dependent variable:     
##                               ----------------------------
##                                          packs            
##                                    1985          1995     
##                                    (1)            (2)     
## ----------------------------------------------------------
## income                            2.649          1.903    
##                                  (1.614)        (1.566)   
##                                                           
## price                           -1.584***      -1.210**   
##                                  (0.459)        (0.553)   
##                                                           
## tax                               0.306          0.208    
##                                  (0.722)        (0.822)   
##                                                           
## Constant                        231.498***    206.421***  
##                                  (35.056)      (39.127)   
##                                                           
## ----------------------------------------------------------
## Observations                        48            48      
## R2                                0.294          0.416    
## Adjusted R2                       0.245          0.376    
## Residual Std. Error (df = 44)     18.423        18.785    
## F Statistic (df = 3; 44)         6.096***      10.443***  
## ==========================================================
## Note:                          *p<0.1; **p<0.05; ***p<0.01

Next, we can estimate first difference, between, and pooled models:

regcig3 <- plm(packs ~ income + price + tax, data =cigspanel, model = "fd")
regcig4 <- plm(packs ~ income + price + tax, data =cigspanel, model = "between")
regcig5 <- plm(packs ~ income + price + tax, data =cigspanel, model = "pooling")
stargazer(regcig3, regcig4, regcig5, 
          type = "text", 
          column.labels = c("First-Diff","Between","Pooled"))

## 
## ================================================================================
##                                      Dependent variable:                        
##              -------------------------------------------------------------------
##                                             packs                               
##                    First-Diff              Between                Pooled        
##                       (1)                    (2)                   (3)          
## --------------------------------------------------------------------------------
## income               -0.987                2.711*                2.273**        
##                     (1.861)                (1.596)               (1.091)        
##                                                                                 
## price              -0.627***              -1.637***             -1.342***       
##                     (0.222)                (0.538)               (0.212)        
##                                                                                 
## tax                  -0.263                 0.553                 0.313         
##                     (0.348)                (0.813)               (0.399)        
##                                                                                 
## Constant             -7.376              231.138***             212.963***      
##                     (4.705)               (37.099)               (14.509)       
##                                                                                 
## --------------------------------------------------------------------------------
## Observations           48                    48                     96          
## R2                   0.562                  0.361                 0.517         
## Adjusted R2          0.532                  0.318                 0.502         
## F Statistic  18.801*** (df = 3; 44) 8.294*** (df = 3; 44) 32.887*** (df = 3; 92)
## ================================================================================
## Note:                                                *p<0.1; **p<0.05; ***p<0.01

Note that the first-differenced model and between models only have 48 observations; this is because the data is transformed in such a way as to lose observations for each cross-sectional unit. The first-difference model looks at the difference in each state between 1995 and 1985, whereas the between model looks at each state’s average over the two years.

This is essentially a demand curve estimation; one might wonder why there be such a big difference between the estimated coefficient on $price$ between the first-difference model and the others. The first difference model suggests that, for every $1 increase in price within any state, the consumption of packs consumed fell by 0.63. The pooled model suggests that the effect is much bigger, and an additional dollar in price should reduce consumption by more than double that amount (-1.342)!

Which of these is more likely to be correct? Probably the first-differenced model; there are likely to be some state specific factors that influence smoking behavior, and first differencing is a way to take those into account. For example, of the top 5 states for cigarette consumption in 1985, 3 are still in the top 5 10 years later:

CigarettesSW %>% 
  select(year, state, packs) %>% 
  filter(year == 1985) %>% 
  slice_max(packs,n=5)

##    year state    packs
## 28 1985    NH 197.9940
## 15 1985    KY 186.0352
## 25 1985    NC 155.2838
## 44 1985    VT 145.2830
## 7  1985    DE 143.8511

CigarettesSW %>% 
  select(year, state, packs) %>% 
  filter(year == 1995) %>% 
  slice_max(packs,n=5)

##    year state    packs
## 63 1995    KY 172.6478
## 76 1995    NH 156.3367
## 61 1995    IN 134.2583
## 55 1995    DE 124.4666
## 70 1995    MO 122.4503

If you look at the bottom 5 states, 4 of them are still in the bottom 5 a decade later:

CigarettesSW %>% 
  select(year, state, packs) %>% 
  filter(year == 1985) %>% 
  slice_min(packs,n=5)

##    year state     packs
## 42 1985    UT  68.04626
## 30 1985    NM  88.74218
## 45 1985    WA  96.22813
## 4  1985    CA 100.36304
## 11 1985    ID 103.01811

CigarettesSW %>% 
  select(year, state, packs) %>% 
  filter(year == 1995) %>% 
  slice_min(packs,n=5)

##    year state    packs
## 90 1995    UT 49.27220
## 52 1995    CA 56.85931
## 78 1995    NM 64.66887
## 93 1995    WA 65.53092
## 80 1995    NY 70.81732

In other words, there is something persistent about smoking within each state, so including some way of keeping state specific details in the regression should improve the results.

Moving on, we should estimate the random effects and fixed effects models as well:

regcig6 <- plm(packs ~ income + price + tax, data =cigspanel, model = "within")
regcig7 <- plm(packs ~ income + price + tax, data =cigspanel, model = "random")
regcig5 <- plm(packs ~ income + price + tax, data =cigspanel, model = "pooling")
stargazer(regcig6, regcig7, regcig5, 
          type = "text", 
          column.labels = c("Fixed Eff.","Random Eff.","Pooled"))

## 
## =======================================================================
##                                 Dependent variable:                    
##              ----------------------------------------------------------
##                                        packs                           
##                    Fixed Eff.        Random Eff.         Pooled        
##                        (1)               (2)              (3)          
## -----------------------------------------------------------------------
## income               -2.711*           -0.164           2.273**        
##                      (1.525)           (1.073)          (1.091)        
##                                                                        
## price               -0.845***         -1.101***        -1.342***       
##                      (0.175)           (0.132)          (0.212)        
##                                                                        
## tax                  -0.006             0.286            0.313         
##                      (0.311)           (0.256)          (0.399)        
##                                                                        
## Constant                             221.417***        212.963***      
##                                        (9.096)          (14.509)       
##                                                                        
## -----------------------------------------------------------------------
## Observations           96                96                96          
## R2                    0.918             0.845            0.517         
## Adjusted R2           0.827             0.840            0.502         
## F Statistic  168.271*** (df = 3; 45) 503.402***  32.887*** (df = 3; 92)
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01

As with the first-differenced model, both the fixed effects and random effects models allow for state specific effects to exist in the model. To determine which of these models is preferred, we can hit these with some more tests.

plmtest(regcig5)

## 
##  Lagrange Multiplier Test - (Honda) for balanced panels
## 
## data:  packs ~ income + price + tax
## normal = 5.9797, p-value = 1.118e-09
## alternative hypothesis: significant effects

The Lagrange Multiplier test indicates that the random effects model is preferred to the pooled model.

phtest(regcig6, regcig7)

## 
##  Hausman Test
## 
## data:  packs ~ income + price + tax
## chisq = 6.1643, df = 3, p-value = 0.1039
## alternative hypothesis: one model is inconsistent

The Hausman test is insignificant, indicating that the random effects model is preferred.

11.9 Wrapping Up

Panel methods, particularly Fixed Effects models, are incredibly important in economics because they give the ability to control for individual specific unobservable characteristics.

11.10 End of Chapter Exercises

Panel Data: For each of the following, examine the data (and the help file for the data) to identify the cross-sectional and time components of the panel data. Use the plm function to build a good multivariate model. Think carefully about the variables you have to choose among to explain your dependent variable with. Estimate between and first-differenced models. Estimate the pooled, random effects, and fixed effects model, and execute and interpret the appropriate tests to identify which of these models are best. Finally, interpret your regression results.

AER:Fatalities - This is a similar sort of dataset to the USSeatbelts data above.
AER:Municipalities This is an interesting Swedish dataset of city taxes and spending.
AER:USAirlines - Small dataset that looks at cost of production.
wooldridge:driving - This is another driving fatality dataset.
wooldridge:crime4 - Crime data is often a good place to look at first-difference models.
wooldridge:wagepan - Looking at wages in a panel often provides very different conclusions than just in a cross sectional dataset.
AER:NaturalGas - This could be a good dataset for estimating a demand function.