Chapter 11 Panel Regression
Panel Regression is a technique used for data that has both a cross-sectional and a time series component. In other words, data where each individual/country/company/etc. in the data set is observed at multiple points in time. While there are panel versions of most of the models we have encountered thus far (e.g. logit/probit, Tobit, negbin, …), we will focus only on the linear model.
As usual, we begin with loading in some data to analyze. Additionally, this chapter will make use of the plm
package, so install it if you have not done so already.
data(USSeatBelts)
data(Fatalities)
data(Municipalities)
data(USAirlines)
data(driving)
data(airfare)
data(murder)
data(NaturalGas)
data(HealthInsurance)
data(wagepan)
data(CigarettesSW)
11.1 Panel Data
With the exception of the chapter on Time Series, Chapter 10, nearly every data set we have dealt with has cross-sectional; each subject is observed once, and typically all subjects are observed at the same time. By contrast, panel data (sometimes referred to as longitudinal data) includes multiple observations of the same set of subjects, made at different points in time. A panel is said to be be balanced if you have an observation for every cross-sectional group for every time period in the data. Otherwise, the data is said to be unbalanced.
To see what a panel data set looks like, let’s look at an example of a panel data set, the airfare
data from the wooldridge
package.
This is a pretty huge data set, with 4596 observations over 14 variables. If we examine the description of the data using ?airfare
, we can see the features of this data set that make it panel data. First, the variable \(year\) indicates that the data is collected in each of 4 years; 1997, 1998, 1999, and 2000. Second, the data has an individual identifier, in this case \(id\). If we take a deeper look at the data and see that we have an observation for each \(year \times id\) combination, then we have a panel data set. To that end, let’s take a quick look at the variables we will be looking at from the first 10 lines of data:
%>%
airfare ::select(year, id, fare, dist, passen, concen) %>%
dplyrslice(1:10)
## year id fare dist passen concen
## 1 1997 1 106 528 152 0.8386
## 2 1998 1 106 528 265 0.8133
## 3 1999 1 113 528 336 0.8262
## 4 2000 1 123 528 298 0.8612
## 5 1997 2 104 861 282 0.5798
## 6 1998 2 105 861 178 0.5817
## 7 1999 2 115 861 204 0.7319
## 8 2000 2 129 861 190 0.5386
## 9 1997 3 207 852 241 0.8180
## 10 1998 3 188 852 253 0.8172
To see that this is panel data, look at the first two columns. Observations 1-4 are for \(id = 1\) for each of the 4 possible values of \(year\), observations 5-8 are the same for \(id = 2\), and so forth. We also have 4 other variables we will look at:
- \(fare\) - the average ticket price
- \(dist\) - the distance of each air route
- \(passen\) - the average daily passengers on each route
- \(concen\) - a measure of the market share of the biggest carrier
To begin using this data, we need to first transform it into a panel data set. This means explicitly telling R that \(id\) and \(year\) are the variables that define the cross-sectional and time series dimensions of our data, respectively. As we can see, R currently views them as numeric.
summary(airfare)
## year id dist passen
## Min. :1997 Min. : 1 Min. : 95.0 Min. : 2.0
## 1st Qu.:1998 1st Qu.: 288 1st Qu.: 505.0 1st Qu.: 215.0
## Median :1998 Median : 575 Median : 861.0 Median : 357.0
## Mean :1998 Mean : 575 Mean : 989.7 Mean : 636.8
## 3rd Qu.:1999 3rd Qu.: 862 3rd Qu.:1304.0 3rd Qu.: 717.0
## Max. :2000 Max. :1149 Max. :2724.0 Max. :8497.0
## fare bmktshr ldist y98
## Min. : 37.0 Min. :0.1605 Min. :4.554 Min. :0.00
## 1st Qu.:123.0 1st Qu.:0.4650 1st Qu.:6.225 1st Qu.:0.00
## Median :168.0 Median :0.6039 Median :6.758 Median :0.00
## Mean :178.8 Mean :0.6101 Mean :6.696 Mean :0.25
## 3rd Qu.:225.0 3rd Qu.:0.7531 3rd Qu.:7.173 3rd Qu.:0.25
## Max. :522.0 Max. :1.0000 Max. :7.910 Max. :1.00
## y99 y00 lfare ldistsq
## Min. :0.00 Min. :0.00 Min. :3.611 Min. :20.74
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:4.812 1st Qu.:38.75
## Median :0.00 Median :0.00 Median :5.124 Median :45.67
## Mean :0.25 Mean :0.25 Mean :5.096 Mean :45.28
## 3rd Qu.:0.25 3rd Qu.:0.25 3rd Qu.:5.416 3rd Qu.:51.45
## Max. :1.00 Max. :1.00 Max. :6.258 Max. :62.57
## concen lpassen
## Min. :0.1605 Min. :0.6931
## 1st Qu.:0.4650 1st Qu.:5.3706
## Median :0.6039 Median :5.8777
## Mean :0.6101 Mean :6.0170
## 3rd Qu.:0.7531 3rd Qu.:6.5751
## Max. :1.0000 Max. :9.0475
We can use the pdata.frame()
command in the plm
package to set \(id\) and \(year\) as our index
variables.
<- pdata.frame(airfare, index = c("id","year")) airfarepanel
Now, when we look at how the \(id\) and \(year\) variables are stored in the airfare
and airfarepanel
objects, we see that R views them quite differently; now, it is treating them as factors in the airfarepanel
object.
%>% dplyr::select(id, year) %>%
airfare summary(.)
## id year
## Min. : 1 Min. :1997
## 1st Qu.: 288 1st Qu.:1998
## Median : 575 Median :1998
## Mean : 575 Mean :1998
## 3rd Qu.: 862 3rd Qu.:1999
## Max. :1149 Max. :2000
%>% dplyr::select(id, year) %>%
airfarepanel summary(.)
## id year
## 1 : 4 1997:1149
## 2 : 4 1998:1149
## 3 : 4 1999:1149
## 4 : 4 2000:1149
## 5 : 4
## 6 : 4
## (Other):4572
Before we estimate any models with our panel data, let us first estimate an OLS model with lm()
on the airfare
data. The model we are estimating is:
\[\begin{equation} fare_{i} = \alpha +\beta_1 distance_{i} + \beta_2 concentration_i +\beta_3 passengers_i \end{equation}\]
<- lm(fare ~ dist + concen + passen, data = airfare)
reg1a stargazer(reg1a, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## fare
## -----------------------------------------------
## dist 0.087***
## (0.002)
##
## concen 69.431***
## (5.123)
##
## passen -0.005***
## (0.001)
##
## Constant 52.995***
## (4.488)
##
## -----------------------------------------------
## Observations 4,596
## R2 0.418
## Adjusted R2 0.418
## Residual Std. Error 57.130 (df = 4592)
## F Statistic 1,100.722*** (df = 3; 4592)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Interpreting this regression, we see that distance and market share are positively correlated with fare and the number of passengers is negatively correlated with fare.
11.2 Pooled OLS
We can estimate the same regression using our panel data set using the plm()
(Panel Linear Model ) command. Because we already specified that the airfarepanel
object is panel data, the plm()
command only requires one more argument than the lm()
command. This is an argument you should be used to by now, the model =
argument. To estimate the same regression as a basic OLS model, we use the model = "pooling"
option to estimate a pooled regression.
<- plm(fare ~ dist + concen + passen , data = airfarepanel, model = "pooling")
reg1b stargazer(reg1a, reg1b, column.labels = c("OLS", "Pooled"), type = "text")
##
## ==========================================================
## Dependent variable:
## -------------------------------
## fare
## OLS panel
## linear
## OLS Pooled
## (1) (2)
## ----------------------------------------------------------
## dist 0.087*** 0.087***
## (0.002) (0.002)
##
## concen 69.431*** 69.431***
## (5.123) (5.123)
##
## passen -0.005*** -0.005***
## (0.001) (0.001)
##
## Constant 52.995*** 52.995***
## (4.488) (4.488)
##
## ----------------------------------------------------------
## Observations 4,596 4,596
## R2 0.418 0.418
## Adjusted R2 0.418 0.418
## Residual Std. Error 57.130 (df = 4592)
## F Statistic (df = 3; 4592) 1,100.722*** 1,100.722***
## ==========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The results are identical! A pooled regression, therefore is just a fancy name for a regression that uses panel data, but doesn’t actually take into consideration the fact that we have the same subjects being observed over multiple time periods.
11.3 Between estimator
One way to take into consideration the fact that we have multiple observations of each subject is to “collapse” the data into group averages and estimate a linear model using that data.
In other words, consider the four observations associated with route 1:
%>%
airfare filter(id == 1)
## year id dist passen fare bmktshr ldist y98 y99 y00 lfare ldistsq
## 1 1997 1 528 152 106 0.8386 6.269096 0 0 0 4.663439 39.30157
## 2 1998 1 528 265 106 0.8133 6.269096 1 0 0 4.663439 39.30157
## 3 1999 1 528 336 113 0.8262 6.269096 0 1 0 4.727388 39.30157
## 4 2000 1 528 298 123 0.8612 6.269096 0 0 1 4.812184 39.30157
## concen lpassen
## 1 0.8386 5.023880
## 2 0.8133 5.579730
## 3 0.8262 5.817111
## 4 0.8612 5.697093
What if we replace this with a single observation that is the average for this route over the 4 years:
%>%
airfare filter(id == 1) %>%
summarize_all(mean)
## year id dist passen fare bmktshr ldist y98 y99 y00 lfare
## 1 1998.5 1 528 262.75 112 0.834825 6.269096 0.25 0.25 0.25 4.716613
## ldistsq concen lpassen
## 1 39.30157 0.834825 5.529454
If we perform this task for every value of \(id\) and run a regression on this new dataset, we would wind up with what is called the between estimator. The between estimator is not commonly used in economics, but can be estimated using the plm()
command as below:
<- plm(fare ~ dist + concen + passen , data = airfarepanel, model = "between")
reg1c stargazer(reg1b, reg1c, column.labels = c("Pooled", "Between"), type = "text")
##
## ==================================================================
## Dependent variable:
## -----------------------------------------------------
## fare
## Pooled Between
## (1) (2)
## ------------------------------------------------------------------
## dist 0.087*** 0.089***
## (0.002) (0.003)
##
## concen 69.431*** 77.244***
## (5.123) (10.305)
##
## passen -0.005*** -0.004*
## (0.001) (0.002)
##
## Constant 52.995*** 46.183***
## (4.488) (8.950)
##
## ------------------------------------------------------------------
## Observations 4,596 1,149
## R2 0.418 0.444
## Adjusted R2 0.418 0.443
## F Statistic 1,100.722*** (df = 3; 4592) 305.152*** (df = 3; 1145)
## ==================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
To verify that the results are the same as collapsing the data manually, we can use group_by()
and summarize_all()
in dplyr
to get to the same place:
<- airfare %>%
reg1ca group_by(id) %>%
summarize_all(mean) %>%
lm(fare ~ dist + concen + passen, data = .)
stargazer(reg1c, reg1ca, column.labels = c("Between (plm)", "Between (manual)"), type = "text")
##
## ===========================================================
## Dependent variable:
## --------------------------------
## fare
## panel OLS
## linear
## Between (plm) Between (manual)
## (1) (2)
## -----------------------------------------------------------
## dist 0.089*** 0.089***
## (0.003) (0.003)
##
## concen 77.244*** 77.244***
## (10.305) (10.305)
##
## passen -0.004* -0.004*
## (0.002) (0.002)
##
## Constant 46.183*** 46.183***
## (8.950) (8.950)
##
## -----------------------------------------------------------
## Observations 1,149 1,149
## R2 0.444 0.444
## Adjusted R2 0.443 0.443
## Residual Std. Error 54.328 (df = 1145)
## F Statistic (df = 3; 1145) 305.152*** 305.152***
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
11.4 Fixed effects
The most “popular” panel model in economics is the Fixed Effects, or Within, estimator. The equation for this model looks like:
\[\begin{equation} fare_{i} = \alpha + \upsilon_j+\beta_1 distance_{i} + \beta_2 concentration_i +\beta_3 passengers_i \end{equation}\]
In this equation, \(\upsilon_j\) is a separate constant term for each of the \(j\) cross-sectional groups in your data; here, therefore, it says that we should estimate a different constant term for each air route.
There are two ways to conceptualize the Fixed Effects model:
The fixed effects model simply adds a whole lot of dummy variables, one for each of your cross-sectional groups. For this data, as there 1149 different routes, there would be 1149 dummies added to the regression.
The Fixed effects model calculates the within-group mean for each variable, calculates the difference between each observation and its within-group mean, and runs the regression using these differences. This is where it gets its alternate name of the Within model from.
It turns out that both of these ways to conceptualize the fixed effects model are the same, even though they may not seem like it. The fixed effects estimator can be obtained using the model = "within"
option in the plm()
command:
<- plm(fare ~ dist + concen + passen , data = airfare, index = c("id","year"), model = "within")
reg1d stargazer(reg1b, reg1d, column.labels = c("Pooled", "Fixed"), type = "text")
##
## ==================================================================
## Dependent variable:
## -----------------------------------------------------
## fare
## Pooled Fixed
## (1) (2)
## ------------------------------------------------------------------
## dist 0.087***
## (0.002)
##
## concen 69.431*** -2.752
## (5.123) (5.375)
##
## passen -0.005*** -0.051***
## (0.001) (0.003)
##
## Constant 52.995***
## (4.488)
##
## ------------------------------------------------------------------
## Observations 4,596 4,596
## R2 0.418 0.092
## Adjusted R2 0.418 -0.212
## F Statistic 1,100.722*** (df = 3; 4592) 173.880*** (df = 2; 3445)
## ==================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
While the between and pooled estimators looked very similar, the fixed effect model looks very different. The first thing we might note is that there is no estimate for the \(dist\) variable. Recall the second conceptualization of the fixed effects method above, that it is calculating within-group variation and using that to estimate the model. However, \(dist\) is time invariant within each route – the distances of the routes do not change from year to year. Since they do not change, the \(dist\) variable is basically a column full of zeroes, and that can’t be included in a regression. One of the drawbacks of a fixed effect regression is that it is not very useful if one of your variables of interest is time invariant.
The other thing we might note is that there is constant term, why not? Recall the first conceptualization of the fixed effect model above, that it is the same thing as running a regression with a whole slew of dummy variables. Typically, in such cases, we would omit one dummy variable, because having a full set of dummies would be collinear with the constant term. Instead, we could simply omit the constant term, which is what is going on here.
We can also demonstrate the equivalence of the fixed effects model with a dummy variable model pretty easily. Let’s estimate the same model using lm()
and adding factor(id)
as one of our variables – factor(id)
tells R to treat \(id\) as a factor variable so it will make 1149 dummies and estimate that model. This is far more computationally intensive than the plm()
command, and displaying result with stargazer
or summary()
will probably be thousands of lines long, so let’s just look at the coefficients from the OLS regression and compare it to the fixed effects model above.
<- lm(fare ~ dist + concen + passen + factor(id), data = airfare)
reg1e $coefficients[1:4] reg1e
## (Intercept) dist concen passen
## 55.35835180 0.13707225 -2.75163793 -0.05113363
When compared to the fixed effects estimates, we can see that the estimated coefficients for \(concen\) and \(passen\) are identical between the two models.
11.5 Random effects
Next, we will look at the Random Effects estimator. The model being estimated looks like:
\[\begin{equation} fare_{i} = \alpha +\beta_1 distance_{i} + \beta_2 concentration_i +\beta_3 passengers_i +\epsilon_i + \omega_j \end{equation}\]
In this model, \(\omega_j\) is a separate error term for each of the \(j\) cross-sectional groups in your data; here, therefore, it is a different error term for each air route. If you are thinking this looks very similar to the fixed effects model, you are right. What differentiates this from the fixed effects model is that \(\omega_j\) is assumed to be normally distributed; we can test this assumption using the Hausman test, which we will see a bit later.
<- plm(fare ~ dist + concen + passen, data = airfarepanel, model = c("random"))
reg1f stargazer(reg1b, reg1d, reg1f, column.labels = c("Pooled", "Fixed", "Random"), type = "text")
##
## ===============================================================================
## Dependent variable:
## ------------------------------------------------------------------
## fare
## Pooled Fixed Random
## (1) (2) (3)
## -------------------------------------------------------------------------------
## dist 0.087*** 0.077***
## (0.002) (0.003)
##
## concen 69.431*** -2.752 17.642***
## (5.123) (5.375) (4.868)
##
## passen -0.005*** -0.051*** -0.021***
## (0.001) (0.003) (0.002)
##
## Constant 52.995*** 105.629***
## (4.488) (5.193)
##
## -------------------------------------------------------------------------------
## Observations 4,596 4,596 4,596
## R2 0.418 0.092 0.179
## Adjusted R2 0.418 -0.212 0.179
## F Statistic 1,100.722*** (df = 3; 4592) 173.880*** (df = 2; 3445) 1,002.162***
## ===============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
11.6 Model testing
The typical “workflow” in econometrics is to estimate the pooled, random effects, and fixed Effects models and then do a couple of tests to determine which model is the best.
First, we test whether or not the random effects model is better than the pooled model using the Lagrange Multiplier Test and the plmtest()
command.
plmtest(reg1b)
##
## Lagrange Multiplier Test - (Honda) for balanced panels
##
## data: fare ~ dist + concen + passen
## normal = 72.238, p-value < 2.2e-16
## alternative hypothesis: significant effects
Recall that reg1b
was the regression object from the pooled model. The null hypothesis is that the pooled model is the best model; because the p-value is less than .05, we conclude that the random effects model is preferred to the pooled model.
Next, we test whether or not the fixed effects model is preferred to the random effects model using the Hausman Test via the phtest()
command:
phtest(reg1d, reg1f)
##
## Hausman Test
##
## data: fare ~ dist + concen + passen
## chisq = 227.11, df = 2, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent
In this test, the null hypothesis is that the random effects model is preferred to the fixed effects model. We see that our p-value is less than .05, so we conclude that the fixed effects model is actually preferred to the random effects model.
11.7 First difference modeling
One final type of panel model to look at is the First Difference model. The concept is related to the idea of differencing that we discussed in Chapter (timeseries). This model first-differences the variables within each group and estimates the regression using the resulting values. Because we have 4 time periods, we can calculate 3 first differences for each group:
- 1997 to 1998
- 1998 to 1999
- 1999 to 2000
Thus, the model we are estimating is:
\[\begin{equation} \Delta fare_{i} = \alpha +\beta_1 \Delta distance_{i} + \beta_2 \Delta concentration_i +\beta_3 \Delta passengers_i +\epsilon_i \end{equation}\]
Where we take the Greek letter Delta (\(\Delta\)) to mean “change in”. So we are asking whether or not we can explain the year-to-year change in fare by the year-to-year change in distance (which will of course zero out, as it did with the fixed effects estimator), the year-to-year change in market concentration, and the year-to-year change in passengers. This model can be estimated using the model = "fd"
option in the plm()
command:
<- plm(fare ~ dist + concen + passen, data = airfarepanel, model = "fd")
reg1g stargazer(reg1b, reg1d, reg1f, reg1g, column.labels = c("Pooled", "Fixed", "Random", "First Diff."), type = "text")
##
## =========================================================================================================
## Dependent variable:
## --------------------------------------------------------------------------------------------
## fare
## Pooled Fixed Random First Diff.
## (1) (2) (3) (4)
## ---------------------------------------------------------------------------------------------------------
## dist 0.087*** 0.077***
## (0.002) (0.003)
##
## concen 69.431*** -2.752 17.642*** 7.492
## (5.123) (5.375) (4.868) (4.743)
##
## passen -0.005*** -0.051*** -0.021*** -0.086***
## (0.001) (0.003) (0.002) (0.003)
##
## Constant 52.995*** 105.629*** 6.783***
## (4.488) (5.193) (0.362)
##
## ---------------------------------------------------------------------------------------------------------
## Observations 4,596 4,596 4,596 3,447
## R2 0.418 0.092 0.179 0.201
## Adjusted R2 0.418 -0.212 0.179 0.200
## F Statistic 1,100.722*** (df = 3; 4592) 173.880*** (df = 2; 3445) 1,002.162*** 432.210*** (df = 2; 3444)
## =========================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As expected, the \(dist\) variable had to be dropped due to the fact that it is time invariant within group. It is also noteworthy that the number of observations fell from 4596 to 3447 – this is a natural consequence of the process of first-differencing the data; if there are 4 time periods, there can only be 3 first-differences. Because of the way the data ha been transformed, the coefficients need to be interpreted with respect to the fact that the regression was estimated on first-differenced data.
11.8 Further Examples
Let’s analyze another panel data set. If you are pretty sure you aren’t interested in a first-differenced model, the basic workflow is fairly simple:
- Identify cross-sectional and time series variables in the data.
- Estimate pooled, random, and fixed effects, and interpret the results.
- Perform tests to see which model is preferred model
Let’s begin with the AER:USSeatBelts
data. First, we will create the new object called seatbeltpanel
that specifies our index()
for the data.
<- pdata.frame(USSeatBelts, index = c("state","year")) seatbeltpanel
Next, we estimate the 3 basic panel models:
<- plm(fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol + income + age + enforce, data = seatbeltpanel, model = "pooling")
reg2a <- plm(fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol + income + age + enforce, data = seatbeltpanel, model = "random")
reg2b <- plm(fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol + income + age + enforce, data = seatbeltpanel, model = "within")
reg2c stargazer(reg2a, reg2b, reg2c, type = "text", column.labels = c("Pooled","Random Eff.", "Fixed Eff"))
##
## ==============================================================================
## Dependent variable:
## -------------------------------------------------------------
## fatalities
## Pooled Random Eff. Fixed Eff
## (1) (2) (3)
## ------------------------------------------------------------------------------
## seatbelt -0.001 -0.007*** -0.008***
## (0.002) (0.001) (0.001)
##
## speed65yes 0.0002 -0.001 -0.001*
## (0.0005) (0.0003) (0.0004)
##
## speed70yes 0.002*** 0.001*** 0.001***
## (0.001) (0.0003) (0.0003)
##
## drinkageyes -0.001 0.00003 0.0001
## (0.001) (0.001) (0.001)
##
## alcoholyes -0.002*** -0.001*** -0.001***
## (0.0005) (0.0004) (0.0004)
##
## income -0.00000*** -0.00000*** -0.00000***
## (0.00000) (0.00000) (0.00000)
##
## age -0.0001 -0.00004 0.0003
## (0.0001) (0.0002) (0.0004)
##
## enforceprimary 0.002** 0.002** 0.002**
## (0.001) (0.001) (0.001)
##
## enforcesecondary 0.001* 0.0003 0.0002
## (0.001) (0.0004) (0.0004)
##
## Constant 0.038*** 0.035***
## (0.004) (0.008)
##
## ------------------------------------------------------------------------------
## Observations 556 556 556
## R2 0.515 0.656 0.672
## Adjusted R2 0.507 0.650 0.633
## F Statistic 64.530*** (df = 9; 546) 1,020.572*** 113.010*** (df = 9; 496)
## ==============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
All told, these are extremely consistent results. They suggest that traffic fatalities are higher with higher speed limits, but are lower in states with more strict alcohol laws, higher rates of seat belt usage, and higher incomes.
Finally, we estimate the Lagarange Multiplier and Hausman tests.
plmtest(reg2a)
##
## Lagrange Multiplier Test - (Honda) for unbalanced panels
##
## data: fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol + ...
## normal = 33.362, p-value < 2.2e-16
## alternative hypothesis: significant effects
The Lagrange Multiplier test indicates that the random effects model is preferred to the pooled model.
phtest(reg2b, reg2c)
##
## Hausman Test
##
## data: fatalities ~ seatbelt + speed65 + speed70 + drinkage + alcohol + ...
## chisq = 15.719, df = 9, p-value = 0.07298
## alternative hypothesis: one model is inconsistent
The Hausman test has a p-value of 0.07, so it’s right on the edge of being significant. If I were writing a paper and saw this, I’d just report both and argue that the result is not sensitive to the choice of random vs fixed effects!
Finally, let’s take a look at the CigarettesSW
data from the AER
package. As usual, we start with creating a panel data set; the cross sectional and time series variables are \(state\) and \(year\), respectively. Additionally, we can do a bit of data manipulation to convert the \(income\) variable into per capita terms and convert the variables measured in dollar terms from nominal to real by dividing them by \(cpi\):
<- pdata.frame(CigarettesSW, index = c("state","year"))
cigspanel $income <-cigspanel$income/cigspanel$population
cigspanel$income <- cigspanel$income/cigspanel$cpi
cigspanel$tax <- cigspanel$tax/cigspanel$cpi
cigspanel$price <- cigspanel$price/cigspanel$cpi cigspanel
Since there are two years, we may be interested to look at regressions for each year individually; let’s use dplyr
and the filter()
command to run OLS regressions on the two subsets.
<-cigspanel %>%
regcig1 filter(year == 1985) %>%
lm(packs ~ income + price + tax, data = .)
<-cigspanel %>%
regcig2 filter(year == 1995) %>%
lm(packs ~ income + price + tax, data = .)
stargazer(regcig1, regcig2, type = "text", column.labels = c("1985","1995"))
##
## ==========================================================
## Dependent variable:
## ----------------------------
## packs
## 1985 1995
## (1) (2)
## ----------------------------------------------------------
## income 2.649 1.903
## (1.614) (1.566)
##
## price -1.584*** -1.210**
## (0.459) (0.553)
##
## tax 0.306 0.208
## (0.722) (0.822)
##
## Constant 231.498*** 206.421***
## (35.056) (39.127)
##
## ----------------------------------------------------------
## Observations 48 48
## R2 0.294 0.416
## Adjusted R2 0.245 0.376
## Residual Std. Error (df = 44) 18.423 18.785
## F Statistic (df = 3; 44) 6.096*** 10.443***
## ==========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Next, we can estimate first difference, between, and pooled models:
<- plm(packs ~ income + price + tax, data =cigspanel, model = "fd")
regcig3 <- plm(packs ~ income + price + tax, data =cigspanel, model = "between")
regcig4 <- plm(packs ~ income + price + tax, data =cigspanel, model = "pooling")
regcig5 stargazer(regcig3, regcig4, regcig5,
type = "text",
column.labels = c("First-Diff","Between","Pooled"))
##
## ================================================================================
## Dependent variable:
## -------------------------------------------------------------------
## packs
## First-Diff Between Pooled
## (1) (2) (3)
## --------------------------------------------------------------------------------
## income -0.987 2.711* 2.273**
## (1.861) (1.596) (1.091)
##
## price -0.627*** -1.637*** -1.342***
## (0.222) (0.538) (0.212)
##
## tax -0.263 0.553 0.313
## (0.348) (0.813) (0.399)
##
## Constant -7.376 231.138*** 212.963***
## (4.705) (37.099) (14.509)
##
## --------------------------------------------------------------------------------
## Observations 48 48 96
## R2 0.562 0.361 0.517
## Adjusted R2 0.532 0.318 0.502
## F Statistic 18.801*** (df = 3; 44) 8.294*** (df = 3; 44) 32.887*** (df = 3; 92)
## ================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Note that the first-differenced model and between models only have 48 observations; this is because the data is transformed in such a way as to lose observations for each cross-sectional unit. The first-difference model looks at the difference in each state between 1995 and 1985, whereas the between model looks at each state’s average over the two years.
This is essentially a demand curve estimation; one might wonder why there be such a big difference between the estimated coefficient on \(price\) between the first-difference model and the others. The first difference model suggests that, for every $1 increase in price within any state, the consumption of packs consumed fell by 0.63. The pooled model suggests that the effect is much bigger, and an additional dollar in price should reduce consumption by more than double that amount (-1.342)!
Which of these is more likely to be correct? Probably the first-differenced model; there are likely to be some state specific factors that influence smoking behavior, and first differencing is a way to take those into account. For example, of the top 5 states for cigarette consumption in 1985, 3 are still in the top 5 10 years later:
%>%
CigarettesSW select(year, state, packs) %>%
filter(year == 1985) %>%
slice_max(packs,n=5)
## year state packs
## 28 1985 NH 197.9940
## 15 1985 KY 186.0352
## 25 1985 NC 155.2838
## 44 1985 VT 145.2830
## 7 1985 DE 143.8511
%>%
CigarettesSW select(year, state, packs) %>%
filter(year == 1995) %>%
slice_max(packs,n=5)
## year state packs
## 63 1995 KY 172.6478
## 76 1995 NH 156.3367
## 61 1995 IN 134.2583
## 55 1995 DE 124.4666
## 70 1995 MO 122.4503
If you look at the bottom 5 states, 4 of them are still in the bottom 5 a decade later:
%>%
CigarettesSW select(year, state, packs) %>%
filter(year == 1985) %>%
slice_min(packs,n=5)
## year state packs
## 42 1985 UT 68.04626
## 30 1985 NM 88.74218
## 45 1985 WA 96.22813
## 4 1985 CA 100.36304
## 11 1985 ID 103.01811
%>%
CigarettesSW select(year, state, packs) %>%
filter(year == 1995) %>%
slice_min(packs,n=5)
## year state packs
## 90 1995 UT 49.27220
## 52 1995 CA 56.85931
## 78 1995 NM 64.66887
## 93 1995 WA 65.53092
## 80 1995 NY 70.81732
In other words, there is something persistent about smoking within each state, so including some way of keeping state specific details in the regression should improve the results.
Moving on, we should estimate the random effects and fixed effects models as well:
<- plm(packs ~ income + price + tax, data =cigspanel, model = "within")
regcig6 <- plm(packs ~ income + price + tax, data =cigspanel, model = "random")
regcig7 <- plm(packs ~ income + price + tax, data =cigspanel, model = "pooling")
regcig5 stargazer(regcig6, regcig7, regcig5,
type = "text",
column.labels = c("Fixed Eff.","Random Eff.","Pooled"))
##
## =======================================================================
## Dependent variable:
## ----------------------------------------------------------
## packs
## Fixed Eff. Random Eff. Pooled
## (1) (2) (3)
## -----------------------------------------------------------------------
## income -2.711* -0.164 2.273**
## (1.525) (1.073) (1.091)
##
## price -0.845*** -1.101*** -1.342***
## (0.175) (0.132) (0.212)
##
## tax -0.006 0.286 0.313
## (0.311) (0.256) (0.399)
##
## Constant 221.417*** 212.963***
## (9.096) (14.509)
##
## -----------------------------------------------------------------------
## Observations 96 96 96
## R2 0.918 0.845 0.517
## Adjusted R2 0.827 0.840 0.502
## F Statistic 168.271*** (df = 3; 45) 503.402*** 32.887*** (df = 3; 92)
## =======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As with the first-differenced model, both the fixed effects and random effects models allow for state specific effects to exist in the model. To determine which of these models is preferred, we can hit these with some more tests.
plmtest(regcig5)
##
## Lagrange Multiplier Test - (Honda) for balanced panels
##
## data: packs ~ income + price + tax
## normal = 5.9797, p-value = 1.118e-09
## alternative hypothesis: significant effects
The Lagrange Multiplier test indicates that the random effects model is preferred to the pooled model.
phtest(regcig6, regcig7)
##
## Hausman Test
##
## data: packs ~ income + price + tax
## chisq = 6.1643, df = 3, p-value = 0.1039
## alternative hypothesis: one model is inconsistent
The Hausman test is insignificant, indicating that the random effects model is preferred.
11.9 Wrapping Up
Panel methods, particularly Fixed Effects models, are incredibly important in economics because they give the ability to control for individual specific unobservable characteristics.
11.10 End of Chapter Exercises
Panel Data: For each of the following, examine the data (and the help file for the data) to identify the cross-sectional and time components of the panel data. Use the plm
function to build a good multivariate model. Think carefully about the variables you have to choose among to explain your dependent variable with. Estimate between and first-differenced models. Estimate the pooled, random effects, and fixed effects model, and execute and interpret the appropriate tests to identify which of these models are best. Finally, interpret your regression results.
AER:Fatalities
- This is a similar sort of dataset to theUSSeatbelts
data above.AER:Municipalities
This is an interesting Swedish dataset of city taxes and spending.AER:USAirlines
- Small dataset that looks at cost of production.wooldridge:driving
- This is another driving fatality dataset.wooldridge:crime4
- Crime data is often a good place to look at first-difference models.wooldridge:wagepan
- Looking at wages in a panel often provides very different conclusions than just in a cross sectional dataset.AER:NaturalGas
- This could be a good dataset for estimating a demand function.