Chapter 16 Advanced Panel Data
In this chapter we will learn techniques in R for panel data where there might be serially correlated errors, temporal dependence with a lagged dependent variable, and random effects models.
16.1 The Data
We will make use of the Cigar
dataset from the plm
package for this chapter. Cigar
is a panel of 46 U.S. states over the period 1963-1992. The variables are:
- state - State number
- year
- price - the price per pack of cigarettes in cents
- pop - state population in thousands
- pop16 - state population over the age of 16 in thousands
- cpi - consumer price index (1983=100)
- ndi - per capita disposable income in dollars
- sales - cigarette sales per capita in packs
- pimin - minimum price in adjoining states per pack of cigarettes in cents
The plm
packages offers many functions to simplify the handling of advanced panel data.
16.2 Variation within Units
dplyr
verbs make checking for variation within units across multiple variables relatively simple. First we use group-by
so that any functions will be applied to each state individually. summarize_all
will apply a function to each variable.
library(tidyverse)
library(broom)
library(magrittr)
# Check for variaton by state.
Cigar %>%
group_by(state) %>% # ensures that subsequent functions will be performed by state
select(price, pop, pop16, cpi, ndi, sales, pimin) %>%
summarise_all(sd) # sd is standard deviation
# A tibble: 46 x 8
state price pop pop16 cpi ndi sales pimin
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 39.6 269. 319. 37.1 3944. 10.7 38.0
2 3 40.7 748. 622. 37.1 4364. 13.3 40.8
3 4 41.5 189. 198. 37.1 3814. 10.8 37.6
4 5 48.3 3995. 3531. 37.1 5264. 20.6 40.9
5 7 48.0 139. 218. 37.1 6624. 18.2 43.7
6 8 40.6 58.1 66.6 37.1 4921. 13.8 38.3
7 9 45.4 79.5 34.7 37.1 6223. 59.3 38.2
8 10 45.4 2562. 2229. 37.1 4889. 10.3 37.6
9 11 37.6 783. 734. 37.1 4478. 10.0 35.9
10 13 41.6 136. 113. 37.1 3900. 14.3 37.5
# ... with 36 more rows
# Check for variation by year.
Cigar %>%
group_by(year) %>%
select(-year, -state) %>% # the "-" indicates variables to be removed
summarise_all(sd)
# A tibble: 30 x 8
year price pop pop16 cpi ndi sales pimin
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 63 1.86 4116. 2855. 0 387. 33.1 1.21
2 64 1.89 4185. 2903. 0 413. 32.8 1.41
3 65 1.97 4253. 2953. 0 417. 33.1 1.68
4 66 2.43 4285. 2982. 0 438. 41.6 2.25
5 67 2.32 4329. 3024. 0 460. 37.9 2.10
6 68 2.39 4372. 3069. 0 486. 33.4 2.23
7 69 3.13 4420. 3139. 0 504. 32.0 2.28
8 70 3.65 4463. 3198. 0 537. 32.0 3.24
9 71 3.71 4513. 3260. 0 559. 33.5 3.27
10 72 4.45 4536. 3306. 0 555. 36.2 4.07
# ... with 20 more rows
CPI is the only variable with a standard deviation of 0 for all units. As would be expected, CPI should not vary within year.
pvar
from the plm
package will perform the task of checking for variation.
no time variation: state
no individual variation: year cpi
16.3 Two-Way Fixed Effects Model
Let’s estimate cigarette demand as: \[sales_{it}=\beta_0+\beta_1price_{it}+\beta_2pop16_{it}+\beta_3ndi_{it}+\alpha_i+\tau_t+\nu_{it}\]
We would expect \(\beta_1<0\), \(\beta_2>0\), and \(\beta_3<0\) if cigarettes are an inferior good42.
cigar_plm <- plm(sales ~ price + pop16 + ndi,
data = Cigar, # recall plm does not play nice with the expose pipe, %$%
index = c("state", "year"),
model = "within",
effect = "twoways")
cigar_plm %>%
tidy()
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 price -0.841 0.0750 -11.2 6.35e-28
2 pop16 0.00114 0.000547 2.07 3.83e- 2
3 ndi -0.00557 0.000445 -12.5 5.72e-34
Each of the coefficients has the expected sign and is significant at the 5% level.
16.4 Testing for autocorrelation
Testing for autocorrelation is done by testing the following hypothesis: \[H_0:\rho=0\] \[H_1:\rho\ne0\]
Rows: 1,380
Columns: 9
$ state <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ year <int> 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 7...
$ price <dbl> 28.6, 29.8, 29.8, 31.5, 31.6, 35.6, 36.6, 39.6, 42.7, 42.3, 4...
$ pop <dbl> 3383, 3431, 3486, 3524, 3533, 3522, 3531, 3444, 3481, 3511, 3...
$ pop16 <dbl> 2236, 2277, 2328, 2370, 2394, 2405, 2412, 2395, 2444, 2485, 2...
$ cpi <dbl> 30.6, 31.0, 31.5, 32.4, 33.4, 34.8, 36.7, 38.8, 40.5, 41.8, 4...
$ ndi <dbl> 1558, 1684, 1810, 1915, 2024, 2202, 2377, 2591, 2785, 3035, 3...
$ sales <dbl> 93.9, 95.4, 98.5, 96.4, 95.5, 88.4, 90.1, 89.8, 95.4, 101.1, ...
$ pimin <dbl> 26.1, 27.5, 28.9, 29.5, 29.6, 32.0, 32.8, 34.3, 35.8, 37.4, 3...
Our data are organized by unit by year, so we can estimate \(\hat\rho\) directly. First, obtain the residuals, e, from the estimated equation. Estimate the equation \(e=\rho e_{i,t-1}+\eta_{it}\).
# Obtain the residuals
Cigar$e <- cigar_plm$residuals
# test of rho hat
aux_1 <-
Cigar %$%
lm(e ~ -1 + lag(e)) # -1 removes the constant.
aux_1 %>%
tidy()
# A tibble: 1 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 lag(e) 0.888 0.0126 70.3 0
We can reject the null hypothesis at the 1% level.
We can also check for autocorrelation with the LM test by estimating the model \[\hat\epsilon_{it}=\rho\hat\epsilon_{i,t-1}+\gamma_1price_{it}+\gamma_2pop16_{it}+\gamma_3ndi_{it}+\eta_{it}\] where \(nR^2\sim\chi^2_{df=1}\).
aux_2 <-
plm(e ~ lag(e) + price + pop16 + ndi,
data = Cigar,
index = c("state", "year"),
model = "within",
effect = "twoways")
nR2 <-
aux_2 %>%
r.squared *
aux_2$df.residual
nR2 %>%
pchisq(1, lower.tail = F)
[1] 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000103
Again, we can reject the null hypothesis of no autocorrelation.
pwartest
from the lpm
package allows us to test for autocorrelation (?pwartest
for relevant arguments).
Wooldridge's test for serial correlation in FE panels
data: cigar_plm
F = 10212, df1 = 1, df2 = 1332, p-value <0.0000000000000002
alternative hypothesis: serial correlation
We reject the null hpothesis of no autocorrelation.
16.5 Estimating \(\hat\rho\)
To correct for autocorrelation we need an estimate of \(\hat\rho\). We can estimate \(\hat\rho\) using either auxiliary regression from above.
# A tibble: 1 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 lag(e) 0.888 0.0126 70.3 0
Our estimate of \(\hat\rho\) is 0.888
is 0.888.
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 lag(e) 0.894 0.0127 70.2 0
2 price 0.156 0.0339 4.59 0.00000482
3 pop16 -0.000203 0.000254 -0.802 0.423
4 ndi 0.000517 0.000201 2.57 0.0102
Our estimate of \(\hat\rho\) is 0.894.
16.6 Estimate a \(\rho\)-Transformed Model
We can manually transform the data and compare the transformed model to the non-transformed model.
rho_hat <- aux_2$coefficients[1] # set rho_hat to the coef of lagged e in aux_2
plm(I(sales - rho_hat*lag(sales)) ~
I(price - rho_hat*lag(price)) +
I(pop - rho_hat*lag(pop16)) +
I(ndi - rho_hat*lag(ndi)),
data = Cigar,
index = c("state", "year"),
model = "within",
effect = "twoways") %>%
summary()
Twoways effects Within Model
Call:
plm(formula = I(sales - rho_hat * lag(sales)) ~ I(price - rho_hat *
lag(price)) + I(pop - rho_hat * lag(pop16)) + I(ndi - rho_hat *
lag(ndi)), data = Cigar, effect = "twoways", model = "within",
index = c("state", "year"))
Balanced Panel: n = 46, T = 29, N = 1334
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-27.611 -1.992 0.208 1.956 61.268
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
I(price - rho_hat * lag(price)) -0.329602 0.047324 -6.96 0.0000000000053
I(pop - rho_hat * lag(pop16)) -0.000730 0.000594 -1.23 0.22
I(ndi - rho_hat * lag(ndi)) 0.000988 0.000682 1.45 0.15
I(price - rho_hat * lag(price)) ***
I(pop - rho_hat * lag(pop16))
I(ndi - rho_hat * lag(ndi))
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 35700
Residual Sum of Squares: 34300
R-Squared: 0.0414
Adj. R-Squared: -0.0166
F-statistic: 18.0908 on 3 and 1257 DF, p-value: 0.000000000017
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 price -0.841 0.0750 -11.2 6.35e-28
2 pop16 0.00114 0.000547 2.07 3.83e- 2
3 ndi -0.00557 0.000445 -12.5 5.72e-34
Now only \(\hat\beta_1\) is significantly different than zero.
We can use the panelAR
package to directly estimate a corrected model43. ?panelAR
for arguments necessary to estimate the corrected model.
library(panelAR)
panelAR(sales ~ price + pop +ndi,
data = Cigar,
panelVar = "state",
timeVar = "year",
autoCorr = "ar1",
panelCorrMethod = "pcse") %>%
summary()
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors
Balanced Panel Design:
Total obs.: 1380 Avg obs. per panel 30
Number of panels: 46 Max obs. per panel 30
Number of times: 30 Min obs. per panel 30
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 137.461557 5.974766 23.01 < 0.0000000000000002 ***
price -0.386976 0.060172 -6.43 0.00000000017 ***
pop -0.000446 0.000338 -1.32 0.1872
ndi 0.001879 0.000719 2.61 0.0091 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.5899
Wald statistic: 57.2869, Pr(>Chisq(3)): 0
16.7 Lagged Dependent Variable Panel Data Model
Let’s estimate the lagged-depdendent variable model \[sales_{it} = \gamma sales_{i,t-1}+\beta_0+\beta_1price_{it}+\beta_2pop16_{it}+\beta_3ndi_{it}+\epsilon_{it}\]
cigar_lag_plm <- plm(sales ~ lag(sales) + price + pop16 + ndi,
data = Cigar, # recall plm does not play nice with the expose pipe, %$%
index = c("state", "year"),
model = "within",
effect = "twoways")
cigar_lag_plm %>%
summary()
Twoways effects Within Model
Call:
plm(formula = sales ~ lag(sales) + price + pop16 + ndi, data = Cigar,
effect = "twoways", model = "within", index = c("state",
"year"))
Balanced Panel: n = 46, T = 29, N = 1334
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-29.476 -1.900 0.241 2.002 60.188
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
lag(sales) 0.8977592 0.0119755 74.97 < 0.0000000000000002 ***
price -0.1349955 0.0332910 -4.06 0.000053 ***
pop16 0.0000834 0.0002409 0.35 0.73
ndi -0.0002809 0.0002022 -1.39 0.16
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 244000
Residual Sum of Squares: 35000
R-Squared: 0.857
Adj. R-Squared: 0.848
F-statistic: 1876.56 on 4 and 1256 DF, p-value: <0.0000000000000002
16.8 Random Effects Model
plm(sales ~ price + pop16 + ndi,
data = Cigar,
index = c("state", "year"),
model = "random",
effect = "twoways"
) %>%
summary()
Twoways effects Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = sales ~ price + pop16 + ndi, data = Cigar, effect = "twoways",
model = "random", index = c("state", "year"))
Balanced Panel: n = 46, T = 30, N = 1380
Effects:
var std.dev share
idiosyncratic 156.13 12.50 0.26
individual 438.16 20.93 0.73
time 6.70 2.59 0.01
theta: 0.892 (id) 0.42 (time) 0.419 (total)
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-57.175 -7.171 0.235 5.785 128.022
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 139.028985 3.995359 34.80 < 0.0000000000000002 ***
price -0.197951 0.044962 -4.40 0.000011 ***
pop16 0.000357 0.000526 0.68 0.50
ndi -0.000356 0.000404 -0.88 0.38
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 325000
Residual Sum of Squares: 280000
R-Squared: 0.138
Adj. R-Squared: 0.136
Chisq: 220.089 on 3 DF, p-value: <0.0000000000000002