# Chapter 17 Advanced Panel Data

In this chapter we will learn techinques in R for panel data where there might be serially correlated errors, temporal dependence with a lagged dependent variable, and random effects models.

## 17.1 The Data

We will make use of the Cigar dataset from the plm package for this chapter. Cigar is a panel of 46 U.S. states over the period 1963-1992. The variables are:

• state - State number
• year
• price - the price per pack of cigarettes in cents
• pop - state population in thousands
• pop16 - state population over the age of 16 in thousands
• cpi - consumer price index (1983=100)
• ndi - per capita disposable income in dollars
• sales - cigarette sales per capita in packs
• pimin - minimum price in adjoining states per pack of cigarettes in cents
library(plm)
data("Cigar")

The plm packages offers many functions to simplify the handling of advanced panel data.

## 17.2 Variation within Units

dplyr verbs make checking for variation within units across multiple variables relatively simple. First we use group-by so that any functions will be applied to each state individually. summarize_all will apply a function to each variable.

library(tidyverse)
library(broom)
library(magrittr)
# Check for variaton by state.
Cigar %>%
group_by(state) %>% # ensures that subsequent functions will be performed by state
select(price, pop, pop16, cpi, ndi, sales, pimin) %>%
summarise_all(sd) # sd is standard deviation 
# A tibble: 46 x 8
state price    pop  pop16   cpi   ndi sales pimin
<int> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1  39.6  269.   319.   37.1 3944.  10.7  38.0
2     3  40.7  748.   622.   37.1 4364.  13.3  40.8
3     4  41.5  189.   198.   37.1 3814.  10.8  37.6
4     5  48.3 3995.  3531.   37.1 5264.  20.6  40.9
5     7  48.0  139.   218.   37.1 6624.  18.2  43.7
6     8  40.6   58.1   66.6  37.1 4921.  13.8  38.3
7     9  45.4   79.5   34.7  37.1 6223.  59.3  38.2
8    10  45.4 2562.  2229.   37.1 4889.  10.3  37.6
9    11  37.6  783.   734.   37.1 4478.  10.0  35.9
10    13  41.6  136.   113.   37.1 3900.  14.3  37.5
# ... with 36 more rows
# Check for variation by year.
Cigar %>%
group_by(year) %>%
select(-year, -state) %>% # the "-" indicates variables to be removed
summarise_all(sd)
# A tibble: 30 x 8
year price   pop pop16   cpi   ndi sales pimin
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    63  1.86 4116. 2855.     0  387.  33.1  1.21
2    64  1.89 4185. 2903.     0  413.  32.8  1.41
3    65  1.97 4253. 2953.     0  417.  33.1  1.68
4    66  2.43 4285. 2982.     0  438.  41.6  2.25
5    67  2.32 4329. 3024.     0  460.  37.9  2.10
6    68  2.39 4372. 3069.     0  486.  33.4  2.23
7    69  3.13 4420. 3139.     0  504.  32.0  2.28
8    70  3.65 4463. 3198.     0  537.  32.0  3.24
9    71  3.71 4513. 3260.     0  559.  33.5  3.27
10    72  4.45 4536. 3306.     0  555.  36.2  4.07
# ... with 20 more rows

CPI is the only variable with a standard deviation of 0 for all units. As would be expected, CPI should not vary within year.

pvar from the plm package will perform the task of checking for variation.

pvar(Cigar)
no time variation:       state
no individual variation: year cpi 

## 17.3 Two-Way Fixed Effects Model

Let’s estimate cigarette demand as: $sales_{it}=\beta_0+\beta_1price_{it}+\beta_2pop16_{it}+\beta_3ndi_{it}+\alpha_i+\tau_t+\nu_{it}$

We would expect $$\beta_1<0$$, $$\beta_2>0$$, and $$\beta_3<0$$ if cigarettes are an inferior good25.

cigar_plm <- plm(sales ~ price + pop16 + ndi,
data = Cigar, # recall plm does not play nice with the expose pipe, %$% index = c("state", "year"), model = "within", effect = "twoways") cigar_plm %>% tidy() # A tibble: 3 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 price -0.841 0.0750 -11.2 6.35e-28 2 pop16 0.00114 0.000547 2.07 3.83e- 2 3 ndi -0.00557 0.000445 -12.5 5.72e-34 Each of the coefficients has the expected sign and is significant at the 5% level. ## 17.4 Testing for autocorrelation Testing for autocorrelation is done by testing the following hypothesis: $H_0:\rho=0$ $H_1:\rho\ne0$ Cigar %>% glimpse() Observations: 1,380 Variables: 9$ state <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$year <int> 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, ...$ price <dbl> 28.6, 29.8, 29.8, 31.5, 31.6, 35.6, 36.6, 39.6, 42.7, 42...
$pop <dbl> 3383, 3431, 3486, 3524, 3533, 3522, 3531, 3444, 3481, 35...$ pop16 <dbl> 2236, 2277, 2328, 2370, 2394, 2405, 2412, 2395, 2444, 24...
$cpi <dbl> 30.6, 31.0, 31.5, 32.4, 33.4, 34.8, 36.7, 38.8, 40.5, 41...$ ndi   <dbl> 1558, 1684, 1810, 1915, 2024, 2202, 2377, 2591, 2785, 30...
$sales <dbl> 93.9, 95.4, 98.5, 96.4, 95.5, 88.4, 90.1, 89.8, 95.4, 10...$ pimin <dbl> 26.1, 27.5, 28.9, 29.5, 29.6, 32.0, 32.8, 34.3, 35.8, 37...

Our data are organized by unit by year, so we can estimate $$\hat\rho$$ directly. First, obtain the residuals, e, from the estimated equation. Estimate the equation $$e=\rho e_{i,t-1}+\eta_{it}$$.

# Obtain the residuals
Cigar$e <- cigar_plm$residuals
# test of rho hat
aux_1 <-
Cigar %$% lm(e ~ -1 + lag(e)) # -1 removes the constant. aux_1 %>% tidy() # A tibble: 1 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 lag(e) 0.888 0.0126 70.3 0 We can reject the null hypothesis at the 1% level. We can also check for autocorrelation with the LM test by estimating the model $\hat\epsilon_{it}=\rho\hat\epsilon_{i,t-1}+\gamma_1price_{it}+\gamma_2pop16_{it}+\gamma_3ndi_{it}+\eta_{it}$ where $$nR^2\sim\chi^2_{df=1}$$. aux_2 <- plm(e ~ lag(e) + price + pop16 + ndi, data = Cigar, index = c("state", "year"), model = "within", effect = "twoways") nR2 <- aux_2 %>% r.squared * aux_2$df.residual
nR2 %>%
pchisq(1, lower.tail = F)
 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000103

Again, we can reject the null hypothesis of no autocorrelation.

pwartest from the lpm package allows us to test for autocorrelation (?pwartest for relevant arguments).

pwartest(cigar_plm)

Wooldridge's test for serial correlation in FE panels

data:  cigar_plm
F = 10212, df1 = 1, df2 = 1332, p-value <0.0000000000000002
alternative hypothesis: serial correlation

We reject the null hpothesis of no autocorrelation.

## 17.5 Estimating $$\hat\rho$$

To correct for autocorrelation we need an estimate of $$\hat\rho$$. We can estimate $$\hat\rho$$ using either auxiliary regression from above.

aux_1 %>%
tidy()
# A tibble: 1 x 5
term   estimate std.error statistic p.value
<chr>     <dbl>     <dbl>     <dbl>   <dbl>
1 lag(e)    0.888    0.0126      70.3       0

Our estimate of $$\hat\rho$$ is 0.888 is 0.888.

aux_2 %>%
tidy()
# A tibble: 4 x 5
term    estimate std.error statistic    p.value
<chr>      <dbl>     <dbl>     <dbl>      <dbl>
1 lag(e)  0.894     0.0127      70.2   0
2 price   0.156     0.0339       4.59  0.00000482
3 pop16  -0.000203  0.000254    -0.802 0.423
4 ndi     0.000517  0.000201     2.57  0.0102    

Our estimate of $$\hat\rho$$ is 0.894.

## 17.6 Estimate a $$\rho$$-Transformed Model

We can manually transform the data and compare the transformed model to the non-transformed model.

rho_hat <- aux_2$coefficients # set rho_hat to the coef of lagged e in aux_2 plm(I(sales - rho_hat*lag(sales)) ~ I(price - rho_hat*lag(price)) + I(pop - rho_hat*lag(pop16)) + I(ndi - rho_hat*lag(ndi)), data = Cigar, index = c("state", "year"), model = "within", effect = "twoways") %>% summary() Twoways effects Within Model Call: plm(formula = I(sales - rho_hat * lag(sales)) ~ I(price - rho_hat * lag(price)) + I(pop - rho_hat * lag(pop16)) + I(ndi - rho_hat * lag(ndi)), data = Cigar, effect = "twoways", model = "within", index = c("state", "year")) Balanced Panel: n = 46, T = 29, N = 1334 Residuals: Min. 1st Qu. Median 3rd Qu. Max. -27.611 -1.992 0.208 1.956 61.268 Coefficients: Estimate Std. Error t-value I(price - rho_hat * lag(price)) -0.329602 0.047324 -6.96 I(pop - rho_hat * lag(pop16)) -0.000730 0.000594 -1.23 I(ndi - rho_hat * lag(ndi)) 0.000988 0.000682 1.45 Pr(>|t|) I(price - rho_hat * lag(price)) 0.0000000000053 *** I(pop - rho_hat * lag(pop16)) 0.22 I(ndi - rho_hat * lag(ndi)) 0.15 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Total Sum of Squares: 35700 Residual Sum of Squares: 34300 R-Squared: 0.0414 Adj. R-Squared: -0.0166 F-statistic: 18.0908 on 3 and 1257 DF, p-value: 0.000000000017 cigar_plm %>% tidy() # A tibble: 3 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 price -0.841 0.0750 -11.2 6.35e-28 2 pop16 0.00114 0.000547 2.07 3.83e- 2 3 ndi -0.00557 0.000445 -12.5 5.72e-34 Now only $$\hat\beta_1$$ is significantly different than zero. We can use the panelAR package to directly estimate a corrected model26. ?panelAR for arguments necessary to estimate the corrected model. library(panelAR) panelAR(sales ~ price + pop +ndi, data = Cigar, panelVar = "state", timeVar = "year", autoCorr = "ar1", panelCorrMethod = "pcse") %>% summary()  Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors Balanced Panel Design: Total obs.: 1380 Avg obs. per panel 30 Number of panels: 46 Max obs. per panel 30 Number of times: 30 Min obs. per panel 30 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 137.461557 5.974766 23.01 < 0.0000000000000002 *** price -0.386976 0.060172 -6.43 0.00000000017 *** pop -0.000446 0.000338 -1.32 0.1872 ndi 0.001879 0.000719 2.61 0.0091 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-squared: 0.5899 Wald statistic: 57.2869, Pr(>Chisq(3)): 0 ## 17.7 Lagged Dependent Variable Panel Data Model Let’s estimate the lagged-depdendent variable model $sales_{it} = \gamma sales_{i,t-1}+\beta_0+\beta_1price_{it}+\beta_2pop16_{it}+\beta_3ndi_{it}+\epsilon_{it}$ cigar_lag_plm <- plm(sales ~ lag(sales) + price + pop16 + ndi, data = Cigar, # recall plm does not play nice with the expose pipe, %$%
index = c("state", "year"),
model = "within",
effect = "twoways")
cigar_lag_plm %>%
summary()
Twoways effects Within Model

Call:
plm(formula = sales ~ lag(sales) + price + pop16 + ndi, data = Cigar,
effect = "twoways", model = "within", index = c("state",
"year"))

Balanced Panel: n = 46, T = 29, N = 1334

Residuals:
Min. 1st Qu.  Median 3rd Qu.    Max.
-29.476  -1.900   0.241   2.002  60.188

Coefficients:
Estimate Std. Error t-value             Pr(>|t|)
lag(sales)  0.8977592  0.0119755   74.97 < 0.0000000000000002 ***
price      -0.1349955  0.0332910   -4.06             0.000053 ***
pop16       0.0000834  0.0002409    0.35                 0.73
ndi        -0.0002809  0.0002022   -1.39                 0.16
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    244000
Residual Sum of Squares: 35000
R-Squared:      0.857
F-statistic: 1876.56 on 4 and 1256 DF, p-value: <0.0000000000000002

## 17.8 Random Effects Model

plm(sales ~ price + pop16 + ndi,
data = Cigar,
index = c("state", "year"),
model = "random",
effect = "twoways"
) %>%
summary()
Twoways effects Random Effect Model
(Swamy-Arora's transformation)

Call:
plm(formula = sales ~ price + pop16 + ndi, data = Cigar, effect = "twoways",
model = "random", index = c("state", "year"))

Balanced Panel: n = 46, T = 30, N = 1380

Effects:
var std.dev share
idiosyncratic 156.13   12.50  0.26
individual    438.16   20.93  0.73
time            6.70    2.59  0.01
theta: 0.892 (id) 0.42 (time) 0.419 (total)

Residuals:
Min. 1st Qu.  Median 3rd Qu.    Max.
-57.175  -7.171   0.235   5.785 128.022

Coefficients:
Estimate Std. Error z-value             Pr(>|z|)
(Intercept) 139.028985   3.995359   34.80 < 0.0000000000000002 ***
price        -0.197951   0.044962   -4.40             0.000011 ***
pop16         0.000357   0.000526    0.68                 0.50
ndi          -0.000356   0.000404   -0.88                 0.38
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    325000
Residual Sum of Squares: 280000
R-Squared:      0.138
Chisq: 220.089 on 3 DF, p-value: <0.0000000000000002
1. If cigarettes are a normal good we’d expect $$\beta_3>0$$
2. Note the slight differences, because panelAR also corrects for heteroscedasticity.