Chapter 16 Advanced Panel Data

In this chapter we will learn techniques in R for panel data where there might be serially correlated errors, temporal dependence with a lagged dependent variable, and random effects models.

16.1 The Data

We will make use of the Cigar dataset from the plm package for this chapter. Cigar is a panel of 46 U.S. states over the period 1963-1992. The variables are:

  • state - State number
  • year
  • price - the price per pack of cigarettes in cents
  • pop - state population in thousands
  • pop16 - state population over the age of 16 in thousands
  • cpi - consumer price index (1983=100)
  • ndi - per capita disposable income in dollars
  • sales - cigarette sales per capita in packs
  • pimin - minimum price in adjoining states per pack of cigarettes in cents
library(plm)
data("Cigar")

The plm packages offers many functions to simplify the handling of advanced panel data.

16.2 Variation within Units

dplyr verbs make checking for variation within units across multiple variables relatively simple. First we use group-by so that any functions will be applied to each state individually. summarize_all will apply a function to each variable.

library(tidyverse)
library(broom)
library(magrittr)
# Check for variaton by state.
Cigar %>% 
  group_by(state) %>% # ensures that subsequent functions will be performed by state
  select(price, pop, pop16, cpi, ndi, sales, pimin) %>% 
  summarise_all(sd) # sd is standard deviation 
# A tibble: 46 x 8
   state price    pop  pop16   cpi   ndi sales pimin
   <int> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1  39.6  269.   319.   37.1 3944.  10.7  38.0
 2     3  40.7  748.   622.   37.1 4364.  13.3  40.8
 3     4  41.5  189.   198.   37.1 3814.  10.8  37.6
 4     5  48.3 3995.  3531.   37.1 5264.  20.6  40.9
 5     7  48.0  139.   218.   37.1 6624.  18.2  43.7
 6     8  40.6   58.1   66.6  37.1 4921.  13.8  38.3
 7     9  45.4   79.5   34.7  37.1 6223.  59.3  38.2
 8    10  45.4 2562.  2229.   37.1 4889.  10.3  37.6
 9    11  37.6  783.   734.   37.1 4478.  10.0  35.9
10    13  41.6  136.   113.   37.1 3900.  14.3  37.5
# ... with 36 more rows
# Check for variation by year.
Cigar %>% 
  group_by(year) %>% 
  select(-year, -state) %>% # the "-" indicates variables to be removed
  summarise_all(sd)
# A tibble: 30 x 8
    year price   pop pop16   cpi   ndi sales pimin
   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    63  1.86 4116. 2855.     0  387.  33.1  1.21
 2    64  1.89 4185. 2903.     0  413.  32.8  1.41
 3    65  1.97 4253. 2953.     0  417.  33.1  1.68
 4    66  2.43 4285. 2982.     0  438.  41.6  2.25
 5    67  2.32 4329. 3024.     0  460.  37.9  2.10
 6    68  2.39 4372. 3069.     0  486.  33.4  2.23
 7    69  3.13 4420. 3139.     0  504.  32.0  2.28
 8    70  3.65 4463. 3198.     0  537.  32.0  3.24
 9    71  3.71 4513. 3260.     0  559.  33.5  3.27
10    72  4.45 4536. 3306.     0  555.  36.2  4.07
# ... with 20 more rows

CPI is the only variable with a standard deviation of 0 for all units. As would be expected, CPI should not vary within year.

pvar from the plm package will perform the task of checking for variation.

pvar(Cigar)
no time variation:       state 
no individual variation: year cpi 

16.3 Two-Way Fixed Effects Model

Let’s estimate cigarette demand as: \[sales_{it}=\beta_0+\beta_1price_{it}+\beta_2pop16_{it}+\beta_3ndi_{it}+\alpha_i+\tau_t+\nu_{it}\]

We would expect \(\beta_1<0\), \(\beta_2>0\), and \(\beta_3<0\) if cigarettes are an inferior good42.

cigar_plm <- plm(sales ~ price + pop16 + ndi, 
                 data = Cigar, # recall plm does not play nice with the expose pipe, %$%
                 index = c("state", "year"), 
                 model = "within", 
                 effect = "twoways")
cigar_plm %>% 
  tidy()
# A tibble: 3 x 5
  term  estimate std.error statistic  p.value
  <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 price -0.841    0.0750      -11.2  6.35e-28
2 pop16  0.00114  0.000547      2.07 3.83e- 2
3 ndi   -0.00557  0.000445    -12.5  5.72e-34

Each of the coefficients has the expected sign and is significant at the 5% level.

16.4 Testing for autocorrelation

Testing for autocorrelation is done by testing the following hypothesis: \[H_0:\rho=0\] \[H_1:\rho\ne0\]

Cigar %>% 
  glimpse()
Rows: 1,380
Columns: 9
$ state <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ year  <int> 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 7...
$ price <dbl> 28.6, 29.8, 29.8, 31.5, 31.6, 35.6, 36.6, 39.6, 42.7, 42.3, 4...
$ pop   <dbl> 3383, 3431, 3486, 3524, 3533, 3522, 3531, 3444, 3481, 3511, 3...
$ pop16 <dbl> 2236, 2277, 2328, 2370, 2394, 2405, 2412, 2395, 2444, 2485, 2...
$ cpi   <dbl> 30.6, 31.0, 31.5, 32.4, 33.4, 34.8, 36.7, 38.8, 40.5, 41.8, 4...
$ ndi   <dbl> 1558, 1684, 1810, 1915, 2024, 2202, 2377, 2591, 2785, 3035, 3...
$ sales <dbl> 93.9, 95.4, 98.5, 96.4, 95.5, 88.4, 90.1, 89.8, 95.4, 101.1, ...
$ pimin <dbl> 26.1, 27.5, 28.9, 29.5, 29.6, 32.0, 32.8, 34.3, 35.8, 37.4, 3...

Our data are organized by unit by year, so we can estimate \(\hat\rho\) directly. First, obtain the residuals, e, from the estimated equation. Estimate the equation \(e=\rho e_{i,t-1}+\eta_{it}\).

# Obtain the residuals
Cigar$e <- cigar_plm$residuals
# test of rho hat
aux_1 <- 
  Cigar %$%
  lm(e ~ -1 + lag(e)) # -1 removes the constant.
aux_1 %>% 
  tidy()
# A tibble: 1 x 5
  term   estimate std.error statistic p.value
  <chr>     <dbl>     <dbl>     <dbl>   <dbl>
1 lag(e)    0.888    0.0126      70.3       0

We can reject the null hypothesis at the 1% level.

We can also check for autocorrelation with the LM test by estimating the model \[\hat\epsilon_{it}=\rho\hat\epsilon_{i,t-1}+\gamma_1price_{it}+\gamma_2pop16_{it}+\gamma_3ndi_{it}+\eta_{it}\] where \(nR^2\sim\chi^2_{df=1}\).

aux_2 <- 
plm(e ~ lag(e) + price + pop16 + ndi,  
    data = Cigar,
    index = c("state", "year"), 
    model = "within", 
    effect = "twoways") 
nR2 <- 
  aux_2 %>% 
  r.squared *
  aux_2$df.residual 
nR2 %>% 
  pchisq(1, lower.tail = F)
[1] 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000103

Again, we can reject the null hypothesis of no autocorrelation.

pwartest from the lpm package allows us to test for autocorrelation (?pwartest for relevant arguments).

pwartest(cigar_plm)

	Wooldridge's test for serial correlation in FE panels

data:  cigar_plm
F = 10212, df1 = 1, df2 = 1332, p-value <0.0000000000000002
alternative hypothesis: serial correlation

We reject the null hpothesis of no autocorrelation.

16.5 Estimating \(\hat\rho\)

To correct for autocorrelation we need an estimate of \(\hat\rho\). We can estimate \(\hat\rho\) using either auxiliary regression from above.

aux_1 %>%  
   tidy()
# A tibble: 1 x 5
  term   estimate std.error statistic p.value
  <chr>     <dbl>     <dbl>     <dbl>   <dbl>
1 lag(e)    0.888    0.0126      70.3       0

Our estimate of \(\hat\rho\) is 0.888 is 0.888.

aux_2 %>% 
  tidy()
# A tibble: 4 x 5
  term    estimate std.error statistic    p.value
  <chr>      <dbl>     <dbl>     <dbl>      <dbl>
1 lag(e)  0.894     0.0127      70.2   0         
2 price   0.156     0.0339       4.59  0.00000482
3 pop16  -0.000203  0.000254    -0.802 0.423     
4 ndi     0.000517  0.000201     2.57  0.0102    

Our estimate of \(\hat\rho\) is 0.894.

16.6 Estimate a \(\rho\)-Transformed Model

We can manually transform the data and compare the transformed model to the non-transformed model.

rho_hat <- aux_2$coefficients[1] # set rho_hat to the coef of lagged e in aux_2
plm(I(sales - rho_hat*lag(sales)) ~ 
      I(price - rho_hat*lag(price)) + 
      I(pop - rho_hat*lag(pop16)) + 
      I(ndi - rho_hat*lag(ndi)), 
    data = Cigar, 
    index = c("state", "year"), 
    model = "within", 
    effect = "twoways") %>% 
  summary()
Twoways effects Within Model

Call:
plm(formula = I(sales - rho_hat * lag(sales)) ~ I(price - rho_hat * 
    lag(price)) + I(pop - rho_hat * lag(pop16)) + I(ndi - rho_hat * 
    lag(ndi)), data = Cigar, effect = "twoways", model = "within", 
    index = c("state", "year"))

Balanced Panel: n = 46, T = 29, N = 1334

Residuals:
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-27.611  -1.992   0.208   1.956  61.268 

Coefficients:
                                 Estimate Std. Error t-value        Pr(>|t|)
I(price - rho_hat * lag(price)) -0.329602   0.047324   -6.96 0.0000000000053
I(pop - rho_hat * lag(pop16))   -0.000730   0.000594   -1.23            0.22
I(ndi - rho_hat * lag(ndi))      0.000988   0.000682    1.45            0.15
                                   
I(price - rho_hat * lag(price)) ***
I(pop - rho_hat * lag(pop16))      
I(ndi - rho_hat * lag(ndi))        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    35700
Residual Sum of Squares: 34300
R-Squared:      0.0414
Adj. R-Squared: -0.0166
F-statistic: 18.0908 on 3 and 1257 DF, p-value: 0.000000000017
cigar_plm %>% 
  tidy()
# A tibble: 3 x 5
  term  estimate std.error statistic  p.value
  <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 price -0.841    0.0750      -11.2  6.35e-28
2 pop16  0.00114  0.000547      2.07 3.83e- 2
3 ndi   -0.00557  0.000445    -12.5  5.72e-34

Now only \(\hat\beta_1\) is significantly different than zero.

We can use the panelAR package to directly estimate a corrected model43. ?panelAR for arguments necessary to estimate the corrected model.

library(panelAR)
panelAR(sales ~ price + pop +ndi, 
        data = Cigar, 
        panelVar = "state", 
        timeVar = "year",
        autoCorr = "ar1",
        panelCorrMethod = "pcse") %>% 
  summary()

Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors

Balanced Panel Design:                                             
 Total obs.:       1380 Avg obs. per panel 30
 Number of panels: 46   Max obs. per panel 30
 Number of times:  30   Min obs. per panel 30

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 137.461557   5.974766   23.01 < 0.0000000000000002 ***
price        -0.386976   0.060172   -6.43        0.00000000017 ***
pop          -0.000446   0.000338   -1.32               0.1872    
ndi           0.001879   0.000719    2.61               0.0091 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.5899
Wald statistic: 57.2869, Pr(>Chisq(3)): 0

16.7 Lagged Dependent Variable Panel Data Model

Let’s estimate the lagged-depdendent variable model \[sales_{it} = \gamma sales_{i,t-1}+\beta_0+\beta_1price_{it}+\beta_2pop16_{it}+\beta_3ndi_{it}+\epsilon_{it}\]

cigar_lag_plm <- plm(sales ~ lag(sales) + price + pop16 + ndi, 
                 data = Cigar, # recall plm does not play nice with the expose pipe, %$%
                 index = c("state", "year"), 
                 model = "within", 
                 effect = "twoways")
cigar_lag_plm %>% 
  summary()
Twoways effects Within Model

Call:
plm(formula = sales ~ lag(sales) + price + pop16 + ndi, data = Cigar, 
    effect = "twoways", model = "within", index = c("state", 
        "year"))

Balanced Panel: n = 46, T = 29, N = 1334

Residuals:
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-29.476  -1.900   0.241   2.002  60.188 

Coefficients:
             Estimate Std. Error t-value             Pr(>|t|)    
lag(sales)  0.8977592  0.0119755   74.97 < 0.0000000000000002 ***
price      -0.1349955  0.0332910   -4.06             0.000053 ***
pop16       0.0000834  0.0002409    0.35                 0.73    
ndi        -0.0002809  0.0002022   -1.39                 0.16    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    244000
Residual Sum of Squares: 35000
R-Squared:      0.857
Adj. R-Squared: 0.848
F-statistic: 1876.56 on 4 and 1256 DF, p-value: <0.0000000000000002

16.8 Random Effects Model

plm(sales ~ price + pop16 + ndi,
    data = Cigar,
    index = c("state", "year"),
    model = "random",
    effect = "twoways" 
    ) %>% 
  summary()
Twoways effects Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = sales ~ price + pop16 + ndi, data = Cigar, effect = "twoways", 
    model = "random", index = c("state", "year"))

Balanced Panel: n = 46, T = 30, N = 1380

Effects:
                 var std.dev share
idiosyncratic 156.13   12.50  0.26
individual    438.16   20.93  0.73
time            6.70    2.59  0.01
theta: 0.892 (id) 0.42 (time) 0.419 (total)

Residuals:
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-57.175  -7.171   0.235   5.785 128.022 

Coefficients:
              Estimate Std. Error z-value             Pr(>|z|)    
(Intercept) 139.028985   3.995359   34.80 < 0.0000000000000002 ***
price        -0.197951   0.044962   -4.40             0.000011 ***
pop16         0.000357   0.000526    0.68                 0.50    
ndi          -0.000356   0.000404   -0.88                 0.38    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    325000
Residual Sum of Squares: 280000
R-Squared:      0.138
Adj. R-Squared: 0.136
Chisq: 220.089 on 3 DF, p-value: <0.0000000000000002

  1. If cigarettes are a normal good we’d expect \(\beta_3>0\)↩︎

  2. Note the slight differences, because panelAR also corrects for heteroscedasticity.↩︎