Chapter 17 Advanced Panel Data

In this chapter we will learn techinques in R for panel data where there might be serially correlated errors, temporal dependence with a lagged dependent variable, and random effects models.

17.1 The Data

We will make use of the Cigar dataset from the plm package for this chapter. Cigar is a panel of 46 U.S. states over the period 1963-1992. The variables are:

  • state - State number
  • year
  • price - the price per pack of cigarettes in cents
  • pop - state population in thousands
  • pop16 - state population over the age of 16 in thousands
  • cpi - consumer price index (1983=100)
  • ndi - per capita disposable income in dollars
  • sales - cigarette sales per capita in packs
  • pimin - minimum price in adjoining states per pack of cigarettes in cents

The plm packages offers many functions to simplify the handling of advanced panel data.

17.2 Variation within Units

dplyr verbs make checking for variation within units across multiple variables relatively simple. First we use group-by so that any functions will be applied to each state individually. summarize_all will apply a function to each variable.

# A tibble: 46 x 8
   state price    pop  pop16   cpi   ndi sales pimin
   <int> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1  39.6  269.   319.   37.1 3944.  10.7  38.0
 2     3  40.7  748.   622.   37.1 4364.  13.3  40.8
 3     4  41.5  189.   198.   37.1 3814.  10.8  37.6
 4     5  48.3 3995.  3531.   37.1 5264.  20.6  40.9
 5     7  48.0  139.   218.   37.1 6624.  18.2  43.7
 6     8  40.6   58.1   66.6  37.1 4921.  13.8  38.3
 7     9  45.4   79.5   34.7  37.1 6223.  59.3  38.2
 8    10  45.4 2562.  2229.   37.1 4889.  10.3  37.6
 9    11  37.6  783.   734.   37.1 4478.  10.0  35.9
10    13  41.6  136.   113.   37.1 3900.  14.3  37.5
# ... with 36 more rows
# A tibble: 30 x 8
    year price   pop pop16   cpi   ndi sales pimin
   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    63  1.86 4116. 2855.     0  387.  33.1  1.21
 2    64  1.89 4185. 2903.     0  413.  32.8  1.41
 3    65  1.97 4253. 2953.     0  417.  33.1  1.68
 4    66  2.43 4285. 2982.     0  438.  41.6  2.25
 5    67  2.32 4329. 3024.     0  460.  37.9  2.10
 6    68  2.39 4372. 3069.     0  486.  33.4  2.23
 7    69  3.13 4420. 3139.     0  504.  32.0  2.28
 8    70  3.65 4463. 3198.     0  537.  32.0  3.24
 9    71  3.71 4513. 3260.     0  559.  33.5  3.27
10    72  4.45 4536. 3306.     0  555.  36.2  4.07
# ... with 20 more rows

CPI is the only variable with a standard deviation of 0 for all units. As would be expected, CPI should not vary within year.

pvar from the plm package will perform the task of checking for variation.

no time variation:       state 
no individual variation: year cpi 

17.3 Two-Way Fixed Effects Model

Let’s estimate cigarette demand as: \[sales_{it}=\beta_0+\beta_1price_{it}+\beta_2pop16_{it}+\beta_3ndi_{it}+\alpha_i+\tau_t+\nu_{it}\]

We would expect \(\beta_1<0\), \(\beta_2>0\), and \(\beta_3<0\) if cigarettes are an inferior good25.

# A tibble: 3 x 5
  term  estimate std.error statistic  p.value
  <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 price -0.841    0.0750      -11.2  6.35e-28
2 pop16  0.00114  0.000547      2.07 3.83e- 2
3 ndi   -0.00557  0.000445    -12.5  5.72e-34

Each of the coefficients has the expected sign and is significant at the 5% level.

17.4 Testing for autocorrelation

Testing for autocorrelation is done by testing the following hypothesis: \[H_0:\rho=0\] \[H_1:\rho\ne0\]

Observations: 1,380
Variables: 9
$ state <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ year  <int> 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, ...
$ price <dbl> 28.6, 29.8, 29.8, 31.5, 31.6, 35.6, 36.6, 39.6, 42.7, 42...
$ pop   <dbl> 3383, 3431, 3486, 3524, 3533, 3522, 3531, 3444, 3481, 35...
$ pop16 <dbl> 2236, 2277, 2328, 2370, 2394, 2405, 2412, 2395, 2444, 24...
$ cpi   <dbl> 30.6, 31.0, 31.5, 32.4, 33.4, 34.8, 36.7, 38.8, 40.5, 41...
$ ndi   <dbl> 1558, 1684, 1810, 1915, 2024, 2202, 2377, 2591, 2785, 30...
$ sales <dbl> 93.9, 95.4, 98.5, 96.4, 95.5, 88.4, 90.1, 89.8, 95.4, 10...
$ pimin <dbl> 26.1, 27.5, 28.9, 29.5, 29.6, 32.0, 32.8, 34.3, 35.8, 37...

Our data are organized by unit by year, so we can estimate \(\hat\rho\) directly. First, obtain the residuals, e, from the estimated equation. Estimate the equation \(e=\rho e_{i,t-1}+\eta_{it}\).

# A tibble: 1 x 5
  term   estimate std.error statistic p.value
  <chr>     <dbl>     <dbl>     <dbl>   <dbl>
1 lag(e)    0.888    0.0126      70.3       0

We can reject the null hypothesis at the 1% level.

We can also check for autocorrelation with the LM test by estimating the model \[\hat\epsilon_{it}=\rho\hat\epsilon_{i,t-1}+\gamma_1price_{it}+\gamma_2pop16_{it}+\gamma_3ndi_{it}+\eta_{it}\] where \(nR^2\sim\chi^2_{df=1}\).

[1] 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000103

Again, we can reject the null hypothesis of no autocorrelation.

pwartest from the lpm package allows us to test for autocorrelation (?pwartest for relevant arguments).


    Wooldridge's test for serial correlation in FE panels

data:  cigar_plm
F = 10212, df1 = 1, df2 = 1332, p-value <0.0000000000000002
alternative hypothesis: serial correlation

We reject the null hpothesis of no autocorrelation.

17.5 Estimating \(\hat\rho\)

To correct for autocorrelation we need an estimate of \(\hat\rho\). We can estimate \(\hat\rho\) using either auxiliary regression from above.

# A tibble: 1 x 5
  term   estimate std.error statistic p.value
  <chr>     <dbl>     <dbl>     <dbl>   <dbl>
1 lag(e)    0.888    0.0126      70.3       0

Our estimate of \(\hat\rho\) is 0.888 is 0.888.

# A tibble: 4 x 5
  term    estimate std.error statistic    p.value
  <chr>      <dbl>     <dbl>     <dbl>      <dbl>
1 lag(e)  0.894     0.0127      70.2   0         
2 price   0.156     0.0339       4.59  0.00000482
3 pop16  -0.000203  0.000254    -0.802 0.423     
4 ndi     0.000517  0.000201     2.57  0.0102    

Our estimate of \(\hat\rho\) is 0.894.

17.6 Estimate a \(\rho\)-Transformed Model

We can manually transform the data and compare the transformed model to the non-transformed model.

Twoways effects Within Model

Call:
plm(formula = I(sales - rho_hat * lag(sales)) ~ I(price - rho_hat * 
    lag(price)) + I(pop - rho_hat * lag(pop16)) + I(ndi - rho_hat * 
    lag(ndi)), data = Cigar, effect = "twoways", model = "within", 
    index = c("state", "year"))

Balanced Panel: n = 46, T = 29, N = 1334

Residuals:
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-27.611  -1.992   0.208   1.956  61.268 

Coefficients:
                                 Estimate Std. Error t-value
I(price - rho_hat * lag(price)) -0.329602   0.047324   -6.96
I(pop - rho_hat * lag(pop16))   -0.000730   0.000594   -1.23
I(ndi - rho_hat * lag(ndi))      0.000988   0.000682    1.45
                                       Pr(>|t|)    
I(price - rho_hat * lag(price)) 0.0000000000053 ***
I(pop - rho_hat * lag(pop16))              0.22    
I(ndi - rho_hat * lag(ndi))                0.15    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    35700
Residual Sum of Squares: 34300
R-Squared:      0.0414
Adj. R-Squared: -0.0166
F-statistic: 18.0908 on 3 and 1257 DF, p-value: 0.000000000017
# A tibble: 3 x 5
  term  estimate std.error statistic  p.value
  <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 price -0.841    0.0750      -11.2  6.35e-28
2 pop16  0.00114  0.000547      2.07 3.83e- 2
3 ndi   -0.00557  0.000445    -12.5  5.72e-34

Now only \(\hat\beta_1\) is significantly different than zero.

We can use the panelAR package to directly estimate a corrected model26. ?panelAR for arguments necessary to estimate the corrected model.


Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors

Balanced Panel Design:                                             
 Total obs.:       1380 Avg obs. per panel 30
 Number of panels: 46   Max obs. per panel 30
 Number of times:  30   Min obs. per panel 30

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 137.461557   5.974766   23.01 < 0.0000000000000002 ***
price        -0.386976   0.060172   -6.43        0.00000000017 ***
pop          -0.000446   0.000338   -1.32               0.1872    
ndi           0.001879   0.000719    2.61               0.0091 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.5899
Wald statistic: 57.2869, Pr(>Chisq(3)): 0

17.7 Lagged Dependent Variable Panel Data Model

Let’s estimate the lagged-depdendent variable model \[sales_{it} = \gamma sales_{i,t-1}+\beta_0+\beta_1price_{it}+\beta_2pop16_{it}+\beta_3ndi_{it}+\epsilon_{it}\]

Twoways effects Within Model

Call:
plm(formula = sales ~ lag(sales) + price + pop16 + ndi, data = Cigar, 
    effect = "twoways", model = "within", index = c("state", 
        "year"))

Balanced Panel: n = 46, T = 29, N = 1334

Residuals:
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-29.476  -1.900   0.241   2.002  60.188 

Coefficients:
             Estimate Std. Error t-value             Pr(>|t|)    
lag(sales)  0.8977592  0.0119755   74.97 < 0.0000000000000002 ***
price      -0.1349955  0.0332910   -4.06             0.000053 ***
pop16       0.0000834  0.0002409    0.35                 0.73    
ndi        -0.0002809  0.0002022   -1.39                 0.16    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    244000
Residual Sum of Squares: 35000
R-Squared:      0.857
Adj. R-Squared: 0.848
F-statistic: 1876.56 on 4 and 1256 DF, p-value: <0.0000000000000002

17.8 Random Effects Model

Twoways effects Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = sales ~ price + pop16 + ndi, data = Cigar, effect = "twoways", 
    model = "random", index = c("state", "year"))

Balanced Panel: n = 46, T = 30, N = 1380

Effects:
                 var std.dev share
idiosyncratic 156.13   12.50  0.26
individual    438.16   20.93  0.73
time            6.70    2.59  0.01
theta: 0.892 (id) 0.42 (time) 0.419 (total)

Residuals:
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-57.175  -7.171   0.235   5.785 128.022 

Coefficients:
              Estimate Std. Error z-value             Pr(>|z|)    
(Intercept) 139.028985   3.995359   34.80 < 0.0000000000000002 ***
price        -0.197951   0.044962   -4.40             0.000011 ***
pop16         0.000357   0.000526    0.68                 0.50    
ndi          -0.000356   0.000404   -0.88                 0.38    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    325000
Residual Sum of Squares: 280000
R-Squared:      0.138
Adj. R-Squared: 0.136
Chisq: 220.089 on 3 DF, p-value: <0.0000000000000002

  1. If cigarettes are a normal good we’d expect \(\beta_3>0\)

  2. Note the slight differences, because panelAR also corrects for heteroscedasticity.