## 18.4 Panel Data

Detail notes in R can be found here

Follows an individual over T time periods.

Panel data structure is like having n samples of time series data

**Characteristics**

Information both across individuals and over time (cross-sectional and time-series)

N individuals and T time periods

Data can be either

- Balanced: all individuals are observed in all time periods
- Unbalanced: all individuals are not observed in all time periods.

Assume correlation (clustering) over time for a given individual, with independence over individuals.

**Types**

- Short panel: many individuals and few time periods.
- Long panel: many time periods and few individuals
- Both: many time periods and many individuals

**Time Trends and Time Effects**

- Nonlinear
- Seasonality
- Discontinuous shocks

**Regressors**

- Time-invariant regressors \(x_{it}=x_i\) for all t (e.g., gender, race, education) have zero within variation
- Individual-invariant regressors \(x_{it}=x_{t}\) for all i (e.g., time trend, economy trends) have zero between variation

**Variation for the dependent variable and regressors**

- Overall variation: variation over time and individuals.
- Between variation: variation between individuals
- Within variation: variation within individuals (over time).

Estimate | Formula |
---|---|

Individual mean | \(\bar{x_i}= \frac{1}{T} \sum_{t}x_{it}\) |

Overall mean | \(\bar{x}=\frac{1}{NT} \sum_{i} \sum_t x_{it}\) |

Overall Variance | \(s _O^2 = \frac{1}{NT-1} \sum_i \sum_t (x_{it} - \bar{x})^2\) |

Between variance | \(s_B^2 = \frac{1}{N-1} \sum_i (\bar{x_i} -\bar{x})^2\) |

Within variance | \(s_W^2= \frac{1}{NT-1} \sum_i \sum_t (x_{it} - \bar{x_i})^2 = \frac{1}{NT-1} \sum_i \sum_t (x_{it} - \bar{x_i} +\bar{x})^2\) |

**Note**: \(s_O^2 \approx s_B^2 + s_W^2\)

Since we have n observation for each time period t, we can control for each time effect separately by including time dummies (time effects)

\[ y_{it}=\mathbf{x_{it}\beta} + d_1\delta_1+...+d_{T-1}\delta_{T-1} + \epsilon_{it} \]

**Note**: we cannot use these many time dummies in time series data because in time series data, our n is 1. Hence, there is no variation, and sometimes not enough data compared to variables to estimate coefficients.

**Unobserved Effects Model** Similar to group clustering, assume that there is a random effect that captures differences across individuals but is constant in time.

\[ y_it=\mathbf{x_{it}\beta} + d_1\delta_1+...+d_{T-1}\delta_{T-1} + c_i + u_{it} \]

where

- \(c_i + u_{it} = \epsilon_{it}\)
- \(c_i\) unobserved individual heterogeneity (effect)
- \(u_{it}\) idiosyncratic shock
- \(\epsilon_{it}\) unobserved error term.

### 18.4.1 Pooled OLS Estimator

If \(c_i\) is uncorrelated with \(x_{it}\)

\[ E(\mathbf{x_{it}'}(c_i+u_{it})) = 0 \]

then A3a still holds. And we have Pooled OLS consistent.

If A4 does not hold, OLS is still consistent, but not efficient, and we need cluster robust SE.

Sufficient for A3a to hold, we need

**Exogeneity**for \(u_{it}\) A3a (contemporaneous exogeneity): \(E(\mathbf{x_{it}'}u_{it})=0\) time varying error**Random Effect Assumption**(time constant error): \(E(\mathbf{x_{it}'}c_{i})=0\)

Pooled OLS will give you consistent coefficient estimates under A1, A2, A3a (for both \(u_{it}\) and RE assumption), and A5 (randomly sampling across i).

### 18.4.2 Individual-specific effects model

- If we believe that there is unobserved heterogeneity across individual (e.g., unobserved ability of an individual affects \(y\)), If the individual-specific effects are correlated with the regressors, then we have the Fixed Effects Estimator. and if they are not correlated we have the Random Effects Estimator.

#### 18.4.2.1 Random Effects Estimator

Random Effects estimator is the Feasible GLS estimator that assumes \(u_{it}\) is serially uncorrelated and homoskedastic

#### 18.4.2.2 Fixed Effects Estimator

also known as **Within Estimator** uses within variation (over time)

If the **RE assumption** is not hold (\(E(\mathbf{x_{it}'}c_i) \neq 0\)), then A3a does not hold (\(E(\mathbf{x_{it}'}\epsilon_i) \neq 0\)). Hence, the OLS and RE are inconsistent/biased (because of omitted variable bias)

To deal with violation in \(c_i\), we have

\[ y_{it}= \mathbf{x_{it}\beta} + c_i + u_{it} \]

\[ \bar{y_i}=\bar{\mathbf{x_i}} \beta + c_i + \bar{u_i} \]

where the second equation is the time averaged equation

using **within transformation**, we have

\[ y_{it} - \bar{y_i} = \mathbf{(x_{it} - \bar{x_i}\beta)} + u_{it} - \bar{u_i} \]

because \(c_i\) is time constant.

The Fixed Effects estimator uses POLS on the transformed equation

\[ y_{it} - \bar{y_i} = \mathbf{(x_{it} - \bar{x_i}\beta)} + d_1\delta_1 + ... + d_{T-2}\delta_{T-2} + u_{it} - \bar{u_i} \]

we need A3 (strict exogeneity) (\(E((\mathbf{x_{it}-\bar{x_i}})'(u_{it}-\bar{u_i})=0\)) to have FE consistent.

Variables that are time constant will be absorbed into \(c_i\). Hence we cannot make inference on time constant independent variables.

- If you are interested in the effects of time-invariant variables, you could consider the OLS or
**between estimator**

- If you are interested in the effects of time-invariant variables, you could consider the OLS or
It’s recommended that you should still use cluster robust standard errors.

Equivalent to the within transformation, we can have the fixed effect estimator be the same with the dummy regression

\[ y_{it} = x_{it}\beta + d_1\delta_1 + ... + d_{T-2}\delta_{T-2} + c_1\gamma_1 + ... + c_{n-1}\gamma_{n-1} + u_{it} \]

where

\[\begin{equation} c_i = \begin{cases} 1 &\text{if observation is i} \\ 0 &\text{otherwise} \\ \end{cases} \end{equation}\]

- The standard error is incorrectly calculated.
- the FE within transformation is controlling for any difference across individual which is allowed to correlated with observables.

### 18.4.3 Tests for Assumptions

We typically don’t test heteroskedasticity because we will use robust covariance matrix estimation anyway.

Dataset

```
library("plm")
data("EmplUK", package="plm")
data("Produc", package="plm")
data("Grunfeld", package="plm")
data("Wages", package="plm")
```

#### 18.4.3.1 Poolability

also known as an F test of stability (or Chow test) for the coefficients

\(H_0\): All individuals have the same coefficients (i.e., equal coefficients for all individuals).

\(H_a\) Different individuals have different coefficients.

Notes:

- Under a within (i.e., fixed) model, different intercepts for each individual are assumed
- Under random model, same intercept is assumed

```
library(plm)
::pooltest(inv~value+capital, data=Grunfeld, model="within") plm
```

```
##
## F statistic
##
## data: inv ~ value + capital
## F = 5.7805, df1 = 18, df2 = 170, p-value = 1.219e-10
## alternative hypothesis: unstability
```

Hence, we reject the null hypothesis that coefficients are stable. Then, we should use the random model.

#### 18.4.3.2 Individual and time effects

use the Lagrange multiplier test to test the presence of individual or time or both (i.e., individual and time).

Types:

`honda`

: (Honda 1985) Default`bp`

: (T. S. Breusch and Pagan 1980) for unbalanced panels`kw`

: (M. L. King and Wu 1997) unbalanced panels, and two-way effects`ghm`

: (Gourieroux, Holly, and Monfort 1982): two-way effects

`pFtest(inv~value+capital, data=Grunfeld, effect="twoways")`

```
##
## F test for twoways effects
##
## data: inv ~ value + capital
## F = 17.403, df1 = 28, df2 = 169, p-value < 2.2e-16
## alternative hypothesis: significant effects
```

`pFtest(inv~value+capital, data=Grunfeld, effect="individual")`

```
##
## F test for individual effects
##
## data: inv ~ value + capital
## F = 49.177, df1 = 9, df2 = 188, p-value < 2.2e-16
## alternative hypothesis: significant effects
```

`pFtest(inv~value+capital, data=Grunfeld, effect="time")`

```
##
## F test for time effects
##
## data: inv ~ value + capital
## F = 0.23451, df1 = 19, df2 = 178, p-value = 0.9997
## alternative hypothesis: significant effects
```

#### 18.4.3.3 Cross-sectional dependence/contemporaneous correlation

- Null hypothesis: residuals across entities are not correlated.

##### 18.4.3.3.1 Global cross-sectional dependence

`pcdtest(inv~value+capital, data=Grunfeld, model="within")`

```
##
## Pesaran CD test for cross-sectional dependence in panels
##
## data: inv ~ value + capital
## z = 4.6612, p-value = 3.144e-06
## alternative hypothesis: cross-sectional dependence
```

##### 18.4.3.3.2 Local cross-sectional dependence

use the same command, but supply matrix `w`

to the argument.

`pcdtest(inv~value+capital, data=Grunfeld, model="within")`

```
##
## Pesaran CD test for cross-sectional dependence in panels
##
## data: inv ~ value + capital
## z = 4.6612, p-value = 3.144e-06
## alternative hypothesis: cross-sectional dependence
```

#### 18.4.3.4 Serial Correlation

Null hypothesis: there is no serial correlation

usually seen in macro panels with long time series (large N and T), not seen in micro panels (small T and large N)

Serial correlation can arise from individual effects(i.e., time-invariant error component), or idiosyncratic error terms (e..g, in the case of AR(1) process). But typically, when we refer to serial correlation, we refer to the second one.

Can be

**marginal**test: only 1 of the two above dependence (but can be biased towards rejection)**joint**test: both dependencies (but don’t know which one is causing the problem)**conditional**test: assume you correctly specify one dependence structure, test whether the other departure is present.

##### 18.4.3.4.1 Unobserved effect test

semi-parametric test (the test statistic \(W \dot{\sim} N\) regardless of the distribution of the errors) with \(H_0: \sigma^2_\mu = 0\) (i.e., no unobserved effects in the residuals), favors pooled OLS.

- Under the null, covariance matrix of the residuals = its diagonal (off-diagonal = 0)

It is robust against both

**unobserved effects**that are constant within every group, and any kind of**serial correlation**.

`pwtest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc)`

```
##
## Wooldridge's test for unobserved individual effects
##
## data: formula
## z = 3.9383, p-value = 8.207e-05
## alternative hypothesis: unobserved effect
```

Here, we reject the null hypothesis that the no unobserved effects in the residuals. Hence, we will exclude using pooled OLS.

##### 18.4.3.4.2 Locally robust tests for random effects and serial correlation

- A joint LM test for
**random effects**and**serial correlation**assuming normality and homoskedasticity of the idiosyncratic errors (Baltagi and Li 1991)(Baltagi and Li 1995)

`pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc, test="j")`

```
##
## Baltagi and Li AR-RE joint test - balanced panel
##
## data: formula
## chisq = 4187.6, df = 2, p-value < 2.2e-16
## alternative hypothesis: AR(1) errors or random effects
```

Here, we reject the null hypothesis that there is no presence of **serial correlation,** and **random effects**. But we still do not know whether it is because of serial correlation, of random effects or of both

To know the departure from the null assumption, we can use (Bera, Sosa-Escudero, and Yoon 2001)’s test for first-order serial correlation or random effects (both under normality and homoskedasticity assumption of the error).

BSY for serial correlation

`pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc)`

```
##
## Bera, Sosa-Escudero and Yoon locally robust test - balanced panel
##
## data: formula
## chisq = 52.636, df = 1, p-value = 4.015e-13
## alternative hypothesis: AR(1) errors sub random effects
```

BSY for random effects

`pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc, test="re")`

```
##
## Bera, Sosa-Escudero and Yoon locally robust test (one-sided) -
## balanced panel
##
## data: formula
## z = 57.914, p-value < 2.2e-16
## alternative hypothesis: random effects sub AR(1) errors
```

Since BSY is only locally robust, if you “know” there is no serial correlation, then this test is based on LM test is more superior:

`plmtest(inv ~ value + capital, data = Grunfeld, type = "honda")`

```
##
## Lagrange Multiplier Test - (Honda) for balanced panels
##
## data: inv ~ value + capital
## normal = 28.252, p-value < 2.2e-16
## alternative hypothesis: significant effects
```

On the other hand, if you know there is no random effects, to test for serial correlation, use (BREUSCH 1978)-(Godfrey 1978)’s test

`::bgtest() lmtest`

If you “know” there are random effects, use (Baltagi and Li 1995)’s. to test for serial correlation in both AR(1) and MA(1) processes.

\(H_0\): Uncorrelated errors.

Note:

- one-sided only has power against positive serial correlation.
- applicable to only balanced panels.

```
pbltest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp,
data=Produc, alternative="onesided")
```

```
##
## Baltagi and Li one-sided LM test
##
## data: log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp
## z = 21.69, p-value < 2.2e-16
## alternative hypothesis: AR(1)/MA(1) errors in RE panel model
```

General serial correlation tests

- applicable to random effects model, OLS, and FE (with large T, also known as long panel).
- can also test higher-order serial correlation

`::pbgtest(plm::plm(inv~value+capital, data = Grunfeld, model = "within"), order = 2) plm`

```
##
## Breusch-Godfrey/Wooldridge test for serial correlation in panel models
##
## data: inv ~ value + capital
## chisq = 42.587, df = 2, p-value = 5.655e-10
## alternative hypothesis: serial correlation in idiosyncratic errors
```

in the case of short panels (small T and large n), we can use

`pwartest(log(emp) ~ log(wage) + log(capital), data=EmplUK)`

```
##
## Wooldridge's test for serial correlation in FE panels
##
## data: plm.model
## F = 312.3, df1 = 1, df2 = 889, p-value < 2.2e-16
## alternative hypothesis: serial correlation
```

#### 18.4.3.5 Unit roots/stationarity

- Dickey-Fuller test for stochastic trends.
- Null hypothesis: the series is non-stationary (unit root)
- You would want your test to be less than the critical value (p<.5) so that there is evidence there is not unit roots.

#### 18.4.3.6 Heteroskedasticity

Breusch-Pagan test

Null hypothesis: the data is homoskedastic

If there is evidence for heteroskedasticity, robust covariance matrix is advised.

To control for heteroskedasticity: Robust covariance matrix estimation (Sandwich estimator)

- “white1” - for general heteroskedasticity but no serial correlation (check serial correlation first). Recommended for random effects.
- “white2” - is “white1” restricted to a common variance within groups. Recommended for random effects.
- “arellano” - both heteroskedasticity and serial correlation. Recommended for fixed effects

### 18.4.4 Model Selection

#### 18.4.4.1 POLS vs. RE

The continuum between RE (used FGLS which more assumption ) and POLS check back on the section of FGLS

**Breusch-Pagan LM** test

- Test for the random effect model based on the OLS residual
- Null hypothesis: variances across entities is zero. In another word, no panel effect.
- If the test is significant, RE is preferable compared to POLS

#### 18.4.4.2 FE vs. RE

- RE does not require strict exogeneity for consistency (feedback effect between residual and covariates)

Hypothesis | If true |
---|---|

\(H_0: Cov(c_i,\mathbf{x_{it}})=0\) | \(\hat{\beta}_{RE}\) is consistent and efficient, while \(\hat{\beta}_{FE}\) is consistent |

\(H_0: Cov(c_i,\mathbf{x_{it}}) \neq 0\) | \(\hat{\beta}_{RE}\) is inconsistent, while \(\hat{\beta}_{FE}\) is consistent |

**Hausman Test**

For the Hausman test to run, you need to assume that

- strict exogeneity hold
- A4 to hold for \(u_{it}\)

Then,

- Hausman test statistic: \(H=(\hat{\beta}_{RE}-\hat{\beta}_{FE})'(V(\hat{\beta}_{RE})- V(\hat{\beta}_{FE}))(\hat{\beta}_{RE}-\hat{\beta}_{FE}) \sim \chi_{n(X)}^2\) where \(n(X)\) is the number of parameters for the time-varying regressors.
- A low p-value means that we would reject the null hypothesis and prefer FE
- A high p-value means that we would not reject the null hypothesis and consider RE estimator.

```
<- plm(inv~value+capital, data=Grunfeld, model="within")
gw <- plm(inv~value+capital, data=Grunfeld, model="random")
gr phtest(gw, gr)
```

```
##
## Hausman Test
##
## data: inv ~ value + capital
## chisq = 2.3304, df = 2, p-value = 0.3119
## alternative hypothesis: one model is inconsistent
```

Violation Estimator | Basic Estimator | Instrumental variable Estimator | Variable Coefficients estimator | Generalized Method of Moments estimator | General FGLS estimator | Means groups estimator | CCEMG | Estimator for limited dependent variables |

### 18.4.5 Summary

All three estimators (POLS, RE, FE) require A1, A2, A5 (for individuals) to be consistent. Additionally,

POLS is consistent under A3a(for \(u_{it}\)): \(E(\mathbf{x}_{it}'u_{it})=0\), and RE Assumption \(E(\mathbf{x}_{it}'c_{i})=0\)

- If A4 does not hold, use cluster robust SE but POLS is not efficient

RE is consistent under A3a(for \(u_{it}\)): \(E(\mathbf{x}_{it}'u_{it})=0\), and RE Assumption \(E(\mathbf{x}_{it}'c_{i})=0\)

FE is consistent under A3 \(E((\mathbf{x}_{it}-\bar{\mathbf{x}}_{it})'(u_{it} -\bar{u}_{it}))=0\)

- Cannot estimate effects of time constant variables
- A4 generally does not hold for \(u_{it} -\bar{u}_{it}\) so cluster robust SE are needed

**Note**: A5 for individual (not for time dimension) implies that you have A5a for the entire data set.

Estimator / True Model | POLS | RE | FE |
---|---|---|---|

POLS | Consistent | Consistent | Inconsistent |

FE | Consistent | Consistent | Consistent |

RE | Consistent | Consistent | Inconsistent |

Based on table provided by Ani Katchova

### 18.4.6 Application

Recommended application of `plm`

can be found here and here by Yves Croissant

```
#install.packages("plm")
library("plm")
library(foreign)
<- read.dta("http://dss.princeton.edu/training/Panel101.dta")
Panel
attach(Panel)
<- cbind(y)
Y <- cbind(x1, x2, x3)
X
# Set data as panel data
<- pdata.frame(Panel, index=c("country","year"))
pdata
# Pooled OLS estimator
<- plm(Y ~ X, data=pdata, model= "pooling")
pooling summary(pooling)
# Between estimator
<- plm(Y ~ X, data=pdata, model= "between")
between summary(between)
# First differences estimator
<- plm(Y ~ X, data=pdata, model= "fd")
firstdiff summary(firstdiff)
# Fixed effects or within estimator
<- plm(Y ~ X, data=pdata, model= "within")
fixed summary(fixed)
# Random effects estimator
<- plm(Y ~ X, data=pdata, model= "random")
random summary(random)
# LM test for random effects versus OLS
# Accept Null, then OLS, Reject Null then RE
plmtest(pooling,effect = "individual", type = c("bp")) # other type: "honda", "kw"," "ghm"; other effect : "time" "twoways"
# B-P/LM and Pesaran CD (cross-sectional dependence) test
pcdtest(fixed, test = c("lm")) # Breusch and Pagan's original LM statistic
pcdtest(fixed, test = c("cd")) # Pesaran's CD statistic
# Serial Correlation
pbgtest(fixed)
# stationary
library("tseries")
adf.test(pdata$y, k = 2)
# LM test for fixed effects versus OLS
pFtest(fixed, pooling)
# Hausman test for fixed versus random effects model
phtest(random, fixed)
# Breusch-Pagan heteroskedasticity
library(lmtest)
bptest(y ~ x1 + factor(country), data = pdata)
# If there is presence of heteroskedasticity
## For RE model
coeftest(random) #orginal coef
coeftest(random, vcovHC) # Heteroskedasticity consistent coefficients
t(sapply(c("HC0", "HC1", "HC2", "HC3", "HC4"), function(x) sqrt(diag(vcovHC(random, type = x))))) #show HC SE of the coef
# HC0 - heteroskedasticity consistent. The default.
# HC1,HC2, HC3 – Recommended for small samples. HC3 gives less weight to influential observations.
# HC4 - small samples with influential observations
# HAC - heteroskedasticity and autocorrelation consistent
## For FE model
coeftest(fixed) # Original coefficients
coeftest(fixed, vcovHC) # Heteroskedasticity consistent coefficients
coeftest(fixed, vcovHC(fixed, method = "arellano")) # Heteroskedasticity consistent coefficients (Arellano)
t(sapply(c("HC0", "HC1", "HC2", "HC3", "HC4"), function(x) sqrt(diag(vcovHC(fixed, type = x))))) #show HC SE of the coef
```

**Advanced**

Other methods to estimate the random model:

`"swar"`

:*default*(Swamy and Arora 1972)`"walhus"`

: (Wallace and Hussain 1969)`"amemiya"`

: (Fuller and Battese 1974)`"nerlove"`

" (Nerlove 1971)

Other effects:

- Individual effects:
*default* - Time effects:
`"time"`

- Individual and time effects:
`"twoways"`

**Note**: no random two-ways effect model for `random.method = "nerlove"`

`<- plm(Y ~ X, data=pdata, model= "random",random.method = "amemiya",effect = "twoways") amemiya `

To call the estimation of the variance of the error components

`ercomp(Y~X, data=pdata, method = "amemiya", effect = "twoways")`

Check for the unbalancedness. Closer to 1 indicates balanced data (Ahrens and Pincus 1981)

`punbalancedness(random)`

**Instrumental variable**

`"bvk"`

: default (Balestra and Varadharajan-Krishnakumar 1987)`"baltagi"`

: (Baltagi 1981)`"am"`

(Amemiya and MaCurdy 1986)`"bms"`

: (Trevor S. Breusch, Mizon, and Schmidt 1989)

`<- plm(Y ~ X | X_ins, data = pdata, random.method = "ht", model = "random", inst.method = "baltagi") instr `

### 18.4.7 Other Estimators

#### 18.4.7.1 Variable Coefficients Model

```
<- pvcm(Y~X, data=pdata, model="within")
fixed_pvcm <- pvcm(Y~X, data=pdata, model="random") random_pvcm
```

More details can be found here

#### 18.4.7.2 Generalized Method of Moments Estimator

Typically use in dynamic models. Example is from plm package

```
<- pgmm(log(emp) ~ lag(log(emp), 1)+ lag(log(wage), 0:1) +
z2 lag(log(capital), 0:1) | lag(log(emp), 2:99) +
lag(log(wage), 2:99) + lag(log(capital), 2:99),
data = EmplUK, effect = "twoways", model = "onestep",
transformation = "ld")
summary(z2, robust = TRUE)
```

#### 18.4.7.3 General Feasible Generalized Least Squares Models

Assume there is no cross-sectional correlation Robust against intragroup heteroskedasticity and serial correlation. Suited when n is much larger than T (long panel) However, inefficient under groupwise heteorskedasticity.

```
# Random Effects
<- pggls(log(emp)~log(wage)+log(capital), data=EmplUK, model="pooling")
zz
# Fixed
<- pggls(log(emp)~log(wage)+log(capital), data=EmplUK, model="within") zz
```