## 18.4 Panel Data

Detail notes in R can be found here

Follows an individual over T time periods.

Panel data structure is like having n samples of time series data

Characteristics

• Information both across individuals and over time (cross-sectional and time-series)

• N individuals and T time periods

• Data can be either

• Balanced: all individuals are observed in all time periods
• Unbalanced: all individuals are not observed in all time periods.
• Assume correlation (clustering) over time for a given individual, with independence over individuals.

Types

• Short panel: many individuals and few time periods.
• Long panel: many time periods and few individuals
• Both: many time periods and many individuals

Time Trends and Time Effects

• Nonlinear
• Seasonality
• Discontinuous shocks

Regressors

• Time-invariant regressors $$x_{it}=x_i$$ for all t (e.g., gender, race, education) have zero within variation
• Individual-invariant regressors $$x_{it}=x_{t}$$ for all i (e.g., time trend, economy trends) have zero between variation

Variation for the dependent variable and regressors

• Overall variation: variation over time and individuals.
• Between variation: variation between individuals
• Within variation: variation within individuals (over time).
Estimate Formula
Individual mean $$\bar{x_i}= \frac{1}{T} \sum_{t}x_{it}$$
Overall mean $$\bar{x}=\frac{1}{NT} \sum_{i} \sum_t x_{it}$$
Overall Variance $$s _O^2 = \frac{1}{NT-1} \sum_i \sum_t (x_{it} - \bar{x})^2$$
Between variance $$s_B^2 = \frac{1}{N-1} \sum_i (\bar{x_i} -\bar{x})^2$$
Within variance $$s_W^2= \frac{1}{NT-1} \sum_i \sum_t (x_{it} - \bar{x_i})^2 = \frac{1}{NT-1} \sum_i \sum_t (x_{it} - \bar{x_i} +\bar{x})^2$$

Note: $$s_O^2 \approx s_B^2 + s_W^2$$

Since we have n observation for each time period t, we can control for each time effect separately by including time dummies (time effects)

$y_{it}=\mathbf{x_{it}\beta} + d_1\delta_1+...+d_{T-1}\delta_{T-1} + \epsilon_{it}$

Note: we cannot use these many time dummies in time series data because in time series data, our n is 1. Hence, there is no variation, and sometimes not enough data compared to variables to estimate coefficients.

Unobserved Effects Model Similar to group clustering, assume that there is a random effect that captures differences across individuals but is constant in time.

$y_it=\mathbf{x_{it}\beta} + d_1\delta_1+...+d_{T-1}\delta_{T-1} + c_i + u_{it}$

where

• $$c_i + u_{it} = \epsilon_{it}$$
• $$c_i$$ unobserved individual heterogeneity (effect)
• $$u_{it}$$ idiosyncratic shock
• $$\epsilon_{it}$$ unobserved error term.

### 18.4.1 Pooled OLS Estimator

If $$c_i$$ is uncorrelated with $$x_{it}$$

$E(\mathbf{x_{it}'}(c_i+u_{it})) = 0$

then A3a still holds. And we have Pooled OLS consistent.

If A4 does not hold, OLS is still consistent, but not efficient, and we need cluster robust SE.

Sufficient for A3a to hold, we need

• Exogeneity for $$u_{it}$$ A3a (contemporaneous exogeneity): $$E(\mathbf{x_{it}'}u_{it})=0$$ time varying error
• Random Effect Assumption (time constant error): $$E(\mathbf{x_{it}'}c_{i})=0$$

Pooled OLS will give you consistent coefficient estimates under A1, A2, A3a (for both $$u_{it}$$ and RE assumption), and A5 (randomly sampling across i).

### 18.4.2 Individual-specific effects model

• If we believe that there is unobserved heterogeneity across individual (e.g., unobserved ability of an individual affects $$y$$), If the individual-specific effects are correlated with the regressors, then we have the Fixed Effects Estimator. and if they are not correlated we have the Random Effects Estimator.

#### 18.4.2.1 Random Effects Estimator

Random Effects estimator is the Feasible GLS estimator that assumes $$u_{it}$$ is serially uncorrelated and homoskedastic

• Under A1, A2, A3a (for both $$u_{it}$$ and RE assumption) and A5 (randomly sampling across i), RE estimator is consistent.

• If A4 holds for $$u_{it}$$, RE is the most efficient estimator
• If A4 fails to hold (may be heteroskedasticity across i, and serial correlation over t), then RE is not the most efficient, but still more efficient than pooled OLS.

#### 18.4.2.2 Fixed Effects Estimator

also known as Within Estimator uses within variation (over time)

If the RE assumption is not hold ($$E(\mathbf{x_{it}'}c_i) \neq 0$$), then A3a does not hold ($$E(\mathbf{x_{it}'}\epsilon_i) \neq 0$$). Hence, the OLS and RE are inconsistent/biased (because of omitted variable bias)

To deal with violation in $$c_i$$, we have

$y_{it}= \mathbf{x_{it}\beta} + c_i + u_{it}$

$\bar{y_i}=\bar{\mathbf{x_i}} \beta + c_i + \bar{u_i}$

where the second equation is the time averaged equation

using within transformation, we have

$y_{it} - \bar{y_i} = \mathbf{(x_{it} - \bar{x_i}\beta)} + u_{it} - \bar{u_i}$

because $$c_i$$ is time constant.

The Fixed Effects estimator uses POLS on the transformed equation

$y_{it} - \bar{y_i} = \mathbf{(x_{it} - \bar{x_i}\beta)} + d_1\delta_1 + ... + d_{T-2}\delta_{T-2} + u_{it} - \bar{u_i}$

• we need A3 (strict exogeneity) ($$E((\mathbf{x_{it}-\bar{x_i}})'(u_{it}-\bar{u_i})=0$$) to have FE consistent.

• Variables that are time constant will be absorbed into $$c_i$$. Hence we cannot make inference on time constant independent variables.

• If you are interested in the effects of time-invariant variables, you could consider the OLS or between estimator
• It’s recommended that you should still use cluster robust standard errors.

Equivalent to the within transformation, we can have the fixed effect estimator be the same with the dummy regression

$y_{it} = x_{it}\beta + d_1\delta_1 + ... + d_{T-2}\delta_{T-2} + c_1\gamma_1 + ... + c_{n-1}\gamma_{n-1} + u_{it}$

where

$\begin{equation} c_i = \begin{cases} 1 &\text{if observation is i} \\ 0 &\text{otherwise} \\ \end{cases} \end{equation}$

• The standard error is incorrectly calculated.
• the FE within transformation is controlling for any difference across individual which is allowed to correlated with observables.

### 18.4.3 Tests for Assumptions

We typically don’t test heteroskedasticity because we will use robust covariance matrix estimation anyway.

Dataset

library("plm")
data("EmplUK", package="plm")
data("Produc", package="plm")
data("Grunfeld", package="plm")
data("Wages", package="plm")

#### 18.4.3.1 Poolability

also known as an F test of stability (or Chow test) for the coefficients

$$H_0$$: All individuals have the same coefficients (i.e., equal coefficients for all individuals).

$$H_a$$ Different individuals have different coefficients.

Notes:

• Under a within (i.e., fixed) model, different intercepts for each individual are assumed
• Under random model, same intercept is assumed
library(plm)
plm::pooltest(inv~value+capital, data=Grunfeld, model="within")
##
##  F statistic
##
## data:  inv ~ value + capital
## F = 5.7805, df1 = 18, df2 = 170, p-value = 1.219e-10
## alternative hypothesis: unstability

Hence, we reject the null hypothesis that coefficients are stable. Then, we should use the random model.

#### 18.4.3.2 Individual and time effects

use the Lagrange multiplier test to test the presence of individual or time or both (i.e., individual and time).

Types:

• honda: Default
• bp: for unbalanced panels
• kw: unbalanced panels, and two-way effects
• ghm: : two-way effects
pFtest(inv~value+capital, data=Grunfeld, effect="twoways")
##
##  F test for twoways effects
##
## data:  inv ~ value + capital
## F = 17.403, df1 = 28, df2 = 169, p-value < 2.2e-16
## alternative hypothesis: significant effects
pFtest(inv~value+capital, data=Grunfeld, effect="individual")
##
##  F test for individual effects
##
## data:  inv ~ value + capital
## F = 49.177, df1 = 9, df2 = 188, p-value < 2.2e-16
## alternative hypothesis: significant effects
pFtest(inv~value+capital, data=Grunfeld, effect="time")
##
##  F test for time effects
##
## data:  inv ~ value + capital
## F = 0.23451, df1 = 19, df2 = 178, p-value = 0.9997
## alternative hypothesis: significant effects

#### 18.4.3.3 Cross-sectional dependence/contemporaneous correlation

• Null hypothesis: residuals across entities are not correlated.
##### 18.4.3.3.1 Global cross-sectional dependence
pcdtest(inv~value+capital, data=Grunfeld, model="within")
##
##  Pesaran CD test for cross-sectional dependence in panels
##
## data:  inv ~ value + capital
## z = 4.6612, p-value = 3.144e-06
## alternative hypothesis: cross-sectional dependence
##### 18.4.3.3.2 Local cross-sectional dependence

use the same command, but supply matrix w to the argument.

pcdtest(inv~value+capital, data=Grunfeld, model="within")
##
##  Pesaran CD test for cross-sectional dependence in panels
##
## data:  inv ~ value + capital
## z = 4.6612, p-value = 3.144e-06
## alternative hypothesis: cross-sectional dependence

#### 18.4.3.4 Serial Correlation

• Null hypothesis: there is no serial correlation

• usually seen in macro panels with long time series (large N and T), not seen in micro panels (small T and large N)

• Serial correlation can arise from individual effects(i.e., time-invariant error component), or idiosyncratic error terms (e..g, in the case of AR(1) process). But typically, when we refer to serial correlation, we refer to the second one.

• Can be

• marginal test: only 1 of the two above dependence (but can be biased towards rejection)

• joint test: both dependencies (but don’t know which one is causing the problem)

• conditional test: assume you correctly specify one dependence structure, test whether the other departure is present.

##### 18.4.3.4.1 Unobserved effect test
• semi-parametric test (the test statistic $$W \dot{\sim} N$$ regardless of the distribution of the errors) with $$H_0: \sigma^2_\mu = 0$$ (i.e., no unobserved effects in the residuals), favors pooled OLS.

• Under the null, covariance matrix of the residuals = its diagonal (off-diagonal = 0)
• It is robust against both unobserved effects that are constant within every group, and any kind of serial correlation.

pwtest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc)
##
##  Wooldridge's test for unobserved individual effects
##
## data:  formula
## z = 3.9383, p-value = 8.207e-05
## alternative hypothesis: unobserved effect

Here, we reject the null hypothesis that the no unobserved effects in the residuals. Hence, we will exclude using pooled OLS.

##### 18.4.3.4.2 Locally robust tests for random effects and serial correlation
• A joint LM test for random effects and serial correlation assuming normality and homoskedasticity of the idiosyncratic errors
pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc, test="j")
##
##  Baltagi and Li AR-RE joint test - balanced panel
##
## data:  formula
## chisq = 4187.6, df = 2, p-value < 2.2e-16
## alternative hypothesis: AR(1) errors or random effects

Here, we reject the null hypothesis that there is no presence of serial correlation, and random effects. But we still do not know whether it is because of serial correlation, of random effects or of both

To know the departure from the null assumption, we can use ’s test for first-order serial correlation or random effects (both under normality and homoskedasticity assumption of the error).

BSY for serial correlation

pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc)
##
##  Bera, Sosa-Escudero and Yoon locally robust test - balanced panel
##
## data:  formula
## chisq = 52.636, df = 1, p-value = 4.015e-13
## alternative hypothesis: AR(1) errors sub random effects

BSY for random effects

pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc, test="re")
##
##  Bera, Sosa-Escudero and Yoon locally robust test (one-sided) -
##  balanced panel
##
## data:  formula
## z = 57.914, p-value < 2.2e-16
## alternative hypothesis: random effects sub AR(1) errors

Since BSY is only locally robust, if you “know” there is no serial correlation, then this test is based on LM test is more superior:

plmtest(inv ~ value + capital, data = Grunfeld, type = "honda")
##
##  Lagrange Multiplier Test - (Honda) for balanced panels
##
## data:  inv ~ value + capital
## normal = 28.252, p-value < 2.2e-16
## alternative hypothesis: significant effects

On the other hand, if you know there is no random effects, to test for serial correlation, use -’s test

lmtest::bgtest()

If you “know” there are random effects, use ’s. to test for serial correlation in both AR(1) and MA(1) processes.

$$H_0$$: Uncorrelated errors.

Note:

• one-sided only has power against positive serial correlation.
• applicable to only balanced panels.
pbltest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp,
data=Produc, alternative="onesided")
##
##  Baltagi and Li one-sided LM test
##
## data:  log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp
## z = 21.69, p-value < 2.2e-16
## alternative hypothesis: AR(1)/MA(1) errors in RE panel model

General serial correlation tests

• applicable to random effects model, OLS, and FE (with large T, also known as long panel).
• can also test higher-order serial correlation
plm::pbgtest(plm::plm(inv~value+capital, data = Grunfeld, model = "within"), order = 2)
##
##  Breusch-Godfrey/Wooldridge test for serial correlation in panel models
##
## data:  inv ~ value + capital
## chisq = 42.587, df = 2, p-value = 5.655e-10
## alternative hypothesis: serial correlation in idiosyncratic errors

in the case of short panels (small T and large n), we can use

pwartest(log(emp) ~ log(wage) + log(capital), data=EmplUK)
##
##  Wooldridge's test for serial correlation in FE panels
##
## data:  plm.model
## F = 312.3, df1 = 1, df2 = 889, p-value < 2.2e-16
## alternative hypothesis: serial correlation

#### 18.4.3.5 Unit roots/stationarity

• Dickey-Fuller test for stochastic trends.
• Null hypothesis: the series is non-stationary (unit root)
• You would want your test to be less than the critical value (p<.5) so that there is evidence there is not unit roots.

#### 18.4.3.6 Heteroskedasticity

• Breusch-Pagan test

• Null hypothesis: the data is homoskedastic

• If there is evidence for heteroskedasticity, robust covariance matrix is advised.

• To control for heteroskedasticity: Robust covariance matrix estimation (Sandwich estimator)

• “white1” - for general heteroskedasticity but no serial correlation (check serial correlation first). Recommended for random effects.
• “white2” - is “white1” restricted to a common variance within groups. Recommended for random effects.
• “arellano” - both heteroskedasticity and serial correlation. Recommended for fixed effects

### 18.4.4 Model Selection

#### 18.4.4.1 POLS vs. RE

The continuum between RE (used FGLS which more assumption ) and POLS check back on the section of FGLS

Breusch-Pagan LM test

• Test for the random effect model based on the OLS residual
• Null hypothesis: variances across entities is zero. In another word, no panel effect.
• If the test is significant, RE is preferable compared to POLS

#### 18.4.4.2 FE vs. RE

• RE does not require strict exogeneity for consistency (feedback effect between residual and covariates)
Hypothesis If true
$$H_0: Cov(c_i,\mathbf{x_{it}})=0$$ $$\hat{\beta}_{RE}$$ is consistent and efficient, while $$\hat{\beta}_{FE}$$ is consistent
$$H_0: Cov(c_i,\mathbf{x_{it}}) \neq 0$$ $$\hat{\beta}_{RE}$$ is inconsistent, while $$\hat{\beta}_{FE}$$ is consistent

Hausman Test

For the Hausman test to run, you need to assume that

• strict exogeneity hold
• A4 to hold for $$u_{it}$$

Then,

• Hausman test statistic: $$H=(\hat{\beta}_{RE}-\hat{\beta}_{FE})'(V(\hat{\beta}_{RE})- V(\hat{\beta}_{FE}))(\hat{\beta}_{RE}-\hat{\beta}_{FE}) \sim \chi_{n(X)}^2$$ where $$n(X)$$ is the number of parameters for the time-varying regressors.
• A low p-value means that we would reject the null hypothesis and prefer FE
• A high p-value means that we would not reject the null hypothesis and consider RE estimator.
gw <- plm(inv~value+capital, data=Grunfeld, model="within")
gr <- plm(inv~value+capital, data=Grunfeld, model="random")
phtest(gw, gr)
##
##  Hausman Test
##
## data:  inv ~ value + capital
## chisq = 2.3304, df = 2, p-value = 0.3119
## alternative hypothesis: one model is inconsistent
 Violation   Estimator Basic Estimator Instrumental variable Estimator Variable Coefficients estimator Generalized Method of Moments estimator General FGLS estimator Means groups estimator CCEMG Estimator for limited dependent variables

### 18.4.5 Summary

• All three estimators (POLS, RE, FE) require A1, A2, A5 (for individuals) to be consistent. Additionally,

• POLS is consistent under A3a(for $$u_{it}$$): $$E(\mathbf{x}_{it}'u_{it})=0$$, and RE Assumption $$E(\mathbf{x}_{it}'c_{i})=0$$

• If A4 does not hold, use cluster robust SE but POLS is not efficient
• RE is consistent under A3a(for $$u_{it}$$): $$E(\mathbf{x}_{it}'u_{it})=0$$, and RE Assumption $$E(\mathbf{x}_{it}'c_{i})=0$$

• If A4 (for $$u_{it}$$) holds then usual SE are valid and RE is most efficient
• If A4 (for $$u_{it}$$) does not hold, use cluster robust SE ,and RE is no longer most efficient (but still more efficient than POLS)
• FE is consistent under A3 $$E((\mathbf{x}_{it}-\bar{\mathbf{x}}_{it})'(u_{it} -\bar{u}_{it}))=0$$

• Cannot estimate effects of time constant variables
• A4 generally does not hold for $$u_{it} -\bar{u}_{it}$$ so cluster robust SE are needed

Note: A5 for individual (not for time dimension) implies that you have A5a for the entire data set.

Estimator / True Model POLS RE FE
POLS Consistent Consistent Inconsistent
FE Consistent Consistent Consistent
RE Consistent Consistent Inconsistent

Based on table provided by Ani Katchova

### 18.4.6 Application

Recommended application of plm can be found here and here by Yves Croissant

#install.packages("plm")
library("plm")

library(foreign)
Panel <- read.dta("http://dss.princeton.edu/training/Panel101.dta")

attach(Panel)
Y <- cbind(y)
X <- cbind(x1, x2, x3)

# Set data as panel data
pdata <- pdata.frame(Panel, index=c("country","year"))

# Pooled OLS estimator
pooling <- plm(Y ~ X, data=pdata, model= "pooling")
summary(pooling)

# Between estimator
between <- plm(Y ~ X, data=pdata, model= "between")
summary(between)

# First differences estimator
firstdiff <- plm(Y ~ X, data=pdata, model= "fd")
summary(firstdiff)

# Fixed effects or within estimator
fixed <- plm(Y ~ X, data=pdata, model= "within")
summary(fixed)

# Random effects estimator
random <- plm(Y ~ X, data=pdata, model= "random")
summary(random)

# LM test for random effects versus OLS
# Accept Null, then OLS, Reject Null then RE
plmtest(pooling,effect = "individual", type = c("bp")) # other type: "honda", "kw"," "ghm"; other effect : "time" "twoways"

# B-P/LM and Pesaran CD (cross-sectional dependence) test
pcdtest(fixed, test = c("lm")) # Breusch and Pagan's original LM statistic
pcdtest(fixed, test = c("cd")) # Pesaran's CD statistic

# Serial Correlation
pbgtest(fixed)

# stationary
library("tseries")
adf.test(pdata\$y, k = 2)

# LM test for fixed effects versus OLS
pFtest(fixed, pooling)

# Hausman test for fixed versus random effects model
phtest(random, fixed)

# Breusch-Pagan heteroskedasticity
library(lmtest)
bptest(y ~ x1 + factor(country), data = pdata)

# If there is presence of heteroskedasticity
## For RE model
coeftest(random) #orginal coef
coeftest(random, vcovHC) # Heteroskedasticity consistent coefficients

t(sapply(c("HC0", "HC1", "HC2", "HC3", "HC4"), function(x) sqrt(diag(vcovHC(random, type = x))))) #show HC SE of the coef
# HC0 - heteroskedasticity consistent. The default.
# HC1,HC2, HC3 – Recommended for small samples. HC3 gives less weight to influential observations.
# HC4 - small samples with influential observations
# HAC - heteroskedasticity and autocorrelation consistent

## For FE model
coeftest(fixed) # Original coefficients
coeftest(fixed, vcovHC) # Heteroskedasticity consistent coefficients
coeftest(fixed, vcovHC(fixed, method = "arellano")) # Heteroskedasticity consistent coefficients (Arellano)
t(sapply(c("HC0", "HC1", "HC2", "HC3", "HC4"), function(x) sqrt(diag(vcovHC(fixed, type = x))))) #show HC SE of the coef

Advanced

Other methods to estimate the random model:

• "swar": default
• "walhus":
• "amemiya":
• "nerlove""

Other effects:

• Individual effects: default
• Time effects: "time"
• Individual and time effects: "twoways"

Note: no random two-ways effect model for random.method = "nerlove"

amemiya <- plm(Y ~ X, data=pdata, model= "random",random.method = "amemiya",effect = "twoways")

To call the estimation of the variance of the error components

ercomp(Y~X, data=pdata, method = "amemiya", effect = "twoways")

Check for the unbalancedness. Closer to 1 indicates balanced data

punbalancedness(random)

Instrumental variable

• "bvk": default
• "baltagi":
• "am"
• "bms":
instr <- plm(Y ~ X | X_ins, data = pdata, random.method = "ht", model = "random", inst.method = "baltagi")

### 18.4.7 Other Estimators

#### 18.4.7.1 Variable Coefficients Model

fixed_pvcm <- pvcm(Y~X, data=pdata, model="within")
random_pvcm <- pvcm(Y~X, data=pdata, model="random")

More details can be found here

#### 18.4.7.2 Generalized Method of Moments Estimator

Typically use in dynamic models. Example is from plm package

z2 <- pgmm(log(emp) ~ lag(log(emp), 1)+ lag(log(wage), 0:1) +
lag(log(capital), 0:1) | lag(log(emp), 2:99) +
lag(log(wage), 2:99) + lag(log(capital), 2:99),
data = EmplUK, effect = "twoways", model = "onestep",
transformation = "ld")
summary(z2, robust = TRUE)

#### 18.4.7.3 General Feasible Generalized Least Squares Models

Assume there is no cross-sectional correlation Robust against intragroup heteroskedasticity and serial correlation. Suited when n is much larger than T (long panel) However, inefficient under groupwise heteorskedasticity.

# Random Effects
zz <- pggls(log(emp)~log(wage)+log(capital), data=EmplUK, model="pooling")

# Fixed
zz <- pggls(log(emp)~log(wage)+log(capital), data=EmplUK, model="within")