Data can be defined broadly as any set of values, facts, or statistics that can be used for reference, analysis, and drawing inferences. In research, data drives the process of understanding phenomena, testing hypotheses, and formulating evidence-based conclusions. Choosing the right type of data (and understanding its strengths and limitations) is critical for the validity and reliability of findings.
11.1 Data Types
11.1.1 Qualitative vs. Quantitative Data
A foundational way to categorize data is by whether it is qualitative (non-numerical) or quantitative (numerical). These distinctions often guide research designs, data collection methods, and analytical techniques.
Qualitative
Quantitative
Examples: In-depth interviews, focus groups, case studies, ethnographies, open-ended questions, field notes
Examples: Surveys with closed-ended questions, experiments, numerical observations, structured interviews
Nature: Text-based, often descriptive, subjective interpretations
Nature: Numeric, more standardized, objective measures
Outcome: Rich context, detailed understanding of phenomena
Outcome: Measurable facts, generalizable findings (with appropriate sampling and design)
11.1.1.1 Uses and Advantages of Qualitative Data
Deep Understanding: Captures context, motivations, and perceptions in depth.
Flexibility: Elicits new insights through open-ended inquiry.
Inductive Approaches: Often used to build new theories or conceptual frameworks.
11.1.1.2 Uses and Advantages of Quantitative Data
Measurement and Comparison: Facilitates measuring variables and comparing across groups or over time.
Generalizability: With proper sampling, findings can often be generalized to broader populations.
Hypothesis Testing: Permits the use of statistical methods to test specific predictions or relationships.
11.1.1.3 Limitations of Qualitative and Quantitative Data
Qualitative:
Findings may be difficult to generalize if samples are small or non-representative.
Analysis can be time-consuming due to coding and interpreting text.
Potential for researcher bias in interpretation.
Quantitative:
May oversimplify complex human behaviors or contextual factors by reducing them to numbers.
Validity depends heavily on how well constructs are operationalized.
Can miss underlying meanings or nuances not captured in numeric measures.
11.1.1.4 Levels of Measurement
Even within quantitative data, there are further distinctions based on the level of measurement. This classification is crucial for determining which statistical techniques are appropriate:
Nominal: Categorical data with no inherent order (e.g., gender, blood type, eye color).
Ordinal: Categorical data with a specific order or ranking but without consistent intervals between ranks (e.g., Likert scale responses: “strongly disagree,” “disagree,” “neutral,” “agree,” “strongly agree”).
Interval: Numeric data with equal intervals but no true zero (e.g., temperature in Celsius or Fahrenheit).
Ratio: Numeric data with equal intervals and a meaningful zero (e.g., height, weight, income).
The level of measurement affects which statistical tests (like t-tests, ANOVA, correlations, regressions) are valid and how you can interpret differences or ratios in the data.
11.1.2 Other Ways to Classify Data
Beyond observational structure, there are multiple other dimensions used to classify data:
11.1.2.1 Primary vs. Secondary Data
Primary Data: Collected directly by the researcher for a specific purpose (e.g., firsthand surveys, experiments, direct measurements).
Secondary Data: Originally gathered by someone else for a different purpose (e.g., government census data, administrative records, previously published datasets).
11.1.2.2 Structured, Semi-Structured, and Unstructured Data
Structured Data: Organized in a predefined manner, typically in rows and columns (e.g., spreadsheets, relational databases).
Semi-Structured Data: Contains organizational markers but not strictly tabular (e.g., JSON, XML logs, HTML).
Unstructured Data: Lacks a clear, consistent format (e.g., raw text, images, videos, audio files).
Often analyzed using natural language processing (NLP), image recognition, or other advanced techniques.
11.1.2.3 Big Data
Characterized by the “3 Vs”: Volume (large amounts), Variety (diverse forms), and Velocity (high-speed generation).
Requires specialized computational tools (e.g., Hadoop, Spark) and often cloud-based infrastructure for storage and processing.
Can be structured or unstructured (e.g., social media feeds, sensor data, clickstream data).
11.1.2.4 Internal vs. External Data (in Organizational Contexts)
Internal Data: Generated within an organization (e.g., sales records, HR data, production metrics).
External Data: Sourced from outside (e.g., macroeconomic indicators, market research reports, social media analytics).
11.1.2.5 Proprietary vs. Public Datas
Proprietary Data: Owned by an organization or entity, not freely available for public use.
Public/Open Data: Freely accessible data provided by governments, NGOs, or other institutions (e.g., data.gov, World Bank Open Data).
11.1.3 Data by Observational Structure Over Time
Another primary way to categorize data is by how observations are collected over time. This classification shapes research design, analytic methods, and the types of inferences we can make. Four major types here are:
Allows causal inference, controls for unobserved heterogeneity, tracks individual trajectories.
Expensive, prone to attrition, requires complex statistical methods.
11.2 Cross-Sectional Data
Cross-sectional data consists of observations on multiple entities (e.g., individuals, firms, regions, or countries) at a single point in time or over a very short period, where time is not a primary dimension of variation.
Each observation represents a different entity, rather than the same entity tracked over time.
Unlike time series data, the order of observations does not carry temporal meaning.
Examples
Labor Economics: Wage and demographic data for 1,000 workers in 2024.
Marketing Analytics: Customer satisfaction ratings and purchasing behavior for 500 online shoppers surveyed in Q1 of a year.
Corporate Finance: Financial statements of 1,000 firms for the fiscal year 2023.
Key Characteristics
Observations are independent (in an ideal setting): Each unit is drawn from a population with no intrinsic dependence on others.
No natural ordering: Unlike time series data, the sequence of observations does not affect analysis.
Variation occurs across entities, not over time: Differences in observed outcomes arise from differences between individuals, firms, or regions.
Advantages
Straightforward Interpretation: Since time effects are not present, the focus remains on relationships between variables at a single point.
Easier to Collect and Analyze: Compared to time series or panel data, cross-sectional data is often simpler to collect and model.
Suitable for causal inference (if exogeneity conditions hold).
Challenges
Omitted Variable Bias: Unobserved confounders may drive both the dependent and independent variables.
Endogeneity: Reverse causality or measurement error can introduce bias.
Heteroskedasticity: Variance of errors may differ across entities, requiring robust standard errors.
A typical cross-sectional regression model:
yi=β0+xi1β1+xi2β2+⋯+xi(k−1)βk−1+ϵi
where:
yi is the outcome variable for entity i,
xij are explanatory variables,
ϵi is an error term capturing unobserved factors.
11.3 Time Series Data
Time series data consists of observations on the same variable(s) recorded over multiple time periods for a single entity (or aggregated entity). These data points are typically collected at consistent intervals—hourly, daily, monthly, quarterly, or annually—allowing for the analysis of trends, patterns, and forecasting.
Examples
Stock Market: Daily closing prices of a company’s stock over five years.
Economics: Monthly unemployment rates in a country over a decade.
Macroeconomics: Annual GDP of a country from 1960 to 2020.
Key Characteristics
The primary goal is to analyze trends, seasonality, cyclic patterns, and forecast future values.
Time series data requires specialized statistical methods, such as:
Autoregressive Integrated Moving Average (ARIMA)
Seasonal ARIMA (SARIMA)
Exponential Smoothing
Vector Autoregression (VAR)
Advantages
Captures temporal patterns such as trends, seasonal fluctuations, and economic cycles.
Essential for forecasting and policy-making, such as setting interest rates based on economic indicators.
Challenges
Autocorrelation: Observations close in time are often correlated.
Structural Breaks: Sudden changes due to policy shifts or economic crises can distort analysis.
Seasonality: Must be accounted for to avoid misleading conclusions.
A time series typically consists of four key components:
Trend: Long-term directional movement in the data over time.
Seasonality: Regular, periodic fluctuations (e.g., increased retail sales in December).
Cyclical Patterns: Long-term economic cycles that are irregular but recurrent.
Irregular (Random) Component: Unpredictable variations not explained by trend, seasonality, or cycles.
A general linear time series model can be expressed as:
yt=β0+xt1β1+xt2β2+⋯+xt(k−1)βk−1+ϵt
Some Common Model Types
Static Model
A simple time series regression:
yt=β0+x1β1+x2β2+x3β3+ϵt
Finite Distributed Lag Model
Captures the effect of past values of an explanatory variable:
yt=β0+petδ0+pet−1δ1+pet−2δ2+ϵt
Long-Run Propensity: Measures the cumulative effect of explanatory variables over time:
LRP=δ0+δ1+δ2
Dynamic Model
A model incorporating lagged dependent variables:
GDPt=β0+β1GDPt−1+ϵt
11.3.1 Statistical Properties of Time Series Models
For time series regression, standard OLS assumptions must be carefully examined. The following conditions affect estimation:
Finite Sample Properties
A1-A3: OLS remains unbiased.
A1-A4: Standard errors are consistent, and the Gauss-Markov Theorem holds (OLS is BLUE).
11.3.4 Violations of Exogeneity in Time Series Models
The exogeneity assumption (A3) plays a crucial role in ensuring unbiased and consistent estimation in time series models. However, in many cases, the assumption is violated due to the inherent nature of time-dependent processes.
In a standard regression framework, we assume:
E(ϵt|x1,x2,...,xT)=0
which requires that the error term is uncorrelated with all past, present, and future values of the independent variables.
In finite distributed lag (FDL) models, failing to include the correct number of lags leads to omitted variable bias and correlation between regressors and errors.
However, since yt−1 depends on ϵt−1 from the previous period, we obtain:
Cov(yt−1,ϵt)≠0
Implication:
Strict exogeneity (A3) fails, as yt−1 and ϵt are correlated.
OLS estimates are biased and inconsistent.
Standard autoregressive models (AR) require alternative estimation techniques like Generalized Method of Moments or Maximum Likelihood Estimation.
11.3.4.3 Dynamic Completeness and Omitted Lags
A finite distributed lag (FDL) model:
yt=β0+xtδ0+xt−1δ1+ϵt
assumes that the included lags fully capture the relationship between yt and past values of xt. However, if we omit relevant lags, the exogeneity assumption (A3):
E(ϵt|x1,x2,...,xt,xt+1,...,xT)=0
fails, as unmodeled lag effects create correlation between xt−2 and ϵt.
Implication:
The regression suffers from omitted variable bias, making OLS estimates unreliable.
Solution:
Include additional lags of xt.
Use lag selection criteria (e.g., AIC, BIC) to determine the appropriate lag structure.
11.3.5 Consequences of Exogeneity Violations
If strict exogeneity (A3) fails, standard OLS assumptions no longer hold:
The derivation of asymptotic variance depends on A4 (Homoskedasticity).
However, in time series settings, we often encounter serial correlation:
Cov(ϵt,ϵs)≠0for|t−s|>0
To ensure valid inference, standard errors must be corrected using methods such as Newey-West HAC estimators.
11.3.6 Highly Persistent Data
In time series analysis, a key assumption for OLS consistency is that the data-generating process exhibits A5a weak dependence (i.e., observations are not too strongly correlated over time). However, when yt and xt are highly persistent, standard OLS assumptions break down.
If a time series is not weakly dependent, it means:
yt and yt−h remain strongly correlated even for large lags (h→∞).
A5a (Weak Dependence Assumption) fails, leading to:
OLS inconsistency.
No valid limiting distribution (asymptotic normality does not hold).
Example: A classic example of a highly persistent process is a random walk:
yt=yt−1+ut
or with drift:
yt=α+yt−1+ut
where ut is a white noise error term.
yt does not revert to a mean—it has an infinite variance as t→∞.
Shocks accumulate, making standard regression analysis unreliable.
11.3.6.1 Solution: First Differencing
A common way to transform non-stationary series into stationary ones is through first differencing:
Δyt=yt−yt−1=ut
If ut is a weakly dependent process (i.e., I(0), stationary), then yt is said to be difference-stationary or integrated of order 1,I(1).
If both yt and xt follow a random walk (I(1)), we estimate:
Δyt=(Δxtβ)+(ϵt−ϵt−1)Δyt=Δxtβ+Δut
This ensures OLS estimation remains valid.
11.3.7 Unit Root Testing
To formally determine whether a time series contains a unit root (i.e., is non-stationary), we test:
yt=α+ρyt−1+ut
Hypothesis Testing
H0:ρ=1 (unit root, non-stationary)
OLS is not consistent or asymptotically normal.
Ha:ρ<1 (stationary process)
OLS is consistent and asymptotically normal.
Key Issues
The usual t-test is not valid because OLS under H0 does not have a standard distribution.
Including lags of Δyt ensures a better-specified model.
Ignoring Deterministic Time Trends
If a series exhibits a deterministic trend, failing to include it biases the unit root test.
Example: If yt grows over time, a test without a trend component will falsely detect a unit root.
Solution: Include a deterministic time trend (t) in the regression:
Δyt=α+θyt−1+δt+vt
Allows for quadratic relationships with time.
Changes the critical values, requiring an adjusted statistical test.
11.3.7.2 Augmented Dickey-Fuller Test
The ADF test generalizes the DF test by allowing for:
Lags ofΔyt (to correct for serial correlation).
Time trends (to handle deterministic trends).
Regression Equation
Δyt=α+θyt−1+δt+γ1Δyt−1+⋯+γpΔyt−p+vt
where θ=1−ρ.
Hypotheses
H0:θ=0 (Unit root: non-stationary)
Ha:θ<0 (Stationary)
11.3.8 Newey-West Standard Errors
Newey-West standard errors, also known as Heteroskedasticity and Autocorrelation Consistent (HAC) estimators, provide valid inference when errors exhibit both heteroskedasticity (i.e., A4 Homoskedasticity assumption is violated) and serial correlation. These standard errors adjust for dependence in the error structure, ensuring that hypothesis tests remain valid.
Key Features
Accounts for autocorrelation: Handles time dependence in error terms.
Accounts for heteroskedasticity: Allows for non-constant variance across observations.
Ensures positive semi-definiteness: Downweights longer-lagged covariances to maintain mathematical validity.
g is the chosen lag truncation parameter (bandwidth),
et are the residuals from the OLS regression,
xt are the explanatory variables.
Choosing the Lag Length (g)
Selecting an appropriate lag truncation parameter (g) is crucial for balancing efficiency and bias. Common guidelines include:
Yearly data: g=1 or 2 usually suffices.
Quarterly data: g=4 or 8 accounts for seasonal dependencies.
Monthly data: g=12 or 14 captures typical cyclical effects.
Alternatively, data-driven methods can be used:
Newey-West Rule: g=⌊4(T/100)2/9⌋
Alternative Heuristic: g=⌊T1/4⌋
# Load necessary librarieslibrary(sandwich)library(lmtest)# Simulate dataset.seed(42)T<-100# Sample sizetime<-1:Tx<-rnorm(T)epsilon<-arima.sim(n =T, list(ar =0.5))# Autocorrelated errorsy<-2+3*x+epsilon# True model# Estimate OLS modelmodel<-lm(y~x)# Compute Newey-West standard errorslag_length<-floor(4*(T/100)^(2/9))# Newey-West rulenw_se<-NeweyWest(model, lag =lag_length, prewhite =FALSE)# Display robust standard errorscoeftest(model, vcov =nw_se)#> #> t test of coefficients:#> #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 1.71372 0.13189 12.993 < 2.2e-16 ***#> x 3.15831 0.13402 23.567 < 2.2e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
11.3.8.1 Testing for Serial Correlation
Serial correlation (also known as autocorrelation) occurs when error terms are correlated across time:
E(ϵtϵt−h)≠0for some h≠0
Steps for Detecting Serial Correlation
Estimate an OLS regression:
Run the regression of yt on xt and obtain residuals et.
Test for autocorrelation in residuals:
Regress et on xt and its lagged residual et−1:
et=γ0+x′tγ+ρet−1+vt
Test whether ρ is significantly different from zero.
Decision Rule:
If ρ is statistically significant at the 5% level, reject the null hypothesis of no serial correlation.
Higher-Order Serial Correlation
To test for higher-order autocorrelation, extend the previous regression:
et=γ0+x′tγ+ρ1et−1+ρ2et−2+⋯+ρpet−p+vt
Jointly testρ1=ρ2=⋯=ρp=0 using an F-test.
If the null is rejected, autocorrelation of order p is present.
Step 1: Estimate an OLS Regression and Obtain Residuals
# Load necessary librarieslibrary(lmtest)library(sandwich)# Generate some example dataset.seed(123)n<-100x<-rnorm(n)y<-1+0.5*x+rnorm(n)# True model: y = 1 + 0.5*x + e# Estimate the OLS regressionmodel<-lm(y~x)# Obtain residualsresiduals<-resid(model)
Step 2: Test for Autocorrelation in Residuals
# Create lagged residualslagged_residuals<-c(NA, residuals[-length(residuals)])# Regress residuals on x and lagged residualsautocorr_test_model<-lm(residuals~x+lagged_residuals)# Summary of the regressionsummary(autocorr_test_model)#> #> Call:#> lm(formula = residuals ~ x + lagged_residuals)#> #> Residuals:#> Min 1Q Median 3Q Max #> -1.94809 -0.72539 -0.08105 0.58503 3.12941 #> #> Coefficients:#> Estimate Std. Error t value Pr(>|t|)#> (Intercept) 0.008175 0.098112 0.083 0.934#> x -0.002841 0.107167 -0.027 0.979#> lagged_residuals -0.127605 0.101746 -1.254 0.213#> #> Residual standard error: 0.9707 on 96 degrees of freedom#> (1 observation deleted due to missingness)#> Multiple R-squared: 0.01614, Adjusted R-squared: -0.004354 #> F-statistic: 0.7876 on 2 and 96 DF, p-value: 0.4579# Test if the coefficient of lagged_residuals is significantrho<-coef(autocorr_test_model)["lagged_residuals"]rho_p_value<-summary(autocorr_test_model)$coefficients["lagged_residuals", "Pr(>|t|)"]# Decision Ruleif(rho_p_value<0.05){cat("Reject the null hypothesis: There is evidence of serial correlation.\n")}else{cat("Fail to reject the null hypothesis: No evidence of serial correlation.\n")}#> Fail to reject the null hypothesis: No evidence of serial correlation.
Step 3: Testing for Higher-Order Serial Correlation
# Number of lags to testp<-2# Example: testing for 2nd order autocorrelation# Create a matrix of lagged residualslagged_residuals_matrix<-sapply(1:p, function(i)c(rep(NA, i), residuals[1:(n-i)]))# Regress residuals on x and lagged residualshigher_order_autocorr_test_model<-lm(residuals~x+lagged_residuals_matrix)# Summary of the regressionsummary(higher_order_autocorr_test_model)#> #> Call:#> lm(formula = residuals ~ x + lagged_residuals_matrix)#> #> Residuals:#> Min 1Q Median 3Q Max #> -1.9401 -0.7290 -0.1036 0.6359 3.0253 #> #> Coefficients:#> Estimate Std. Error t value Pr(>|t|)#> (Intercept) 0.006263 0.099104 0.063 0.950#> x 0.010442 0.108370 0.096 0.923#> lagged_residuals_matrix1 -0.140426 0.103419 -1.358 0.178#> lagged_residuals_matrix2 -0.107385 0.103922 -1.033 0.304#> #> Residual standard error: 0.975 on 94 degrees of freedom#> (2 observations deleted due to missingness)#> Multiple R-squared: 0.02667, Adjusted R-squared: -0.004391 #> F-statistic: 0.8587 on 3 and 94 DF, p-value: 0.4655# Joint F-test for the significance of lagged residualsf_test<-car::linearHypothesis(higher_order_autocorr_test_model, paste0("lagged_residuals_matrix", 1:p, " = 0"))# Print the F-test resultsprint(f_test)#> #> Linear hypothesis test:#> lagged_residuals_matrix1 = 0#> lagged_residuals_matrix2 = 0#> #> Model 1: restricted model#> Model 2: residuals ~ x + lagged_residuals_matrix#> #> Res.Df RSS Df Sum of Sq F Pr(>F)#> 1 96 91.816 #> 2 94 89.368 2 2.4479 1.2874 0.2808# Decision Ruleif(f_test$`Pr(>F)`[2]<0.05){cat("Reject the null hypothesis: There is evidence of higher-order serial correlation.\n")}else{cat("Fail to reject the null hypothesis: No evidence of higher-order serial correlation.\n")}#> Fail to reject the null hypothesis: No evidence of higher-order serial correlation.
Corrections for Serial Correlation
If serial correlation is detected, the following adjustments should be made:
Include lags of dependent variable or use HAC estimators with higher lag orders
11.4 Repeated Cross-Sectional Data
Repeated cross-sectional data consists of multiple independent cross-sections collected at different points in time. Unlike panel data, where the same individuals are tracked over time, repeated cross-sections draw a fresh sample in each wave.
This approach allows researchers to analyze aggregate trends over time, but it does not track individual-level changes.
Examples
General Social Survey (GSS) (U.S.) – Conducted every two years with a new sample of respondents.
Political Opinion Polls – Monthly voter surveys to track shifts in public sentiment.
National Health Surveys – Annual studies with fresh samples to monitor population-wide health trends.
Educational Surveys – Sampling different groups of students each year to assess learning outcomes.
11.4.1 Key Characteristics
Fresh Sample in Each Wave
Each survey represents an independent cross-section.
No respondent is tracked across waves.
Population-Level Trends Over Time
Researchers can study how the distribution of characteristics (e.g., income, attitudes, behaviors) changes over time.
However, individual trajectories cannot be observed.
Sample Design Consistency
To ensure comparability across waves, researchers must maintain consistent:
Sampling methods
Questionnaire design
Definitions of key variables
11.4.2 Statistical Modeling for Repeated Cross-Sections
Since repeated cross-sections do not track the same individuals, specific regression methods are used to analyze changes over time.
Pooled Cross-Sectional Regression (Time Fixed Effects)
Combines multiple survey waves into a single dataset while controlling for time effects:
yi=xiβ+δ1y1+...+δTyT+ϵi
where:
yi is the outcome for individual i,
xi are explanatory variables,
yt are time period dummies,
δt captures the average change in outcomes across time periods.
Key Features:
Allows for different intercepts across time periods, capturing shifts in baseline outcomes.
Tracks overall population trends without assuming a constant effect of xi over time.
Allowing for Structural Change in Pooled Cross-Sections (Time-Dependent Effects)
To test whether relationships between variables change over time (structural breaks), interactions between time dummies and explanatory variables can be introduced:
yi=xiβ+xiy1γ1+...+xiyTγT+δ1y1+...+δTyT+ϵi
Interactingxi with time period dummies allows for:
Different slopes for each time period.
Time-dependent effects of explanatory variables.
Practical Application:
If xi represents education level and yt represents survey year, an interaction term can test whether the effect of education on income has changed over time.
Structural break tests help determine whether such time-varying effects are statistically significant.
Useful for policy analysis, where a policy might impact certain subgroups differently across time.
Difference-in-Means Over Time
A simple approach to comparing aggregate trends:
ˉyt−ˉyt−1
Measures whether the average outcome has changed over time.
Common in policy evaluations (e.g., assessing the effect of minimum wage increases on average income).
Synthetic Cohort Analysis
Since repeated cross-sections do not track individuals, a synthetic cohort can be created by grouping observations based on shared characteristics:
Example: If education levels are collected over multiple waves, we can track average income changes within education groups to approximate trends.
11.4.3 Advantages of Repeated Cross-Sectional Data
Advantage
Explanation
Tracks population trends
Useful for studying shifts in demographics, attitudes, and economic conditions over time.
Lower cost than panel data
Tracking individuals across multiple waves (as in panel studies) is expensive and prone to attrition.
No attrition bias
Unlike panel surveys, where respondents drop out over time, each wave draws a new representative sample.
Easier implementation
Organizations can design a single survey protocol and repeat it at set intervals without managing panel retention.
11.4.4 Disadvantages of Repeated Cross-Sectional Data
Disadvantage
Explanation
No individual-level transitions
Cannot track how specific individuals change over time (e.g., income mobility, changes in attitudes).
Limited causal inference
Since we observe different people in each wave, we cannot directly infer individual cause-and-effect relationships.
Comparability issues
Small differences in survey design (e.g., question wording or sampling frame) can make it difficult to compare across waves.
To ensure valid comparisons across time:
Consistent Sampling: Each wave should use the same sampling frame and methodology.
Standardized Questions: Small variations in question wording can introduce inconsistencies.
Weighting Adjustments: If sampling strategies change, apply survey weights to maintain representativeness.
Accounting for Structural Changes: Economic, demographic, or social changes may impact comparability.
11.5 Panel Data
Panel data (also called longitudinal data) consists of observations of the same entities over multiple time periods. Unlike repeated cross-sections, where new samples are drawn in each wave, panel data tracks the same individuals, households, firms, or regions over time, enabling richer statistical analysis.
Panel data combines cross-sectional variation (differences across entities) and time-series variation (changes within entities over time).
Examples
Panel Study of Income Dynamics – Follows households annually, collecting data on income, employment, and expenditures.
Medical Longitudinal Studies – Tracks the same patients over months or years to study disease progression.
Firm-Level Financial Data – Follows a set of companies over multiple years through financial statements.
Student Achievement Studies – Follows the same students across different grade levels to assess academic progress.
Structure
N entities (individuals, firms, etc.) observed over T time periods.
The dataset can be:
Balanced Panel: All entities are observed in every time period.
Unbalanced Panel: Some entities have missing observations for certain periods.
Types of Panels
Short Panel: Many individuals (N) but few time periods (T).
Long Panel: Many time periods (T) but few individuals (N).
Both Large: Large N and T (e.g., firm-level data over decades).
11.5.1 Advantages of Panel Data
Advantage
Explanation
Captures individual trajectories
Allows for studying how individuals or firms evolve over time.
Difference-in-differences and FE models improve causal interpretation.
More efficient estimates
Exploits both cross-sectional and time-series variation.
11.5.2 Disadvantages of Panel Data
Disadvantage
Explanation
Higher cost and complexity
Tracking individuals over time is resource-intensive.
Attrition bias
If certain individuals drop out systematically, results may be biased.
Measurement errors
Errors accumulate over time, leading to potential biases.
11.5.3 Sources of Variation in Panel Data
Since we observe both individuals and time periods, we distinguish three types of variation:
Overall variation: Differences across both time and individuals.
Between variation: Differences between individuals (cross-sectional variation).
Within variation: Differences within individuals (time variation).
Estimate
Formula
Individual mean
ˉxi=1T∑txit
Overall mean
ˉx=1NT∑i∑txit
Overall variance
s2O=1NT−1∑i∑t(xit−ˉx)2
Between variance
s2B=1N−1∑i(ˉxi−ˉx)2
Within variance
s2W=1NT−1∑i∑t(xit−ˉxi)2
Note:s2O≈s2B+s2W
11.5.4 Pooled OLS Estimator
The Pooled Ordinary Least Squares estimator is the simplest way to estimate relationships in panel data. It treats panel data as a large cross-sectional dataset, ignoring individual-specific effects and time dependence.
The pooled OLS model is specified as:
yit=xitβ+ϵit
where:
yit is the dependent variable for individual i at time t,
xit is a vector of explanatory variables,
β is the vector of coefficients to be estimated,
ϵit=ci+uit is the composite error term.
ci is the unobserved individual heterogeneity.
uit is the idiosyncratic shock.
By treating all observations as independent, pooled OLS assumes no systematic differences across individuals beyond what is captured by xit.
For pooled OLS to be consistent and unbiased, the following conditions must hold:
Comparing Pooled OLS with Alternative Panel Models
Model
Assumption aboutci
Uses Within Variation?
Uses Between Variation?
Best When
Pooled OLS
Assumes ci is uncorrelated with xit
✅ Yes
✅ Yes
No individual heterogeneity
Fixed Effects
Removes ci via demeaning
✅ Yes
❌ No
ci is correlated with xit
Random Effects
Assumes ci is uncorrelated with xit
✅ Yes
✅ Yes
ci is uncorrelated with xit
When to Use Pooled OLS?
If individual heterogeneity is negligible
If panel is short (T is small) and cross-section is large (N is big)
If random effects assumption holds (E(x′itci)=0)
If these conditions fail, Fixed Effects or Random Effects models should be used instead.
11.5.5 Individual-Specific Effects Model
In panel data, unobserved heterogeneity can arise when individual-specific factors (ci) influence the dependent variable. These effects can be:
Correlated with regressors (E(x′itci)≠0): Use the Fixed Effects estimator.
Uncorrelated with regressors (E(x′itci)=0): Use the Random Effects estimator.
The general model is:
yit=xitβ+ci+uit
where:
ci is the individual-specific effect (time-invariant),
uit is the idiosyncratic error (time-variant).
Comparing Fixed Effects and Random Effects
Model
Assumption onci
Uses Within Variation?
Uses Between Variation?
Best When
Fixed Effects
ci is correlated with xit
✅ Yes
❌ No
Unobserved heterogeneity bias present
Random Effects
ci is uncorrelated with xit
✅ Yes
✅ Yes
No correlation with regressors
11.5.6 Random Effects Estimator
The Random Effects (RE) estimator is a Feasible Generalized Least Squares method used in panel data analysis. It assumes that individual-specific effects (ci) are uncorrelated with the explanatory variables (xit), allowing for estimation using both within-group (time variation) and between-group (cross-sectional variation).
The standard Random Effects model is:
yit=xitβ+ci+uit
where:
yit is the dependent variable for entity i at time t,
xit is a vector of explanatory variables,
β represents the coefficients of interest,
ci is the unobserved individual-specific effect (time-invariant),
uit is the idiosyncratic error (time-varying).
In contrast to the Fixed Effects model, which eliminates ci by demeaning the data, the Random Effects model treats ci as a random variable and incorporates it into the error structure.
11.5.6.1 Key Assumptions for Random Effects
For the Random Effects estimator to be consistent, the following assumptions must hold:
Exacerbates Measurement Error: FE can worsen errors-in-variables bias.
11.5.7.1 Demean (Within) Transformation
To remove ci, we take the individual mean of the regression equation:
yit=xitβ+ci+uit
Averaging over time (T):
ˉyi=ˉxiβ+ci+ˉui
Subtracting the second equation from the first (i.e., within transformation):
(yit−ˉyi)=(xit−ˉxi)β+(uit−ˉui)
This transformation:
Eliminates ci, solving omitted variable bias.
Only uses within-individual variation.
The transformed regression is estimated via OLS:
yit−ˉyi=(xit−ˉxi)β+d1δ1+⋯+dT−2δT−2+(uit−ˉui)
where
dt is a time dummy variable, which equals 1 if the observation in the time periods t, and 0 otherwise. This variable is for period t=1,…,T−1 (one period omitted to avoid perfect multicollinearity).
δt is the coefficient on the time dummy, capturing aggregate shocks that affect all individual in period t.
Time-invariant variables are dropped (e.g., gender, ethnicity). If you’re interested in the effect of these time-invariant variables, consider using either OLS or the between estimator.
Cluster-Robust Standard Errors should be used.
11.5.7.2 Dummy Variable Approach
The Dummy Variable Approach is an alternative way to estimate Fixed Effects in panel data. Instead of transforming the data by demeaning (Within Transformation), this method explicitly includes individual dummy variables to control for entity-specific heterogeneity.
The general FE model is:
yit=xitβ+ci+uit
where:
ci is the unobserved, time-invariant individual effect (e.g., ability, cultural preferences, managerial style).
uit is the idiosyncratic error term (fluctuates over time and across individuals).
To estimate this model using the Dummy Variable Approach, we include a separate dummy variable for each individual:
yit=xitβ+d1δ1+...+dT−2δT−2+c1γ1+...+cn−1γn−1+uit
where:
ci is now modeled explicitly as a dummy variable (ciγi) for each individual.
dt are time dummies, capturing time-specific shocks.
δt are coefficients on time dummies, controlling for common time effects.
Interpretation of the Dummy Variables
The dummy variable ci takes a value of 1 for individual i and 0 otherwise:
ci={1if observation is for individual i0otherwise
These N dummy variables absorb all individual-specific variation, ensuring that only within-individual (over-time) variation remains.
Advantages of the Dummy Variable Approach
Easy to Interpret: Explicitly includes entity-specific effects, making it easier to understand how individual heterogeneity is modeled.
Equivalent to the Within (Demean) Transformation: Mathematically, this approach produces the same coefficient estimates as the Within Transformation.
Allows for Inclusion of Time Dummies: The model can easily incorporate time dummies (dt) to control for period-specific shocks.
Limitations of the Dummy Variable Approach
Computational Complexity with LargeN
Adding N dummy variables significantly increases the number of parameters estimated.
If N is very large (e.g., 10,000 individuals), this approach can be computationally expensive.
Standard Errors Are Incorrectly Estimated
The standard errors for ci dummy variables are often incorrectly calculated, as they absorb all within-individual variation.
This is why the Within Transformation (Demeaning Approach) is generally preferred.
Consumes Degrees of Freedom
Introducing N additional parameters reduces degrees of freedom, which can lead to overfitting.
11.5.7.3 First-Difference Approach
An alternative way to eliminate individual-specific effects (ci) is to take first differences across time, rather than subtracting the individual mean.
The FE model:
yit=xitβ+ci+uit
Since ci is constant over time, taking the first difference:
yit−yi(t−1)=(xit−xi(t−1))β+(uit−ui(t−1))
This transformation removesci completely, leaving a model that can be estimated using Pooled OLS.
Advantages of the First-Difference Approach
Eliminates Individual Effects (ci)
Since ci is time-invariant, differencing removes it from the equation.
Works Well with Few Time Periods (T is Small)
If T is small, first-differencing is often preferred over the Within Transformation, as it does not require averaging over many periods.
Less Computationally Intensive
Unlike the Dummy Variable Approach, which requires estimating N additional parameters, the First-Difference Approach reduces the dimensionality of the problem.
Limitations of the First-Difference Approach
Cannot Handle Missing Observations Well
If data is missing in period t−1 for an individual, then the corresponding first-difference observation is lost.
This can significantly reduce sample size in unbalanced panels.
Reduces Number of Observations by One
Since first differences require yi(t−1) to exist, the model loses one time period (T−1 observations per individual instead of T).
Can Introduce Serial Correlation
Since we are differencing uit−ui(t−1), the error term now exhibits autocorrelation.
This means standard OLS assumptions (independent errors) no longer hold, requiring the use of robust standard errors.
Does transferring resources to low-income families improve upward mobility for children?
What are the mechanisms of intergenerational mobility?
Mechanisms for Intergenerational Mobility
There are multiple pathways through which parental income influences child outcomes:
Genetics (Ability Endowment)
If mobility is purely genetic, policy cannot affect outcomes.
Environmental Indirect Effects
Family background, peer influences, school quality.
Environmental Direct Effects
Parental investments in education, health, social capital.
Financial Transfers
Direct monetary support, inheritance, wealth accumulation.
One way to measure the impact of income on human capital accumulation is:
%ΔHuman Capital%ΔParental Income
where human capital includes education, skills, and job market outcomes.
Income is measured in different ways to capture its long-term effects:
Total household income
Wage income
Non-wage income
Annual vs. Permanent Income (important distinction for long-term analysis)
Key control variables must be exogenous to avoid bias. Bad Controls are those that are jointly determined with the dependent variable (e.g., mother’s labor force participation).
Exogenous controls:
Mother’s race
Birth location
Parental education
Household structure at age 14
The estimated model is:
Yijt=Xjtβi+Ijtαi+ϵijt
where:
i = test (e.g., academic test score).
j = individual (child).
t = time.
Xjt = observable child characteristics.
Ijt = parental income.
ϵijt = error term.
Grandmother’s Fixed-Effects Model
Since a child (j) is nested within a mother (m), and a mother is nested within a grandmother (g), we estimate:
Yijgmt=Xitβi+Ijtαi+γg+uijgmt
where:
g = Grandmother, m = Mother, j = Child, t = Time.
γg captures both grandmother and mother fixed effects.
The nested structure controls for genetic and fixed family environment effects.
Cluster standard errors at the family level to account for correlation in errors across generations.
Pros of Grandmother FE Model
Controls for genetics + fixed family background
Allows estimation of income effects independent of family background
Cons
Might not fully control for unobserved heterogeneity
Measurement errors in income can exaggerate attenuation bias
11.5.7.6.2 Fixed Effects in Teacher Quality Studies – Babcock (2010)
The study investigates:
How teacher quality influences student performance.
Whether students adjust course selection behavior based on past grading experiences.
How to properly estimate teacher fixed effects while addressing selection bias and measurement error.
The initial model estimates student performance (Tijct) based on class expectations and student characteristics:
Tijct=α0+Sjctα1+Xijctα2+uijct
where:
Tijct = Student test score.
Sjct = Class-level grading expectation (e.g., expected GPA in the course).
Xijct = Individual student characteristics.
i = Student, j = Instructor, c = Course, t = Time.
uijct = Idiosyncratic error term.
A key issue in this model is that grading expectations may not be randomly assigned. If students select into courses based on grading expectations, simultaneity bias can arise.
To control for instructor and course heterogeneity, the model introduces teacher-course fixed effects (μjc):
Tijct=β0+Sjctβ1+Xijctβ2+μjc+ϵijct
where:
μjc is a unique fixed effect for each instructor-course combination.
This controls for instructor-specific grading policies and course difficulty.
It differs from a simple instructor effect (θj) and course effect (δc) because it captures interaction effects.
Implications of Instructor-Course Fixed Effects
Reduces Bias from Course Shopping
Students may select courses based on grading expectations.
Including μjc controls for the fact that some instructors systematically assign easier grades.
Shifts in Student Expectations
Even if course content remains constant, students adjust their expectations based on past grading experiences.
This influences their future course selection behavior.
Identification Strategy
A key challenge in estimating teacher effects is endogeneity from:
Simultaneity Bias
Grading expectations (Sjct) and student performance may be jointly determined.
If grading expectations are based on past student performance, OLS will be biased.
Unobserved Teacher Characteristics
Some teachers may have innate ability to motivate students, leading to higher student performance independent of observable teacher traits.
To address these concerns, the model first controls for observable teacher characteristics:
However, if teacher characteristics are correlated with unobserved ability, we replace them with teacher fixed effects:
Yijt=Xitα+Γitθj+uijt
where:
θj = Teacher Fixed Effect, capturing all time-invariant teacher characteristics.
Γit represents within-teacher variation.
To further analyze teacher impact, we express student test scores as:
Yijt=Xitγ+ϵijt
where:
γ represents the between and within variation.
eijt is the prediction error.
Decomposing the error term:
eijt=Titδj+˜eijt
where:
δj = Group-level teacher effect.
˜eijt = Residual error.
To control for prior student performance, we introduce lagged test scores:
Yijkt=Yijkt−1+Xitβ+Titτj+(Wi+Pk+ϵijkt)
where:
Yijkt−1 = Lagged student test score.
τj = Teacher Fixed Effect.
Wi = Student Fixed Effect.
Pk = School Fixed Effect.
uijkt=Wi+Pk+ϵijkt.
A major issue is selection bias:
If students sort into better teachers, the teacher effect (τ) may be overestimated.
Bias in τ for teacher j is:
1NjNj∑i=1(Wi+Pk+ϵijkt)
where Nj is the number of students in class with teacher j.
Smaller class sizes → Higher bias in teacher effect estimates because 1Nj∑Nji=1ϵijkt≠0 will inflate the teacher fixed effect. If we use the random teacher effects instead, τ will still contain bias and we do not know the direction of the bias.
If teachers switch schools, we can separately estimate:
Teacher Fixed Effects (τj)
School Fixed Effects (Pk)
The mobility web refers to the network of teacher transitions across schools, which helps in identifying both teacher and school fixed effects.
Thin mobility web: Few teachers switch schools, making it harder to separate teacher effects from school effects.
Thick mobility web: Many teachers switch schools, improving identification of teacher quality independent of school characteristics.
The panel data model capturing student performance over time is:
this means that the randomness in student assignments does not systematically bias teacher quality estimates.
The total observed variance in estimated teacher effects is:
var(ˆτ)=var(τ)+var(λ)
Rearranging:
var(τ)=var(ˆτ)−var(λ)
Since we observe var(ˆτ), we need to estimate var(λ).
Measurement error variance (var(λ)) can be approximated using the average squared standard error of teacher effects:
var(λ)=1JJ∑j=1ˆσ2j
where ˆσ2j is the squared standard error of teacher j (which depends on sample size Nj).
The signal-to-noise ratio (or reliability) of teacher effect estimates is:
var(τ)var(ˆτ)=Reliability
where:
Higher reliability indicates that most of the variation comes from true teacher effects (τ) rather than noise.
Lower reliability suggests that a large portion of variation is due to measurement error.
The proportion of error variance in estimated teacher effects is:
1−var(τ)var(ˆτ)=Noise
Even if true teacher quality depends on class size (Nj), our method for estimating λ remains unaffected.
To check whether teacher effects are biased by sampling error, we regress estimated teacher effects (ˆτj) on teacher characteristics (Xj):
ˆτj=β0+Xjβ1+ϵj
If teacher characteristics do not predict sampling error, then:
R2≈0
This would confirm that teacher characteristics are uncorrelated with measurement error, validating the identification strategy.
11.5.8 Tests for Assumptions in Panel Data Analysis
We typically don’t test heteroskedasticity explicitly because robust covariance matrix estimation is used. However, other key assumptions should be tested before choosing the appropriate panel model.
Tests whether coefficients are the same across individuals (also known as an F-test of stability or Chow test).
H0: All individuals have the same coefficients (i.e., equal coefficients for all individuals).
Ha: Different individuals have different coefficients.
Notes:
A fixed effects model assumes different intercepts per individual.
A random effects model assumes a common intercept.
library(plm)plm::pooltest(inv~value+capital, data =Grunfeld, model ="within")#> #> F statistic#> #> data: inv ~ value + capital#> F = 5.7805, df1 = 18, df2 = 170, p-value = 1.219e-10#> alternative hypothesis: unstability
If the null is rejected, we should not use a pooled OLS model.
11.5.8.2 Testing for Individual and Time Effects
Checks for the presence of individual or time effects, or both.
Types of tests:
honda: Default test for individual effects (Honda 1985)
pFtest(inv~value+capital, data =Grunfeld, effect ="twoways")#> #> F test for twoways effects#> #> data: inv ~ value + capital#> F = 17.403, df1 = 28, df2 = 169, p-value < 2.2e-16#> alternative hypothesis: significant effectspFtest(inv~value+capital, data =Grunfeld, effect ="individual")#> #> F test for individual effects#> #> data: inv ~ value + capital#> F = 49.177, df1 = 9, df2 = 188, p-value < 2.2e-16#> alternative hypothesis: significant effectspFtest(inv~value+capital, data =Grunfeld, effect ="time")#> #> F test for time effects#> #> data: inv ~ value + capital#> F = 0.23451, df1 = 19, df2 = 178, p-value = 0.9997#> alternative hypothesis: significant effects
If the null hypothesis is rejected, a fixed effects model is more appropriate.
Tests whether residuals across entities are correlated.
11.5.8.3.1 Global Cross-Sectional Dependence
pcdtest(inv~value+capital, data =Grunfeld, model ="within")#> #> Pesaran CD test for cross-sectional dependence in panels#> #> data: inv ~ value + capital#> z = 4.6612, p-value = 3.144e-06#> alternative hypothesis: cross-sectional dependence
11.5.8.3.2 Local Cross-Sectional Dependence
Uses a spatial weight matrix w.
pcdtest(inv~value+capital, data =Grunfeld, model ="within", w =weight_matrix)
If the null is rejected, cross-sectional correlation exists and should be addressed.
11.5.8.4 Serial Correlation in Panel Data
Null hypothesis: There is no serial correlation.
Serial correlation is typically observed in macro panels with long time series (large N and T). It is less relevant in micro panels with short time series (small T and large N).
Idiosyncratic error terms: Often modeled as an autoregressive process (e.g., AR(1)).
Typically, “serial correlation” refers to the second type (idiosyncratic errors).
Types of Serial Correlation Tests
Marginal tests: Test for one type of dependence at a time but may be biased towards rejection.
Joint tests: Detect both sources of dependence but do not distinguish the source of the problem.
Conditional tests: Assume one dependence structure is correctly specified and test for additional departures.
11.5.8.4.1 Unobserved Effects Test
A semi-parametric test for unobserved effects, with the test statistic W∼N regardless of the error distribution.
Null hypothesis (H0): No unobserved effects (σ2μ=0), which supports using pooled OLS.
Under H0: The covariance matrix of residuals is diagonal (no off-diagonal correlations).
Robustness: The test is robust to both unobserved individual effects and serial correlation.
library(plm)data("Produc", package ="plm")# Wooldridge test for unobserved individual effectspwtest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data =Produc)#> #> Wooldridge's test for unobserved individual effects#> #> data: formula#> z = 3.9383, p-value = 8.207e-05#> alternative hypothesis: unobserved effect
Interpretation: If we reject H0, pooled OLS is inappropriate due to the presence of unobserved effects.
11.5.8.4.2 Locally Robust Tests for Serial Correlation and Random Effects
Joint LM Test for Random Effects and Serial Correlation
A Lagrange Multiplier test to jointly detect:
Random effects (panel-level variance components).
Serial correlation (time-series dependence).
Null Hypothesis: Normality and homoskedasticity of idiosyncratic errors (Baltagi and Li 1991, 1995).
This is equivalent to assuming there is no presence of serial correlation, and random effects.
# Baltagi and Li's joint test for serial correlation and random effectspbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data =Produc, test ="j")#> #> Baltagi and Li AR-RE joint test#> #> data: formula#> chisq = 4187.6, df = 2, p-value < 2.2e-16#> alternative hypothesis: AR(1) errors or random effects
Interpretation: If we reject H0, either serial correlation, random effects, or both are present. But we don’t know the source of dependence.
To distinguish the source of dependence, we use either (both tests assume normality and homoskedasticity) (Bera, Sosa-Escudero, and Yoon 2001):
BSY Test for Serial Correlation
pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data =Produc)#> #> Bera, Sosa-Escudero and Yoon locally robust test#> #> data: formula#> chisq = 52.636, df = 1, p-value = 4.015e-13#> alternative hypothesis: AR(1) errors sub random effects
BSY Test for Random Effects
pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data =Produc, test ="re")#> #> Bera, Sosa-Escudero and Yoon locally robust test (one-sided)#> #> data: formula#> z = 57.914, p-value < 2.2e-16#> alternative hypothesis: random effects sub AR(1) errors
If serial correlation is “known” to be absent (based on the BSY test), the LM test for random effects is superior.
plmtest(inv~value+capital, data =Grunfeld, type ="honda")#> #> Lagrange Multiplier Test - (Honda)#> #> data: inv ~ value + capital#> normal = 28.252, p-value < 2.2e-16#> alternative hypothesis: significant effects
If random effects are absent (based on the BSY test), we use Breusch-Godfrey’s serial correlation test(Breusch 1978; Godfrey 1978).
If Random Effects are Present: Use Baltagi and Li’s Test
Baltagi and Li’s test detects serial correlation in AR(1) and MA(1) processes under random effects.
Null hypothesis (H0): Uncorrelated errors.
Note:
The test has power only against positive serial correlation (one-sided).
It is applicable only to balanced panels
pbltest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data =Produc, alternative ="onesided")#> #> Baltagi and Li one-sided LM test#> #> data: log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp#> z = 21.69, p-value < 2.2e-16#> alternative hypothesis: AR(1)/MA(1) errors in RE panel model
11.5.8.4.3 General Serial Correlation Tests
Applicable to random effects, pooled OLS, and fixed effects models.
Can test for higher-order serial correlation.
# Baltagi-Griffin test for higher-order serial correlationplm::pbgtest(plm::plm(inv~value+capital, data =Grunfeld, model ="within"), order =2)#> #> Breusch-Godfrey/Wooldridge test for serial correlation in panel models#> #> data: inv ~ value + capital#> chisq = 42.587, df = 2, p-value = 5.655e-10#> alternative hypothesis: serial correlation in idiosyncratic errors
For short panels (Small T, Large N), use Wooldridge’s test:
pwartest(log(emp)~log(wage)+log(capital), data =EmplUK)#> #> Wooldridge's test for serial correlation in FE panels#> #> data: plm.model#> F = 312.3, df1 = 1, df2 = 889, p-value < 2.2e-16#> alternative hypothesis: serial correlation
11.5.8.5 Unit Roots and Stationarity in Panel Data
11.5.8.5.1 Dickey-Fuller Test for Stochastic Trends
Purpose: Tests for the presence of a unit root (non-stationarity) in a time series.
Null hypothesis (H0): The series is non-stationary (i.e., it has a unit root).
Alternative hypothesis (HA): The series is stationary (no unit root).
Decision Rule:
If the test statistic is less than the critical value (or p<0.05), reject H0, indicating stationarity.
If the test statistic is greater than the critical value (or p≥0.05), fail to reject H0, suggesting the presence of a unit root.
library(tseries)# Example: Test for unit root in GDP dataadf.test(Produc$gsp, alternative ="stationary")#> #> Augmented Dickey-Fuller Test#> #> data: Produc$gsp#> Dickey-Fuller = -6.5425, Lag order = 9, p-value = 0.01#> alternative hypothesis: stationary
If we reject H0, the series is stationary and does not exhibit a stochastic trend.
11.5.8.5.2 Levin-Lin-Chu Unit Root Test
Purpose: Tests for the presence of a unit root in a panel dataset.
Null hypothesis (H0): The series has a unit root (non-stationary).
Alternative hypothesis (HA): The series is stationary.
Assumptions: Requires large N (cross-sections) and moderate T (time periods).
Decision Rule: If the test statistic is less than the critical value or p<0.05, reject H0 (evidence of stationarity).
library(tseries)library(plm)# Levin-Lin-Chu (LLC) Unit Root Testpurtest(Grunfeld, test ="levinlin")#> #> Levin-Lin-Chu Unit-Root Test (ex. var.: None)#> #> data: Grunfeld#> z = 0.39906, p-value = 0.6551#> alternative hypothesis: stationarity
If we reject H0, the series is stationary.
11.5.8.6 Heteroskedasticity in Panel Data
11.5.8.6.1 Breusch-Pagan Test
Purpose: Detects heteroskedasticity in regression residuals.
Null hypothesis (H0): The data is homoskedastic (constant variance).
Alternative hypothesis (HA): The data exhibits heteroskedasticity (non-constant variance).
Decision Rule:
If the p-value is small (e.g., p<0.05), reject H0, suggesting heteroskedasticity.
If the p-value is large (p≥0.05), fail to reject H0, implying homoskedasticity.
library(lmtest)# Fit a panel model (pooled OLS)model<-lm(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data =Produc)# Breusch-Pagan Test for Heteroskedasticitybptest(model)#> #> studentized Breusch-Pagan test#> #> data: model#> BP = 80.033, df = 4, p-value < 2.2e-16
If heteroskedasticity is detected, we need to adjust for it using robust standard errors.
If heteroskedasticity is present, robust covariance matrix estimation is recommended. Different estimators apply depending on whether serial correlation is also an issue.
Choosing the Correct Robust Covariance Matrix Estimator
Estimator
Corrects for Heteroskedasticity?
Corrects for Serial Correlation?
Recommended For
"white1"
✅ Yes
❌ No
Random Effects
"white2"
✅ Yes (common variance within groups)
❌ No
Random Effects
"arellano"
✅ Yes
✅ Yes
Fixed Effects
library(plm)# Fit a fixed effects modelfe_model<-plm(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data =Produc, model ="within")# Compute robust standard errors using Arellano's methodcoeftest(fe_model, vcov =vcovHC(fe_model, method ="arellano"))#> #> t test of coefficients:#> #> Estimate Std. Error t value Pr(>|t|) #> log(pcap) -0.0261497 0.0603262 -0.4335 0.66480 #> log(pc) 0.2920069 0.0617425 4.7294 2.681e-06 ***#> log(emp) 0.7681595 0.0816652 9.4062 < 2.2e-16 ***#> unemp -0.0052977 0.0024958 -2.1226 0.03411 * #> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using a robust covariance matrix corrects for heteroskedasticity and/or serial correlation, ensuring valid inference.
11.5.9 Model Selection in Panel Data
Panel data models must be chosen based on the structure of the data and underlying assumptions. This section provides guidance on selecting between Pooled OLS, Random Effects, and Fixed Effects models.
11.5.9.1 Pooled OLS vs. Random Effects
The choice between POLS and RE depends on whether there are unobserved individual effects.
Breusch-Pagan Lagrange Multiplier Test
Purpose: Tests whether a random effects model is preferable to a pooled OLS model.
Null hypothesis (H0): Variance across entities is zero (i.e., no panel effect → POLS is preferred).
Alternative hypothesis (HA): There is significant panel-level variation → RE is preferable to POLS.
Decision Rule: If p<0.05, reject H0, indicating that RE is preferred.
library(plm)# Breusch-Pagan LM Testplmtest(plm(inv~value+capital, data =Grunfeld, model ="pooling"), type ="bp")#> #> Lagrange Multiplier Test - (Breusch-Pagan)#> #> data: inv ~ value + capital#> chisq = 798.16, df = 1, p-value < 2.2e-16#> alternative hypothesis: significant effects
If the test is significant, RE is more appropriate than POLS.
11.5.9.2 Fixed Effects vs. Random Effects
The choice between FE and RE depends on whether the individual-specific effects are correlated with the regressors.
Key Assumptions and Properties
Hypothesis
If True
H0:Cov(ci,xit)=0
ˆβREis consistent and efficient, whileˆβFE is consistent
H0:Cov(ci,xit)≠0
ˆβREis inconsistent, whileˆβFE remains consistent
Hausman Test
Purpose: Determines whether FE or RE is appropriate.
For the Hausman test to work, you need to assume that
Hausman test statistic: H=(ˆβRE−ˆβFE)′(V(ˆβRE)−V(ˆβFE))(ˆβRE−ˆβFE)∼χ2n(X) where n(X) is the number of parameters for the time-varying regressors.
Null hypothesis (H0): RE estimator is consistent and efficient.
Alternative hypothesis (HA): RE estimator is inconsistent, meaning FE should be used.
Decision Rule:
If p<0.05: Reject H0, meaning FE is preferred.
If p≥0.05: Fail to reject H0, meaning RE can be used.
library(plm)# Fit FE and RE modelsfe_model<-plm(inv~value+capital, data =Grunfeld, model ="within")re_model<-plm(inv~value+capital, data =Grunfeld, model ="random")# Hausman testphtest(fe_model, re_model)#> #> Hausman Test#> #> data: inv ~ value + capital#> chisq = 2.3304, df = 2, p-value = 0.3119#> alternative hypothesis: one model is inconsistent
If the null hypothesis is rejected, use FE. If not, RE is appropriate.
11.5.9.3 Summary of Model Assumptions and Consistency
Generalized Method of Moments Estimator: For dynamic panel models.
General Feasible GLS Estimator: Accounts for heteroskedasticity and serial correlation.
Means Groups Estimator: Averages individual-specific estimates.
Common Correlated Effects Mean Group Estimator: Accounts for cross-sectional dependence.
Limited Dependent Variable Estimators: Used for binary or censored data.
11.5.11 Application
11.5.11.1plm Package
The plm package in R is designed for panel data analysis, allowing users to estimate various models, including pooled OLS, fixed effects, random effects, and other specifications commonly used in econometrics.
Other test types: "honda", "kw", "ghm". Other effects: "time", "twoways".
Cross-Sectional Dependence Tests
Breusch-Pagan LM test for cross-sectional dependence
pcdtest(fixed, test ="lm")#> #> Breusch-Pagan LM test for cross-sectional dependence in panels#> #> data: log(gsp) ~ log(pcap) + log(emp) + unemp#> chisq = 6490.4, df = 1128, p-value < 2.2e-16#> alternative hypothesis: cross-sectional dependence
Pesaran’s CD statistic
pcdtest(fixed, test ="cd")#> #> Pesaran CD test for cross-sectional dependence in panels#> #> data: log(gsp) ~ log(pcap) + log(emp) + unemp#> z = 37.13, p-value < 2.2e-16#> alternative hypothesis: cross-sectional dependence
Serial Correlation Test (Panel Version of the Breusch-Godfrey Test)
Used to check for autocorrelation in panel data.
pbgtest(fixed)#> #> Breusch-Godfrey/Wooldridge test for serial correlation in panel models#> #> data: log(gsp) ~ log(pcap) + log(emp) + unemp#> chisq = 476.92, df = 17, p-value < 2.2e-16#> alternative hypothesis: serial correlation in idiosyncratic errors
Stationarity Test (Augmented Dickey-Fuller Test)
Checks whether a time series variable is stationary.
library(tseries)adf.test(pdata$gsp, k =2)#> #> Augmented Dickey-Fuller Test#> #> data: pdata$gsp#> Dickey-Fuller = -5.9028, Lag order = 2, p-value = 0.01#> alternative hypothesis: stationary
F-Test for Fixed Effects vs. Pooled OLS
Null Hypothesis: Pooled OLS is appropriate.
Alternative Hypothesis: Fixed effects model is preferred.
pFtest(fixed, pooling)#> #> F test for individual effects#> #> data: log(gsp) ~ log(pcap) + log(emp) + unemp#> F = 149.58, df1 = 47, df2 = 765, p-value < 2.2e-16#> alternative hypothesis: significant effects
Hausman Test for Fixed vs. Random Effects
Null Hypothesis: Random effects are appropriate.
Alternative Hypothesis: Fixed effects are preferred (RE assumptions are violated).
phtest(random, fixed)#> #> Hausman Test#> #> data: log(gsp) ~ log(pcap) + log(emp) + unemp#> chisq = 84.924, df = 3, p-value < 2.2e-16#> alternative hypothesis: one model is inconsistent
Heteroskedasticity and Robust Standard Errors
Breusch-Pagan Test for Heteroskedasticity
Tests whether heteroskedasticity is present in the panel dataset.
HC4: Useful for small samples with influential observations.
For Fixed Effects Model
# Original coefficientscoeftest(fixed)#> #> t test of coefficients:#> #> Estimate Std. Error t value Pr(>|t|) #> log(pcap) 0.03488447 0.03092191 1.1281 0.2596 #> log(emp) 1.03017988 0.02161353 47.6636 <2e-16 ***#> unemp -0.00021084 0.00096121 -0.2194 0.8264 #> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Heteroskedasticity-consistent standard errorscoeftest(fixed, vcovHC)#> #> t test of coefficients:#> #> Estimate Std. Error t value Pr(>|t|) #> log(pcap) 0.03488447 0.06661083 0.5237 0.6006 #> log(emp) 1.03017988 0.06413365 16.0630 <2e-16 ***#> unemp -0.00021084 0.00217453 -0.0970 0.9228 #> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Arellano method for robust errorscoeftest(fixed, vcovHC(fixed, method ="arellano"))#> #> t test of coefficients:#> #> Estimate Std. Error t value Pr(>|t|) #> log(pcap) 0.03488447 0.06661083 0.5237 0.6006 #> log(emp) 1.03017988 0.06413365 16.0630 <2e-16 ***#> unemp -0.00021084 0.00217453 -0.0970 0.9228 #> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Different HC typest(sapply(c("HC0", "HC1", "HC2", "HC3", "HC4"), function(x)sqrt(diag(vcovHC(fixed, type =x)))))#> log(pcap) log(emp) unemp#> HC0 0.06661083 0.06413365 0.002174525#> HC1 0.06673362 0.06425187 0.002178534#> HC2 0.06689078 0.06441024 0.002182114#> HC3 0.06717278 0.06468886 0.002189747#> HC4 0.06742431 0.06496436 0.002193150
Summary of Model Selection
Test
Null Hypothesis (H₀)
Decision Rule
LM Test
OLS is appropriate
Reject H₀ → Use RE
Hausman Test
Random effects preferred
Reject H₀ → Use FE
pFtest
OLS is appropriate
Reject H₀ → Use FE
Breusch-Pagan
No heteroskedasticity
Reject H₀ → Use robust SE
Variance Components Structure
Beyond the standard random effects model, the plm package provides additional methods for estimating variance components models and instrumental variable techniques for dealing with endogeneity in panel data.
Different estimators for the variance components structure exist in the literature, and plm allows users to specify them through the random.method argument.
"nerlove": Nerlove estimator (Nerlove 1971) (Note: Not available for two-way random effects).
Effects in Panel Models:
Individual effects (default).
Time effects (effect = "time").
Two-way effects (effect = "twoways").
amemiya<-plm(log(gsp)~log(pcap)+log(emp)+unemp, data =pdata, model ="random", random.method ="amemiya", effect ="twoways")summary(amemiya)#> Twoways effects Random Effect Model #> (Amemiya's transformation)#> #> Call:#> plm(formula = log(gsp) ~ log(pcap) + log(emp) + unemp, data = pdata, #> effect = "twoways", model = "random", random.method = "amemiya")#> #> Balanced Panel: n = 48, T = 17, N = 816#> #> Effects:#> var std.dev share#> idiosyncratic 0.001228 0.035039 0.028#> individual 0.041201 0.202981 0.941#> time 0.001359 0.036859 0.031#> theta: 0.9582 (id) 0.8641 (time) 0.8622 (total)#> #> Residuals:#> Min. 1st Qu. Median 3rd Qu. Max. #> -0.13796209 -0.01951506 -0.00053384 0.01807398 0.20452581 #> #> Coefficients:#> Estimate Std. Error z-value Pr(>|z|) #> (Intercept) 3.9581876 0.1767036 22.4001 < 2.2e-16 ***#> log(pcap) 0.0378443 0.0253963 1.4902 0.136184 #> log(emp) 0.8891887 0.0227677 39.0548 < 2.2e-16 ***#> unemp -0.0031568 0.0011240 -2.8086 0.004976 ** #> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Total Sum of Squares: 5.3265#> Residual Sum of Squares: 0.98398#> R-Squared: 0.81527#> Adj. R-Squared: 0.81458#> Chisq: 3583.53 on 3 DF, p-value: < 2.22e-16
The ercomp() function retrieves estimates of the variance components in a random effects model. Below, we extract the variance decomposition using Amemiya’s method:
ercomp(log(gsp)~log(pcap)+log(emp)+unemp, data =pdata, method ="amemiya", effect ="twoways")#> var std.dev share#> idiosyncratic 0.001228 0.035039 0.028#> individual 0.041201 0.202981 0.941#> time 0.001359 0.036859 0.031#> theta: 0.9582 (id) 0.8641 (time) 0.8622 (total)
This output includes:
Variance of the individual effect.
Variance of the time effect (if applicable).
Variance of the idiosyncratic error.
Checking Panel Data Balance
Panel datasets may be balanced (each individual has observations for all time periods) or unbalanced (some individuals are missing observations). The punbalancedness() function measures the degree of balance in the data, with values closer to 1 indicating a balanced panel (Ahrens and Pincus 1981).
Instrumental variables (IV) are used to address endogeneity, which arises when regressors are correlated with the error term. plm provides various IV estimation methods through the inst.method argument.
Beyond standard fixed effects and random effects models, the plm package provides additional estimation techniques tailored for heterogeneous coefficients, dynamic panel models, and feasible generalized least squares (FGLS) methods.
Variable Coefficients Model (pvcm)
The variable coefficients model (VCM) allows coefficients to vary across cross-sectional units, accounting for unobserved heterogeneity more flexibly.
Two Estimation Approaches:
Fixed effects (within): Assumes coefficients are constant over time but vary across individuals.
Random effects (random): Assumes coefficients are drawn from a random distribution.
fixed_pvcm<-pvcm(log(gsp)~log(pcap)+log(emp)+unemp, data =pdata, model ="within")random_pvcm<-pvcm(log(gsp)~log(pcap)+log(emp)+unemp, data =pdata, model ="random")summary(fixed_pvcm)#> Oneway (individual) effect No-pooling model#> #> Call:#> pvcm(formula = log(gsp) ~ log(pcap) + log(emp) + unemp, data = pdata, #> model = "within")#> #> Balanced Panel: n = 48, T = 17, N = 816#> #> Residuals:#> Min. 1st Qu. Median 3rd Qu. Max. #> -0.075247625 -0.013247956 0.000666934 0.013852996 0.118966807 #> #> Coefficients:#> (Intercept) log(pcap) log(emp) unemp #> Min. :-3.8868 Min. :-1.11962 Min. :0.3790 Min. :-1.597e-02 #> 1st Qu.: 0.9917 1st Qu.:-0.38475 1st Qu.:0.8197 1st Qu.:-5.319e-03 #> Median : 2.9848 Median :-0.03147 Median :1.1506 Median : 5.335e-05 #> Mean : 2.8079 Mean :-0.06028 Mean :1.1656 Mean : 9.024e-04 #> 3rd Qu.: 4.3553 3rd Qu.: 0.25573 3rd Qu.:1.3779 3rd Qu.: 8.374e-03 #> Max. :12.8800 Max. : 1.16922 Max. :2.4276 Max. : 2.507e-02 #> #> Total Sum of Squares: 15729#> Residual Sum of Squares: 0.40484#> Multiple R-Squared: 0.99997summary(random_pvcm)#> Oneway (individual) effect Random coefficients model#> #> Call:#> pvcm(formula = log(gsp) ~ log(pcap) + log(emp) + unemp, data = pdata, #> model = "random")#> #> Balanced Panel: n = 48, T = 17, N = 816#> #> Residuals:#> Min. 1st Qu. Median Mean 3rd Qu. Max. #> -0.23364 -0.03401 0.05558 0.09811 0.19349 1.14326 #> #> Estimated mean of the coefficients:#> Estimate Std. Error z-value Pr(>|z|) #> (Intercept) 2.79030044 0.53104167 5.2544 1.485e-07 ***#> log(pcap) -0.04195768 0.08621579 -0.4867 0.6265 #> log(emp) 1.14988911 0.07225221 15.9149 < 2.2e-16 ***#> unemp 0.00031135 0.00163864 0.1900 0.8493 #> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Estimated variance of the coefficients:#> (Intercept) log(pcap) log(emp) unemp#> (Intercept) 11.2648882 -1.335932 0.2035824 0.00827707#> log(pcap) -1.3359322 0.287021 -0.1872915 -0.00345298#> log(emp) 0.2035824 -0.187291 0.2134845 0.00336374#> unemp 0.0082771 -0.003453 0.0033637 0.00009425#> #> Total Sum of Squares: 15729#> Residual Sum of Squares: 40.789#> Multiple R-Squared: 0.99741#> Chisq: 739.334 on 3 DF, p-value: < 2.22e-16#> Test for parameter homogeneity: Chisq = 21768.8 on 188 DF, p-value: < 2.22e-16
Generalized Method of Moments Estimator (pgmm)
The Generalized Method of Moments estimator is commonly used for dynamic panel models, especially when:
There is concern over endogeneity in lagged dependent variables.
Instrumental variables are used for estimation.
library(plm)# estimates a dynamic labor demand function using one-step GMM, # applying lagged variables as instrumentsz2<-pgmm(log(emp)~lag(log(emp), 1)+lag(log(wage), 0:1)+lag(log(capital), 0:1)|lag(log(emp), 2:99)+lag(log(wage), 2:99)+lag(log(capital), 2:99), data =EmplUK, effect ="twoways", model ="onestep", transformation ="ld")summary(z2, robust =TRUE)#> Twoways effects One-step model System GMM #> #> Call:#> pgmm(formula = log(emp) ~ lag(log(emp), 1) + lag(log(wage), 0:1) + #> lag(log(capital), 0:1) | lag(log(emp), 2:99) + lag(log(wage), #> 2:99) + lag(log(capital), 2:99), data = EmplUK, effect = "twoways", #> model = "onestep", transformation = "ld")#> #> Unbalanced Panel: n = 140, T = 7-9, N = 1031#> #> Number of Observations Used: 1642#> Residuals:#> Min. 1st Qu. Median Mean 3rd Qu. Max. #> -0.7530341 -0.0369030 0.0000000 0.0002882 0.0466069 0.6001503 #> #> Coefficients:#> Estimate Std. Error z-value Pr(>|z|) #> lag(log(emp), 1) 0.935605 0.026295 35.5810 < 2.2e-16 ***#> lag(log(wage), 0:1)0 -0.630976 0.118054 -5.3448 9.050e-08 ***#> lag(log(wage), 0:1)1 0.482620 0.136887 3.5257 0.0004224 ***#> lag(log(capital), 0:1)0 0.483930 0.053867 8.9838 < 2.2e-16 ***#> lag(log(capital), 0:1)1 -0.424393 0.058479 -7.2572 3.952e-13 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Sargan test: chisq(100) = 118.763 (p-value = 0.097096)#> Autocorrelation test (1): normal = -4.808434 (p-value = 1.5212e-06)#> Autocorrelation test (2): normal = -0.2800133 (p-value = 0.77947)#> Wald test for coefficients: chisq(5) = 11174.82 (p-value = < 2.22e-16)#> Wald test for time dummies: chisq(7) = 14.71138 (p-value = 0.039882)
Explanation of Arguments:
log(emp) ~ lag(log(emp), 1) + lag(log(wage), 0:1) + lag(log(capital), 0:1)
→ Specifies the dynamic model, where log(emp) depends on its first lag and contemporaneous plus lagged values of log(wage) and log(capital).
| lag(log(emp), 2:99) + lag(log(wage), 2:99) + lag(log(capital), 2:99)
→ Instruments for endogenous regressors, using further lags.
effect = "twoways"
→ Includes both individual and time effects.
model = "onestep"
→ Uses one-step GMM (alternative: "twostep" for efficiency gain).
transformation = "ld"
→ Uses lagged differences as transformation.
Generalized Feasible Generalized Least Squares Models (pggls)
The FGLS estimator (pggls) is robust against:
Intragroup heteroskedasticity.
Serial correlation (within groups).
However, it assumes no cross-sectional correlation and is most suitable when NNN (cross-sectional units) is much larger than TTT (time periods), i.e., long panels.
Random Effects FGLS Model:
zz<-pggls(log(emp)~log(wage)+log(capital), data =EmplUK, model ="pooling")summary(zz)#> Oneway (individual) effect General FGLS model#> #> Call:#> pggls(formula = log(emp) ~ log(wage) + log(capital), data = EmplUK, #> model = "pooling")#> #> Unbalanced Panel: n = 140, T = 7-9, N = 1031#> #> Residuals:#> Min. 1st Qu. Median Mean 3rd Qu. Max. #> -1.80696 -0.36552 0.06181 0.03230 0.44279 1.58719 #> #> Coefficients:#> Estimate Std. Error z-value Pr(>|z|) #> (Intercept) 2.023480 0.158468 12.7690 < 2.2e-16 ***#> log(wage) -0.232329 0.048001 -4.8401 1.298e-06 ***#> log(capital) 0.610484 0.017434 35.0174 < 2.2e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> Total Sum of Squares: 1853.6#> Residual Sum of Squares: 402.55#> Multiple R-squared: 0.78283
Fixed Effects FGLS Model:
zz<-pggls(log(emp)~log(wage)+log(capital), data =EmplUK, model ="within")summary(zz)#> Oneway (individual) effect Within FGLS model#> #> Call:#> pggls(formula = log(emp) ~ log(wage) + log(capital), data = EmplUK, #> model = "within")#> #> Unbalanced Panel: n = 140, T = 7-9, N = 1031#> #> Residuals:#> Min. 1st Qu. Median 3rd Qu. Max. #> -0.508362414 -0.074254395 -0.002442181 0.076139063 0.601442300 #> #> Coefficients:#> Estimate Std. Error z-value Pr(>|z|) #> log(wage) -0.617617 0.030794 -20.056 < 2.2e-16 ***#> log(capital) 0.561049 0.017185 32.648 < 2.2e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> Total Sum of Squares: 1853.6#> Residual Sum of Squares: 17.368#> Multiple R-squared: 0.99063
Key Considerations:
Efficient under the assumption of homoskedasticity.
Inefficient if there is group-wise heteroskedasticity.
Ideal for large-N, small-T panels.
Summary of Alternative Panel Data Estimators
Estimator
Method
Application
Variable Coefficients (pvcm)
Fixed (within), Random (random)
Allows coefficients to vary across individuals.
GMM (pgmm)
One-step, Two-step
Used in dynamic models with endogeneity.
Feasible GLS (pggls)
Fixed (within), Random (pooling)
Handles heteroskedasticity and serial correlation but assumes no cross-sectional correlation.
11.5.11.2fixest Package
The fixest package provides efficient and flexible methods for estimating fixed effects and generalized linear models in panel data. It is optimized for handling large datasets with high-dimensional fixed effects and allows for multiple model estimation, robust standard errors, and split-sample estimation.
This estimates models separately for each Month in the dataset.
Robust Standard Errors in fixest
fixest supports a variety of robust standard error estimators, including:
iid: errors are homoskedastic and independent and identically distributed
hetero: errors are heteroskedastic using White correction
cluster: errors are correlated within the cluster groups
newey_west: (Newey and West 1986) use for time series or panel data. Errors are heteroskedastic and serially correlated.
vcov = newey_west ~ id + period where id is the subject id and period is time period of the panel.
to specify lag period to consider vcov = newey_west(2) ~ id + period where we’re considering 2 lag periods.
driscoll_kraay(Driscoll and Kraay 1998) use for panel data. Errors are cross-sectionally and serially correlated.
vcov = discoll_kraay ~ period
conley: (Conley 1999) for cross-section data. Errors are spatially correlated
vcov = conley ~ latitude + longitude
to specify the distance cutoff, vcov = vcov_conley(lat = "lat", lon = "long", cutoff = 100, distance = "spherical"), which will use the conley() helper function.
hc: from the sandwich package
vcov = function(x) sandwich::vcovHC(x, type = "HC1"))
To let R know which SE estimation you want to use, insert vcov = vcov_type ~ variables
This corrects for bias when working with small samples.
11.6 Choosing the Right Type of Data
Selecting the appropriate data type depends on:
Research Questions: Do you need to understand changes over time at the individual level (panel) or just a snapshot comparison at one point (cross-sectional)?
Resources: Longitudinal or panel studies can be resource-intensive.
Time Constraints: If you need fast results, cross-sectional or repeated cross-sectional might be more practical.
Analytical Goals: Time-series forecasting, causal inference, or descriptive comparison each has different data requirements.
Availability: Sometimes only secondary or repeated cross-sectional data is available, which constrains the design.
11.7 Data Quality and Ethical Considerations
Regardless of data type, data quality is crucial. Poor data—be it incomplete, biased, or improperly measured—can lead to incorrect conclusions. Researchers should:
Ensure Validity and Reliability: Use well-designed instruments and consistent measurement techniques.
Address Missing Data: Apply appropriate imputation methods if feasible.
Manage Attrition (in Panel Data): Consider weighting or sensitivity analyses to deal with dropouts.
Check Representativeness: Especially in cross-sectional and repeated cross-sectional surveys, ensure sampling frames match the target population.
Protect Confidentiality and Privacy: Particularly in panel studies with repeated contact, store data securely and follow ethical guidelines.
Obtain Proper Consent: Inform participants about study details, usage of data, and rights to withdraw.