Loading [MathJax]/jax/output/CommonHTML/jax.js
Type to search
A Guide on Data Analysis
Preface
How to cite this book
More books
1
Introduction
1.1
General Recommendations
2
Prerequisites
2.1
Matrix Theory
2.1.1
Rank of a Matrix
2.1.2
Inverse of a Matrix
2.1.3
Definiteness of a Matrix
2.1.4
Matrix Calculus
2.1.5
Optimization in Scalar and Vector Spaces
2.1.6
Cholesky Decomposition
2.2
Probability Theory
2.2.1
Axioms and Theorems of Probability
2.2.2
Central Limit Theorem
2.2.3
Random Variable
2.2.4
Moment Generating Function
2.2.5
Moments
2.2.6
Skewness
2.2.7
Kurtosis
2.2.8
Distributions
2.3
General Math
2.3.1
Number Sets
2.3.2
Summation Notation and Series
2.3.3
Taylor Expansion
2.3.4
Law of Large Numbers
2.3.5
Convergence
2.3.6
Sufficient Statistics and Likelihood
2.3.7
Parameter Transformations
2.4
Data Import/Export
2.4.1
Medium size
2.4.2
Large size
2.5
Data Manipulation
I. BASIC
3
Descriptive Statistics
3.1
Numerical Measures
3.2
Graphical Measures
3.2.1
Shape
3.2.2
Scatterplot
3.3
Normality Assessment
3.3.1
Graphical Assessment
3.3.2
Summary Statistics
3.4
Bivariate Statistics
3.4.1
Two Continuous
3.4.2
Categorical and Continuous
3.4.3
Two Discrete
3.4.4
General Approach to Bivariate Statistics
4
Basic Statistical Inference
4.1
Hypothesis Testing Framework
4.1.1
Null and Alternative Hypotheses
4.1.2
Errors in Hypothesis Testing
4.1.3
The Role of Distributions in Hypothesis Testing
4.1.4
The Test Statistic
4.1.5
Critical Values and Rejection Regions
4.1.6
Visualizing Hypothesis Testing
4.2
Key Concepts and Definitions
4.2.1
Random Sample
4.2.2
Sample Statistics
4.2.3
Distribution of the Sample Mean
4.3
One-Sample Inference
4.3.1
For Single Mean
4.3.2
For Difference of Means, Independent Samples
4.3.3
For Difference of Means, Paired Samples
4.3.4
For Difference of Two Proportions
4.3.5
For Single Proportion
4.3.6
For Single Variance
4.3.7
Non-parametric Tests
4.4
Two-Sample Inference
4.4.1
For Means
4.4.2
For Variances
4.4.3
Power
4.4.4
Matched Pair Designs
4.4.5
Nonparametric Tests for Two Samples
4.5
Categorical Data Analysis
4.5.1
Association Tests
4.5.2
Ordinal Association
4.5.3
Ordinal Trend
4.6
Divergence Metrics and Tests for Comparing Distributions
4.6.1
Kolmogorov-Smirnov Test
4.6.2
Anderson-Darling Test
4.6.3
Chi-Square Goodness-of-Fit Test
4.6.4
Cramér-von Mises Test
4.6.5
Kullback-Leibler Divergence
4.6.6
Jensen-Shannon Divergence
4.6.7
Hellinger Distance
4.6.8
Bhattacharyya Distance
4.6.9
Wasserstein Distance
4.6.10
Energy Distance
4.6.11
Total Variation Distance
4.6.12
Summary
II. REGRESSION
5
Linear Regression
5.1
Ordinary Least Squares
5.1.1
Simple Regression (Basic) Model
5.1.2
Multiple Linear Regression
5.2
Generalized Least Squares
5.2.1
Infeasible Generalized Least Squares
5.2.2
Feasible Generalized Least Squares
5.2.3
Weighted Least Squares
5.3
Maximum Likelihood
5.3.1
Motivation for MLE
5.3.2
Key Quantities for Inference
5.3.3
Assumptions of MLE
5.3.4
Properties of MLE
5.3.5
Practical Considerations
5.3.6
Comparison of MLE and OLS
5.3.7
Applications of MLE
5.4
Penalized (Regularized) Estimators
5.4.1
Motivation for Penalized Estimators
5.4.2
Ridge Regression
5.4.3
Lasso Regression
5.4.4
Elastic Net
5.4.5
Tuning Parameter Selection
5.4.6
Properties of Penalized Estimators
5.5
Robust Estimators
5.5.1
Motivation for Robust Estimation
5.5.2
M
-Estimators
5.5.3
R
-Estimators
5.5.4
L
-Estimators
5.5.5
Least Trimmed Squares (LTS)
5.5.6
S
-Estimators
5.5.7
M
M
-Estimators
5.5.8
Practical Considerations
5.6
Partial Least Squares
5.6.1
Motivation for PLS
5.6.2
Steps to Construct PLS Components
5.6.3
Properties of PLS
5.6.4
Comparison with Related Methods
6
Non-Linear Regression
6.1
Inference
6.1.1
Linear Functions of the Parameters
6.1.2
Nonlinear Functions of Parameters
6.2
Non-linear Least Squares Estimation
6.2.1
Iterative Optimization
6.2.2
Derivative-Free
6.2.3
Stochastic Heuristic
6.2.4
Linearization
6.2.5
Hybrid
6.2.6
Comparison of Nonlinear Optimizers
6.3
Practical Considerations
6.3.1
Selecting Starting Values
6.3.2
Handling Constrained Parameters
6.3.3
Failure to Converge
6.3.4
Convergence to a Local Minimum
6.3.5
Model Adequacy and Estimation Considerations
6.4
Application
6.4.1
Nonlinear Estimation Using Gauss-Newton Algorithm
6.4.2
Logistic Growth Model
6.4.3
Nonlinear Plateau Model
7
Generalized Linear Models
7.1
Logistic Regression
7.1.1
Logistic Model
7.1.2
Likelihood Function
7.1.3
Fisher Information Matrix
7.1.4
Inference in Logistic Regression
7.1.5
Application: Logistic Regression
7.2
Probit Regression
7.2.1
Probit Model
7.2.2
Application: Probit Regression
7.3
Binomial Regression
7.3.1
Dataset Overview
7.3.2
Apply Logistic Model
7.3.3
Apply Probit Model
7.4
Poisson Regression
7.4.1
The Poisson Distribution
7.4.2
Poisson Model
7.4.3
Link Function Choices
7.4.4
Application: Poisson Regression
7.5
Negative Binomial Regression
7.5.1
Negative Binomial Distribution
7.5.2
Application: Negative Binomial Regression
7.5.3
Fitting a Zero-Inflated Negative Binomial Model
7.6
Quasi-Poisson Regression
7.6.1
Is Quasi-Poisson Regression a Generalized Linear Model?
7.6.2
Application: Quasi-Poisson Regression
7.7
Multinomial Logistic Regression
7.7.1
The Multinomial Distribution
7.7.2
Modeling Probabilities Using Log-Odds
7.7.3
Softmax Representation
7.7.4
Log-Odds Ratio Between Two Categories
7.7.5
Estimation
7.7.6
Interpretation of Coefficients
7.7.7
Application: Multinomial Logistic Regression
7.7.8
Application: Gamma Regression
7.8
Generalization of Generalized Linear Models
7.8.1
Exponential Family
7.8.2
Properties of GLM Exponential Families
7.8.3
Structure of a Generalized Linear Model
7.8.4
Components of a GLM
7.8.5
Canonical Link
7.8.6
Inverse Link Functions
7.8.7
Estimation of Parameters in GLMs
7.8.8
Inference
7.8.9
Deviance
7.8.10
Diagnostic Plots
7.8.11
Goodness of Fit
7.8.12
Over-Dispersion
8
Linear Mixed Models
8.1
Dependent Data
8.1.1
Motivation: A Repeated Measurements Example
8.1.2
Example: Linear Mixed Model for Repeated Measurements
8.1.3
Random-Intercepts Model
8.1.4
Covariance Models in Linear Mixed Models
8.1.5
Covariance Structures in Mixed Models
8.2
Estimation in Linear Mixed Models
8.2.1
Interpretation of the Mixed Model Equations
8.2.2
Derivation of the Mixed Model Equations
8.2.3
Bayesian Interpretation of Linear Mixed Models
8.2.4
Estimating the Variance-Covariance Matrix
8.3
Inference in Linear Mixed Models
8.3.1
Inference for Fixed Effects (
β
)
8.3.2
Inference for Variance Components (
θ
)
8.4
Information Criteria for Model Selection
8.4.1
Akaike Information Criterion
8.4.2
Corrected AIC
8.4.3
Bayesian Information Criterion
8.4.4
Practical Example with Linear Mixed Models
8.5
Split-Plot Designs
8.5.1
Example Setup
8.5.2
Statistical Model for Split-Plot Designs
8.5.3
Approaches to Analyzing Split-Plot Designs
8.5.4
Application: Split-Plot Design
8.6
Repeated Measures in Mixed Models
8.7
Unbalanced or Unequally Spaced Data
8.7.1
Variance-Covariance Structure: Power Model
8.8
Application: Mixed Models in Practice
8.8.1
Example 1: Pulp Brightness Analysis
8.8.2
Example 2: Penicillin Yield (GLMM with Blocking)
8.8.3
Example 3: Growth in Rats Over Time
8.8.4
Example 4: Tree Water Use (Agridat)
9
Nonlinear and Generalized Linear Mixed Models
9.1
Nonlinear Mixed Models
9.2
Generalized Linear Mixed Models
9.3
Relationship Between NLMMs and GLMMs
9.4
Marginal Properties of GLMMs
9.4.1
Marginal Mean of
y
i
9.4.2
Marginal Variance of
y
i
9.4.3
Marginal Covariance of
y
9.5
Estimation in Nonlinear and Generalized Linear Mixed Models
9.5.1
Estimation by Numerical Integration
9.5.2
Estimation by Linearization
9.5.3
Estimation by Bayesian Hierarchical Models
9.5.4
Practical Implementation in R
9.6
Application: Nonlinear and Generalized Linear Mixed Models
9.6.1
Binomial Data: CBPP Dataset
9.6.2
Count Data: Owl Dataset
9.6.3
Binomial Example: Gotway Hessian Fly Data
9.6.4
Nonlinear Mixed Model: Yellow Poplar Data
9.7
Summary
10
Nonparametric Regression
10.1
Why Nonparametric?
10.1.1
Flexibility
10.1.2
Fewer Assumptions
10.1.3
Interpretability
10.1.4
Practical Considerations
10.1.5
Balancing Parametric and Nonparametric Approaches
10.2
Basic Concepts in Nonparametric Estimation
10.2.1
Bias-Variance Trade-Off
10.2.2
Kernel Smoothing and Local Averages
10.3
Kernel Regression
10.3.1
Basic Setup
10.3.2
Nadaraya-Watson Kernel Estimator
10.3.3
Priestley–Chao Kernel Estimator
10.3.4
Gasser–Müller Kernel Estimator
10.3.5
Comparison of Kernel-Based Estimators
10.3.6
Bandwidth Selection
10.3.7
Asymptotic Properties
10.3.8
Derivation of the Nadaraya-Watson Estimator
10.4
Local Polynomial Regression
10.4.1
Local Polynomial Fitting
10.4.2
Mathematical Form of the Solution
10.4.3
Bias, Variance, and Asymptotics
10.4.4
Special Case: Local Linear Regression
10.4.5
Bandwidth Selection
10.4.6
Asymptotic Properties Summary
10.5
Smoothing Splines
10.5.1
Properties and Form of the Smoothing Spline
10.5.2
Choice of
λ
10.5.3
Connection to Reproducing Kernel Hilbert Spaces
10.6
Confidence Intervals in Nonparametric Regression
10.6.1
Asymptotic Normality
10.6.2
Bootstrap Methods
10.6.3
Practical Considerations
10.7
Generalized Additive Models
10.7.1
Estimation via Penalized Likelihood
10.7.2
Interpretation of GAMs
10.7.3
Model Selection and Smoothing Parameter Estimation
10.7.4
Extensions of GAMs
10.8
Regression Trees and Random Forests
10.8.1
Regression Trees
10.8.2
Random Forests
10.8.3
Theoretical Insights
10.8.4
Feature Importance in Random Forests
10.8.5
Advantages and Limitations of Tree-Based Methods
10.9
Wavelet Regression
10.9.1
Wavelet Series Expansion
10.9.2
Wavelet Regression Model
10.9.3
Wavelet Shrinkage and Thresholding
10.10
Multivariate Nonparametric Regression
10.10.1
The Curse of Dimensionality
10.10.2
Multivariate Kernel Regression
10.10.3
Multivariate Splines
10.10.4
Additive Models (GAMs)
10.10.5
Radial Basis Functions
10.11
Conclusion: The Evolving Landscape of Regression Analysis
10.11.1
Key Takeaways
10.11.2
The Art and Science of Regression
10.11.3
Looking Forward
10.11.4
Final Thoughts
III. RAMIFICATIONS
11
Data
11.1
Data Types
11.1.1
Qualitative vs. Quantitative Data
11.1.2
Other Ways to Classify Data
11.1.3
Data by Observational Structure Over Time
11.2
Cross-Sectional Data
11.3
Time Series Data
11.3.1
Statistical Properties of Time Series Models
11.3.2
Common Time Series Processes
11.3.3
Deterministic Time Trends
11.3.4
Violations of Exogeneity in Time Series Models
11.3.5
Consequences of Exogeneity Violations
11.3.6
Highly Persistent Data
11.3.7
Unit Root Testing
11.3.8
Newey-West Standard Errors
11.4
Repeated Cross-Sectional Data
11.4.1
Key Characteristics
11.4.2
Statistical Modeling for Repeated Cross-Sections
11.4.3
Advantages of Repeated Cross-Sectional Data
11.4.4
Disadvantages of Repeated Cross-Sectional Data
11.5
Panel Data
11.5.1
Advantages of Panel Data
11.5.2
Disadvantages of Panel Data
11.5.3
Sources of Variation in Panel Data
11.5.4
Pooled OLS Estimator
11.5.5
Individual-Specific Effects Model
11.5.6
Random Effects Estimator
11.5.7
Fixed Effects Estimator
11.5.8
Tests for Assumptions in Panel Data Analysis
11.5.9
Model Selection in Panel Data
11.5.10
Alternative Estimators
11.5.11
Application
11.6
Choosing the Right Type of Data
11.7
Data Quality and Ethical Considerations
12
Variable Transformation
12.1
Continuous Variables
12.1.1
Standardization (Z-score Normalization)
12.1.2
Min-Max Scaling (Normalization)
12.1.3
Square Root and Cube Root Transformations
12.1.4
Logarithmic Transformation
12.1.5
Exponential Transformation
12.1.6
Power Transformation
12.1.7
Inverse (Reciprocal) Transformation
12.1.8
Hyperbolic Arcsine Transformation
12.1.9
Ordered Quantile Normalization (Rank-Based Transformation)
12.1.10
Lambert W x F Transformation
12.1.11
Inverse Hyperbolic Sine Transformation
12.1.12
Box-Cox Transformation
12.1.13
Yeo-Johnson Transformation
12.1.14
RankGauss Transformation
12.1.15
Automatically Choosing the Best Transformation
12.2
Categorical Variables
12.2.1
One-Hot Encoding (Dummy Variables)
12.2.2
Label Encoding
12.2.3
Feature Hashing (Hash Encoding)
12.2.4
Binary Encoding
12.2.5
Base-N Encoding (Generalized Binary Encoding)
12.2.6
Frequency Encoding
12.2.7
Target Encoding (Mean Encoding)
12.2.8
Ordinal Encoding
12.2.9
Weight of Evidence Encoding
12.2.10
Helmert Encoding
12.2.11
Probability Ratio Encoding
12.2.12
Backward Difference Encoding
12.2.13
Leave-One-Out Encoding
12.2.14
James-Stein Encoding
12.2.15
M-Estimator Encoding
12.2.16
Thermometer Encoding
12.2.17
Choosing the Right Encoding Method
13
Imputation (Missing Data)
13.1
Introduction to Missing Data
13.1.1
Types of Imputation
13.1.2
When and Why to Use Imputation
13.1.3
Importance of Missing Data Treatment in Statistical Modeling
13.1.4
Prevalence of Missing Data Across Domains
13.1.5
Practical Considerations for Imputation
13.2
Theoretical Foundations of Missing Data
13.2.1
Definition and Classification of Missing Data
13.2.2
Missing Data Mechanisms
13.2.3
Relationship Between Mechanisms and Ignorability
13.3
Diagnosing the Missing Data Mechanism
13.3.1
Descriptive Methods
13.3.2
Statistical Tests for Missing Data Mechanisms
13.3.3
Assessing MAR and MNAR
13.4
Methods for Handling Missing Data
13.4.1
Basic Methods
13.4.2
Single Imputation Techniques
13.4.3
Machine Learning and Modern Approaches
13.4.4
Multiple Imputation
13.5
Evaluation of Imputation Methods
13.5.1
Statistical Metrics for Assessing Imputation Quality
13.5.2
Bias-Variance Tradeoff in Imputation
13.5.3
Sensitivity Analysis
13.5.4
Validation Using Simulated Data and Real-World Case Studies
13.6
Criteria for Choosing an Effective Approach
13.7
Challenges and Ethical Considerations
13.7.1
Challenges in High-Dimensional Data
13.7.2
Missing Data in Big Data Contexts
13.7.3
Ethical Concerns
13.8
Emerging Trends in Missing Data Handling
13.8.1
Advances in Neural Network Approaches
13.8.2
Integration with Reinforcement Learning
13.8.3
Synthetic Data Generation for Missing Data
13.8.4
Federated Learning and Privacy-Preserving Imputation
13.8.5
Imputation in Streaming and Online Data Environments
13.9
Application of Imputation
13.9.1
Visualizing Missing Data
13.9.2
How Many Imputations?
13.9.3
Generating Missing Data for Demonstration
13.9.4
Imputation with Mean, Median, and Mode
13.9.5
K-Nearest Neighbors (KNN) Imputation
13.9.6
Imputation with Decision Trees (rpart)
13.9.7
MICE (Multivariate Imputation via Chained Equations)
13.9.8
Amelia
13.9.9
missForest
13.9.10
Hmisc
13.9.11
mi
14
Model Specification Tests
14.1
Nested Model Tests
14.1.1
Wald Test
14.1.2
Likelihood Ratio Test
14.1.3
F-Test (for Linear Regression)
14.1.4
Chow Test
14.2
Non-Nested Model Tests
14.2.1
Vuong Test
14.2.2
Davidson–MacKinnon J-Test
14.2.3
Adjusted
R
2
14.2.4
Comparing Models with Transformed Dependent Variables
14.3
Heteroskedasticity Tests
14.3.1
Breusch–Pagan Test
14.3.2
White Test
14.3.3
Goldfeld–Quandt Test
14.3.4
Park Test
14.3.5
Glejser Test
14.3.6
Summary of Heteroskedasticity Tests
14.4
Functional Form Tests
14.4.1
Ramsey RESET Test (Regression Equation Specification Error Test)
14.4.2
Harvey–Collier Test
14.4.3
Rainbow Test
14.4.4
Summary of Functional Form Tests
14.5
Autocorrelation Tests
14.5.1
Durbin–Watson Test
14.5.2
Breusch–Godfrey Test
14.5.3
Ljung–Box Test (or Box–Pierce Test)
14.5.4
Runs Test
14.5.5
Summary of Autocorrelation Tests
14.6
Multicollinearity Diagnostics
14.6.1
Variance Inflation Factor
14.6.2
Tolerance Statistic
14.6.3
Condition Index and Eigenvalue Decomposition
14.6.4
Pairwise Correlation Matrix
14.6.5
Determinant of the Correlation Matrix
14.6.6
Summary of Multicollinearity Diagnostics
14.6.7
Addressing Multicollinearity
15
Variable Selection
15.1
Filter Methods (Statistical Criteria, Model-Agnostic)
15.1.1
Information Criteria-Based Selection
15.1.2
Univariate Selection Methods
15.1.3
Correlation-Based Feature Selection
15.1.4
Variance Thresholding
15.2
Wrapper Methods (Model-Based Subset Evaluation)
15.2.1
Best Subsets Algorithm
15.2.2
Stepwise Selection Methods
15.2.3
Branch-and-Bound Algorithm
15.2.4
Recursive Feature Elimination
15.3
Embedded Methods (Integrated into Model Training)
15.3.1
Regularization-Based Selection
15.3.2
Tree-Based Feature Importance
15.3.3
Genetic Algorithms
15.4
Summary Table
16
Hypothesis Testing
16.1
Null Hypothesis Significance Testing
16.1.1
Error Types in Hypothesis Testing
16.1.2
Hypothesis Testing Framework
16.1.3
Interpreting Hypothesis Testing Results
16.1.4
Understanding p-Values
16.1.5
The Role of Sample Size
16.1.6
p-Value Hacking
16.1.7
Practical vs. Statistical Significance
16.1.8
Mitigating the Misuse of p-Values
16.1.9
Wald Test
16.1.10
Likelihood Ratio Test
16.1.11
Lagrange Multiplier (Score) Test
16.1.12
Comparing Hypothesis Tests
16.2
Two One-Sided Tests Equivalence Testing
16.2.1
When to Use TOST?
16.2.2
Interpretation of the TOST Procedure
16.2.3
Relationship to Confidence Intervals
16.2.4
Example 1: Testing the Equivalence of Two Means
16.2.5
Advantages of TOST Equivalence Testing
16.2.6
When
Not
to Use TOST
16.3
False Discovery Rate
16.3.1
Benjamini-Hochberg Procedure
16.3.2
Benjamini-Yekutieli Procedure
16.3.3
Storey’s q-value Approach
16.3.4
Summary: False Discovery Rate Methods
16.4
Comparison of Testing Frameworks
17
Marginal Effects
17.1
Definition of Marginal Effects
17.1.1
Analytical Derivation of Marginal Effects
17.1.2
Numerical Approximation of Marginal Effects
17.2
Marginal Effects in Different Contexts
17.3
Marginal Effects Interpretation
17.4
Delta Method
17.5
Comparison: Delta Method vs. Alternative Approaches
17.5.1
Example: Applying the Delta Method in a logistic regression
17.6
Types of Marginal Effect
17.6.1
Average Marginal Effect
17.6.2
Marginal Effects at the Mean
17.6.3
Marginal Effects at the Average
17.7
Packages for Marginal Effects
17.7.1
marginaleffects
Package (Recommended)
17.7.2
margins
Package
17.7.3
mfx
Package
17.7.4
Comparison of Packages
18
Moderation
18.1
Types of Moderation Analyses
18.2
Key Terminology
18.3
Moderation Model
18.4
Types of Interactions
18.5
Three-Way Interactions
18.6
Additional Resources
18.7
Application
18.7.1
emmeans
Package
18.7.2
probemod
Package
18.7.3
interactions
Package
18.7.4
interactionR
Package
18.7.5
sjPlot
Package
18.7.6
Summary of Moderation Analysis Packages
19
Mediation
19.1
Traditional Approach
19.1.1
Steps in the Traditional Mediation Model
19.1.2
Graphical Representation of Mediation
19.1.3
Measuring Mediation
19.1.4
Assumptions in Linear Mediation Models
19.1.5
Testing for Mediation
19.1.6
Additional Considerations
19.1.7
Assumptions in Mediation Analysis
19.1.8
Indirect Effect Tests
19.1.9
Power Analysis for Mediation
19.1.10
Multiple Mediation Analysis
19.1.11
Multiple Treatments in Mediation
19.2
Causal Inference Approach to Mediation
19.2.1
Example: Traditional Mediation Analysis
19.2.2
Two Approaches in Causal Mediation Analysis
20
Prediction and Estimation
20.1
Conceptual Framing
20.1.1
Predictive Modeling
20.1.2
Estimation or Causal Inference
20.2
Mathematical Setup
20.2.1
Probability Space and Data
20.2.2
Loss Functions and Risk
20.3
Prediction in Detail
20.3.1
Empirical Risk Minimization and Generalization
20.3.2
Bias-Variance Decomposition
20.3.3
Example: Linear Regression for Prediction
20.3.4
Applications in Economics
20.4
Parameter Estimation and Causal Inference
20.4.1
Estimation in Parametric Models
20.4.2
Causal Inference Fundamentals
20.4.3
Role of Identification
20.4.4
Challenges
20.5
Causation versus Prediction
20.6
Illustrative Equations and Mathematical Contrasts
20.6.1
Risk Minimization vs. Consistency
20.6.2
Partial Derivatives vs. Predictions
20.6.3
Example: High-Dimensional Regularization
20.6.4
Potential Outcomes Notation
20.7
Extended Mathematical Points
20.7.1
M-Estimation and Asymptotic Theory
20.7.2
The Danger of Omitted Variables
20.7.3
Cross-Validation vs. Statistical Testing
20.8
Putting It All Together: Comparing Objectives
20.9
Conclusion
IV. CAUSAL INFERENCE
21
Causal Inference
21.1
The Ladder of Causation
21.2
The Formal Notation of Causality
21.3
The 7 Tools of Structural Causal Models
21.4
Simpson’s Paradox
21.4.1
What is Simpson’s Paradox?
21.4.2
Why is this Important?
21.4.3
Comparison between Simpson’s Paradox and Omitted Variable Bias
21.4.4
Illustrating Simpson’s Paradox: Marketing Campaign Success Rates
21.4.5
Why Does This Happen?
21.4.6
How Does Causal Inference Solve This?
21.4.7
Correcting Simpson’s Paradox with Regression Adjustment
21.4.8
Key Takeaways
21.5
Additional Resources
21.6
Experimental vs. Quasi-Experimental Designs
21.6.1
Criticisms of Quasi-Experimental Designs
21.7
Hierarchical Ordering of Causal Tools
21.8
Types of Validity in Research
21.8.1
Measurement Validity
21.8.2
Construct Validity
21.8.3
Criterion Validity
21.8.4
Internal Validity
21.8.5
External Validity
21.8.6
Ecological Validity
21.8.7
Statistical Conclusion Validity
21.8.8
Putting It All Together
21.9
Types of Subjects in a Treatment Setting
21.9.1
Non-Switchers
21.9.2
Switchers
21.9.3
Classification of Individuals Based on Treatment Assignment
21.10
Types of Treatment Effects
21.10.1
Average Treatment Effect
21.10.2
Conditional Average Treatment Effect
21.10.3
Intention-to-Treat Effect
21.10.4
Local Average Treatment Effects
21.10.5
Population vs. Sample Average Treatment Effects
21.10.6
Average Treatment Effects on the Treated and Control
21.10.7
Quantile Average Treatment Effects
21.10.8
Log-Odds Treatment Effects for Binary Outcomes
21.10.9
Summary Table: Treatment Effect Estimands
A. EXPERIMENTAL DESIGN
22
Experimental Design
22.1
Principles of Experimental Design
22.2
The Gold Standard: Randomized Controlled Trials
22.3
Selection Problem
22.3.1
The Observed Difference in Outcomes
22.3.2
Eliminating Selection Bias with Random Assignment
22.3.3
Another Representation Under Regression
22.4
Classical Experimental Designs
22.4.1
Completely Randomized Design
22.4.2
Randomized Block Design
22.4.3
Factorial Design
22.4.4
Crossover Design
22.4.5
Split-Plot Design
22.4.6
Latin Square Design
22.5
Advanced Experimental Designs
22.5.1
Semi-Random Experiments
22.5.2
Re-Randomization
22.5.3
Two-Stage Randomized Experiments
22.5.4
Two-Stage Randomized Experiments with Interference and Noncompliance
22.6
Emerging Research
22.6.1
Handling Zero-Valued Outcomes
23
Sampling
23.1
Population and Sample
23.2
Probability Sampling
23.2.1
Simple Random Sampling
23.2.2
Stratified Sampling
23.2.3
Systematic Sampling
23.2.4
Cluster Sampling
23.3
Non-Probability Sampling
23.3.1
Convenience Sampling
23.3.2
Quota Sampling
23.3.3
Snowball Sampling
23.4
Unequal Probability Sampling
23.5
Balanced Sampling
23.5.1
Cube Method for Balanced Sampling
23.5.2
Balanced Sampling with Stratification
23.5.3
Balanced Sampling in Cluster Sampling
23.5.4
Balanced Sampling in Two-Stage Sampling
23.6
Sample Size Determination
24
Analysis of Variance
24.1
Completely Randomized Design
24.1.1
Single-Factor Fixed Effects ANOVA
24.1.2
Single Factor Random Effects ANOVA
24.1.3
Two-Factor Fixed Effects ANOVA
24.1.4
Two-Way Random Effects ANOVA
24.1.5
Two-Way Mixed Effects ANOVA
24.2
Nonparametric ANOVA
24.2.1
Kruskal-Wallis Test (One-Way Nonparametric ANOVA)
24.2.2
Friedman Test (Nonparametric Two-Way ANOVA)
24.3
Randomized Block Designs
24.4
Nested Designs
24.4.1
Two-Factor Nested Design
24.4.2
Unbalanced Nested Two-Factor Designs
24.4.3
Random Factor Effects
24.5
Sample Size Planning for ANOVA
24.5.1
Balanced Designs
24.5.2
Single Factor Studies
24.5.3
Multi-Factor Studies
24.5.4
Procedure for Sample Size Selection
24.5.5
Randomized Block Experiments
24.6
Single Factor Covariance Model
24.6.1
Statistical Inference for Treatment Effects
24.6.2
Testing for Parallel Slopes
24.6.3
Adjusted Means
25
Multivariate Methods
25.1
Basic Understanding
25.1.1
Multivariate Random Vectors
25.1.2
Covariance Matrix
25.1.3
Equalities in Expectation and Variance
25.1.4
Multivariate Normal Distribution
25.1.5
Test of Multivariate Normality
25.1.6
Mean Vector Inference
25.1.7
General Hypothesis Testing
25.2
Multivariate Analysis of Variance
25.2.1
One-Way MANOVA
25.2.2
Profile Analysis
25.3
Statistical Test Selection for Comparing Means
B. QUASI-EXPERIMENTAL DESIGN
26
Quasi-Experimental Methods
26.1
Identification Strategy in Quasi-Experiments
26.2
Robustness Checks
26.3
Establishing Mechanisms
26.4
Limitations of Quasi-Experiments
26.5
Assumptions for Identifying Treatment Effects
26.5.1
Stable Unit Treatment Value Assumption
26.5.2
Conditional Ignorability Assumption
26.5.3
Overlap (Positivity) Assumption
26.6
Natural Experiments
26.6.1
The Problem of Reusing Natural Experiments
26.6.2
Statistical Challenges in Reusing Natural Experiments
26.6.3
Solutions: Multiple Testing Corrections
26.7
Design vs. Model-Based Approaches
26.7.1
Design-Based Perspective
26.7.2
Model-Based Perspective
26.7.3
Placing Methods Along a Spectrum
27
Regression Discontinuity
27.1
Conceptual Framework
27.2
Types of Regression Discontinuity Designs
27.3
Assumptions for RD Validity
27.4
Threats to RD Validity
27.4.1
Violation of Continuity in Covariates
27.4.2
Multiple Discontinuities
27.4.3
Manipulation of the Running Variable
27.5
Model Estimation Strategies
27.5.1
Parametric Models: Polynomial Regression
27.5.2
Nonparametric Models: Local Regression
27.6
Formal Definition
27.6.1
Identification Assumptions
27.7
Estimation and Inference
27.7.1
Local Randomization-Based Approach
27.7.2
Continuity-Based Approach
27.8
Specification Checks
27.8.1
Balance Checks
27.8.2
Sorting, Bunching, and Manipulation
27.8.3
Placebo Tests
27.8.4
Sensitivity to Bandwidth Choice
27.8.5
Assessing Sensitivity
27.8.6
Manipulation Robust Regression Discontinuity Bounds
27.9
Fuzzy RD Design
27.10
Regression Kink Design
27.11
Multi-cutoff
27.12
Multi-score
27.13
Steps for Sharp RD
27.14
Steps for Fuzzy RD
27.15
Steps for RDiT (Regression Discontinuity in Time)
27.16
Evaluation of an RD
27.17
Applications
27.17.1
Example 1
27.17.2
Example 2
27.17.3
Example 3
27.17.4
Example 4
28
Synthetic Difference-in-Differences
28.1
Understanding
28.2
Application
28.2.1
Block Treatment
28.2.2
Staggered Adoption
29
Difference-in-differences
29.1
Visualization
29.2
Simple Dif-n-dif
29.3
Notes
29.4
Standard Errors
29.5
Examples
29.5.1
Example by
Doleac and Hansen (2020)
29.5.2
Example from Princeton
29.5.3
Example by
Card and Krueger (1993)
29.5.4
Example by
Butcher, McEwan, and Weerapana (2014)
29.6
One Difference
29.7
Two-way Fixed-effects
29.8
Multiple periods and variation in treatment timing
29.9
Staggered Dif-n-dif
29.9.1
Stacked DID
29.9.2
Goodman-Bacon Decomposition
29.9.3
DID with in and out treatment condition
29.9.4
Gardner (2022)
and
Borusyak, Jaravel, and Spiess (2021)
29.9.5
Clément De Chaisemartin and d’Haultfoeuille (2020)
29.9.6
Callaway and Sant’Anna (2021)
29.9.7
L. Sun and Abraham (2021)
29.9.8
Wooldridge (2022)
29.9.9
Doubly Robust DiD
29.9.10
Augmented/Forward DID
29.10
Multiple Treatments
29.11
Mediation Under DiD
29.12
Assumptions
29.12.1
Prior Parallel Trends Test
29.12.2
Placebo Test
29.12.3
Assumption Violations
29.12.4
Robustness Checks
30
Changes-in-Changes
30.1
Application
30.1.1
ECIC package
30.1.2
QTE package
31
Synthetic Control
31.1
Applications
31.1.1
Example 1
31.1.2
Example 2
31.1.3
Example 3
31.1.4
Example 4
31.2
Augmented Synthetic Control Method
31.3
Synthetic Control with Staggered Adoption
31.4
Bayesian Synthetic Control
31.5
Generalized Synthetic Control
31.6
Other Advances
32
Event Studies
32.1
Other Issues
32.1.1
Event Studies in marketing
32.1.2
Economic significance
32.1.3
Statistical Power
32.2
Testing
32.2.1
Parametric Test
32.2.2
Non-parametric Test
32.3
Sample
32.3.1
Confounders
32.4
Biases
32.5
Long-run event studies
32.5.1
Buy and Hold Abnormal Returns (BHAR)
32.5.2
Long-term Cumulative Abnormal Returns (LCARs)
32.5.3
Calendar-time Portfolio Abnormal Returns (CTARs)
32.6
Aggregation
32.6.1
Over Time
32.6.2
Across Firms + Over Time
32.7
Heterogeneity in the event effect
32.7.1
Common variables in marketing
32.8
Expected Return Calculation
32.8.1
Statistical Models
32.8.2
Economic Model
32.9
Application
32.9.1
Eventus
32.9.2
Evenstudies
32.9.3
EventStudy
33
Instrumental Variables
33.1
Framework
33.2
Estimation
33.2.1
2SLS Estimation
33.2.2
IV-GMM
33.3
Inference
33.3.1
AR approach
33.3.2
tF Procedure
33.3.3
AK approach
33.4
Testing Assumptions
33.4.1
Relevance Assumption
33.4.2
Exogeneity Assumption
33.5
Negative
R
2
33.6
Treatment Intensity
33.7
Control Function
33.7.1
Simulation
33.8
New Advances
34
Matching Methods
34.1
Selection on Observables
34.1.1
MatchIt
34.1.2
designmatch
34.1.3
MatchingFrontier
34.1.4
Propensity Scores
34.1.5
Mahalanobis Distance
34.1.6
Coarsened Exact Matching
34.1.7
Genetic Matching
34.1.8
Entropy Balancing
34.1.9
Matching for high-dimensional data
34.1.10
Matching for time series-cross-section data
34.1.11
Matching for multiple treatments
34.1.12
Matching for multi-level treatments
34.1.13
Matching for repeated treatments
34.2
Selection on Unobservables
34.2.1
Rosenbaum Bounds
34.2.2
Relative Correlation Restrictions
34.2.3
Coefficient-stability Bounds
35
Interrupted Time Series
35.1
Autocorrelation
35.2
Multiple Groups
C. OTHER CONCERNS
36
Endogeneity
36.1
Endogenous Treatment
36.1.1
Measurement Error
36.1.2
Simultaneity
36.1.3
Endogenous Treatment Solutions
36.2
Endogenous Sample Selection
36.2.1
Mitigation-Based Selection
36.2.2
Preference-Based Selection
36.3
Implications for Causal Inference
36.3.1
Addressing Selection Bias
36.3.2
Tobit-2
36.3.3
Tobit-5
37
Other Biases
37.1
Aggregation Bias
37.1.1
Simpson’s Paradox
37.2
Contamination Bias
37.3
Survivorship Bias
37.4
Publication Bias
38
Controls
38.1
Bad Controls
38.1.1
M-bias
38.1.2
Bias Amplification
38.1.3
Overcontrol bias
38.1.4
Selection Bias
38.1.5
Case-control Bias
38.2
Good Controls
38.2.1
Omitted Variable Bias Correction
38.2.2
Omitted Variable Bias in Mediation Correction
38.3
Neutral Controls
38.3.1
Good Predictive Controls
38.3.2
Good Selection Bias
38.3.3
Bad Predictive Controls
38.3.4
Bad Selection Bias
38.4
Choosing Controls
39
Directed Acyclic Graph
39.1
Basic Notations
V. MISCELLANEOUS
40
Report
40.1
One summary table
40.2
Model Comparison
40.3
Changes in an estimate
40.4
Standard Errors
40.5
Coefficient Uncertainty and Distribution
40.6
Descriptive Tables
40.7
Visualizations and Plots
41
Exploratory Data Analysis
42
Sensitivity Analysis/ Robustness Check
42.1
Specification curve
42.1.1
starbility
42.1.2
rdfanalysis
42.2
Coefficient stability
42.3
Omitted Variable Bias Quantification
43
Replication and Synthetic Data
43.1
The Replication Standard
43.1.1
Solutions for Empirical Replication
43.1.2
Free Data Repositories
43.1.3
Exceptions to Replication
43.1.4
Replication Landscape
43.2
Synthetic Data
43.2.1
Benefits of Synthetic Data
43.2.2
Concerns and Limitations
43.2.3
Further Insights on Synthetic Data
43.2.4
Generating Synthetic Data
43.3
Application
43.3.1
Original Dataset
43.3.2
Restricted Dataset
43.3.3
Synthpop
APPENDIX
A
Appendix
A.1
Git
A.2
Short-cut
A.3
Function short-cut
A.4
Citation
A.5
Install all necessary packages/libaries on your local machine
B
Bookdown cheat sheet
B.1
Operation
B.2
Math Expression/ Syntax
B.2.1
Statistics Notation
B.3
Table
References
Published with bookdown
Facebook
Twitter
LinkedIn
Weibo
Instapaper
A
A
Serif
Sans
White
Sepia
Night
PDF
EPUB
MOBI
A Guide on Data Analysis
29.11
Mediation Under DiD
Check this
post