Loading [MathJax]/jax/output/CommonHTML/jax.js
  • A Guide on Data Analysis
  • Preface
    • How to cite this book
  • 1 Introduction
    • 1.1 General Recommendations
  • 2 Prerequisites
    • 2.1 Matrix Theory
      • 2.1.1 Rank of a Matrix
      • 2.1.2 Inverse of a Matrix
      • 2.1.3 Definiteness of a Matrix
      • 2.1.4 Matrix Calculus
      • 2.1.5 Optimization in Scalar and Vector Spaces
      • 2.1.6 Cholesky Decomposition
    • 2.2 Probability Theory
      • 2.2.1 Axioms and Theorems of Probability
      • 2.2.2 Central Limit Theorem
      • 2.2.3 Random Variable
      • 2.2.4 Moment Generating Function
      • 2.2.5 Moments
      • 2.2.6 Skewness
      • 2.2.7 Kurtosis
      • 2.2.8 Distributions
    • 2.3 General Math
      • 2.3.1 Number Sets
      • 2.3.2 Summation Notation and Series
      • 2.3.3 Taylor Expansion
      • 2.3.4 Law of Large Numbers
      • 2.3.5 Convergence
      • 2.3.6 Sufficient Statistics and Likelihood
      • 2.3.7 Parameter Transformations
    • 2.4 Data Import/Export
      • 2.4.1 Key Limitations of R
      • 2.4.2 Solutions and Workarounds
      • 2.4.3 Medium size
      • 2.4.4 Large size
    • 2.5 Data Manipulation
  • I. BASIC
  • 3 Descriptive Statistics
    • 3.1 Numerical Measures
    • 3.2 Graphical Measures
      • 3.2.1 Shape
      • 3.2.2 Scatterplot
    • 3.3 Normality Assessment
      • 3.3.1 Graphical Assessment
      • 3.3.2 Summary Statistics
    • 3.4 Bivariate Statistics
      • 3.4.1 Two Continuous
      • 3.4.2 Categorical and Continuous
      • 3.4.3 Two Discrete
      • 3.4.4 General Approach to Bivariate Statistics
  • 4 Basic Statistical Inference
    • 4.1 Hypothesis Testing Framework
      • 4.1.1 Null and Alternative Hypotheses
      • 4.1.2 Errors in Hypothesis Testing
      • 4.1.3 The Role of Distributions in Hypothesis Testing
      • 4.1.4 The Test Statistic
      • 4.1.5 Critical Values and Rejection Regions
      • 4.1.6 Visualizing Hypothesis Testing
    • 4.2 Key Concepts and Definitions
      • 4.2.1 Random Sample
      • 4.2.2 Sample Statistics
      • 4.2.3 Distribution of the Sample Mean
    • 4.3 One-Sample Inference
      • 4.3.1 For Single Mean
      • 4.3.2 For Difference of Means, Independent Samples
      • 4.3.3 For Difference of Means, Paired Samples
      • 4.3.4 For Difference of Two Proportions
      • 4.3.5 For Single Proportion
      • 4.3.6 For Single Variance
      • 4.3.7 Non-parametric Tests
    • 4.4 Two-Sample Inference
      • 4.4.1 For Means
      • 4.4.2 For Variances
      • 4.4.3 Power
      • 4.4.4 Matched Pair Designs
      • 4.4.5 Nonparametric Tests for Two Samples
    • 4.5 Categorical Data Analysis
      • 4.5.1 Association Tests
      • 4.5.2 Ordinal Association
      • 4.5.3 Ordinal Trend
    • 4.6 Divergence Metrics and Tests for Comparing Distributions
      • 4.6.1 Kolmogorov-Smirnov Test
      • 4.6.2 Anderson-Darling Test
      • 4.6.3 Chi-Square Goodness-of-Fit Test
      • 4.6.4 Cramér-von Mises Test
      • 4.6.5 Kullback-Leibler Divergence
      • 4.6.6 Jensen-Shannon Divergence
      • 4.6.7 Hellinger Distance
      • 4.6.8 Bhattacharyya Distance
      • 4.6.9 Wasserstein Distance
      • 4.6.10 Energy Distance
      • 4.6.11 Total Variation Distance
      • 4.6.12 Summary
  • II. REGRESSION
  • 5 Linear Regression
    • 5.1 Ordinary Least Squares
      • 5.1.1 Simple Regression (Basic) Model
      • 5.1.2 Multiple Linear Regression
    • 5.2 Generalized Least Squares
      • 5.2.1 Infeasible Generalized Least Squares
      • 5.2.2 Feasible Generalized Least Squares
      • 5.2.3 Weighted Least Squares
    • 5.3 Maximum Likelihood
      • 5.3.1 Motivation for MLE
      • 5.3.2 Key Quantities for Inference
      • 5.3.3 Assumptions of MLE
      • 5.3.4 Properties of MLE
      • 5.3.5 Practical Considerations
      • 5.3.6 Comparison of MLE and OLS
      • 5.3.7 Applications of MLE
    • 5.4 Penalized (Regularized) Estimators
      • 5.4.1 Motivation for Penalized Estimators
      • 5.4.2 Ridge Regression
      • 5.4.3 Lasso Regression
      • 5.4.4 Elastic Net
      • 5.4.5 Tuning Parameter Selection
      • 5.4.6 Properties of Penalized Estimators
    • 5.5 Robust Estimators
      • 5.5.1 Motivation for Robust Estimation
      • 5.5.2 M-Estimators
      • 5.5.3 R-Estimators
      • 5.5.4 L-Estimators
      • 5.5.5 Least Trimmed Squares (LTS)
      • 5.5.6 S-Estimators
      • 5.5.7 MM-Estimators
      • 5.5.8 Practical Considerations
    • 5.6 Partial Least Squares
      • 5.6.1 Motivation for PLS
      • 5.6.2 Steps to Construct PLS Components
      • 5.6.3 Properties of PLS
      • 5.6.4 Comparison with Related Methods
  • 6 Non-Linear Regression
    • 6.1 Inference
      • 6.1.1 Linear Functions of the Parameters
      • 6.1.2 Nonlinear Functions of Parameters
    • 6.2 Non-linear Least Squares Estimation
      • 6.2.1 Iterative Optimization
      • 6.2.2 Derivative-Free
      • 6.2.3 Stochastic Heuristic
      • 6.2.4 Linearization
      • 6.2.5 Hybrid
      • 6.2.6 Comparison of Nonlinear Optimizers
    • 6.3 Practical Considerations
      • 6.3.1 Selecting Starting Values
      • 6.3.2 Handling Constrained Parameters
      • 6.3.3 Failure to Converge
      • 6.3.4 Convergence to a Local Minimum
      • 6.3.5 Model Adequacy and Estimation Considerations
    • 6.4 Application
      • 6.4.1 Nonlinear Estimation Using Gauss-Newton Algorithm
      • 6.4.2 Logistic Growth Model
      • 6.4.3 Nonlinear Plateau Model
  • 7 Generalized Linear Models
    • 7.1 Logistic Regression
      • 7.1.1 Logistic Model
      • 7.1.2 Likelihood Function
      • 7.1.3 Fisher Information Matrix
      • 7.1.4 Inference in Logistic Regression
      • 7.1.5 Application: Logistic Regression
    • 7.2 Probit Regression
      • 7.2.1 Probit Model
      • 7.2.2 Application: Probit Regression
    • 7.3 Binomial Regression
      • 7.3.1 Dataset Overview
      • 7.3.2 Apply Logistic Model
      • 7.3.3 Apply Probit Model
    • 7.4 Poisson Regression
      • 7.4.1 The Poisson Distribution
      • 7.4.2 Poisson Model
      • 7.4.3 Link Function Choices
      • 7.4.4 Application: Poisson Regression
    • 7.5 Negative Binomial Regression
      • 7.5.1 Negative Binomial Distribution
      • 7.5.2 Application: Negative Binomial Regression
      • 7.5.3 Fitting a Zero-Inflated Negative Binomial Model
    • 7.6 Quasi-Poisson Regression
      • 7.6.1 Is Quasi-Poisson Regression a Generalized Linear Model?
      • 7.6.2 Application: Quasi-Poisson Regression
    • 7.7 Multinomial Logistic Regression
      • 7.7.1 The Multinomial Distribution
      • 7.7.2 Modeling Probabilities Using Log-Odds
      • 7.7.3 Softmax Representation
      • 7.7.4 Log-Odds Ratio Between Two Categories
      • 7.7.5 Estimation
      • 7.7.6 Interpretation of Coefficients
      • 7.7.7 Application: Multinomial Logistic Regression
      • 7.7.8 Application: Gamma Regression
    • 7.8 Generalization of Generalized Linear Models
      • 7.8.1 Exponential Family
      • 7.8.2 Properties of GLM Exponential Families
      • 7.8.3 Structure of a Generalized Linear Model
      • 7.8.4 Components of a GLM
      • 7.8.5 Canonical Link
      • 7.8.6 Inverse Link Functions
      • 7.8.7 Estimation of Parameters in GLMs
      • 7.8.8 Inference
      • 7.8.9 Deviance
      • 7.8.10 Diagnostic Plots
      • 7.8.11 Goodness of Fit
      • 7.8.12 Over-Dispersion
  • 8 Linear Mixed Models
    • 8.1 Dependent Data
      • 8.1.1 Motivation: A Repeated Measurements Example
      • 8.1.2 Example: Linear Mixed Model for Repeated Measurements
      • 8.1.3 Random-Intercepts Model
      • 8.1.4 Covariance Models in Linear Mixed Models
      • 8.1.5 Covariance Structures in Mixed Models
    • 8.2 Estimation in Linear Mixed Models
      • 8.2.1 Interpretation of the Mixed Model Equations
      • 8.2.2 Derivation of the Mixed Model Equations
      • 8.2.3 Bayesian Interpretation of Linear Mixed Models
      • 8.2.4 Estimating the Variance-Covariance Matrix
    • 8.3 Inference in Linear Mixed Models
      • 8.3.1 Inference for Fixed Effects (β)
      • 8.3.2 Inference for Variance Components (θ)
    • 8.4 Information Criteria for Model Selection
      • 8.4.1 Akaike Information Criterion
      • 8.4.2 Corrected AIC
      • 8.4.3 Bayesian Information Criterion
      • 8.4.4 Practical Example with Linear Mixed Models
    • 8.5 Split-Plot Designs
      • 8.5.1 Example Setup
      • 8.5.2 Statistical Model for Split-Plot Designs
      • 8.5.3 Approaches to Analyzing Split-Plot Designs
      • 8.5.4 Application: Split-Plot Design
    • 8.6 Repeated Measures in Mixed Models
    • 8.7 Unbalanced or Unequally Spaced Data
      • 8.7.1 Variance-Covariance Structure: Power Model
    • 8.8 Application: Mixed Models in Practice
      • 8.8.1 Example 1: Pulp Brightness Analysis
      • 8.8.2 Example 2: Penicillin Yield (GLMM with Blocking)
      • 8.8.3 Example 3: Growth in Rats Over Time
      • 8.8.4 Example 4: Tree Water Use (Agridat)
  • 9 Nonlinear and Generalized Linear Mixed Models
    • 9.1 Nonlinear Mixed Models
    • 9.2 Generalized Linear Mixed Models
    • 9.3 Relationship Between NLMMs and GLMMs
    • 9.4 Marginal Properties of GLMMs
      • 9.4.1 Marginal Mean of yi
      • 9.4.2 Marginal Variance of yi
      • 9.4.3 Marginal Covariance of y
    • 9.5 Estimation in Nonlinear and Generalized Linear Mixed Models
      • 9.5.1 Estimation by Numerical Integration
      • 9.5.2 Estimation by Linearization
      • 9.5.3 Estimation by Bayesian Hierarchical Models
      • 9.5.4 Practical Implementation in R
    • 9.6 Application: Nonlinear and Generalized Linear Mixed Models
      • 9.6.1 Binomial Data: CBPP Dataset
      • 9.6.2 Count Data: Owl Dataset
      • 9.6.3 Binomial Example: Gotway Hessian Fly Data
      • 9.6.4 Nonlinear Mixed Model: Yellow Poplar Data
    • 9.7 Summary
  • 10 Nonparametric Regression
    • 10.1 Why Nonparametric?
      • 10.1.1 Flexibility
      • 10.1.2 Fewer Assumptions
      • 10.1.3 Interpretability
      • 10.1.4 Practical Considerations
      • 10.1.5 Balancing Parametric and Nonparametric Approaches
    • 10.2 Basic Concepts in Nonparametric Estimation
      • 10.2.1 Bias-Variance Trade-Off
      • 10.2.2 Kernel Smoothing and Local Averages
    • 10.3 Kernel Regression
      • 10.3.1 Basic Setup
      • 10.3.2 Nadaraya-Watson Kernel Estimator
      • 10.3.3 Priestley–Chao Kernel Estimator
      • 10.3.4 Gasser–Müller Kernel Estimator
      • 10.3.5 Comparison of Kernel-Based Estimators
      • 10.3.6 Bandwidth Selection
      • 10.3.7 Asymptotic Properties
      • 10.3.8 Derivation of the Nadaraya-Watson Estimator
    • 10.4 Local Polynomial Regression
      • 10.4.1 Local Polynomial Fitting
      • 10.4.2 Mathematical Form of the Solution
      • 10.4.3 Bias, Variance, and Asymptotics
      • 10.4.4 Special Case: Local Linear Regression
      • 10.4.5 Bandwidth Selection
      • 10.4.6 Asymptotic Properties Summary
    • 10.5 Smoothing Splines
      • 10.5.1 Properties and Form of the Smoothing Spline
      • 10.5.2 Choice of λ
      • 10.5.3 Connection to Reproducing Kernel Hilbert Spaces
    • 10.6 Confidence Intervals in Nonparametric Regression
      • 10.6.1 Asymptotic Normality
      • 10.6.2 Bootstrap Methods
      • 10.6.3 Practical Considerations
    • 10.7 Generalized Additive Models
      • 10.7.1 Estimation via Penalized Likelihood
      • 10.7.2 Interpretation of GAMs
      • 10.7.3 Model Selection and Smoothing Parameter Estimation
      • 10.7.4 Extensions of GAMs
    • 10.8 Regression Trees and Random Forests
      • 10.8.1 Regression Trees
      • 10.8.2 Random Forests
      • 10.8.3 Theoretical Insights
      • 10.8.4 Feature Importance in Random Forests
      • 10.8.5 Advantages and Limitations of Tree-Based Methods
    • 10.9 Wavelet Regression
      • 10.9.1 Wavelet Series Expansion
      • 10.9.2 Wavelet Regression Model
      • 10.9.3 Wavelet Shrinkage and Thresholding
    • 10.10 Multivariate Nonparametric Regression
      • 10.10.1 The Curse of Dimensionality
      • 10.10.2 Multivariate Kernel Regression
      • 10.10.3 Multivariate Splines
      • 10.10.4 Additive Models (GAMs)
      • 10.10.5 Radial Basis Functions
    • 10.11 Conclusion: The Evolving Landscape of Regression Analysis
      • 10.11.1 Key Takeaways
      • 10.11.2 The Art and Science of Regression
      • 10.11.3 Looking Forward
      • 10.11.4 Final Thoughts
  • III. RAMIFICATIONS
  • 11 Data
    • 11.1 Data Types
      • 11.1.1 Qualitative vs. Quantitative Data
      • 11.1.2 Other Ways to Classify Data
      • 11.1.3 Data by Observational Structure Over Time
    • 11.2 Cross-Sectional Data
    • 11.3 Time Series Data
      • 11.3.1 Statistical Properties of Time Series Models
      • 11.3.2 Common Time Series Processes
      • 11.3.3 Deterministic Time Trends
      • 11.3.4 Violations of Exogeneity in Time Series Models
      • 11.3.5 Consequences of Exogeneity Violations
      • 11.3.6 Highly Persistent Data
      • 11.3.7 Unit Root Testing
      • 11.3.8 Newey-West Standard Errors
    • 11.4 Repeated Cross-Sectional Data
      • 11.4.1 Key Characteristics
      • 11.4.2 Statistical Modeling for Repeated Cross-Sections
      • 11.4.3 Advantages of Repeated Cross-Sectional Data
      • 11.4.4 Disadvantages of Repeated Cross-Sectional Data
    • 11.5 Panel Data
      • 11.5.1 Advantages of Panel Data
      • 11.5.2 Disadvantages of Panel Data
      • 11.5.3 Sources of Variation in Panel Data
      • 11.5.4 Pooled OLS Estimator
      • 11.5.5 Individual-Specific Effects Model
      • 11.5.6 Random Effects Estimator
      • 11.5.7 Fixed Effects Estimator
      • 11.5.8 Tests for Assumptions in Panel Data Analysis
      • 11.5.9 Model Selection in Panel Data
      • 11.5.10 Alternative Estimators
      • 11.5.11 Application
    • 11.6 Choosing the Right Type of Data
    • 11.7 Data Quality and Ethical Considerations
  • 12 Variable Transformation
    • 12.1 Continuous Variables
      • 12.1.1 Standardization (Z-score Normalization)
      • 12.1.2 Min-Max Scaling (Normalization)
      • 12.1.3 Square Root and Cube Root Transformations
      • 12.1.4 Logarithmic Transformation
      • 12.1.5 Exponential Transformation
      • 12.1.6 Power Transformation
      • 12.1.7 Inverse (Reciprocal) Transformation
      • 12.1.8 Hyperbolic Arcsine Transformation
      • 12.1.9 Ordered Quantile Normalization (Rank-Based Transformation)
      • 12.1.10 Lambert W x F Transformation
      • 12.1.11 Inverse Hyperbolic Sine Transformation
      • 12.1.12 Box-Cox Transformation
      • 12.1.13 Yeo-Johnson Transformation
      • 12.1.14 RankGauss Transformation
      • 12.1.15 Automatically Choosing the Best Transformation
    • 12.2 Categorical Variables
      • 12.2.1 One-Hot Encoding (Dummy Variables)
      • 12.2.2 Label Encoding
      • 12.2.3 Feature Hashing (Hash Encoding)
      • 12.2.4 Binary Encoding
      • 12.2.5 Base-N Encoding (Generalized Binary Encoding)
      • 12.2.6 Frequency Encoding
      • 12.2.7 Target Encoding (Mean Encoding)
      • 12.2.8 Ordinal Encoding
      • 12.2.9 Weight of Evidence Encoding
      • 12.2.10 Helmert Encoding
      • 12.2.11 Probability Ratio Encoding
      • 12.2.12 Backward Difference Encoding
      • 12.2.13 Leave-One-Out Encoding
      • 12.2.14 James-Stein Encoding
      • 12.2.15 M-Estimator Encoding
      • 12.2.16 Thermometer Encoding
      • 12.2.17 Choosing the Right Encoding Method
  • 13 Imputation (Missing Data)
    • 13.1 Introduction to Missing Data
      • 13.1.1 Types of Imputation
      • 13.1.2 When and Why to Use Imputation
      • 13.1.3 Importance of Missing Data Treatment in Statistical Modeling
      • 13.1.4 Prevalence of Missing Data Across Domains
      • 13.1.5 Practical Considerations for Imputation
    • 13.2 Theoretical Foundations of Missing Data
      • 13.2.1 Definition and Classification of Missing Data
      • 13.2.2 Missing Data Mechanisms
      • 13.2.3 Relationship Between Mechanisms and Ignorability
    • 13.3 Diagnosing the Missing Data Mechanism
      • 13.3.1 Descriptive Methods
      • 13.3.2 Statistical Tests for Missing Data Mechanisms
      • 13.3.3 Assessing MAR and MNAR
    • 13.4 Methods for Handling Missing Data
      • 13.4.1 Basic Methods
      • 13.4.2 Single Imputation Techniques
      • 13.4.3 Machine Learning and Modern Approaches
      • 13.4.4 Multiple Imputation
    • 13.5 Evaluation of Imputation Methods
      • 13.5.1 Statistical Metrics for Assessing Imputation Quality
      • 13.5.2 Bias-Variance Tradeoff in Imputation
      • 13.5.3 Sensitivity Analysis
      • 13.5.4 Validation Using Simulated Data and Real-World Case Studies
    • 13.6 Criteria for Choosing an Effective Approach
    • 13.7 Challenges and Ethical Considerations
      • 13.7.1 Challenges in High-Dimensional Data
      • 13.7.2 Missing Data in Big Data Contexts
      • 13.7.3 Ethical Concerns
    • 13.8 Emerging Trends in Missing Data Handling
      • 13.8.1 Advances in Neural Network Approaches
      • 13.8.2 Integration with Reinforcement Learning
      • 13.8.3 Synthetic Data Generation for Missing Data
      • 13.8.4 Federated Learning and Privacy-Preserving Imputation
      • 13.8.5 Imputation in Streaming and Online Data Environments
    • 13.9 Application of Imputation
      • 13.9.1 Visualizing Missing Data
      • 13.9.2 How Many Imputations?
      • 13.9.3 Generating Missing Data for Demonstration
      • 13.9.4 Imputation with Mean, Median, and Mode
      • 13.9.5 K-Nearest Neighbors (KNN) Imputation
      • 13.9.6 Imputation with Decision Trees (rpart)
      • 13.9.7 MICE (Multivariate Imputation via Chained Equations)
      • 13.9.8 Amelia
      • 13.9.9 missForest
      • 13.9.10 Hmisc
      • 13.9.11 mi
  • 14 Model Specification Tests
    • 14.1 Nested Model Tests
      • 14.1.1 Wald Test
      • 14.1.2 Likelihood Ratio Test
      • 14.1.3 F-Test (for Linear Regression)
      • 14.1.4 Chow Test
    • 14.2 Non-Nested Model Tests
      • 14.2.1 Vuong Test
      • 14.2.2 Davidson–MacKinnon J-Test
      • 14.2.3 Adjusted R2
      • 14.2.4 Comparing Models with Transformed Dependent Variables
    • 14.3 Heteroskedasticity Tests
      • 14.3.1 Breusch–Pagan Test
      • 14.3.2 White Test
      • 14.3.3 Goldfeld–Quandt Test
      • 14.3.4 Park Test
      • 14.3.5 Glejser Test
      • 14.3.6 Summary of Heteroskedasticity Tests
    • 14.4 Functional Form Tests
      • 14.4.1 Ramsey RESET Test (Regression Equation Specification Error Test)
      • 14.4.2 Harvey–Collier Test
      • 14.4.3 Rainbow Test
      • 14.4.4 Summary of Functional Form Tests
    • 14.5 Autocorrelation Tests
      • 14.5.1 Durbin–Watson Test
      • 14.5.2 Breusch–Godfrey Test
      • 14.5.3 Ljung–Box Test (or Box–Pierce Test)
      • 14.5.4 Runs Test
      • 14.5.5 Summary of Autocorrelation Tests
    • 14.6 Multicollinearity Diagnostics
      • 14.6.1 Variance Inflation Factor
      • 14.6.2 Tolerance Statistic
      • 14.6.3 Condition Index and Eigenvalue Decomposition
      • 14.6.4 Pairwise Correlation Matrix
      • 14.6.5 Determinant of the Correlation Matrix
      • 14.6.6 Summary of Multicollinearity Diagnostics
      • 14.6.7 Addressing Multicollinearity
  • 15 Variable Selection
    • 15.1 Filter Methods (Statistical Criteria, Model-Agnostic)
      • 15.1.1 Information Criteria-Based Selection
      • 15.1.2 Univariate Selection Methods
      • 15.1.3 Correlation-Based Feature Selection
      • 15.1.4 Variance Thresholding
    • 15.2 Wrapper Methods (Model-Based Subset Evaluation)
      • 15.2.1 Best Subsets Algorithm
      • 15.2.2 Stepwise Selection Methods
      • 15.2.3 Branch-and-Bound Algorithm
      • 15.2.4 Recursive Feature Elimination
    • 15.3 Embedded Methods (Integrated into Model Training)
      • 15.3.1 Regularization-Based Selection
      • 15.3.2 Tree-Based Feature Importance
      • 15.3.3 Genetic Algorithms
    • 15.4 Summary Table
  • 16 Hypothesis Testing
    • 16.1 Null Hypothesis Significance Testing
      • 16.1.1 Error Types in Hypothesis Testing
      • 16.1.2 Hypothesis Testing Framework
      • 16.1.3 Interpreting Hypothesis Testing Results
      • 16.1.4 Understanding p-Values
      • 16.1.5 The Role of Sample Size
      • 16.1.6 p-Value Hacking
      • 16.1.7 Practical vs. Statistical Significance
      • 16.1.8 Mitigating the Misuse of p-Values
      • 16.1.9 Wald Test
      • 16.1.10 Likelihood Ratio Test
      • 16.1.11 Lagrange Multiplier (Score) Test
      • 16.1.12 Comparing Hypothesis Tests
    • 16.2 Two One-Sided Tests Equivalence Testing
      • 16.2.1 When to Use TOST?
      • 16.2.2 Interpretation of the TOST Procedure
      • 16.2.3 Relationship to Confidence Intervals
      • 16.2.4 Example 1: Testing the Equivalence of Two Means
      • 16.2.5 Advantages of TOST Equivalence Testing
      • 16.2.6 When Not to Use TOST
    • 16.3 False Discovery Rate
      • 16.3.1 Benjamini-Hochberg Procedure
      • 16.3.2 Benjamini-Yekutieli Procedure
      • 16.3.3 Storey’s q-value Approach
      • 16.3.4 Summary: False Discovery Rate Methods
    • 16.4 Comparison of Testing Frameworks
  • 17 Marginal Effects
    • 17.1 Definition of Marginal Effects
      • 17.1.1 Analytical Derivation of Marginal Effects
      • 17.1.2 Numerical Approximation of Marginal Effects
    • 17.2 Marginal Effects in Different Contexts
    • 17.3 Marginal Effects Interpretation
    • 17.4 Delta Method
    • 17.5 Comparison: Delta Method vs. Alternative Approaches
      • 17.5.1 Example: Applying the Delta Method in a logistic regression
    • 17.6 Types of Marginal Effect
      • 17.6.1 Average Marginal Effect
      • 17.6.2 Marginal Effects at the Mean
      • 17.6.3 Marginal Effects at the Average
    • 17.7 Packages for Marginal Effects
      • 17.7.1 marginaleffects Package (Recommended)
      • 17.7.2 margins Package
      • 17.7.3 mfx Package
      • 17.7.4 Comparison of Packages
  • 18 Moderation
    • 18.1 Types of Moderation Analyses
    • 18.2 Key Terminology
    • 18.3 Moderation Model
    • 18.4 Types of Interactions
    • 18.5 Three-Way Interactions
    • 18.6 Additional Resources
    • 18.7 Application
      • 18.7.1 emmeans Package
      • 18.7.2 probemod Package
      • 18.7.3 interactions Package
      • 18.7.4 interactionR Package
      • 18.7.5 sjPlot Package
      • 18.7.6 Summary of Moderation Analysis Packages
  • 19 Mediation
    • 19.1 Traditional Approach
      • 19.1.1 Steps in the Traditional Mediation Model
      • 19.1.2 Graphical Representation of Mediation
      • 19.1.3 Measuring Mediation
      • 19.1.4 Assumptions in Linear Mediation Models
      • 19.1.5 Testing for Mediation
      • 19.1.6 Additional Considerations
      • 19.1.7 Assumptions in Mediation Analysis
      • 19.1.8 Indirect Effect Tests
      • 19.1.9 Power Analysis for Mediation
      • 19.1.10 Multiple Mediation Analysis
      • 19.1.11 Multiple Treatments in Mediation
    • 19.2 Causal Inference Approach to Mediation
      • 19.2.1 Example: Traditional Mediation Analysis
      • 19.2.2 Two Approaches in Causal Mediation Analysis
  • 20 Prediction and Estimation
    • 20.1 Conceptual Framing
      • 20.1.1 Predictive Modeling
      • 20.1.2 Estimation or Causal Inference
    • 20.2 Mathematical Setup
      • 20.2.1 Probability Space and Data
      • 20.2.2 Loss Functions and Risk
    • 20.3 Prediction in Detail
      • 20.3.1 Empirical Risk Minimization and Generalization
      • 20.3.2 Bias-Variance Decomposition
      • 20.3.3 Example: Linear Regression for Prediction
      • 20.3.4 Applications in Economics
    • 20.4 Parameter Estimation and Causal Inference
      • 20.4.1 Estimation in Parametric Models
      • 20.4.2 Causal Inference Fundamentals
      • 20.4.3 Role of Identification
      • 20.4.4 Challenges
    • 20.5 Causation versus Prediction
    • 20.6 Illustrative Equations and Mathematical Contrasts
      • 20.6.1 Risk Minimization vs. Consistency
      • 20.6.2 Partial Derivatives vs. Predictions
      • 20.6.3 Example: High-Dimensional Regularization
      • 20.6.4 Potential Outcomes Notation
    • 20.7 Extended Mathematical Points
      • 20.7.1 M-Estimation and Asymptotic Theory
      • 20.7.2 The Danger of Omitted Variables
      • 20.7.3 Cross-Validation vs. Statistical Testing
    • 20.8 Putting It All Together: Comparing Objectives
    • 20.9 Conclusion
  • IV. CAUSAL INFERENCE
  • 21 Causal Inference
    • 21.1 The Ladder of Causation
    • 21.2 The Formal Notation of Causality
    • 21.3 The 7 Tools of Structural Causal Models
    • 21.4 Simpson’s Paradox
      • 21.4.1 What is Simpson’s Paradox?
      • 21.4.2 Why is this Important?
      • 21.4.3 Comparison between Simpson’s Paradox and Omitted Variable Bias
      • 21.4.4 Illustrating Simpson’s Paradox: Marketing Campaign Success Rates
      • 21.4.5 Why Does This Happen?
      • 21.4.6 How Does Causal Inference Solve This?
      • 21.4.7 Correcting Simpson’s Paradox with Regression Adjustment
      • 21.4.8 Key Takeaways
    • 21.5 Additional Resources
    • 21.6 Experimental vs. Quasi-Experimental Designs
      • 21.6.1 Criticisms of Quasi-Experimental Designs
    • 21.7 Hierarchical Ordering of Causal Tools
    • 21.8 Types of Validity in Research
      • 21.8.1 Measurement Validity
      • 21.8.2 Construct Validity
      • 21.8.3 Criterion Validity
      • 21.8.4 Internal Validity
      • 21.8.5 External Validity
      • 21.8.6 Ecological Validity
      • 21.8.7 Statistical Conclusion Validity
      • 21.8.8 Putting It All Together
    • 21.9 Types of Subjects in a Treatment Setting
      • 21.9.1 Non-Switchers
      • 21.9.2 Switchers
      • 21.9.3 Classification of Individuals Based on Treatment Assignment
    • 21.10 Types of Treatment Effects
      • 21.10.1 Average Treatment Effect
      • 21.10.2 Conditional Average Treatment Effect
      • 21.10.3 Intention-to-Treat Effect
      • 21.10.4 Local Average Treatment Effects
      • 21.10.5 Population vs. Sample Average Treatment Effects
      • 21.10.6 Average Treatment Effects on the Treated and Control
      • 21.10.7 Quantile Average Treatment Effects
      • 21.10.8 Log-Odds Treatment Effects for Binary Outcomes
      • 21.10.9 Summary Table: Treatment Effect Estimands
  • A. EXPERIMENTAL DESIGN
  • 22 Experimental Design
    • 22.1 Principles of Experimental Design
    • 22.2 The Gold Standard: Randomized Controlled Trials
    • 22.3 Selection Problem
      • 22.3.1 The Observed Difference in Outcomes
      • 22.3.2 Eliminating Selection Bias with Random Assignment
      • 22.3.3 Another Representation Under Regression
    • 22.4 Classical Experimental Designs
      • 22.4.1 Completely Randomized Design
      • 22.4.2 Randomized Block Design
      • 22.4.3 Factorial Design
      • 22.4.4 Crossover Design
      • 22.4.5 Split-Plot Design
      • 22.4.6 Latin Square Design
    • 22.5 Advanced Experimental Designs
      • 22.5.1 Semi-Random Experiments
      • 22.5.2 Re-Randomization
      • 22.5.3 Two-Stage Randomized Experiments
      • 22.5.4 Two-Stage Randomized Experiments with Interference and Noncompliance
    • 22.6 Emerging Research
      • 22.6.1 Covariate Balancing in Online A/B Testing: The Pigeonhole Design
      • 22.6.2 Handling Zero-Valued Outcomes
  • 23 Sampling
    • 23.1 Population and Sample
    • 23.2 Probability Sampling
      • 23.2.1 Simple Random Sampling
      • 23.2.2 Stratified Sampling
      • 23.2.3 Systematic Sampling
      • 23.2.4 Cluster Sampling
    • 23.3 Non-Probability Sampling
      • 23.3.1 Convenience Sampling
      • 23.3.2 Quota Sampling
      • 23.3.3 Snowball Sampling
    • 23.4 Unequal Probability Sampling
    • 23.5 Balanced Sampling
      • 23.5.1 Cube Method for Balanced Sampling
      • 23.5.2 Balanced Sampling with Stratification
      • 23.5.3 Balanced Sampling in Cluster Sampling
      • 23.5.4 Balanced Sampling in Two-Stage Sampling
    • 23.6 Sample Size Determination
  • 24 Analysis of Variance
    • 24.1 Completely Randomized Design
      • 24.1.1 Single-Factor Fixed Effects ANOVA
      • 24.1.2 Single Factor Random Effects ANOVA
      • 24.1.3 Two-Factor Fixed Effects ANOVA
      • 24.1.4 Two-Way Random Effects ANOVA
      • 24.1.5 Two-Way Mixed Effects ANOVA
    • 24.2 Nonparametric ANOVA
      • 24.2.1 Kruskal-Wallis Test (One-Way Nonparametric ANOVA)
      • 24.2.2 Friedman Test (Nonparametric Two-Way ANOVA)
    • 24.3 Randomized Block Designs
    • 24.4 Nested Designs
      • 24.4.1 Two-Factor Nested Design
      • 24.4.2 Unbalanced Nested Two-Factor Designs
      • 24.4.3 Random Factor Effects
    • 24.5 Sample Size Planning for ANOVA
      • 24.5.1 Balanced Designs
      • 24.5.2 Single Factor Studies
      • 24.5.3 Multi-Factor Studies
      • 24.5.4 Procedure for Sample Size Selection
      • 24.5.5 Randomized Block Experiments
    • 24.6 Single Factor Covariance Model
      • 24.6.1 Statistical Inference for Treatment Effects
      • 24.6.2 Testing for Parallel Slopes
      • 24.6.3 Adjusted Means
  • 25 Multivariate Methods
    • 25.1 Basic Understanding
      • 25.1.1 Multivariate Random Vectors
      • 25.1.2 Covariance Matrix
      • 25.1.3 Equalities in Expectation and Variance
      • 25.1.4 Multivariate Normal Distribution
      • 25.1.5 Test of Multivariate Normality
      • 25.1.6 Mean Vector Inference
      • 25.1.7 General Hypothesis Testing
    • 25.2 Multivariate Analysis of Variance
      • 25.2.1 One-Way MANOVA
      • 25.2.2 Profile Analysis
    • 25.3 Statistical Test Selection for Comparing Means
  • B. QUASI-EXPERIMENTAL DESIGN
  • 26 Quasi-Experimental Methods
    • 26.1 Identification Strategy in Quasi-Experiments
    • 26.2 Robustness Checks
    • 26.3 Establishing Mechanisms
    • 26.4 Limitations of Quasi-Experiments
    • 26.5 Assumptions for Identifying Treatment Effects
      • 26.5.1 Stable Unit Treatment Value Assumption
      • 26.5.2 Conditional Ignorability Assumption
      • 26.5.3 Overlap (Positivity) Assumption
    • 26.6 Natural Experiments
      • 26.6.1 The Problem of Reusing Natural Experiments
      • 26.6.2 Statistical Challenges in Reusing Natural Experiments
      • 26.6.3 Solutions: Multiple Testing Corrections
    • 26.7 Design vs. Model-Based Approaches
      • 26.7.1 Design-Based Perspective
      • 26.7.2 Model-Based Perspective
      • 26.7.3 Placing Methods Along a Spectrum
  • 27 Regression Discontinuity
    • 27.1 Conceptual Framework
      • 27.1.1 Types of Regression Discontinuity Designs
      • 27.1.2 Assumptions for RD Validity
      • 27.1.3 Threats to RD Validity
    • 27.2 Model Estimation Strategies
      • 27.2.1 Parametric Models: Polynomial Regression
      • 27.2.2 Nonparametric Models: Local Regression
    • 27.3 Formal Definition
      • 27.3.1 Identification Assumptions
    • 27.4 Estimation and Inference
      • 27.4.1 Local Randomization-Based Approach
      • 27.4.2 Continuity-Based Approach
    • 27.5 Specification Checks
      • 27.5.1 Balance Checks
      • 27.5.2 Sorting, Bunching, and Manipulation
      • 27.5.3 Placebo Tests
      • 27.5.4 Sensitivity to Bandwidth Choice
      • 27.5.5 Assessing Sensitivity
      • 27.5.6 Manipulation-Robust Regression Discontinuity Bounds
    • 27.6 Fuzzy Regression Discontinuity Design
      • 27.6.1 Compliance Types
      • 27.6.2 Estimating the Local Average Treatment Effect
      • 27.6.3 Equivalent Representation Using Expectations
      • 27.6.4 Estimation Strategies
      • 27.6.5 Practical Considerations
      • 27.6.6 Steps for Fuzzy RD
    • 27.7 Sharp Regression Discontinuity Design
      • 27.7.1 Assumptions for Identification
      • 27.7.2 Estimating the Local Average Treatment Effect
      • 27.7.3 Estimation Methods
      • 27.7.4 Steps for Sharp RD
    • 27.8 Regression Kink Design
      • 27.8.1 Identification in Sharp Regression Kink Design
      • 27.8.2 Identification in Fuzzy Regression Kink Design
      • 27.8.3 Estimation of RKD Effects
      • 27.8.4 Robustness Checks
    • 27.9 Multi-Cutoff Regression Discontinuity Design
      • 27.9.1 Identification
      • 27.9.2 Key Assumptions
      • 27.9.3 Estimation Approaches
      • 27.9.4 Robustness Checks
    • 27.10 Multi-Score Regression Discontinuity Design
      • 27.10.1 General Framework
      • 27.10.2 Identification
      • 27.10.3 Key Assumptions
      • 27.10.4 Estimation Approaches
      • 27.10.5 Robustness Checks
    • 27.11 Evaluation of a Regression Discontinuity Design
      • 27.11.1 Graphical and Formal Evidence
      • 27.11.2 Functional Form of the Running Variable
      • 27.11.3 Bandwidth Selection
      • 27.11.4 Addressing Potential Confounders
      • 27.11.5 External Validity in RD
    • 27.12 Applications of RD Designs
      • 27.12.1 Applications in Marketing
      • 27.12.2 R Packages for RD Estimation
      • 27.12.3 Example of Regression Discontinuity in Education
      • 27.12.4 Example of Occupational Licensing and Market Efficiency
      • 27.12.5 Replicating (Carpenter and Dobkin 2009)
      • 27.12.6 Additional RD Applications
  • 28 Temporal Discontinuity Designs
    • 28.1 Regression Discontinuity in Time
      • 28.1.1 Estimation and Model Selection
      • 28.1.2 Strengths of RDiT
      • 28.1.3 Limitations and Challenges of RDiT
      • 28.1.4 Recommendations for Robustness Checks
      • 28.1.5 Applications of RDiT
      • 28.1.6 Empirical Example
    • 28.2 Interrupted Time Series
      • 28.2.1 Advantages of ITS
      • 28.2.2 Limitations of ITS
      • 28.2.3 Empirical Example
    • 28.3 Combining both RDiT and ITS
      • 28.3.1 Augment an ITS Model with a Local Discontinuity Term
      • 28.3.2 Two-Stage (or Multi-Stage) Modeling
      • 28.3.3 Hierarchical or Multi-Level Modeling
      • 28.3.4 Empirical Example
      • 28.3.5 Practical Guidance
    • 28.4 Case-Crossover Study Design
      • 28.4.1 Mathematical Foundations
      • 28.4.2 Selection of Control Periods
      • 28.4.3 Assumptions
  • 29 Synthetic Difference-in-Differences
    • 29.1 Understanding
      • 29.1.1 Steps in SDID Estimation
      • 29.1.2 Comparison of Methods
      • 29.1.3 Why Use Weights?
      • 29.1.4 Benefits of Localization in SDID
      • 29.1.5 Designing SDID Weights
      • 29.1.6 How SDID Enhances DID’s Plausibility
      • 29.1.7 Choosing SDID Weights
      • 29.1.8 Accounting for Time-Varying Covariates in Weight Estimation
    • 29.2 Application
      • 29.2.1 Block Treatment
      • 29.2.2 Staggered Adoption
  • 30 Difference-in-Differences
    • 30.1 Empirical Studies
      • 30.1.1 Applications of DID in Marketing
      • 30.1.2 Applications of DID in Economics
    • 30.2 Visualization
    • 30.3 Simple Difference-in-Differences
      • 30.3.1 Basic Setup of DID
      • 30.3.2 Extensions of DID
      • 30.3.3 Goals of DID
    • 30.4 Empirical Research Walkthrough
      • 30.4.1 Example: The Unintended Consequences of “Ban the Box” Policies
      • 30.4.2 Example: Minimum Wage and Employment
      • 30.4.3 Example: The Effects of Grade Policies on Major Choice
    • 30.5 One Difference
    • 30.6 Two-Way Fixed Effects
      • 30.6.1 Canonical TWFE Model
      • 30.6.2 Limitations of TWFE
      • 30.6.3 Diagnosing and Addressing Bias in TWFE
      • 30.6.4 Remedies for TWFE’s Shortcomings
      • 30.6.5 Best Practices and Recommendations
    • 30.7 Multiple Periods and Variation in Treatment Timing
      • 30.7.1 Staggered Difference-in-Differences
    • 30.8 Modern Estimators for Staggered Adoption
      • 30.8.1 Group-Time Average Treatment Effects (Callaway and Sant’Anna 2021)
      • 30.8.2 Cohort Average Treatment Effects (L. Sun and Abraham 2021)
      • 30.8.3 Stacked Difference-in-Differences
      • 30.8.4 Panel Match DiD Estimator with In-and-Out Treatment Conditions
      • 30.8.5 Counterfactual Estimators
      • 30.8.6 Matrix Completion Estimator
      • 30.8.7 Reshaped Inverse Probability Weighting - TWFE Estimator
      • 30.8.8 Gardner (2022) and Borusyak, Jaravel, and Spiess (2024)
      • 30.8.9 Dynamic Treatment Effect Estimation with Interactive Fixed Effects and Short Panels
      • 30.8.10 Switching Difference-in-Differences Estimator (Clément De Chaisemartin and d’Haultfoeuille 2020)
      • 30.8.11 Augmented/Forward DID
      • 30.8.12 Doubly Robust Difference-in-Differences Estimators
      • 30.8.13 Nonlinear Difference-in-Differences
    • 30.9 Multiple Treatments
      • 30.9.1 Multiple Treatment Groups: Model Specification
      • 30.9.2 Understanding the Control Group in Multiple Treatment DiD
      • 30.9.3 Alternative Approaches: Separate Regressions vs. One Model
      • 30.9.4 Handling Treatment Intensity
      • 30.9.5 Considerations When Individuals Can Move Between Treatment Groups
      • 30.9.6 Parallel Trends Assumption in Multiple-Treatment DiD
    • 30.10 Mediation Under DiD
      • 30.10.1 Mediation Model in DiD
      • 30.10.2 Interpreting the Results
      • 30.10.3 Challenges in Mediation Analysis for DiD
      • 30.10.4 Alternative Approach: Instrumental Variables for Mediation
    • 30.11 Assumptions
      • 30.11.1 Prior Parallel Trends Test
      • 30.11.2 Placebo Test
    • 30.12 Robustness Checks
      • 30.12.1 Robustness Checks to Strengthen Causal Interpretation
      • 30.12.2 Best Practices for Reliable DiD Implementation
    • 30.13 Concerns in DID
      • 30.13.1 Matching Methods in DID
      • 30.13.2 Control Variables in DID
      • 30.13.3 DID for Count Data: Fixed-Effects Poisson Model
      • 30.13.4 Handling Zero-Valued Outcomes in DID
      • 30.13.5 Standard Errors
      • 30.13.6 Partial Identification
  • 31 Changes-in-Changes
    • 31.1 Key Concepts
    • 31.2 Estimating QTT with CiC
    • 31.3 Application
      • 31.3.1 ECIC package
      • 31.3.2 QTE package
  • 32 Synthetic Control
    • 32.1 Marketing Applications
    • 32.2 Key Features of SCM
    • 32.3 Advantages of SCM
      • 32.3.1 Compared to DiD
      • 32.3.2 Compared to Linear Regression
      • 32.3.3 Additional Advantages
    • 32.4 Disadvantages of SCM
    • 32.5 Assumptions
    • 32.6 Estimation
      • 32.6.1 Constructing the Synthetic Control
      • 32.6.2 Penalized Synthetic Control
    • 32.7 Theoretical Considerations
    • 32.8 Inference in SCM
      • 32.8.1 Permutation (Placebo) Inference
      • 32.8.2 One-Sided Inference
    • 32.9 Augmented Synthetic Control Method
    • 32.10 Synthetic Control with Staggered Adoption
      • 32.10.1 Partially Pooled Synthetic Control
    • 32.11 Generalized Synthetic Control
      • 32.11.1 The Problem with Traditional Methods
      • 32.11.2 Generalized Synthetic Control Model
      • 32.11.3 Identification and Estimation
      • 32.11.4 Bootstrap Procedure for Standard Errors
    • 32.12 Bayesian Synthetic Control
      • 32.12.1 Bayesian Causal Inference Framework
      • 32.12.2 Bayesian Dynamic Multilevel Factor Model
      • 32.12.3 Bayesian Sparse Synthetic Control
      • 32.12.4 Bayesian Inference and MCMC Estimation
    • 32.13 Using Multiple Outcomes to Improve the Synthetic Control Method
      • 32.13.1 Standard Synthetic Control Method
      • 32.13.2 Using Multiple Outcomes for Bias Reduction
      • 32.13.3 Estimation Methods
      • 32.13.4 Empirical Application: Flint Water Crisis
    • 32.14 Applications
      • 32.14.1 Synthetic Control Estimation
      • 32.14.2 The Basque Country Policy Change
      • 32.14.3 Micro-Synthetic Control with microsynth
  • 33 Event Studies
    • 33.1 Review of Event Studies Across Disciplines
      • 33.1.1 Finance Applications
      • 33.1.2 Management Applications
      • 33.1.3 Marketing Applications
    • 33.2 Key Assumptions
    • 33.3 Steps for Conducting an Event Study
      • 33.3.1 Step 1: Event Identification
      • 33.3.2 Step 2: Define the Event and Estimation Windows
      • 33.3.3 Step 3: Compute Normal vs. Abnormal Returns
      • 33.3.4 Step 4: Compute Cumulative Abnormal Returns
      • 33.3.5 Step 5: Statistical Tests for Significance
    • 33.4 Event Studies in Marketing
      • 33.4.1 Definition
      • 33.4.2 When Can Marketing Events Affect Non-Operating Assets or Debt?
      • 33.4.3 Calculating the Leverage Effect
      • 33.4.4 Computing Leverage Effect from Compustat Data
    • 33.5 Economic Significance
    • 33.6 Testing in Event Studies
      • 33.6.1 Statistical Power in Event Studies
      • 33.6.2 Parametric Tests
      • 33.6.3 Non-Parametric Tests
    • 33.7 Sample in Event Studies
    • 33.8 Confounders in Event Studies
      • 33.8.1 Types of Confounding Events
      • 33.8.2 Should We Exclude Confounded Observations?
      • 33.8.3 Simulation Study: Should We Exclude Correlated and Uncorrelated Events?
    • 33.9 Biases in Event Studies
      • 33.9.1 Timing Bias: Different Market Closing Times
      • 33.9.2 Upward Bias in Cumulative Abnormal Returns
      • 33.9.3 Cross-Sectional Dependence Bias
      • 33.9.4 Sample Selection Bias
      • 33.9.5 Corrections for Sample Selection Bias
    • 33.10 Long-run Event Studies
      • 33.10.1 Buy-and-Hold Abnormal Returns (BHAR)
      • 33.10.2 Long-term Cumulative Abnormal Returns (LCARs)
      • 33.10.3 Calendar-time Portfolio Abnormal Returns (CTARs)
    • 33.11 Aggregation
      • 33.11.1 Over Time
      • 33.11.2 Across Firms and Over Time
      • 33.11.3 Statistical Tests
    • 33.12 Heterogeneity in the Event Effect
      • 33.12.1 Common Variables Affecting CAR in Marketing and Finance
    • 33.13 Expected Return Calculation
      • 33.13.1 Statistical Models for Expected Returns
      • 33.13.2 Economic Models for Expected Returns
    • 33.14 Application of Event Study
      • 33.14.1 Sorting Portfolios for Expected Returns
      • 33.14.2 erer Package
      • 33.14.3 Eventus
  • 34 Instrumental Variables
    • 34.1 Challenges with Instrumental Variables
    • 34.2 Framework for Instrumental Variables
      • 34.2.1 Constant-Treatment-Effect Model
      • 34.2.2 Instrumental Variable Solution
      • 34.2.3 Heterogeneous Treatment Effects and the LATE Framework
      • 34.2.4 Assumptions for LATE Identification
      • 34.2.5 Local Average Treatment Effect Theorem
      • 34.2.6 IV in Randomized Trials (Noncompliance)
    • 34.3 Estimation
      • 34.3.1 Two-Stage Least Squares Estimation
      • 34.3.2 IV-GMM
      • 34.3.3 Limited Information Maximum Likelihood
      • 34.3.4 Jackknife IV
      • 34.3.5 Control Function Approach
      • 34.3.6 Fuller and Bias-Reduced IV
    • 34.4 Asymptotic Properties of the IV Estimator
      • 34.4.1 Consistency
      • 34.4.2 Asymptotic Normality
      • 34.4.3 Asymptotic Efficiency
    • 34.5 Inference
      • 34.5.1 Weak Instruments Problem
      • 34.5.2 Solutions and Approaches for Valid Inference
      • 34.5.3 Anderson-Rubin Approach
      • 34.5.4 tF Procedure
      • 34.5.5 AK Approach
    • 34.6 Testing Assumptions
      • 34.6.1 Relevance Assumption
      • 34.6.2 Independence (Unconfoundedness)
      • 34.6.3 Monotonicity Assumption
      • 34.6.4 Homogeneous Treatment Effects (Optional)
      • 34.6.5 Linearity and Additivity
      • 34.6.6 Instrument Exogeneity (Exclusion Restriction)
      • 34.6.7 Exogeneity Assumption
    • 34.7 Cautions in IV
      • 34.7.1 Negative R2 in IV
      • 34.7.2 Many-Instruments Bias
      • 34.7.3 Heterogeneous Effects in IV Estimation
      • 34.7.4 Zero-Valued Outcomes
    • 34.8 Types of IV
      • 34.8.1 Treatment Intensity
      • 34.8.2 Decision-Maker IV
      • 34.8.3 Proxy Variables
  • 35 Matching Methods
    • 35.1 Introduction and Motivation
      • 35.1.1 Why Match?
      • 35.1.2 Matching as “Pruning”
      • 35.1.3 Matching with DiD
    • 35.2 Key Assumptions
    • 35.3 Framework for Generalization
    • 35.4 Steps for Matching
      • 35.4.1 Step 1: Define “Closeness” (Distance Metrics)
      • 35.4.2 Step 2: Matching Algorithms
      • 35.4.3 Step 3: Diagnosing Match Quality
      • 35.4.4 Step 4: Estimating Treatment Effects
    • 35.5 Special Considerations
    • 35.6 Choosing a Matching Strategy
      • 35.6.1 Based on Estimand
      • 35.6.2 Based on Diagnostics
      • 35.6.3 Selection Criteria
    • 35.7 Matching vs. Regression
      • 35.7.1 Matching Estimand
      • 35.7.2 Regression Estimand
      • 35.7.3 Interpretation: Weighting Differences
    • 35.8 Software and Practical Implementation
    • 35.9 Selection on Observables
      • 35.9.1 Matching with MatchIt
      • 35.9.2 Reporting Standards
      • 35.9.3 Optimization-Based Matching via designmatch
      • 35.9.4 MatchingFrontier
      • 35.9.5 Propensity Scores
      • 35.9.6 Mahalanobis Distance Matching
      • 35.9.7 Coarsened Exact Matching (CEM)
      • 35.9.8 Genetic Matching
      • 35.9.9 Entropy Balancing
      • 35.9.10 Matching for High-Dimensional Data
      • 35.9.11 Matching for Multiple Treatments
      • 35.9.12 Matching for Multi-Level Treatments
      • 35.9.13 Matching for Repeated Treatments (Time-Varying Treatments)
    • 35.10 Selection on Unobservables
      • 35.10.1 Rosenbaum Bounds
      • 35.10.2 Relative Correlation Restrictions
      • 35.10.3 Coefficient-Stability Bounds
  • C. OTHER CONCERNS
  • 36 Endogeneity
    • 36.1 Endogenous Treatment
      • 36.1.1 Measurement Errors
      • 36.1.2 Simultaneity
      • 36.1.3 Reverse Causality
      • 36.1.4 Omitted Variable Bias
    • 36.2 Endogenous Sample Selection
      • 36.2.1 Unifying Model Frameworks
      • 36.2.2 Estimation Methods
      • 36.2.3 Theoretical Connections
      • 36.2.4 Tobit-2: Heckman’s Sample Selection Model
      • 36.2.5 Tobit-5: Switching Regression Model
      • 36.2.6 Pattern-Mixture Models
  • 37 Other Biases
    • 37.1 Aggregation Bias
      • 37.1.1 Simpson’s Paradox
    • 37.2 Contamination Bias
    • 37.3 Survivorship Bias
    • 37.4 Publication Bias
  • 38 Directed Acyclic Graphs
    • 38.1 Basic Notation and Graph Structures
    • 38.2 Rule of Thumb for Causal Inference
    • 38.3 Example DAG
    • 38.4 Causal Discovery
    • 38.5
  • 39 Controls
    • 39.1 Bad Controls
      • 39.1.1 M-bias
      • 39.1.2 Bias Amplification
      • 39.1.3 Overcontrol Bias
      • 39.1.4 Selection Bias
      • 39.1.5 Case-Control Bias
      • 39.1.6 Summary
    • 39.2 Good Controls
      • 39.2.1 Omitted Variable Bias Correction
      • 39.2.2 Omitted Variable Bias in Mediation Correction
    • 39.3 Neutral Controls
      • 39.3.1 Good Predictive Controls
      • 39.3.2 Good Selection Bias
      • 39.3.3 Bad Predictive Controls
      • 39.3.4 Bad Selection Bias
      • 39.3.5 Summary Table: Predictive vs. Causal Utility of Controls
    • 39.4 Choosing Controls
      • 39.4.1 Step 1: Use a Causal Diagram (DAG)
      • 39.4.2 Step 2: Use Algorithmic Tools
      • 39.4.3 Step 3: Theoretical Principles
      • 39.4.4 Step 4: Consider Sensitivity Analysis
      • 39.4.5 Step 5: Know When Not to Control
      • 39.4.6 Summary: Control Selection Pipeline
  • V. MISCELLANEOUS
  • 40 Report
    • 40.1 One summary table
    • 40.2 Model Comparison
    • 40.3 Changes in an estimate
    • 40.4 Standard Errors
    • 40.5 Coefficient Uncertainty and Distribution
    • 40.6 Descriptive Tables
    • 40.7 Visualizations and Plots
  • 41 Exploratory Data Analysis
  • 42 Sensitivity Analysis/ Robustness Check
    • 42.1 Specification curve
      • 42.1.1 starbility
      • 42.1.2 rdfanalysis
    • 42.2 Coefficient stability
    • 42.3 Omitted Variable Bias Quantification
  • 43 Replication and Synthetic Data
    • 43.1 The Replication Standard
      • 43.1.1 Solutions for Empirical Replication
      • 43.1.2 Free Data Repositories
      • 43.1.3 Exceptions to Replication
      • 43.1.4 Replication Landscape
    • 43.2 Synthetic Data
      • 43.2.1 Benefits of Synthetic Data
      • 43.2.2 Concerns and Limitations
      • 43.2.3 Further Insights on Synthetic Data
      • 43.2.4 Generating Synthetic Data
    • 43.3 Application
      • 43.3.1 Original Dataset
      • 43.3.2 Restricted Dataset
      • 43.3.3 Synthpop
  • 44 High-Performance Computing
    • 44.1 Best Practices for HPC in Data Analysis
    • 44.2 Example Workflow in R
    • 44.3 Recommendations
    • 44.4 Demonstration
  • APPENDIX
  • A Appendix
    • A.1 Git
    • A.2 Short-cut
    • A.3 Function short-cut
    • A.4 Citation
    • A.5 Install all necessary packages/libaries on your local machine
  • B Bookdown cheat sheet
    • B.1 Operation
    • B.2 Math Expression/ Syntax
      • B.2.1 Statistics Notation
    • B.3 Table
  • References
  • Published with bookdown

A Guide on Data Analysis

13.8 Emerging Trends in Missing Data Handling

13.8.1 Advances in Neural Network Approaches

Neural networks have transformed the landscape of missing data imputation, offering flexible, scalable, and powerful solutions that go beyond traditional methods.

13.8.1.1 Variational Autoencoders (VAEs)

  • Overview: Variational Autoencoders (VAEs) are generative models that encode data into a latent space and reconstruct it, filling in missing values during reconstruction.

  • Advantages:

    • Handle complex, non-linear relationships between variables.
    • Scalable to high-dimensional datasets.
    • Generate probabilistic imputations, reflecting uncertainty.
  • Applications:

    • In marketing, VAEs can impute missing customer behavior data while accounting for seasonal and demographic variations.
    • In finance, VAEs assist in imputing missing stock price data by modeling dependencies among assets.

13.8.1.2 GANs for Missing Data

  • Generative Adversarial Networks (GANs): GANs consist of a generator and a discriminator, with the generator imputing missing data and the discriminator evaluating its quality.

  • Advantages:

    • Preserve data distributions and avoid over-smoothing.
    • Suitable for imputation in datasets with complex patterns or multi-modal distributions.
  • Applications:

    • In healthcare, GANs have been used to impute missing patient records while preserving patient privacy and data integrity.
    • In retail, GANs can model missing sales data to predict trends and optimize inventory.

13.8.2 Integration with Reinforcement Learning

Reinforcement learning (RL) is increasingly being integrated into missing data strategies, particularly in dynamic or sequential data environments.

  • Markov Decision Processes (MDPs): RL models missing data handling as an MDP, where actions (imputations) are optimized based on rewards (accuracy of predictions or decisions).

  • Active Imputation:

    • RL can be used to actively query for missing data points, prioritizing those with the highest impact on downstream tasks.
    • Example: In customer churn prediction, RL can optimize the imputation of high-value customer records.
  • Applications:

    • Financial forecasting: RL models are used to impute missing transaction data dynamically, optimizing portfolio decisions.
    • Smart cities: RL-based models handle missing sensor data to enhance real-time decision-making in traffic management.

13.8.3 Synthetic Data Generation for Missing Data

Synthetic data generation has emerged as a robust solution to address missing data, providing flexibility and privacy.

  • Data Augmentation: Synthetic data is generated to augment datasets with missing values, reducing biases introduced by imputation.

  • Techniques:

    • Simulations: Monte Carlo simulations create plausible data points based on observed distributions.
    • Generative Models: GANs and VAEs generate realistic synthetic data that aligns with existing patterns.
  • Applications:

    • In fraud detection, synthetic datasets balance the impact of missing values on anomaly detection.
    • In insurance, synthetic data supports pricing models by filling in gaps from incomplete policyholder records.

13.8.4 Federated Learning and Privacy-Preserving Imputation

Federated learning has gained traction as a method for collaborative analysis while preserving data privacy.

  • Federated Imputation:
    • Distributed imputation algorithms operate on decentralized data, ensuring that sensitive information remains local.
    • Example: Hospitals collaboratively impute missing patient data without sharing individual records.
  • Privacy Mechanisms:
    • Differential privacy adds noise to imputed values, protecting individual-level data.
    • Homomorphic encryption allows computations on encrypted data, ensuring privacy throughout the imputation process.
  • Applications:
    • Healthcare: Federated learning imputes missing diagnostic data across clinics.
    • Banking: Collaborative imputation of financial transaction data supports risk modeling while adhering to regulations.

13.8.5 Imputation in Streaming and Online Data Environments

The increasing use of streaming data in business and technology requires real-time imputation methods to ensure uninterrupted analysis.

  • Challenges:
    • Imputation must occur dynamically as data streams in.
    • Low latency and high accuracy are essential to maintain real-time decision-making.
  • Techniques:
    • Online Learning Algorithms: Update imputation models incrementally as new data arrives.
    • Sliding Window Methods: Use recent data to estimate and impute missing values in real time.
  • Applications:
    • IoT devices: Imputation in sensor networks for smart homes or industrial monitoring ensures continuous operation despite data transmission issues.
    • Financial markets: Streaming imputation models predict and fill gaps in real-time stock price feeds to inform trading algorithms.