• A Guide on Data Analysis
  • Preface
    • How to cite these books
      • 0.0.1 Volume 1: Foundations of Data Analysis
      • 0.0.2 Volume 2: Regression Techniques for Data Analysis
      • 0.0.3 Volume 3: Advanced Modeling and Data Challenges
      • 0.0.4 Volume 4: Experimental Design
  • 1 Introduction
    • 1.1 General Recommendations
  • 2 Prerequisites
    • 2.1 Matrix Theory
      • 2.1.1 Rank of a Matrix
      • 2.1.2 Inverse of a Matrix
      • 2.1.3 Definiteness of a Matrix
      • 2.1.4 Matrix Calculus
      • 2.1.5 Optimization in Scalar and Vector Spaces
      • 2.1.6 Cholesky Decomposition
    • 2.2 Probability Theory
      • 2.2.1 Axioms and Theorems of Probability
      • 2.2.2 Central Limit Theorem
      • 2.2.3 Random Variable
      • 2.2.4 Moment Generating Function
      • 2.2.5 Moments
      • 2.2.6 Skewness
      • 2.2.7 Kurtosis
      • 2.2.8 Distributions
    • 2.3 General Math
      • 2.3.1 Number Sets
      • 2.3.2 Summation Notation and Series
      • 2.3.3 Taylor Expansion
      • 2.3.4 Law of Large Numbers
      • 2.3.5 Convergence
      • 2.3.6 Sufficient Statistics and Likelihood
      • 2.3.7 Parameter Transformations
    • 2.4 Data Import/Export
      • 2.4.1 Key Limitations of R
      • 2.4.2 Solutions and Workarounds
      • 2.4.3 Medium size
      • 2.4.4 Large size
    • 2.5 Data Manipulation
  • I. BASIC
  • 3 Descriptive Statistics
    • 3.1 Numerical Measures
    • 3.2 Graphical Measures
      • 3.2.1 Shape
      • 3.2.2 Scatterplot
    • 3.3 Normality Assessment
      • 3.3.1 Graphical Assessment
      • 3.3.2 Summary Statistics
    • 3.4 Bivariate Statistics
      • 3.4.1 Two Continuous
      • 3.4.2 Categorical and Continuous
      • 3.4.3 Two Discrete
      • 3.4.4 General Approach to Bivariate Statistics
  • 4 Basic Statistical Inference
    • 4.1 Hypothesis Testing Framework
      • 4.1.1 Null and Alternative Hypotheses
      • 4.1.2 Errors in Hypothesis Testing
      • 4.1.3 The Role of Distributions in Hypothesis Testing
      • 4.1.4 The Test Statistic
      • 4.1.5 Critical Values and Rejection Regions
      • 4.1.6 Visualizing Hypothesis Testing
    • 4.2 Key Concepts and Definitions
      • 4.2.1 Random Sample
      • 4.2.2 Sample Statistics
      • 4.2.3 Distribution of the Sample Mean
    • 4.3 One-Sample Inference
      • 4.3.1 For Single Mean
      • 4.3.2 For Difference of Means, Independent Samples
      • 4.3.3 For Difference of Means, Paired Samples
      • 4.3.4 For Difference of Two Proportions
      • 4.3.5 For Single Proportion
      • 4.3.6 For Single Variance
      • 4.3.7 Non-parametric Tests
    • 4.4 Two-Sample Inference
      • 4.4.1 For Means
      • 4.4.2 For Variances
      • 4.4.3 Power
      • 4.4.4 Matched Pair Designs
      • 4.4.5 Nonparametric Tests for Two Samples
    • 4.5 Categorical Data Analysis
      • 4.5.1 Association Tests
      • 4.5.2 Ordinal Association
      • 4.5.3 Ordinal Trend
    • 4.6 Divergence Metrics and Tests for Comparing Distributions
      • 4.6.1 Kolmogorov-Smirnov Test
      • 4.6.2 Anderson-Darling Test
      • 4.6.3 Chi-Square Goodness-of-Fit Test
      • 4.6.4 Cramér-von Mises Test
      • 4.6.5 Kullback-Leibler Divergence
      • 4.6.6 Jensen-Shannon Divergence
      • 4.6.7 Hellinger Distance
      • 4.6.8 Bhattacharyya Distance
      • 4.6.9 Wasserstein Distance
      • 4.6.10 Energy Distance
      • 4.6.11 Total Variation Distance
      • 4.6.12 Summary
  • II. REGRESSION
  • 5 Linear Regression
    • 5.1 Ordinary Least Squares
      • 5.1.1 Simple Regression (Basic) Model
      • 5.1.2 Multiple Linear Regression
    • 5.2 Generalized Least Squares
      • 5.2.1 Infeasible Generalized Least Squares
      • 5.2.2 Feasible Generalized Least Squares
      • 5.2.3 Weighted Least Squares
    • 5.3 Maximum Likelihood
      • 5.3.1 Motivation for MLE
      • 5.3.2 Key Quantities for Inference
      • 5.3.3 Assumptions of MLE
      • 5.3.4 Properties of MLE
      • 5.3.5 Practical Considerations
      • 5.3.6 Comparison of MLE and OLS
      • 5.3.7 Applications of MLE
    • 5.4 Penalized (Regularized) Estimators
      • 5.4.1 Motivation for Penalized Estimators
      • 5.4.2 Ridge Regression
      • 5.4.3 Lasso Regression
      • 5.4.4 Elastic Net
      • 5.4.5 Tuning Parameter Selection
      • 5.4.6 Properties of Penalized Estimators
    • 5.5 Robust Estimators
      • 5.5.1 Motivation for Robust Estimation
      • 5.5.2 \(M\)-Estimators
      • 5.5.3 \(R\)-Estimators
      • 5.5.4 \(L\)-Estimators
      • 5.5.5 Least Trimmed Squares (LTS)
      • 5.5.6 \(S\)-Estimators
      • 5.5.7 \(MM\)-Estimators
      • 5.5.8 Practical Considerations
    • 5.6 Partial Least Squares
      • 5.6.1 Motivation for PLS
      • 5.6.2 Steps to Construct PLS Components
      • 5.6.3 Properties of PLS
      • 5.6.4 Comparison with Related Methods
  • 6 Non-Linear Regression
    • 6.1 Inference
      • 6.1.1 Linear Functions of the Parameters
      • 6.1.2 Nonlinear Functions of Parameters
    • 6.2 Non-linear Least Squares Estimation
      • 6.2.1 Iterative Optimization
      • 6.2.2 Derivative-Free
      • 6.2.3 Stochastic Heuristic
      • 6.2.4 Linearization
      • 6.2.5 Hybrid
      • 6.2.6 Comparison of Nonlinear Optimizers
    • 6.3 Practical Considerations
      • 6.3.1 Selecting Starting Values
      • 6.3.2 Handling Constrained Parameters
      • 6.3.3 Failure to Converge
      • 6.3.4 Convergence to a Local Minimum
      • 6.3.5 Model Adequacy and Estimation Considerations
    • 6.4 Application
      • 6.4.1 Nonlinear Estimation Using Gauss-Newton Algorithm
      • 6.4.2 Logistic Growth Model
      • 6.4.3 Nonlinear Plateau Model
  • 7 Generalized Linear Models
    • 7.1 Logistic Regression
      • 7.1.1 Logistic Model
      • 7.1.2 Likelihood Function
      • 7.1.3 Fisher Information Matrix
      • 7.1.4 Inference in Logistic Regression
      • 7.1.5 Application: Logistic Regression
    • 7.2 Probit Regression
      • 7.2.1 Probit Model
      • 7.2.2 Application: Probit Regression
    • 7.3 Binomial Regression
      • 7.3.1 Dataset Overview
      • 7.3.2 Apply Logistic Model
      • 7.3.3 Apply Probit Model
    • 7.4 Poisson Regression
      • 7.4.1 The Poisson Distribution
      • 7.4.2 Poisson Model
      • 7.4.3 Link Function Choices
      • 7.4.4 Application: Poisson Regression
    • 7.5 Negative Binomial Regression
      • 7.5.1 Negative Binomial Distribution
      • 7.5.2 Application: Negative Binomial Regression
      • 7.5.3 Fitting a Zero-Inflated Negative Binomial Model
    • 7.6 Quasi-Poisson Regression
      • 7.6.1 Is Quasi-Poisson Regression a Generalized Linear Model?
      • 7.6.2 Application: Quasi-Poisson Regression
    • 7.7 Multinomial Logistic Regression
      • 7.7.1 The Multinomial Distribution
      • 7.7.2 Modeling Probabilities Using Log-Odds
      • 7.7.3 Softmax Representation
      • 7.7.4 Log-Odds Ratio Between Two Categories
      • 7.7.5 Estimation
      • 7.7.6 Interpretation of Coefficients
      • 7.7.7 Application: Multinomial Logistic Regression
      • 7.7.8 Application: Gamma Regression
    • 7.8 Generalization of Generalized Linear Models
      • 7.8.1 Exponential Family
      • 7.8.2 Properties of GLM Exponential Families
      • 7.8.3 Structure of a Generalized Linear Model
      • 7.8.4 Components of a GLM
      • 7.8.5 Canonical Link
      • 7.8.6 Inverse Link Functions
      • 7.8.7 Estimation of Parameters in GLMs
      • 7.8.8 Inference
      • 7.8.9 Deviance
      • 7.8.10 Diagnostic Plots
      • 7.8.11 Goodness of Fit
      • 7.8.12 Over-Dispersion
  • 8 Linear Mixed Models
    • 8.1 Dependent Data
      • 8.1.1 Motivation: A Repeated Measurements Example
      • 8.1.2 Example: Linear Mixed Model for Repeated Measurements
      • 8.1.3 Random-Intercepts Model
      • 8.1.4 Covariance Models in Linear Mixed Models
      • 8.1.5 Covariance Structures in Mixed Models
    • 8.2 Estimation in Linear Mixed Models
      • 8.2.1 Interpretation of the Mixed Model Equations
      • 8.2.2 Derivation of the Mixed Model Equations
      • 8.2.3 Bayesian Interpretation of Linear Mixed Models
      • 8.2.4 Estimating the Variance-Covariance Matrix
    • 8.3 Inference in Linear Mixed Models
      • 8.3.1 Inference for Fixed Effects (\(\beta\))
      • 8.3.2 Inference for Variance Components (\(\theta\))
    • 8.4 Information Criteria for Model Selection
      • 8.4.1 Akaike Information Criterion
      • 8.4.2 Corrected AIC
      • 8.4.3 Bayesian Information Criterion
      • 8.4.4 Practical Example with Linear Mixed Models
    • 8.5 Split-Plot Designs
      • 8.5.1 Example Setup
      • 8.5.2 Statistical Model for Split-Plot Designs
      • 8.5.3 Approaches to Analyzing Split-Plot Designs
      • 8.5.4 Application: Split-Plot Design
    • 8.6 Repeated Measures in Mixed Models
    • 8.7 Unbalanced or Unequally Spaced Data
      • 8.7.1 Variance-Covariance Structure: Power Model
    • 8.8 Application: Mixed Models in Practice
      • 8.8.1 Example 1: Pulp Brightness Analysis
      • 8.8.2 Example 2: Penicillin Yield (GLMM with Blocking)
      • 8.8.3 Example 3: Growth in Rats Over Time
      • 8.8.4 Example 4: Tree Water Use (Agridat)
  • 9 Nonlinear and Generalized Linear Mixed Models
    • 9.1 Nonlinear Mixed Models
    • 9.2 Generalized Linear Mixed Models
    • 9.3 Relationship Between NLMMs and GLMMs
    • 9.4 Marginal Properties of GLMMs
      • 9.4.1 Marginal Mean of \(y_i\)
      • 9.4.2 Marginal Variance of \(y_i\)
      • 9.4.3 Marginal Covariance of \(\mathbf{y}\)
    • 9.5 Estimation in Nonlinear and Generalized Linear Mixed Models
      • 9.5.1 Estimation by Numerical Integration
      • 9.5.2 Estimation by Linearization
      • 9.5.3 Estimation by Bayesian Hierarchical Models
      • 9.5.4 Practical Implementation in R
    • 9.6 Application: Nonlinear and Generalized Linear Mixed Models
      • 9.6.1 Binomial Data: CBPP Dataset
      • 9.6.2 Count Data: Owl Dataset
      • 9.6.3 Binomial Example: Gotway Hessian Fly Data
      • 9.6.4 Nonlinear Mixed Model: Yellow Poplar Data
    • 9.7 Summary
  • 10 Nonparametric Regression
    • 10.1 Why Nonparametric?
      • 10.1.1 Flexibility
      • 10.1.2 Fewer Assumptions
      • 10.1.3 Interpretability
      • 10.1.4 Practical Considerations
      • 10.1.5 Balancing Parametric and Nonparametric Approaches
    • 10.2 Basic Concepts in Nonparametric Estimation
      • 10.2.1 Bias-Variance Trade-Off
      • 10.2.2 Kernel Smoothing and Local Averages
    • 10.3 Kernel Regression
      • 10.3.1 Basic Setup
      • 10.3.2 Nadaraya-Watson Kernel Estimator
      • 10.3.3 Priestley–Chao Kernel Estimator
      • 10.3.4 Gasser–Müller Kernel Estimator
      • 10.3.5 Comparison of Kernel-Based Estimators
      • 10.3.6 Bandwidth Selection
      • 10.3.7 Asymptotic Properties
      • 10.3.8 Derivation of the Nadaraya-Watson Estimator
    • 10.4 Local Polynomial Regression
      • 10.4.1 Local Polynomial Fitting
      • 10.4.2 Mathematical Form of the Solution
      • 10.4.3 Bias, Variance, and Asymptotics
      • 10.4.4 Special Case: Local Linear Regression
      • 10.4.5 Bandwidth Selection
      • 10.4.6 Asymptotic Properties Summary
    • 10.5 Smoothing Splines
      • 10.5.1 Properties and Form of the Smoothing Spline
      • 10.5.2 Choice of \(\lambda\)
      • 10.5.3 Connection to Reproducing Kernel Hilbert Spaces
    • 10.6 Confidence Intervals in Nonparametric Regression
      • 10.6.1 Asymptotic Normality
      • 10.6.2 Bootstrap Methods
      • 10.6.3 Practical Considerations
    • 10.7 Generalized Additive Models
      • 10.7.1 Estimation via Penalized Likelihood
      • 10.7.2 Interpretation of GAMs
      • 10.7.3 Model Selection and Smoothing Parameter Estimation
      • 10.7.4 Extensions of GAMs
    • 10.8 Regression Trees and Random Forests
      • 10.8.1 Regression Trees
      • 10.8.2 Random Forests
      • 10.8.3 Theoretical Insights
      • 10.8.4 Feature Importance in Random Forests
      • 10.8.5 Advantages and Limitations of Tree-Based Methods
    • 10.9 Wavelet Regression
      • 10.9.1 Wavelet Series Expansion
      • 10.9.2 Wavelet Regression Model
      • 10.9.3 Wavelet Shrinkage and Thresholding
    • 10.10 Multivariate Nonparametric Regression
      • 10.10.1 The Curse of Dimensionality
      • 10.10.2 Multivariate Kernel Regression
      • 10.10.3 Multivariate Splines
      • 10.10.4 Additive Models (GAMs)
      • 10.10.5 Radial Basis Functions
    • 10.11 Conclusion: The Evolving Landscape of Regression Analysis
      • 10.11.1 Key Takeaways
      • 10.11.2 The Art and Science of Regression
      • 10.11.3 Looking Forward
      • 10.11.4 Final Thoughts
  • III. RAMIFICATIONS
  • 11 Data
    • 11.1 Data Types
      • 11.1.1 Qualitative vs. Quantitative Data
      • 11.1.2 Other Ways to Classify Data
      • 11.1.3 Data by Observational Structure Over Time
    • 11.2 Cross-Sectional Data
    • 11.3 Time Series Data
      • 11.3.1 Statistical Properties of Time Series Models
      • 11.3.2 Common Time Series Processes
      • 11.3.3 Deterministic Time Trends
      • 11.3.4 Violations of Exogeneity in Time Series Models
      • 11.3.5 Consequences of Exogeneity Violations
      • 11.3.6 Highly Persistent Data
      • 11.3.7 Unit Root Testing
      • 11.3.8 Newey-West Standard Errors
    • 11.4 Repeated Cross-Sectional Data
      • 11.4.1 Key Characteristics
      • 11.4.2 Statistical Modeling for Repeated Cross-Sections
      • 11.4.3 Advantages of Repeated Cross-Sectional Data
      • 11.4.4 Disadvantages of Repeated Cross-Sectional Data
    • 11.5 Panel Data
      • 11.5.1 Advantages of Panel Data
      • 11.5.2 Disadvantages of Panel Data
      • 11.5.3 Sources of Variation in Panel Data
      • 11.5.4 Pooled OLS Estimator
      • 11.5.5 Individual-Specific Effects Model
      • 11.5.6 Random Effects Estimator
      • 11.5.7 Fixed Effects Estimator
      • 11.5.8 Tests for Assumptions in Panel Data Analysis
      • 11.5.9 Model Selection in Panel Data
      • 11.5.10 Alternative Estimators
      • 11.5.11 Application
    • 11.6 Choosing the Right Type of Data
    • 11.7 Data Quality and Ethical Considerations
  • 12 Variable Transformation
    • 12.1 Continuous Variables
      • 12.1.1 Standardization (Z-score Normalization)
      • 12.1.2 Min-Max Scaling (Normalization)
      • 12.1.3 Square Root and Cube Root Transformations
      • 12.1.4 Logarithmic Transformation
      • 12.1.5 Exponential Transformation
      • 12.1.6 Power Transformation
      • 12.1.7 Inverse (Reciprocal) Transformation
      • 12.1.8 Hyperbolic Arcsine Transformation
      • 12.1.9 Ordered Quantile Normalization (Rank-Based Transformation)
      • 12.1.10 Lambert W x F Transformation
      • 12.1.11 Inverse Hyperbolic Sine Transformation
      • 12.1.12 Box-Cox Transformation
      • 12.1.13 Yeo-Johnson Transformation
      • 12.1.14 RankGauss Transformation
      • 12.1.15 Automatically Choosing the Best Transformation
    • 12.2 Categorical Variables
      • 12.2.1 One-Hot Encoding (Dummy Variables)
      • 12.2.2 Label Encoding
      • 12.2.3 Feature Hashing (Hash Encoding)
      • 12.2.4 Binary Encoding
      • 12.2.5 Base-N Encoding (Generalized Binary Encoding)
      • 12.2.6 Frequency Encoding
      • 12.2.7 Target Encoding (Mean Encoding)
      • 12.2.8 Ordinal Encoding
      • 12.2.9 Weight of Evidence Encoding
      • 12.2.10 Helmert Encoding
      • 12.2.11 Probability Ratio Encoding
      • 12.2.12 Backward Difference Encoding
      • 12.2.13 Leave-One-Out Encoding
      • 12.2.14 James-Stein Encoding
      • 12.2.15 M-Estimator Encoding
      • 12.2.16 Thermometer Encoding
      • 12.2.17 Choosing the Right Encoding Method
  • 13 Imputation (Missing Data)
    • 13.1 Introduction to Missing Data
      • 13.1.1 Types of Imputation
      • 13.1.2 When and Why to Use Imputation
      • 13.1.3 Importance of Missing Data Treatment in Statistical Modeling
      • 13.1.4 Prevalence of Missing Data Across Domains
      • 13.1.5 Practical Considerations for Imputation
    • 13.2 Theoretical Foundations of Missing Data
      • 13.2.1 Definition and Classification of Missing Data
      • 13.2.2 Missing Data Mechanisms
      • 13.2.3 Relationship Between Mechanisms and Ignorability
    • 13.3 Diagnosing the Missing Data Mechanism
      • 13.3.1 Descriptive Methods
      • 13.3.2 Statistical Tests for Missing Data Mechanisms
      • 13.3.3 Assessing MAR and MNAR
    • 13.4 Methods for Handling Missing Data
      • 13.4.1 Basic Methods
      • 13.4.2 Single Imputation Techniques
      • 13.4.3 Machine Learning and Modern Approaches
      • 13.4.4 Multiple Imputation
    • 13.5 Evaluation of Imputation Methods
      • 13.5.1 Statistical Metrics for Assessing Imputation Quality
      • 13.5.2 Bias-Variance Tradeoff in Imputation
      • 13.5.3 Sensitivity Analysis
      • 13.5.4 Validation Using Simulated Data and Real-World Case Studies
    • 13.6 Criteria for Choosing an Effective Approach
    • 13.7 Challenges and Ethical Considerations
      • 13.7.1 Challenges in High-Dimensional Data
      • 13.7.2 Missing Data in Big Data Contexts
      • 13.7.3 Ethical Concerns
    • 13.8 Emerging Trends in Missing Data Handling
      • 13.8.1 Advances in Neural Network Approaches
      • 13.8.2 Integration with Reinforcement Learning
      • 13.8.3 Synthetic Data Generation for Missing Data
      • 13.8.4 Federated Learning and Privacy-Preserving Imputation
      • 13.8.5 Imputation in Streaming and Online Data Environments
    • 13.9 Application of Imputation
      • 13.9.1 Visualizing Missing Data
      • 13.9.2 How Many Imputations?
      • 13.9.3 Generating Missing Data for Demonstration
      • 13.9.4 Imputation with Mean, Median, and Mode
      • 13.9.5 K-Nearest Neighbors (KNN) Imputation
      • 13.9.6 Imputation with Decision Trees (rpart)
      • 13.9.7 MICE (Multivariate Imputation via Chained Equations)
      • 13.9.8 Amelia
      • 13.9.9 missForest
      • 13.9.10 Hmisc
      • 13.9.11 mi
  • 14 Model Specification Tests
    • 14.1 Nested Model Tests
      • 14.1.1 Wald Test
      • 14.1.2 Likelihood Ratio Test
      • 14.1.3 F-Test (for Linear Regression)
      • 14.1.4 Chow Test
    • 14.2 Non-Nested Model Tests
      • 14.2.1 Vuong Test
      • 14.2.2 Davidson–MacKinnon J-Test
      • 14.2.3 Adjusted \(R^2\)
      • 14.2.4 Comparing Models with Transformed Dependent Variables
    • 14.3 Heteroskedasticity Tests
      • 14.3.1 Breusch–Pagan Test
      • 14.3.2 White Test
      • 14.3.3 Goldfeld–Quandt Test
      • 14.3.4 Park Test
      • 14.3.5 Glejser Test
      • 14.3.6 Summary of Heteroskedasticity Tests
    • 14.4 Functional Form Tests
      • 14.4.1 Ramsey RESET Test (Regression Equation Specification Error Test)
      • 14.4.2 Harvey–Collier Test
      • 14.4.3 Rainbow Test
      • 14.4.4 Summary of Functional Form Tests
    • 14.5 Autocorrelation Tests
      • 14.5.1 Durbin–Watson Test
      • 14.5.2 Breusch–Godfrey Test
      • 14.5.3 Ljung–Box Test (or Box–Pierce Test)
      • 14.5.4 Runs Test
      • 14.5.5 Summary of Autocorrelation Tests
    • 14.6 Multicollinearity Diagnostics
      • 14.6.1 Variance Inflation Factor
      • 14.6.2 Tolerance Statistic
      • 14.6.3 Condition Index and Eigenvalue Decomposition
      • 14.6.4 Pairwise Correlation Matrix
      • 14.6.5 Determinant of the Correlation Matrix
      • 14.6.6 Summary of Multicollinearity Diagnostics
      • 14.6.7 Addressing Multicollinearity
  • 15 Variable Selection
    • 15.1 Filter Methods (Statistical Criteria, Model-Agnostic)
      • 15.1.1 Information Criteria-Based Selection
      • 15.1.2 Univariate Selection Methods
      • 15.1.3 Correlation-Based Feature Selection
      • 15.1.4 Variance Thresholding
    • 15.2 Wrapper Methods (Model-Based Subset Evaluation)
      • 15.2.1 Best Subsets Algorithm
      • 15.2.2 Stepwise Selection Methods
      • 15.2.3 Branch-and-Bound Algorithm
      • 15.2.4 Recursive Feature Elimination
    • 15.3 Embedded Methods (Integrated into Model Training)
      • 15.3.1 Regularization-Based Selection
      • 15.3.2 Tree-Based Feature Importance
      • 15.3.3 Genetic Algorithms
    • 15.4 Summary Table
  • 16 Hypothesis Testing
    • 16.1 Null Hypothesis Significance Testing
      • 16.1.1 Error Types in Hypothesis Testing
      • 16.1.2 Hypothesis Testing Framework
      • 16.1.3 Interpreting Hypothesis Testing Results
      • 16.1.4 Understanding p-Values
      • 16.1.5 The Role of Sample Size
      • 16.1.6 p-Value Hacking
      • 16.1.7 Practical vs. Statistical Significance
      • 16.1.8 Mitigating the Misuse of p-Values
      • 16.1.9 Wald Test
      • 16.1.10 Likelihood Ratio Test
      • 16.1.11 Lagrange Multiplier (Score) Test
      • 16.1.12 Comparing Hypothesis Tests
    • 16.2 Two One-Sided Tests Equivalence Testing
      • 16.2.1 When to Use TOST?
      • 16.2.2 Interpretation of the TOST Procedure
      • 16.2.3 Relationship to Confidence Intervals
      • 16.2.4 Example 1: Testing the Equivalence of Two Means
      • 16.2.5 Advantages of TOST Equivalence Testing
      • 16.2.6 When Not to Use TOST
    • 16.3 False Discovery Rate
      • 16.3.1 Benjamini-Hochberg Procedure
      • 16.3.2 Benjamini-Yekutieli Procedure
      • 16.3.3 Storey’s q-value Approach
      • 16.3.4 Summary of False Discovery Rate Methods
    • 16.4 Comparison of Testing Frameworks
  • 17 Marginal Effects
    • 17.1 Definition of Marginal Effects
      • 17.1.1 Analytical Derivation of Marginal Effects
      • 17.1.2 Numerical Approximation of Marginal Effects
    • 17.2 Marginal Effects in Different Contexts
    • 17.3 Marginal Effects Interpretation
    • 17.4 Delta Method
    • 17.5 Comparison: Delta Method vs. Alternative Approaches
      • 17.5.1 Example: Applying the Delta Method in a logistic regression
    • 17.6 Types of Marginal Effect
      • 17.6.1 Average Marginal Effect
      • 17.6.2 Marginal Effects at the Mean
      • 17.6.3 Marginal Effects at the Average
    • 17.7 Packages for Marginal Effects
      • 17.7.1 marginaleffects Package (Recommended)
      • 17.7.2 margins Package
      • 17.7.3 mfx Package
      • 17.7.4 Comparison of Packages
  • 18 Moderation
    • 18.1 Types of Moderation Analyses
    • 18.2 Key Terminology
    • 18.3 Moderation Model
    • 18.4 Types of Interactions
    • 18.5 Three-Way Interactions
    • 18.6 Additional Resources
    • 18.7 Application
      • 18.7.1 emmeans Package
      • 18.7.2 probemod Package
      • 18.7.3 interactions Package
      • 18.7.4 interactionR Package
      • 18.7.5 sjPlot Package
      • 18.7.6 Summary of Moderation Analysis Packages
    • 18.8 Interaction Debate: Binning Estimators vs. Generalized Additive Models
      • 18.8.1 The Stakes
      • 18.8.2 The Core Problem: When Linearity Fails
      • 18.8.3 Binning Estimator Approach
      • 18.8.4 Simonsohn’s Critique
      • 18.8.5 Simonsohn’s Core Criticism
      • 18.8.6 Generalized Additive Models Alternative
      • 18.8.7 Mathematical Foundations of the Disagreement
      • 18.8.8 Mathematical Example
      • 18.8.9 When to Use Each Method
      • 18.8.10 A Decision Tree
  • 19 Mediation
    • 19.1 Traditional Approach
      • 19.1.1 Steps in the Traditional Mediation Model
      • 19.1.2 Graphical Representation of Mediation
      • 19.1.3 Measuring Mediation
      • 19.1.4 Assumptions in Linear Mediation Models
      • 19.1.5 Testing for Mediation
      • 19.1.6 Additional Considerations
      • 19.1.7 Assumptions in Mediation Analysis
      • 19.1.8 Indirect Effect Tests
      • 19.1.9 Power Analysis for Mediation
      • 19.1.10 Multiple Mediation Analysis
      • 19.1.11 Multiple Treatments in Mediation
    • 19.2 Causal Inference Approach to Mediation
      • 19.2.1 Example: Traditional Mediation Analysis
      • 19.2.2 Two Approaches in Causal Mediation Analysis
  • 20 Prediction and Estimation
    • 20.1 Conceptual Framing
      • 20.1.1 Predictive Modeling
      • 20.1.2 Estimation or Causal Inference
    • 20.2 Mathematical Setup
      • 20.2.1 Probability Space and Data
      • 20.2.2 Loss Functions and Risk
    • 20.3 Prediction in Detail
      • 20.3.1 Empirical Risk Minimization and Generalization
      • 20.3.2 Bias-Variance Decomposition
      • 20.3.3 Example: Linear Regression for Prediction
      • 20.3.4 Applications in Economics
    • 20.4 Parameter Estimation and Causal Inference
      • 20.4.1 Estimation in Parametric Models
      • 20.4.2 Causal Inference Fundamentals
      • 20.4.3 Role of Identification
      • 20.4.4 Challenges
    • 20.5 Causation versus Prediction
    • 20.6 Illustrative Equations and Mathematical Contrasts
      • 20.6.1 Risk Minimization vs. Consistency
      • 20.6.2 Partial Derivatives vs. Predictions
      • 20.6.3 Example: High-Dimensional Regularization
      • 20.6.4 Potential Outcomes Notation
    • 20.7 Extended Mathematical Points
      • 20.7.1 M-Estimation and Asymptotic Theory
      • 20.7.2 The Danger of Omitted Variables
      • 20.7.3 Cross-Validation vs. Statistical Testing
    • 20.8 Putting It All Together: Comparing Objectives
    • 20.9 Conclusion
  • IV. CAUSAL INFERENCE
  • 21 Causal Inference
    • 21.1 The Formal Notation of Causality
    • 21.2 Simpson’s Paradox
      • 21.2.1 Comparison between Simpson’s Paradox and Omitted Variable Bias
      • 21.2.2 Illustrating Simpson’s Paradox: Marketing Campaign Success Rates
      • 21.2.3 Why Does This Happen?
      • 21.2.4 How Does Causal Inference Solve This?
      • 21.2.5 Correcting Simpson’s Paradox with Regression Adjustment
      • 21.2.6 Key Takeaways
    • 21.3 Experimental vs. Quasi-Experimental Designs
      • 21.3.1 Criticisms of Quasi-Experimental Designs
    • 21.4 Hierarchical Ordering of Causal Tools
    • 21.5 Types of Validity in Research
      • 21.5.1 Measurement Validity
      • 21.5.2 Construct Validity
      • 21.5.3 Criterion Validity
      • 21.5.4 Internal Validity
      • 21.5.5 External Validity
      • 21.5.6 Ecological Validity
      • 21.5.7 Statistical Conclusion Validity
      • 21.5.8 Putting It All Together
    • 21.6 Types of Subjects in a Treatment Setting
      • 21.6.1 Non-Switchers
      • 21.6.2 Switchers
      • 21.6.3 Classification of Individuals Based on Treatment Assignment
    • 21.7 Types of Treatment Effects
      • 21.7.1 Average Treatment Effect
      • 21.7.2 Conditional Average Treatment Effect
      • 21.7.3 Intention-to-Treat Effect
      • 21.7.4 Local Average Treatment Effects
      • 21.7.5 Population vs. Sample Average Treatment Effects
      • 21.7.6 Average Treatment Effects on the Treated and Control
      • 21.7.7 Quantile Average Treatment Effects
      • 21.7.8 Log-Odds Treatment Effects for Binary Outcomes
      • 21.7.9 Summary Table: Treatment Effect Estimands
  • A. EXPERIMENTAL DESIGN
  • 22 Experimental Design
    • 22.1 Principles of Experimental Design
    • 22.2 The Gold Standard: Randomized Controlled Trials
    • 22.3 Selection Problem
      • 22.3.1 The Observed Difference in Outcomes
      • 22.3.2 Eliminating Selection Bias with Random Assignment
      • 22.3.3 Another Representation Under Regression
    • 22.4 Classical Experimental Designs
      • 22.4.1 Completely Randomized Design
      • 22.4.2 Randomized Block Design
      • 22.4.3 Factorial Design
      • 22.4.4 Crossover Design
      • 22.4.5 Split-Plot Design
      • 22.4.6 Latin Square Design
    • 22.5 Advanced Experimental Designs
      • 22.5.1 Semi-Random Experiments
      • 22.5.2 Re-Randomization
      • 22.5.3 Two-Stage Randomized Experiments
      • 22.5.4 Two-Stage Randomized Experiments with Interference and Noncompliance
      • 22.5.5 Switchback Experiments with Surrogate Variables
    • 22.6 Emerging Research
      • 22.6.1 Covariate Balancing in Online A/B Testing: The Pigeonhole Design
      • 22.6.2 Handling Zero-Valued Outcomes
  • 23 Sampling
    • 23.1 Population and Sample
    • 23.2 Probability Sampling
      • 23.2.1 Simple Random Sampling
      • 23.2.2 Stratified Sampling
      • 23.2.3 Systematic Sampling
      • 23.2.4 Cluster Sampling
    • 23.3 Non-Probability Sampling
      • 23.3.1 Convenience Sampling
      • 23.3.2 Quota Sampling
      • 23.3.3 Snowball Sampling
    • 23.4 Unequal Probability Sampling
    • 23.5 Balanced Sampling
      • 23.5.1 Cube Method for Balanced Sampling
      • 23.5.2 Balanced Sampling with Stratification
      • 23.5.3 Balanced Sampling in Cluster Sampling
      • 23.5.4 Balanced Sampling in Two-Stage Sampling
    • 23.6 Sample Size Determination
  • 24 Analysis of Variance
    • 24.1 Completely Randomized Design
      • 24.1.1 Single-Factor Fixed Effects ANOVA
      • 24.1.2 Single Factor Random Effects ANOVA
      • 24.1.3 Two-Factor Fixed Effects ANOVA
      • 24.1.4 Two-Way Random Effects ANOVA
      • 24.1.5 Two-Way Mixed Effects ANOVA
    • 24.2 Nonparametric ANOVA
      • 24.2.1 Kruskal-Wallis Test (One-Way Nonparametric ANOVA)
      • 24.2.2 Friedman Test (Nonparametric Two-Way ANOVA)
    • 24.3 Randomized Block Designs
    • 24.4 Nested Designs
      • 24.4.1 Two-Factor Nested Design
      • 24.4.2 Unbalanced Nested Two-Factor Designs
      • 24.4.3 Random Factor Effects
    • 24.5 Sample Size Planning for ANOVA
      • 24.5.1 Balanced Designs
      • 24.5.2 Single Factor Studies
      • 24.5.3 Multi-Factor Studies
      • 24.5.4 Procedure for Sample Size Selection
      • 24.5.5 Randomized Block Experiments
    • 24.6 Single Factor Covariance Model
      • 24.6.1 Statistical Inference for Treatment Effects
      • 24.6.2 Testing for Parallel Slopes
      • 24.6.3 Adjusted Means
  • 25 Multivariate Methods
    • 25.1 Basic Understanding
      • 25.1.1 Multivariate Random Vectors
      • 25.1.2 Covariance Matrix
      • 25.1.3 Equalities in Expectation and Variance
      • 25.1.4 Multivariate Normal Distribution
      • 25.1.5 Test of Multivariate Normality
      • 25.1.6 Mean Vector Inference
      • 25.1.7 General Hypothesis Testing
    • 25.2 Multivariate Analysis of Variance
      • 25.2.1 One-Way MANOVA
      • 25.2.2 Profile Analysis
    • 25.3 Statistical Test Selection for Comparing Means
  • B. QUASI-EXPERIMENTAL DESIGN
  • 26 Quasi-Experimental Methods
    • 26.1 Identification Strategy in Quasi-Experiments
    • 26.2 Establishing Mechanisms
      • 26.2.1 Mediation Analysis: Explaining the Causal Pathway
      • 26.2.2 Moderation Analysis: For Whom or Under What Conditions?
    • 26.3 Robustness Checks
    • 26.4 Limitations of Quasi-Experiments
      • 26.4.1 What Are the Identification Assumptions?
      • 26.4.2 What Are the Threats to Validity?
      • 26.4.3 How Do You Address These Threats?
      • 26.4.4 What Are the Implications for External Validity and Future Research?
    • 26.5 Assumptions for Identifying Treatment Effects
      • 26.5.1 Stable Unit Treatment Value Assumption
      • 26.5.2 Conditional Ignorability Assumption
      • 26.5.3 Overlap (Positivity) Assumption
    • 26.6 Natural Experiments
      • 26.6.1 The Problem of Reusing Natural Experiments
      • 26.6.2 Statistical Challenges in Reusing Natural Experiments
      • 26.6.3 Solutions: Multiple Testing Corrections
    • 26.7 Design vs. Model-Based Approaches
      • 26.7.1 Design-Based Perspective
      • 26.7.2 Model-Based Perspective
      • 26.7.3 Placing Methods Along a Spectrum
  • 27 Regression Discontinuity
    • 27.1 Conceptual Framework
    • 27.2 Types of Regression Discontinuity Designs
      • 27.2.1 Assumptions for RD Validity
    • 27.3 Model Estimation Strategies
      • 27.3.1 Parametric Models: Polynomial Regression
      • 27.3.2 Nonparametric Models: Local Regression
    • 27.4 Formal Definition
      • 27.4.1 Identification Assumptions
    • 27.5 Estimation and Inference
      • 27.5.1 Local Randomization-Based Approach
      • 27.5.2 Continuity-Based Approach
    • 27.6 Specification Checks
      • 27.6.1 Balance Checks
      • 27.6.2 Sorting, Bunching, and Manipulation
      • 27.6.3 Placebo Tests
      • 27.6.4 Sensitivity to Bandwidth Choice
      • 27.6.5 Assessing Sensitivity
      • 27.6.6 Manipulation-Robust Regression Discontinuity Bounds
    • 27.7 Fuzzy Regression Discontinuity Design
      • 27.7.1 Compliance Types
      • 27.7.2 Estimating the Local Average Treatment Effect
      • 27.7.3 Equivalent Representation Using Expectations
      • 27.7.4 Estimation Strategies
      • 27.7.5 Practical Considerations
      • 27.7.6 Steps for Fuzzy RD
    • 27.8 Sharp Regression Discontinuity Design
      • 27.8.1 Assumptions for Identification
      • 27.8.2 Estimating the Local Average Treatment Effect
      • 27.8.3 Estimation Methods
      • 27.8.4 Steps for Sharp RD
    • 27.9 Regression Kink Design
      • 27.9.1 Identification in Sharp Regression Kink Design
      • 27.9.2 Identification in Fuzzy Regression Kink Design
      • 27.9.3 Estimation of RKD Effects
      • 27.9.4 Robustness Checks
    • 27.10 Multi-Cutoff Regression Discontinuity Design
      • 27.10.1 Identification
      • 27.10.2 Key Assumptions
      • 27.10.3 Estimation Approaches
      • 27.10.4 Robustness Checks
    • 27.11 Multi-Score Regression Discontinuity Design
      • 27.11.1 General Framework
      • 27.11.2 Identification
      • 27.11.3 Key Assumptions
      • 27.11.4 Estimation Approaches
      • 27.11.5 Robustness Checks
    • 27.12 Evaluation of a Regression Discontinuity Design
      • 27.12.1 Graphical and Formal Evidence
      • 27.12.2 Functional Form of the Running Variable
      • 27.12.3 Bandwidth Selection
      • 27.12.4 Addressing Potential Confounders
      • 27.12.5 External Validity in RD
    • 27.13 Applications of RD Designs
      • 27.13.1 Applications in Marketing
      • 27.13.2 R Packages for RD Estimation
      • 27.13.3 Example of Regression Discontinuity in Education
      • 27.13.4 Example of Occupational Licensing and Market Efficiency
  • 28 Temporal Discontinuity Designs
    • 28.1 Regression Discontinuity in Time
      • 28.1.1 Estimation and Model Selection
      • 28.1.2 Strengths of RDiT
      • 28.1.3 Limitations and Challenges of RDiT
      • 28.1.4 Recommendations for Robustness Checks
      • 28.1.5 Applications of RDiT
      • 28.1.6 Empirical Example
    • 28.2 Interrupted Time Series
      • 28.2.1 Advantages of ITS
      • 28.2.2 Limitations of ITS
      • 28.2.3 Empirical Example
    • 28.3 Combining both RDiT and ITS
      • 28.3.1 Augment an ITS Model with a Local Discontinuity Term
      • 28.3.2 Two-Stage (or Multi-Stage) Modeling
      • 28.3.3 Hierarchical or Multi-Level Modeling
      • 28.3.4 Empirical Example
      • 28.3.5 Practical Guidance
    • 28.4 Case-Crossover Study Design
      • 28.4.1 Mathematical Foundations
      • 28.4.2 Selection of Control Periods
      • 28.4.3 Assumptions
  • 29 Synthetic Difference-in-Differences
    • 29.1 Understanding
      • 29.1.1 Steps in SDID Estimation
      • 29.1.2 Comparison of Methods
      • 29.1.3 Why Use Weights?
      • 29.1.4 Benefits of Localization in SDID
      • 29.1.5 Designing SDID Weights
      • 29.1.6 How SDID Enhances DID’s Plausibility
      • 29.1.7 Choosing SDID Weights
      • 29.1.8 Accounting for Time-Varying Covariates in Weight Estimation
    • 29.2 Application
      • 29.2.1 Block Treatment
      • 29.2.2 Staggered Adoption
  • 30 Difference-in-Differences
    • 30.1 Empirical Studies
      • 30.1.1 Applications of DID in Marketing
      • 30.1.2 Applications of DID in Economics
    • 30.2 Visualization
      • 30.2.1 Data check
      • 30.2.2 Treatment Assignment Heatmap
      • 30.2.3 Raw Outcome Trajectories
      • 30.2.4 Event-time Averages
    • 30.3 Simple Difference-in-Differences
      • 30.3.1 Basic Setup of DID
      • 30.3.2 Extensions of DID
      • 30.3.3 Goals of DID
    • 30.4 Empirical Research Walkthrough
      • 30.4.1 Example: The Unintended Consequences of “Ban the Box” Policies
      • 30.4.2 Example: Minimum Wage and Employment
      • 30.4.3 Example: The Effects of Grade Policies on Major Choice
    • 30.5 One Difference
    • 30.6 Two-Way Fixed Effects
      • 30.6.1 Canonical TWFE Model
      • 30.6.2 Limitations of TWFE
      • 30.6.3 Diagnosing and Addressing Bias in TWFE
      • 30.6.4 Remedies for TWFE’s Shortcomings
      • 30.6.5 Best Practices and Recommendations
    • 30.7 Multiple Periods and Variation in Treatment Timing
      • 30.7.1 Staggered Difference-in-Differences
    • 30.8 Modern Estimators for Staggered Adoption
      • 30.8.1 Group-Time Average Treatment Effects (Callaway and Sant’Anna 2021)
      • 30.8.2 Cohort Average Treatment Effects (L. Sun and Abraham 2021)
      • 30.8.3 Stacked Difference-in-Differences
      • 30.8.4 Panel Match DiD Estimator with In-and-Out Treatment Conditions
      • 30.8.5 Counterfactual Estimators
      • 30.8.6 Matrix Completion Estimator
      • 30.8.7 Two-stage DiD Estimator
      • 30.8.8 Reshaped Inverse Probability Weighting - TWFE Estimator
      • 30.8.9 Dynamic Treatment Effect Estimation with Interactive Fixed Effects and Short Panels
      • 30.8.10 Switching Difference-in-Differences Estimator (Clément De Chaisemartin and d’Haultfoeuille 2020)
      • 30.8.11 Augmented/Forward DID
      • 30.8.12 Doubly Robust Difference-in-Differences Estimators
      • 30.8.13 Nonlinear Difference-in-Differences
    • 30.9 Multiple Treatments
      • 30.9.1 Multiple Treatment Groups: Model Specification
      • 30.9.2 Understanding the Control Group in Multiple Treatment DiD
      • 30.9.3 Alternative Approaches: Separate Regressions vs. One Model
      • 30.9.4 Handling Treatment Intensity
      • 30.9.5 Considerations When Individuals Can Move Between Treatment Groups
      • 30.9.6 Parallel Trends Assumption in Multiple-Treatment DiD
    • 30.10 Mediation Under DiD
      • 30.10.1 Mediation Model in DiD
      • 30.10.2 Interpreting the Results
      • 30.10.3 Challenges in Mediation Analysis for DiD
      • 30.10.4 Alternative Approach: Instrumental Variables for Mediation
    • 30.11 Assumptions
      • 30.11.1 Prior Parallel Trends Test
      • 30.11.2 Placebo Test
    • 30.12 Robustness Checks
      • 30.12.1 Robustness Checks to Strengthen Causal Interpretation
      • 30.12.2 Best Practices for Reliable DiD Implementation
    • 30.13 Concerns in DID
      • 30.13.1 Limitations and Common Issues
      • 30.13.2 Matching Methods in DID
      • 30.13.3 Control Variables in DID
      • 30.13.4 DID for Count Data: Fixed-Effects Poisson Model
      • 30.13.5 Handling Zero-Valued Outcomes in DID
      • 30.13.6 Standard Errors
      • 30.13.7 Partial Identification
  • 31 Changes-in-Changes
    • 31.1 Key Concepts
    • 31.2 Estimating QTT with CiC
    • 31.3 Application
      • 31.3.1 ECIC package
      • 31.3.2 QTE package
  • 32 Synthetic Control
    • 32.1 Marketing Applications
    • 32.2 Key Features of SCM
    • 32.3 Advantages of SCM
      • 32.3.1 Compared to DiD
      • 32.3.2 Compared to Linear Regression
      • 32.3.3 Additional Advantages
    • 32.4 Disadvantages of SCM
    • 32.5 Assumptions
    • 32.6 Estimation
      • 32.6.1 Constructing the Synthetic Control
      • 32.6.2 Penalized Synthetic Control
    • 32.7 Theoretical Considerations
    • 32.8 Inference in SCM
      • 32.8.1 Permutation (Placebo) Inference
      • 32.8.2 One-Sided Inference
    • 32.9 Augmented Synthetic Control Method
    • 32.10 Synthetic Control with Staggered Adoption
      • 32.10.1 Partially Pooled Synthetic Control
    • 32.11 Generalized Synthetic Control
      • 32.11.1 The Problem with Traditional Methods
      • 32.11.2 Generalized Synthetic Control Model
      • 32.11.3 Identification and Estimation
      • 32.11.4 Bootstrap Procedure for Standard Errors
    • 32.12 Bayesian Synthetic Control
      • 32.12.1 Bayesian Causal Inference Framework
      • 32.12.2 Bayesian Dynamic Multilevel Factor Model
      • 32.12.3 Bayesian Sparse Synthetic Control
      • 32.12.4 Bayesian Inference and MCMC Estimation
    • 32.13 Using Multiple Outcomes to Improve the Synthetic Control Method
      • 32.13.1 Standard Synthetic Control Method
      • 32.13.2 Using Multiple Outcomes for Bias Reduction
      • 32.13.3 Estimation Methods
      • 32.13.4 Empirical Application: Flint Water Crisis
    • 32.14 Applications
      • 32.14.1 Synthetic Control Estimation
      • 32.14.2 The Basque Country Policy Change
      • 32.14.3 Micro-Synthetic Control with microsynth
  • 33 Event Studies
    • 33.1 Review of Event Studies Across Disciplines
      • 33.1.1 Finance Applications
      • 33.1.2 Management Applications
      • 33.1.3 Marketing Applications
    • 33.2 Key Assumptions
    • 33.3 Steps for Conducting an Event Study
      • 33.3.1 Step 1: Event Identification
      • 33.3.2 Step 2: Define the Event and Estimation Windows
      • 33.3.3 Step 3: Compute Normal vs. Abnormal Returns
      • 33.3.4 Step 4: Compute Cumulative Abnormal Returns
      • 33.3.5 Step 5: Statistical Tests for Significance
    • 33.4 Event Studies in Marketing
      • 33.4.1 Definition
      • 33.4.2 When Can Marketing Events Affect Non-Operating Assets or Debt?
      • 33.4.3 Calculating the Leverage Effect
      • 33.4.4 Computing Leverage Effect from Compustat Data
    • 33.5 Economic Significance
    • 33.6 Testing in Event Studies
      • 33.6.1 Statistical Power in Event Studies
      • 33.6.2 Parametric Tests
      • 33.6.3 Non-Parametric Tests
    • 33.7 Sample in Event Studies
    • 33.8 Confounders in Event Studies
      • 33.8.1 Types of Confounding Events
      • 33.8.2 Should We Exclude Confounded Observations?
      • 33.8.3 Simulation Study: Should We Exclude Correlated and Uncorrelated Events?
    • 33.9 Biases in Event Studies
      • 33.9.1 Timing Bias: Different Market Closing Times
      • 33.9.2 Upward Bias in Cumulative Abnormal Returns
      • 33.9.3 Cross-Sectional Dependence Bias
      • 33.9.4 Sample Selection Bias
      • 33.9.5 Corrections for Sample Selection Bias
    • 33.10 Long-run Event Studies
      • 33.10.1 Buy-and-Hold Abnormal Returns (BHAR)
      • 33.10.2 Long-term Cumulative Abnormal Returns (LCARs)
      • 33.10.3 Calendar-time Portfolio Abnormal Returns (CTARs)
    • 33.11 Aggregation
      • 33.11.1 Over Time
      • 33.11.2 Across Firms and Over Time
      • 33.11.3 Statistical Tests
    • 33.12 Heterogeneity in the Event Effect
      • 33.12.1 Common Variables Affecting CAR in Marketing and Finance
    • 33.13 Expected Return Calculation
      • 33.13.1 Statistical Models for Expected Returns
      • 33.13.2 Economic Models for Expected Returns
    • 33.14 Application of Event Study
      • 33.14.1 Sorting Portfolios for Expected Returns
      • 33.14.2 erer Package
      • 33.14.3 Eventus
  • 34 Instrumental Variables
    • 34.1 Challenges with Instrumental Variables
    • 34.2 Framework for Instrumental Variables
      • 34.2.1 Constant-Treatment-Effect Model
      • 34.2.2 Instrumental Variable Solution
      • 34.2.3 Heterogeneous Treatment Effects and the LATE Framework
      • 34.2.4 Assumptions for LATE Identification
      • 34.2.5 Local Average Treatment Effect Theorem
      • 34.2.6 IV in Randomized Trials (Noncompliance)
    • 34.3 Estimation
      • 34.3.1 Two-Stage Least Squares Estimation
      • 34.3.2 IV-GMM
      • 34.3.3 Limited Information Maximum Likelihood
      • 34.3.4 Jackknife IV
      • 34.3.5 Control Function Approach
      • 34.3.6 Fuller and Bias-Reduced IV
    • 34.4 Asymptotic Properties of the IV Estimator
      • 34.4.1 Consistency
      • 34.4.2 Asymptotic Normality
      • 34.4.3 Asymptotic Efficiency
    • 34.5 Inference
      • 34.5.1 Weak Instruments Problem
      • 34.5.2 Solutions and Approaches for Valid Inference
      • 34.5.3 Anderson-Rubin Approach
      • 34.5.4 tF Procedure
      • 34.5.5 AK Approach
    • 34.6 Testing Assumptions
      • 34.6.1 Relevance Assumption
      • 34.6.2 Independence (Unconfoundedness)
      • 34.6.3 Monotonicity Assumption
      • 34.6.4 Homogeneous Treatment Effects (Optional)
      • 34.6.5 Linearity and Additivity
      • 34.6.6 Instrument Exogeneity (Exclusion Restriction)
      • 34.6.7 Exogeneity Assumption
    • 34.7 Cautions in IV
      • 34.7.1 Negative \(R^2\) in IV
      • 34.7.2 Many-Instruments Bias
      • 34.7.3 Heterogeneous Effects in IV Estimation
      • 34.7.4 Zero-Valued Outcomes
    • 34.8 Types of IV
      • 34.8.1 Treatment Intensity
      • 34.8.2 Decision-Maker IV
      • 34.8.3 Proxy Variables
  • 35 Matching Methods
    • 35.1 Introduction and Motivation
      • 35.1.1 Why Match?
      • 35.1.2 Matching as “Pruning”
      • 35.1.3 Matching with DiD
    • 35.2 Key Assumptions
    • 35.3 Framework for Generalization
    • 35.4 Steps for Matching
      • 35.4.1 Step 1: Define “Closeness” (Distance Metrics)
      • 35.4.2 Step 2: Matching Algorithms
      • 35.4.3 Step 3: Diagnosing Match Quality
      • 35.4.4 Step 4: Estimating Treatment Effects
    • 35.5 Special Considerations
    • 35.6 Choosing a Matching Strategy
      • 35.6.1 Based on Estimand
      • 35.6.2 Based on Diagnostics
      • 35.6.3 Selection Criteria
    • 35.7 Matching vs. Regression
      • 35.7.1 Matching Estimand
      • 35.7.2 Regression Estimand
      • 35.7.3 Interpretation: Weighting Differences
    • 35.8 Software and Practical Implementation
    • 35.9 Selection on Observables
      • 35.9.1 Matching with MatchIt
      • 35.9.2 Reporting Standards
      • 35.9.3 Optimization-Based Matching via designmatch
      • 35.9.4 MatchingFrontier
      • 35.9.5 Propensity Scores
      • 35.9.6 Mahalanobis Distance Matching
      • 35.9.7 Coarsened Exact Matching (CEM)
      • 35.9.8 Genetic Matching
      • 35.9.9 Entropy Balancing
      • 35.9.10 Matching for High-Dimensional Data
      • 35.9.11 Matching for Multiple Treatments
      • 35.9.12 Matching for Multi-Level Treatments
      • 35.9.13 Matching for Repeated Treatments (Time-Varying Treatments)
    • 35.10 Selection on Unobservables
      • 35.10.1 Rosenbaum Bounds
      • 35.10.2 Relative Correlation Restrictions
      • 35.10.3 Coefficient-Stability Bounds
  • C. OTHER CONCERNS
  • 36 Endogeneity
    • 36.1 Endogenous Treatment
      • 36.1.1 Measurement Errors
      • 36.1.2 Simultaneity
      • 36.1.3 Reverse Causality
      • 36.1.4 Omitted Variable Bias
    • 36.2 Endogenous Sample Selection
      • 36.2.1 Unifying Model Frameworks
      • 36.2.2 Estimation Methods
      • 36.2.3 Theoretical Connections
      • 36.2.4 Tobit-2: Heckman’s Sample Selection Model
      • 36.2.5 Tobit-5: Switching Regression Model
      • 36.2.6 Pattern-Mixture Models
  • 37 Other Biases
    • 37.1 Aggregation Bias
      • 37.1.1 Simpson’s Paradox
    • 37.2 Contamination Bias
    • 37.3 Survivorship Bias
    • 37.4 Publication Bias
    • 37.5 p-Hacking
      • 37.5.1 Theoretical Signatures of p-Hacking and Selection
      • 37.5.2 Method Families
      • 37.5.3 Mathematical Details and Assumptions
      • 37.5.4 Schools of Thought and Notable Debates
      • 37.5.5 Practical Guidance for Applied Analysts
      • 37.5.6 Limitations and Open Problems
  • 38 Directed Acyclic Graphs
    • 38.1 Basic Notation and Graph Structures
    • 38.2 Rule of Thumb for Causal Inference
    • 38.3 Example DAG
    • 38.4 Causal Discovery
    • 38.5
  • 39 Controls
    • 39.1 Bad Controls
      • 39.1.1 M-bias
      • 39.1.2 Bias Amplification
      • 39.1.3 Overcontrol Bias
      • 39.1.4 Selection Bias
      • 39.1.5 Case-Control Bias
      • 39.1.6 Summary
    • 39.2 Good Controls
      • 39.2.1 Omitted Variable Bias Correction
      • 39.2.2 Omitted Variable Bias in Mediation Correction
    • 39.3 Neutral Controls
      • 39.3.1 Good Predictive Controls
      • 39.3.2 Good Selection Bias
      • 39.3.3 Bad Predictive Controls
      • 39.3.4 Bad Selection Bias
      • 39.3.5 Summary Table: Predictive vs. Causal Utility of Controls
    • 39.4 Choosing Controls
      • 39.4.1 Step 1: Use a Causal Diagram (DAG)
      • 39.4.2 Step 2: Use Algorithmic Tools
      • 39.4.3 Step 3: Theoretical Principles
      • 39.4.4 Step 4: Consider Sensitivity Analysis
      • 39.4.5 Step 5: Know When Not to Control
      • 39.4.6 Summary: Control Selection Pipeline
  • V. MISCELLANEOUS
  • 40 Reporting Your Analysis
    • 40.1 Recommended Structure
      • 40.1.1 Phase 1: Exploratory Data Analysis (EDA)
      • 40.1.2 Phase 2: Model Selection and Specification
      • 40.1.3 Phase 3: Model Fitting and Diagnostic Assessment
      • 40.1.4 Phase 4: Inference and Prediction
      • 40.1.5 Phase 5: Conclusions and Recommendations
    • 40.2 One Summary Table (Packages)
    • 40.3 Exploratory Analysis
    • 40.4 Model
      • 40.4.1 Assumptions
      • 40.4.2 Why this model?
      • 40.4.3 Considerations
      • 40.4.4 Model Fit
      • 40.4.5 Cluster-Robust Standard Errors
      • 40.4.6 Model to Equation
    • 40.5 Model Comparison
    • 40.6 Changes in an Estimate
      • 40.6.1 Coefficient Uncertainty and Distribution
    • 40.7 Descriptive Tables
      • 40.7.1 Export APA theme (flextable)
    • 40.8 Visualizations & Plots
    • 40.9 One-Table Summary
    • 40.10 Inference / Prediction
    • 40.11 Appendix: Reproducible Snippets
  • 41 Exploratory Data Analysis
    • 41.1 Data Report
    • 41.2 Feature Engineering
    • 41.3 Missing Data
    • 41.4 Error Identification
    • 41.5 Summary statistics
    • 41.6 Not so code-y process
      • 41.6.1 Quick and dirty way to look at your data
      • 41.6.2 Code generation and wrangling (visual)
    • 41.7 Shiny-app based Tableau style
    • 41.8 Customize your daily/automatic report
      • 41.8.1 Appendix: Small “gotchas” to keep in mind
  • 42 Sensitivity Analysis and Robustness Checks
    • 42.1 The Philosophy of Robustness
    • 42.2 Specification Curve Analysis
      • 42.2.1 Conceptual Foundation
      • 42.2.2 The starbility Package
      • 42.2.3 Advanced Specification Curve Techniques
      • 42.2.4 The specr Package
      • 42.2.5 The rdfanalysis Package
    • 42.3 Coefficient Stability
      • 42.3.1 Theoretical Foundation: The Oster (2019) Approach
      • 42.3.2 The robomit Package
      • 42.3.3 The mplot Package for Graphical Model Stability
    • 42.4 Quantifying Omitted Variable Bias
      • 42.4.1 The konfound Package
      • 42.4.2 Visualizing Sensitivity: The Threshold Plot
      • 42.4.3 The Correlation Plot
      • 42.4.4 Konfound for Model Objects
    • 42.5 Advanced Topics in Sensitivity Analysis
      • 42.5.1 Sensitivity to Outliers and Influential Observations
      • 42.5.2 Sensitivity to Measurement Error
    • 42.6 Reporting Sensitivity Analysis Results
      • 42.6.1 Best Practices for Presentation
    • 42.7 Conclusion and Recommendations
  • 43 Replication and Synthetic Data
    • 43.1 The Replication Standard
      • 43.1.1 Solutions for Empirical Replication
      • 43.1.2 Free Data Repositories
      • 43.1.3 Exceptions to Replication
      • 43.1.4 Replication Landscape
    • 43.2 Synthetic Data
      • 43.2.1 Benefits of Synthetic Data
      • 43.2.2 Concerns and Limitations
      • 43.2.3 Further Insights on Synthetic Data
      • 43.2.4 Generating Synthetic Data
    • 43.3 Application
      • 43.3.1 Original Dataset
      • 43.3.2 Restricted Dataset
      • 43.3.3 Synthpop
  • 44 High-Performance Computing
    • 44.1 Best Practices for HPC in Data Analysis
    • 44.2 Example Workflow in R
    • 44.3 Recommendations
    • 44.4 Demonstration
  • APPENDIX
  • A Appendix
    • A.1 Git
      • A.1.1 Basic Setup
      • A.1.2 Creating a Repository
      • A.1.3 Tracking Changes
      • A.1.4 Viewing History and Changes
      • A.1.5 Ignoring Files
      • A.1.6 Remote Repositories
      • A.1.7 Collaboration
      • A.1.8 Branching and Merging
      • A.1.9 Handling Conflicts
      • A.1.10 Licensing
      • A.1.11 Citing Repositories
      • A.1.12 Hosting and Legal Considerations
    • A.2 Short-cut
    • A.3 Function short-cut
    • A.4 Citation
    • A.5 Install All Necessary Packages on Your Local Machine
      • A.5.1 Step 1: Export Installed Packages from Your Current Session
      • A.5.2 Step 2: Install Packages on a New Machine
  • B Bookdown cheat sheet
    • B.1 Operation
    • B.2 Math Expression/ Syntax
      • B.2.1 Statistics Notation
    • B.3 Table
  • References
  • Published with bookdown

A Guide on Data Analysis

40.1 Recommended Structure

40.1.1 Phase 1: Exploratory Data Analysis (EDA)

Understanding Your Data Landscape

Before embarking on any modeling endeavor, immerse yourself thoroughly in your data. Exploratory data analysis serves as the foundation upon which all subsequent analysis rests. This critical phase allows you to develop intuition about your dataset, identify potential challenges, and formulate preliminary hypotheses that will guide your modeling decisions.

Visual Exploration and Data Visualization

Begin by creating a comprehensive suite of visualizations that reveal the character and structure of your data. Univariate plots such as histograms, density plots, and boxplots illuminate the distribution of individual variables, revealing whether they follow normal, skewed, bimodal, or other distribution patterns. These visualizations immediately expose the presence of extreme values and help you understand the central tendency and spread of each variable.

For continuous variables, construct detailed histograms with appropriate bin widths to capture the true shape of the distribution. Overlay kernel density estimates to smooth out the discrete nature of histograms and reveal underlying patterns. Complement these with boxplots that succinctly display the five-number summary while making outliers immediately visible.

For categorical variables, develop bar charts and frequency tables that show the distribution of observations across categories. Pay particular attention to class imbalance, as severely imbalanced categories can create challenges for certain modeling approaches and may require special handling techniques such as stratified sampling or synthetic minority oversampling.

Transition next to bivariate and multivariate visualizations that expose relationships between variables. Scatter plots reveal correlations, non-linear relationships, and interaction effects between continuous variables. When examining the relationship between a continuous outcome and categorical predictors, construct side-by-side boxplots or violin plots that simultaneously display distribution shape and central tendency across groups.

Correlation matrices presented as heatmaps provide an at-a-glance understanding of linear relationships among all continuous variables in your dataset. Use color gradients thoughtfully to make strong positive and negative correlations immediately apparent. Augment simple correlation coefficients with scatter plot matrices that allow you to visually inspect the nature of each pairwise relationship.

For more complex multivariate patterns, consider dimension reduction techniques such as principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE). While these methods will be explored more rigorously later, preliminary visualizations in reduced dimensional space can reveal clustering, separation between groups, or other high-dimensional structure that would otherwise remain hidden.

Preliminary Statistical Results

Complement your visual exploration with descriptive statistics that quantify the properties you’ve observed graphically. Calculate measures of central tendency including means, medians, and modes for each variable. Assess spread through standard deviations, interquartile ranges, and ranges. For skewed distributions, report robust statistics that are less sensitive to extreme values.

Construct detailed contingency tables for categorical variables, including both counts and proportions. Calculate marginal and conditional distributions to understand how categories relate to one another. For key relationships of interest, compute preliminary effect sizes or correlation coefficients to quantify the strength of associations.

Perform initial hypothesis tests where appropriate, but interpret these exploratory results with appropriate caution. At this stage, you are generating hypotheses rather than testing pre-specified ones, so traditional significance thresholds should be applied conservatively. Consider adjusting for multiple comparisons if you conduct numerous exploratory tests, or better yet, clearly distinguish between confirmatory and exploratory findings in your narrative.

Identifying Interesting Patterns, Structure, and Features

As you explore your data, remain vigilant for unexpected patterns that might inform your modeling strategy or reveal important substantive insights. Look for evidence of subgroups or clusters within your data that might suggest the need for hierarchical models, mixture models, or stratified analyses. Notice whether relationships between variables appear consistent across the full range of the data or if they change in character at certain thresholds.

Temporal patterns deserve special attention if your data have any time-series component. Plot variables across time to identify trends, seasonality, or structural breaks that might violate independence assumptions or require specialized time-series modeling approaches. Even in cross-sectional data, consider whether unobserved temporal factors might have introduced systematic patterns.

Geographic or spatial patterns should similarly be explored if your data have spatial attributes. Map-based visualizations can reveal spatial autocorrelation or clustering that standard models might miss. If present, such patterns may necessitate spatial statistical methods that explicitly model dependence structures.

Pay attention to the relationship between variance and mean across groups or conditions. Heteroscedasticity, where the variability of your outcome changes systematically with predictor values, will violate key assumptions of many standard models and may require variance-stabilizing transformations or more flexible modeling frameworks.

Outlier Detection and Characterization

Devote substantial attention to identifying and understanding outliers, which are observations that differ markedly from the overall pattern in your data. Begin with univariate outlier detection using methods such as the \(1.5 \times IQR\) rule for boxplots, which flags points falling more than 1.5 times the interquartile range beyond the first or third quartile. For normally distributed data, consider threshold rules based on standard deviations, such as flagging observations more than three standard deviations from the mean.

Extend your outlier analysis to the multivariate space, where observations that appear unremarkable in any single dimension may nonetheless be anomalous in their combination of values. Mahalanobis distance measures how far each observation lies from the center of the multivariate distribution, accounting for correlations between variables. Cook’s distance and other influence diagnostics, while typically associated with model diagnostics, can also be calculated at this exploratory stage to identify observations that might exert disproportionate influence on subsequent analyses.

Crucially, resist the temptation to automatically discard outliers. Instead, investigate each carefully to understand its origin and nature.

  • Is it a data entry error that should be corrected?

  • Is it a legitimate but rare event that contains valuable information?

  • Does it represent a different population that should be analyzed separately?

Document your decisions transparently, presenting results both with and without questionable observations when appropriate, so readers can assess the robustness of your conclusions.

Consider the domain context when evaluating outliers. In some fields, extreme values may be the most scientifically interesting observations, while in others they may represent measurement errors or irrelevant anomalies. Consult with subject matter experts to properly interpret unusual observations and make informed decisions about their treatment.

40.1.2 Phase 2: Model Selection and Specification

Articulating Model Assumptions

Every statistical model rests on a foundation of assumptions, and making these explicit is essential for proper interpretation and assessment of your results. Begin by clearly stating the distributional assumptions your model makes about the outcome variable. Does your model assume normally distributed errors, or are you working within a generalized linear model framework that allows for binomial, Poisson, or other distributional families?

Detail the assumptions about the relationship between predictors and outcome. Most commonly, models assume linearity in parameters, meaning that the expected outcome changes by a constant amount for each unit change in a predictor (possibly after appropriate transformation or link function). If your model permits non-linear relationships through polynomial terms, splines, or other flexible forms, explain the functional form you’ve adopted and why.

Independence assumptions warrant careful consideration. Standard regression assumes that observations are independent of one another, but this is frequently violated in practice by clustering (students within schools, measurements within individuals), spatial dependence, or temporal autocorrelation. If such dependencies exist in your data structure, acknowledge them explicitly and describe how your model accounts for them, whether through mixed effects, robust standard errors, or specialized correlation structures.

Homoscedasticity, the assumption of constant error variance, should be stated and later verified. Many standard inferential procedures assume that the variance of your outcome does not depend on predictor values or fitted values, though weighted regression or generalized linear models can accommodate heteroscedastic errors when this assumption is untenable.

Additional assumptions relevant to specific methods should be documented. For causal inference, state clearly what identification assumptions are necessary for causal interpretation, such as ignorability, no unmeasured confounding, or valid instrumental variables. For time series models, describe stationarity assumptions. For machine learning approaches, discuss assumptions about the relationship between training and test data distributions.

Justifying Your Modeling Approach

After articulating assumptions, provide a compelling rationale for why your chosen model is the most appropriate tool for addressing your research question. Connect the model selection directly to your scientific objectives. If your goal is prediction, emphasize the model’s predictive performance and its ability to generalize to new data. If your goal is inference about specific parameters, justify how the model structure allows for valid and efficient estimation of those parameters.

Consider the nature of your outcome variable in justifying your approach. Continuous outcomes measured on an interval or ratio scale typically call for linear regression or its extensions, while binary outcomes necessitate logistic regression or other classification methods. Count data often require Poisson or negative binomial regression, while time-to-event data demand survival analysis techniques. Ordinal outcomes merit specialized methods that respect the ordered nature of categories.

Discuss how your model handles the specific challenges present in your data. If you have high-dimensional data with more predictors than observations, explain your choice of regularization method such as ridge, lasso, or elastic net regression. If multicollinearity is a concern, describe how your approach mitigates its effects, whether through variable selection, principal component regression, or Bayesian methods with informative priors.

Address computational considerations when relevant. Some modeling approaches that are theoretically ideal may be computationally intractable for large datasets, while others scale efficiently. If you’ve made tradeoffs between statistical optimality and computational feasibility, acknowledge this transparently and describe any steps taken to validate that the chosen approach provides adequate performance.

Compare your chosen model to reasonable alternatives, explaining why you’ve selected one approach over others. This comparative discussion demonstrates that you’ve thoughtfully considered multiple options rather than defaulting to a familiar method. You might compare parametric versus non-parametric approaches, frequentist versus Bayesian frameworks, or simple versus complex model structures, weighing their relative advantages and limitations in your specific context.

Considering Interactions, Collinearity, and Dependence

Interaction effects represent situations where the effect of one predictor on the outcome depends on the value of another predictor. During model specification, consider whether substantive theory suggests important interactions, and explore whether your exploratory analysis revealed evidence of effect modification. Interaction terms can substantially improve model fit and provide crucial scientific insights, but they also increase model complexity and can make interpretation challenging.

When including interactions, think carefully about whether to also include the constituent main effects (you almost always should, to maintain the principle of marginality), and consider centering continuous variables before forming interaction terms to reduce collinearity and aid interpretation. Visualize predicted values across different combinations of interacting variables to help readers understand these complex relationships.

Multicollinearity, the presence of strong linear relationships among predictors, can create serious problems for parameter estimation and interpretation. Severely collinear predictors lead to unstable coefficient estimates with inflated standard errors, making it difficult to isolate the individual effect of any single predictor. Assess collinearity using variance inflation factors (VIF), with values exceeding 5 or 10 typically indicating problematic levels.

When high collinearity is detected, several remedial strategies exist. You might remove one of a highly correlated pair of predictors based on theoretical considerations or measurement quality. Alternatively, combine collinear predictors into composite scores or indices that capture their shared information. Regularization methods such as ridge regression explicitly address collinearity by shrinking coefficient estimates. In some cases, severe collinearity simply reflects reality and must be acknowledged as a limitation, particularly when you need to include certain predictors for theoretical completeness despite their intercorrelation.

Dependence structures in your data require special modeling approaches. For clustered data, where observations are nested within groups, mixed effects (multilevel or hierarchical) models partition variance into within-group and between-group components and account for the correlation among observations from the same cluster. Specify both fixed effects that represent average relationships and random effects that allow these relationships to vary across clusters.

For longitudinal data with repeated measurements on the same units, consider growth curve models, generalized estimating equations (GEE), or transition models depending on your research question. Each approach handles the correlation among repeated measures differently and allows for different types of inference, so select the framework that best matches your substantive goals.

Spatial or network dependence calls for specialized models that explicitly represent connections between observations. Spatial autoregressive models, geographically weighted regression, or network autocorrelation models may be appropriate depending on the structure of spatial or social relationships in your data.

40.1.3 Phase 3: Model Fitting and Diagnostic Assessment

Evaluating Overall Model Fit

After estimating your model, systematically evaluate how well it fits the observed data. Begin with summary statistics that quantify the proportion of variance explained. For linear models, the coefficient of determination (\(R^2\)) indicates what fraction of outcome variance is captured by your predictors, while adjusted \(R^2\) penalizes model complexity to discourage overfitting. Recognize that while \(R^2\) provides a useful summary, it doesn’t tell the whole story about model adequacy, and even low \(R^2\) values can be scientifically important if they represent relationships that are difficult to predict.

For generalized linear models, report appropriate pseudo-\(R^2\) measures such as McFadden’s, Nagelkerke’s, or Tjur’s \(R^2\), keeping in mind that these lack the direct interpretation of classical \(R^2\). Log-likelihood values and deviance statistics provide information about how well the model’s probability distribution matches the data, with comparisons to null or saturated models offering context for interpretation.

Information criteria including Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) balance goodness of fit against model complexity, rewarding fit while penalizing the inclusion of additional parameters. These are particularly valuable for comparing non-nested models, though differences of less than 2-3 units are generally considered negligible. BIC penalizes complexity more heavily than AIC and tends to favor simpler models, especially with large sample sizes.

For models intended for prediction, assess predictive performance using metrics appropriate to your outcome type. For continuous outcomes, examine mean squared error, root mean squared error, or mean absolute error. For binary outcomes, consider accuracy, sensitivity, specificity, positive and negative predictive values, area under the ROC curve (AUC), and calibration metrics. Critically, evaluate predictive performance on held-out data not used for model training to obtain honest estimates of generalization performance.

Conduct formal goodness-of-fit tests where appropriate. The Hosmer-Lemeshow test for logistic regression, the deviance test for generalized linear models, or omnibus tests for model specification each provide statistical assessments of model adequacy, though remember that with large sample sizes, these tests may reject even models that fit adequately for practical purposes.

Verifying Model Assumptions Through Residual Analysis

Residual analysis forms the cornerstone of model diagnostics, as residuals (i.e., the differences between observed and fitted values) should exhibit certain properties if model assumptions hold. If your model is correctly specified and assumptions are satisfied, residuals should appear as random noise without systematic patterns.

Begin with residual plots that display residuals against fitted values. In a well-fitting model, this plot should show a random cloud of points with no discernible pattern, constant spread across the range of fitted values, and no systematic curvature. A funnel shape, where spread increases or decreases with fitted values, suggests heteroscedasticity. Curved patterns indicate that the assumed functional form may be incorrect and that transformations or additional predictors might improve the model.

For generalized linear models, use appropriate residuals such as deviance, Pearson, or quantile residuals rather than raw residuals, as these better approximate the expected properties under the model assumptions. Deviance residuals are particularly useful for assessing overall model fit, while Pearson residuals help evaluate the variance assumption.

Construct residual plots against each predictor variable to identify whether any individual predictor’s relationship with the outcome is misspecified. Non-random patterns in these plots suggest that the predictor may require transformation, that its effect may be non-linear, or that it may interact with other variables.

Q-Q (quantile-quantile) plots compare the distribution of residuals to the theoretical distribution assumed by your model, typically the normal distribution for linear regression. Points should fall approximately along a straight diagonal line if the distributional assumption is satisfied. Systematic departures from linearity, particularly in the tails, indicate non-normality. Light-tailed distributions (fewer extreme values than expected under normality) produce S-shaped patterns, while heavy-tailed distributions (more extreme values) create inversely S-shaped patterns.

For time series or spatially structured data, examine residual autocorrelation through autocorrelation function (ACF) plots or spatial correlograms. Significant autocorrelation in residuals indicates that your model has failed to account for temporal or spatial dependence, suggesting the need for more sophisticated modeling approaches that explicitly model correlation structures.

Identify influential observations using diagnostic measures such as Cook’s distance, DFBETAS, DFFITS, and leverage values. Influential points are those whose inclusion or exclusion would substantially alter model estimates or predictions. High leverage points have unusual predictor values that give them the potential for influence, while high influence points actually do substantially affect the fitted model. Investigate influential observations carefully, determining whether they represent errors, exceptional cases worthy of separate analysis, or legitimate data that should be retained.

Assess the variance inflation in parameter estimates due to collinearity by examining condition indices or variance decomposition proportions in addition to variance inflation factors. These diagnostics help you understand which specific parameters are most affected by collinearity and whether the instability is severe enough to warrant remedial action.

Test for heteroscedasticity formally using the Breusch-Pagan test, White test, or other appropriate diagnostics depending on your model type. If heteroscedasticity is detected, consider whether variance-stabilizing transformations, weighted least squares, or robust standard error estimators are appropriate remedies.

For mixed effects models, examine residuals at each level of the hierarchy. Inspect level-1 (within-group) residuals for the usual regression diagnostics, and additionally examine level-2 (group-level) residuals and random effects to assess whether higher-level assumptions are satisfied and to identify outlying clusters.

When assumption violations are detected, consider their practical severity carefully. Minor violations may have negligible impact on inference, particularly with large samples where central limit theorem properties provide robustness. Severe violations require remedy through data transformation, alternative modeling approaches, robust methods, or explicit acknowledgment as a limitation.

40.1.4 Phase 4: Inference and Prediction

Drawing Valid Statistical Inferences

With a well-fitting model in hand, turn your attention to statistical inference about parameters of interest and the relationships they represent. Begin by reporting point estimates for all relevant parameters, including regression coefficients, odds ratios, hazard ratios, or other effect measures appropriate to your model type. Present these with appropriate measures of uncertainty, typically confidence intervals and p-values from hypothesis tests.

Interpret each parameter estimate in the context of your research question and in language accessible to your intended audience. For linear regression coefficients, explain the expected change in the outcome associated with a one-unit change in the predictor, holding other variables constant. For logistic regression, interpret odds ratios or convert to more intuitive probability scales for specific covariate values. For survival models, explain hazard ratios in terms of relative risk over time.

Attend carefully to the distinction between statistical significance and practical significance. Statistically significant effects may be too small to matter in practice, particularly with large samples, while non-significant effects may still be substantively important, especially when confidence intervals are wide due to limited power. Report and discuss both the magnitude and precision of estimates rather than focusing exclusively on whether p-values fall below arbitrary thresholds.

Consider the multiple testing problem if you’re conducting numerous hypothesis tests. When testing many hypotheses simultaneously, some will appear significant purely by chance. Address this through appropriate multiple testing corrections such as Bonferroni, Holm, or false discovery rate (FDR) methods, or through a hierarchical testing strategy that prioritizes certain comparisons. Alternatively, distinguish clearly between confirmatory tests of pre-specified hypotheses and exploratory analyses that generate hypotheses for future research.

For predictive models, generate predictions for new observations or for specific covariate profiles of interest. Provide prediction intervals that appropriately capture uncertainty, recognizing that prediction uncertainty includes both estimation uncertainty about parameters and inherent residual variation in individual observations. Visualize predictions across the range of key predictors to help readers understand model implications.

Exploring Alternative Approaches to Support Inference

Strengthen your inferences by demonstrating robustness through alternative analytical approaches. A finding that persists across multiple reasonable modeling strategies is more credible than one that depends critically on specific modeling choices. This triangulation of evidence provides readers with greater confidence in your conclusions.

Conduct sensitivity analyses that explore how results change under different assumptions. Fit variants of your model that include or exclude potential confounders, use different functional forms for continuous predictors, apply different transformations to the outcome, or employ alternative link functions. If conclusions remain substantively similar across these variations, you can be more confident in their validity. If results are sensitive to specific modeling choices, acknowledge this and discuss which specification is most defensible based on theory and empirical evidence.

For causal inference questions, implement multiple analytical strategies if possible. Combine regression adjustment with propensity score methods, instrumental variables, difference-in-differences, or regression discontinuity designs depending on your data structure and research design. Agreement across methods that rely on different identifying assumptions substantially strengthens causal claims.

Employ resampling methods such as bootstrap or permutation tests to validate your inferential conclusions, particularly when sample sizes are modest or distributional assumptions are questionable. The bootstrap provides a way to estimate sampling distributions and standard errors without relying on parametric assumptions, while permutation tests offer exact significance tests for certain hypotheses.

Conduct subgroup analyses to examine whether relationships are consistent across different populations or contexts within your data. While these are exploratory and should be interpreted cautiously due to reduced power and multiple testing concerns, they can reveal important heterogeneity in effects and generate hypotheses about effect moderation that deserve investigation in future studies.

Implement cross-validation or other hold-out validation procedures for predictive models to honestly assess generalization performance. K-fold cross-validation, leave-one-out cross-validation, or train-test splits allow you to evaluate how well your model performs on data it hasn’t seen during training. This is essential for claims about predictive utility and for comparing the predictive performance of different modeling approaches.

If you have access to multiple datasets addressing similar questions, consider replication analyses that fit your model to independent data. Successful replication provides the strongest possible evidence for the robustness and generalizability of your findings, while failures to replicate may indicate that initial results were sample-specific or resulted from chance variation.

For Bayesian analyses, conduct prior sensitivity analyses that examine how posterior inferences change under different prior specifications. If conclusions are similar under a range of reasonable priors, inference is robust to prior specification. If posteriors are highly sensitive to prior choice, either collect more data to allow the likelihood to dominate or acknowledge that definitive conclusions require stronger prior information.

40.1.5 Phase 5: Conclusions and Recommendations

Synthesizing Findings into Actionable Recommendations

In concluding your analysis, synthesize your findings into clear, actionable recommendations that directly address the original research questions or practical problems that motivated the investigation. Avoid simply restating results; instead, interpret their meaning and implications for theory, policy, or practice.

Connect your statistical findings back to the substantive domain, explaining what your results mean for real-world phenomena. If you’ve found that a particular intervention has a significant positive effect, discuss what decision-makers should do with this information. If you’ve built a predictive model, explain how it should be deployed and what level of performance users can expect in practice.

Prioritize your recommendations by importance and strength of evidence. Some findings will be central to your research questions and supported by robust evidence across multiple analyses, while others may be more peripheral or tentative. Help readers understand which conclusions are most secure and which require additional confirmation before being acted upon.

Acknowledge uncertainty in your recommendations. Statistical analysis rarely provides absolute certainty, and honest acknowledgment of uncertainty better serves decision-makers than false precision. Describe the range of plausible effects indicated by confidence intervals and discuss how remaining uncertainty might affect decisions.

If your analysis revealed unexpected findings, discuss their potential significance and implications for existing theory or practice. Surprising results often represent the most important scientific contributions, but they also require more scrutiny and replication before being accepted with high confidence.

Consider differential implications for different stakeholders or contexts. A finding that suggests one course of action for one group might have different implications for another, and careful analysis should recognize this heterogeneity in drawing conclusions.

Acknowledging Limitations with Specificity and Candor

Every analysis has limitations, and comprehensive acknowledgment of these limitations actually strengthens rather than weakens your work by demonstrating careful scientific reasoning and helping readers appropriately calibrate their confidence in your conclusions. Move beyond generic limitations to provide specific, honest assessment of factors that may limit the validity or generalizability of your findings.

Discuss limitations related to your data source and sampling. Is your sample representative of the population to which you wish to generalize, or might selection bias limit external validity? Are there important subgroups underrepresented or absent from your data? Does non-response or attrition introduce potential bias? Are key variables measured with error or missing for substantial proportions of observations?

Address methodological limitations in your analytical approach. Which assumptions of your chosen model are most questionable in your particular application? Are there known alternatives that might have advantages you couldn’t exploit due to data constraints or computational limitations? Does the observational nature of your data limit causal inference, even if you’ve attempted to address confounding through statistical adjustment?

Consider limitations in measurement and operationalization. Do your variables capture the theoretical constructs of interest with high fidelity, or are they imperfect proxies? Are there important dimensions of concepts that your measures don’t capture? Would different but equally defensible operationalizations lead to different conclusions?

Acknowledge temporal limitations. For cross-sectional data, note that you observe relationships at a single time point and cannot make claims about causal ordering or temporal dynamics. For longitudinal data, discuss whether your observation period is long enough to capture relevant changes and whether patterns might differ over longer time horizons.

Discuss limitations related to model complexity and specification. Have you potentially omitted important confounders or moderators due to data unavailability? Does your model impose functional form assumptions that, while reasonable, may not perfectly capture reality? Have you prioritized interpretability over predictive performance, or vice versa, and how might this choice limit certain uses of your findings?

For predictive models, clearly delineate the conditions under which predictions should be trusted and situations where the model may perform poorly. Discuss the training data’s representativeness and how concept drift or distribution shift might affect performance when the model is deployed in different contexts or time periods.

Address limitations in statistical power if applicable. Underpowered studies may fail to detect truly important effects, and confidence intervals may be too wide to provide useful guidance. Non-significant findings in underpowered studies should be interpreted as inconclusive rather than as evidence of null effects.

Charting a Path Forward: Future Research Directions

Conclude by outlining specific steps that could address the limitations you’ve identified and advance understanding beyond what your current analysis achieved. This forward-looking discussion demonstrates scientific maturity and provides a roadmap for continuing research on important questions.

For data-related limitations, describe what improved data collection efforts would look like. Should future studies employ different sampling strategies to improve representativeness? Would longitudinal designs that track individuals over time provide stronger evidence than cross-sectional data? Are there key variables that should be measured but weren’t available in your data? Would larger sample sizes enable detection of more subtle effects or more complex modeling?

Recommend methodological innovations or alternative analytical approaches that might overcome current limitations. Are there emerging statistical methods that would better address the particular challenges your data present? Would experimental or quasi-experimental designs provide stronger causal evidence? Could different modeling frameworks accommodate complexities that your current approach handles imperfectly?

Suggest directions for extending your findings. What related research questions naturally follow from your results? Are there important moderators or boundary conditions that should be explored? Would replication in different populations or contexts test the generalizability of your findings? Are there theoretical mechanisms linking your variables that require further investigation?

For applied work, discuss how implementation research could assess the effectiveness of your recommendations in practice. Statistical findings that seem promising in analysis may encounter challenges when deployed in real-world contexts, and careful evaluation of implementation is crucial for evidence-based practice.

Consider interdisciplinary connections that might enrich future investigation of your research questions. Would combining your quantitative approach with qualitative methods provide richer understanding? Could insights from other disciplines inform better model specification or theoretical development?

If your work identified measurement limitations, suggest how instrument development or validation studies could improve future research. Better measurement is often the key to scientific progress, and acknowledging measurement challenges while proposing solutions contributes meaningfully to your field.

Discuss how emerging data sources or technologies might enable future research that wasn’t possible for your current analysis. Could sensor data, administrative records, natural language processing of text data, or other innovations provide new windows into your research questions?

Finally, contextualize your work within the broader scientific enterprise. Position your analysis as one contribution within an accumulating body of evidence, acknowledging what remains to be learned and how the field should collectively move forward to advance understanding.

This expanded structure provides a comprehensive framework for conducting and presenting rigorous statistical analysis, emphasizing transparency, methodological awareness, and careful reasoning at every stage of the research process.