Loading [MathJax]/jax/output/CommonHTML/jax.js
Type to search
A Guide on Data Analysis
Preface
How to cite this book
More books
1
Introduction
1.1
General Recommendations
2
Prerequisites
2.1
Matrix Theory
2.1.1
Rank of a Matrix
2.1.2
Inverse of a Matrix
2.1.3
Definiteness of a Matrix
2.1.4
Matrix Calculus
2.1.5
Optimization in Scalar and Vector Spaces
2.1.6
Cholesky Decomposition
2.2
Probability Theory
2.2.1
Axioms and Theorems of Probability
2.2.2
Central Limit Theorem
2.2.3
Random Variable
2.2.4
Moment Generating Function
2.2.5
Moments
2.2.6
Skewness
2.2.7
Kurtosis
2.2.8
Distributions
2.3
General Math
2.3.1
Number Sets
2.3.2
Summation Notation and Series
2.3.3
Taylor Expansion
2.3.4
Law of Large Numbers
2.3.5
Convergence
2.3.6
Sufficient Statistics and Likelihood
2.3.7
Parameter Transformations
2.4
Data Import/Export
2.4.1
Medium size
2.4.2
Large size
2.5
Data Manipulation
I. BASIC
3
Descriptive Statistics
3.1
Numerical Measures
3.2
Graphical Measures
3.2.1
Shape
3.2.2
Scatterplot
3.3
Normality Assessment
3.3.1
Graphical Assessment
3.3.2
Summary Statistics
3.4
Bivariate Statistics
3.4.1
Two Continuous
3.4.2
Categorical and Continuous
3.4.3
Two Discrete
3.4.4
General Approach to Bivariate Statistics
4
Basic Statistical Inference
4.1
Hypothesis Testing Framework
4.1.1
Null and Alternative Hypotheses
4.1.2
Errors in Hypothesis Testing
4.1.3
The Role of Distributions in Hypothesis Testing
4.1.4
The Test Statistic
4.1.5
Critical Values and Rejection Regions
4.1.6
Visualizing Hypothesis Testing
4.2
Key Concepts and Definitions
4.2.1
Random Sample
4.2.2
Sample Statistics
4.2.3
Distribution of the Sample Mean
4.3
One-Sample Inference
4.3.1
For Single Mean
4.3.2
For Difference of Means, Independent Samples
4.3.3
For Difference of Means, Paired Samples
4.3.4
For Difference of Two Proportions
4.3.5
For Single Proportion
4.3.6
For Single Variance
4.3.7
Non-parametric Tests
4.4
Two-Sample Inference
4.4.1
For Means
4.4.2
For Variances
4.4.3
Power
4.4.4
Matched Pair Designs
4.4.5
Nonparametric Tests for Two Samples
4.5
Categorical Data Analysis
4.5.1
Association Tests
4.5.2
Ordinal Association
4.5.3
Ordinal Trend
4.6
Divergence Metrics and Tests for Comparing Distributions
4.6.1
Kolmogorov-Smirnov Test
4.6.2
Anderson-Darling Test
4.6.3
Chi-Square Goodness-of-Fit Test
4.6.4
Cramér-von Mises Test
4.6.5
Kullback-Leibler Divergence
4.6.6
Jensen-Shannon Divergence
4.6.7
Hellinger Distance
4.6.8
Bhattacharyya Distance
4.6.9
Wasserstein Distance
4.6.10
Energy Distance
4.6.11
Total Variation Distance
4.6.12
Summary
II. REGRESSION
5
Linear Regression
5.1
Ordinary Least Squares
5.1.1
Simple Regression (Basic) Model
5.1.2
Multiple Linear Regression
5.2
Generalized Least Squares
5.2.1
Infeasible Generalized Least Squares
5.2.2
Feasible Generalized Least Squares
5.2.3
Weighted Least Squares
5.3
Maximum Likelihood
5.3.1
Motivation for MLE
5.3.2
Key Quantities for Inference
5.3.3
Assumptions of MLE
5.3.4
Properties of MLE
5.3.5
Practical Considerations
5.3.6
Comparison of MLE and OLS
5.3.7
Applications of MLE
5.4
Penalized (Regularized) Estimators
5.4.1
Motivation for Penalized Estimators
5.4.2
Ridge Regression
5.4.3
Lasso Regression
5.4.4
Elastic Net
5.4.5
Tuning Parameter Selection
5.4.6
Properties of Penalized Estimators
5.5
Robust Estimators
5.5.1
Motivation for Robust Estimation
5.5.2
M
-Estimators
5.5.3
R
-Estimators
5.5.4
L
-Estimators
5.5.5
Least Trimmed Squares (LTS)
5.5.6
S
-Estimators
5.5.7
M
M
-Estimators
5.5.8
Practical Considerations
5.6
Partial Least Squares
5.6.1
Motivation for PLS
5.6.2
Steps to Construct PLS Components
5.6.3
Properties of PLS
5.6.4
Comparison with Related Methods
6
Non-Linear Regression
6.1
Inference
6.1.1
Linear Functions of the Parameters
6.1.2
Nonlinear Functions of Parameters
6.2
Non-linear Least Squares Estimation
6.2.1
Iterative Optimization
6.2.2
Derivative-Free
6.2.3
Stochastic Heuristic
6.2.4
Linearization
6.2.5
Hybrid
6.2.6
Comparison of Nonlinear Optimizers
6.3
Practical Considerations
6.3.1
Selecting Starting Values
6.3.2
Handling Constrained Parameters
6.3.3
Failure to Converge
6.3.4
Convergence to a Local Minimum
6.3.5
Model Adequacy and Estimation Considerations
6.4
Application
6.4.1
Nonlinear Estimation Using Gauss-Newton Algorithm
6.4.2
Logistic Growth Model
6.4.3
Nonlinear Plateau Model
7
Generalized Linear Models
7.1
Logistic Regression
7.1.1
Logistic Model
7.1.2
Likelihood Function
7.1.3
Fisher Information Matrix
7.1.4
Inference in Logistic Regression
7.1.5
Application: Logistic Regression
7.2
Probit Regression
7.2.1
Probit Model
7.2.2
Application: Probit Regression
7.3
Binomial Regression
7.3.1
Dataset Overview
7.3.2
Apply Logistic Model
7.3.3
Apply Probit Model
7.4
Poisson Regression
7.4.1
The Poisson Distribution
7.4.2
Poisson Model
7.4.3
Link Function Choices
7.4.4
Application: Poisson Regression
7.5
Negative Binomial Regression
7.5.1
Negative Binomial Distribution
7.5.2
Application: Negative Binomial Regression
7.5.3
Fitting a Zero-Inflated Negative Binomial Model
7.6
Quasi-Poisson Regression
7.6.1
Is Quasi-Poisson Regression a Generalized Linear Model?
7.6.2
Application: Quasi-Poisson Regression
7.7
Multinomial Logistic Regression
7.7.1
The Multinomial Distribution
7.7.2
Modeling Probabilities Using Log-Odds
7.7.3
Softmax Representation
7.7.4
Log-Odds Ratio Between Two Categories
7.7.5
Estimation
7.7.6
Interpretation of Coefficients
7.7.7
Application: Multinomial Logistic Regression
7.7.8
Application: Gamma Regression
7.8
Generalization of Generalized Linear Models
7.8.1
Exponential Family
7.8.2
Properties of GLM Exponential Families
7.8.3
Structure of a Generalized Linear Model
7.8.4
Components of a GLM
7.8.5
Canonical Link
7.8.6
Inverse Link Functions
7.8.7
Estimation of Parameters in GLMs
7.8.8
Inference
7.8.9
Deviance
7.8.10
Diagnostic Plots
7.8.11
Goodness of Fit
7.8.12
Over-Dispersion
8
Linear Mixed Models
8.1
Dependent Data
8.1.1
Motivation: A Repeated Measurements Example
8.1.2
Example: Linear Mixed Model for Repeated Measurements
8.1.3
Random-Intercepts Model
8.1.4
Covariance Models in Linear Mixed Models
8.1.5
Covariance Structures in Mixed Models
8.2
Estimation in Linear Mixed Models
8.2.1
Interpretation of the Mixed Model Equations
8.2.2
Derivation of the Mixed Model Equations
8.2.3
Bayesian Interpretation of Linear Mixed Models
8.2.4
Estimating the Variance-Covariance Matrix
8.3
Inference in Linear Mixed Models
8.3.1
Inference for Fixed Effects (
β
)
8.3.2
Inference for Variance Components (
θ
)
8.4
Information Criteria for Model Selection
8.4.1
Akaike Information Criterion
8.4.2
Corrected AIC
8.4.3
Bayesian Information Criterion
8.4.4
Practical Example with Linear Mixed Models
8.5
Split-Plot Designs
8.5.1
Example Setup
8.5.2
Statistical Model for Split-Plot Designs
8.5.3
Approaches to Analyzing Split-Plot Designs
8.5.4
Application: Split-Plot Design
8.6
Repeated Measures in Mixed Models
8.7
Unbalanced or Unequally Spaced Data
8.7.1
Variance-Covariance Structure: Power Model
8.8
Application: Mixed Models in Practice
8.8.1
Example 1: Pulp Brightness Analysis
8.8.2
Example 2: Penicillin Yield (GLMM with Blocking)
8.8.3
Example 3: Growth in Rats Over Time
8.8.4
Example 4: Tree Water Use (Agridat)
9
Nonlinear and Generalized Linear Mixed Models
9.1
Nonlinear Mixed Models
9.2
Generalized Linear Mixed Models
9.3
Relationship Between NLMMs and GLMMs
9.4
Marginal Properties of GLMMs
9.4.1
Marginal Mean of
y
i
9.4.2
Marginal Variance of
y
i
9.4.3
Marginal Covariance of
y
9.5
Estimation in Nonlinear and Generalized Linear Mixed Models
9.5.1
Estimation by Numerical Integration
9.5.2
Estimation by Linearization
9.5.3
Estimation by Bayesian Hierarchical Models
9.5.4
Practical Implementation in R
9.6
Application: Nonlinear and Generalized Linear Mixed Models
9.6.1
Binomial Data: CBPP Dataset
9.6.2
Count Data: Owl Dataset
9.6.3
Binomial Example: Gotway Hessian Fly Data
9.6.4
Nonlinear Mixed Model: Yellow Poplar Data
9.7
Summary
10
Nonparametric Regression
10.1
Why Nonparametric?
10.1.1
Flexibility
10.1.2
Fewer Assumptions
10.1.3
Interpretability
10.1.4
Practical Considerations
10.1.5
Balancing Parametric and Nonparametric Approaches
10.2
Basic Concepts in Nonparametric Estimation
10.2.1
Bias-Variance Trade-Off
10.2.2
Kernel Smoothing and Local Averages
10.3
Kernel Regression
10.3.1
Basic Setup
10.3.2
Nadaraya-Watson Kernel Estimator
10.3.3
Priestley–Chao Kernel Estimator
10.3.4
Gasser–Müller Kernel Estimator
10.3.5
Comparison of Kernel-Based Estimators
10.3.6
Bandwidth Selection
10.3.7
Asymptotic Properties
10.3.8
Derivation of the Nadaraya-Watson Estimator
10.4
Local Polynomial Regression
10.4.1
Local Polynomial Fitting
10.4.2
Mathematical Form of the Solution
10.4.3
Bias, Variance, and Asymptotics
10.4.4
Special Case: Local Linear Regression
10.4.5
Bandwidth Selection
10.4.6
Asymptotic Properties Summary
10.5
Smoothing Splines
10.5.1
Properties and Form of the Smoothing Spline
10.5.2
Choice of
λ
10.5.3
Connection to Reproducing Kernel Hilbert Spaces
10.6
Confidence Intervals in Nonparametric Regression
10.6.1
Asymptotic Normality
10.6.2
Bootstrap Methods
10.6.3
Practical Considerations
10.7
Generalized Additive Models
10.7.1
Estimation via Penalized Likelihood
10.7.2
Interpretation of GAMs
10.7.3
Model Selection and Smoothing Parameter Estimation
10.7.4
Extensions of GAMs
10.8
Regression Trees and Random Forests
10.8.1
Regression Trees
10.8.2
Random Forests
10.8.3
Theoretical Insights
10.8.4
Feature Importance in Random Forests
10.8.5
Advantages and Limitations of Tree-Based Methods
10.9
Wavelet Regression
10.9.1
Wavelet Series Expansion
10.9.2
Wavelet Regression Model
10.9.3
Wavelet Shrinkage and Thresholding
10.10
Multivariate Nonparametric Regression
10.10.1
The Curse of Dimensionality
10.10.2
Multivariate Kernel Regression
10.10.3
Multivariate Splines
10.10.4
Additive Models (GAMs)
10.10.5
Radial Basis Functions
10.11
Conclusion: The Evolving Landscape of Regression Analysis
10.11.1
Key Takeaways
10.11.2
The Art and Science of Regression
10.11.3
Looking Forward
10.11.4
Final Thoughts
III. RAMIFICATIONS
11
Data
11.1
Qualitative vs. Quantitative Data
11.1.1
Uses and Advantages of Qualitative Data
11.1.2
Uses and Advantages of Quantitative Data
11.1.3
Limitations of Qualitative and Quantitative Data
11.2
Levels of Measurement
11.3
Data by Observational Structure Over Time
11.4
Cross-Sectional Data
11.5
Time Series Data
11.5.1
Statistical Properties of Time Series Models
11.5.2
Common Time Series Processes
11.5.3
Deterministic Time Trends
11.5.4
Violations of Exogeneity in Time Series Models
11.5.5
Consequences of Exogeneity Violations
11.5.6
Highly Persistent Data
11.5.7
Unit Root Testing
11.5.8
Newey-West Standard Errors
11.6
Repeated Cross-Sectional Data
11.6.1
Key Characteristics
11.6.2
Statistical Modeling for Repeated Cross-Sections
11.6.3
Advantages of Repeated Cross-Sectional Data
11.6.4
Disadvantages of Repeated Cross-Sectional Data
11.7
Panel Data
11.7.1
Advantages of Panel Data
11.7.2
Disadvantages of Panel Data
11.7.3
Sources of Variation in Panel Data
11.7.4
Pooled OLS Estimator
11.7.5
Individual-Specific Effects Model
11.7.6
Random Effects Estimator
11.7.7
Fixed Effects Estimator
11.7.8
Tests for Assumptions in Panel Data Analysis
11.7.9
Model Selection in Panel Data
11.7.10
Alternative Estimators
11.7.11
Application
11.8
Other Ways to Classify Data
11.8.1
Primary vs. Secondary Data
11.8.2
Structured, Semi-Structured, and Unstructured Data
11.8.3
Big Data
11.8.4
Internal vs. External Data (in Organizational Contexts)
11.8.5
Proprietary vs. Public Data
11.9
Choosing the Right Type of Data
11.10
Data Quality and Ethical Considerations
12
Variable Transformation
12.1
Continuous Variables
12.1.1
Standardization
12.1.2
Min-max scaling
12.1.3
Square Root/Cube Root
12.1.4
Logarithmic
12.1.5
Exponential
12.1.6
Power
12.1.7
Inverse/Reciprocal
12.1.8
Hyperbolic arcsine
12.1.9
Ordered Quantile Norm
12.1.10
Arcsinh
12.1.11
Lambert W x F Transformation
12.1.12
Inverse Hyperbolic Sine (IHS) transformation
12.1.13
Box-Cox Transformation
12.1.14
Yeo-Johnson Transformation
12.1.15
RankGauss
12.1.16
Summary
12.2
Categorical Variables
13
Imputation (Missing Data)
13.1
Introduction to Missing Data
13.1.1
Types of Imputation
13.1.2
When and Why to Use Imputation
13.1.3
Importance of Missing Data Treatment in Statistical Modeling
13.1.4
Prevalence of Missing Data Across Domains
13.1.5
Practical Considerations for Imputation
13.2
Theoretical Foundations of Missing Data
13.2.1
Definition and Classification of Missing Data
13.2.2
Missing Data Mechanisms
13.2.3
Relationship Between Mechanisms and Ignorability
13.3
Diagnosing the Missing Data Mechanism
13.3.1
Descriptive Methods
13.3.2
Statistical Tests for Missing Data Mechanisms
13.3.3
Assessing MAR and MNAR
13.4
Methods for Handling Missing Data
13.4.1
Basic Methods
13.4.2
Single Imputation Techniques
13.4.3
Machine Learning and Modern Approaches
13.4.4
Multiple Imputation
13.5
Evaluation of Imputation Methods
13.5.1
Statistical Metrics for Assessing Imputation Quality
13.5.2
Bias-Variance Tradeoff in Imputation
13.5.3
Sensitivity Analysis
13.5.4
Validation Using Simulated Data and Real-World Case Studies
13.6
Criteria for Choosing an Effective Approach
13.7
Challenges and Ethical Considerations
13.7.1
Challenges in High-Dimensional Data
13.7.2
Missing Data in Big Data Contexts
13.7.3
Ethical Concerns
13.8
Emerging Trends in Missing Data Handling
13.8.1
Advances in Neural Network Approaches
13.8.2
Integration with Reinforcement Learning
13.8.3
Synthetic Data Generation for Missing Data
13.8.4
Federated Learning and Privacy-Preserving Imputation
13.8.5
Imputation in Streaming and Online Data Environments
13.9
Application of Imputation
13.9.1
Visualizing Missing Data
13.9.2
How Many Imputations?
13.9.3
Generating Missing Data for Demonstration
13.9.4
Imputation with Mean, Median, and Mode
13.9.5
K-Nearest Neighbors (KNN) Imputation
13.9.6
Imputation with Decision Trees (rpart)
13.9.7
MICE (Multivariate Imputation via Chained Equations)
13.9.8
Amelia
13.9.9
missForest
13.9.10
Hmisc
13.9.11
mi
14
Model Specification Tests
14.1
Nested Model Tests
14.1.1
Wald Test
14.1.2
Likelihood Ratio Test
14.1.3
F-Test (for Linear Regression)
14.1.4
Chow Test
14.2
Non-Nested Model Tests
14.2.1
Vuong Test
14.2.2
Davidson–MacKinnon J-Test
14.2.3
Adjusted
R
2
14.2.4
Comparing Models with Transformed Dependent Variables
14.3
Heteroskedasticity Tests
14.3.1
Breusch–Pagan Test
14.3.2
White Test
14.3.3
Goldfeld–Quandt Test
14.3.4
Park Test
14.3.5
Glejser Test
14.3.6
Summary of Heteroskedasticity Tests
14.4
Functional Form Tests
14.4.1
Ramsey RESET Test (Regression Equation Specification Error Test)
14.4.2
Harvey–Collier Test
14.4.3
Rainbow Test
14.4.4
Summary of Functional Form Tests
14.5
Autocorrelation Tests
14.5.1
Durbin–Watson Test
14.5.2
Breusch–Godfrey Test
14.5.3
Ljung–Box Test (or Box–Pierce Test)
14.5.4
Runs Test
14.5.5
Summary of Autocorrelation Tests
14.6
Multicollinearity Diagnostics
14.6.1
Variance Inflation Factor
14.6.2
Tolerance Statistic
14.6.3
Condition Index and Eigenvalue Decomposition
14.6.4
Pairwise Correlation Matrix
14.6.5
Determinant of the Correlation Matrix
14.6.6
Summary of Multicollinearity Diagnostics
14.6.7
Addressing Multicollinearity
15
Variable Selection
15.1
Filter Methods (Statistical Criteria, Model-Agnostic)
15.1.1
Information Criteria-Based Selection
15.1.2
Univariate Selection Methods
15.1.3
Correlation-Based Feature Selection
15.1.4
Variance Thresholding
15.2
Wrapper Methods (Model-Based Subset Evaluation)
15.2.1
Best Subsets Algorithm
15.2.2
Stepwise Selection Methods
15.2.3
Branch-and-Bound Algorithm
15.2.4
Recursive Feature Elimination
15.3
Embedded Methods (Integrated into Model Training)
15.3.1
Regularization-Based Selection
15.3.2
Tree-Based Feature Importance
15.3.3
Genetic Algorithms
15.4
Summary Table
16
Hypothesis Testing
16.1
Null Hypothesis Significance Testing
16.1.1
Error Types in Hypothesis Testing
16.1.2
Hypothesis Testing Framework
16.1.3
Interpreting Hypothesis Testing Results
16.1.4
Understanding p-Values
16.1.5
The Role of Sample Size
16.1.6
p-Value Hacking
16.1.7
Practical vs. Statistical Significance
16.1.8
Mitigating the Misuse of p-Values
16.1.9
Wald Test
16.1.10
Likelihood Ratio Test
16.1.11
Lagrange Multiplier (Score) Test
16.1.12
Comparing Hypothesis Tests
16.2
Two One-Sided Tests Equivalence Testing
16.2.1
When to Use TOST?
16.2.2
Interpretation of the TOST Procedure
16.2.3
Relationship to Confidence Intervals
16.2.4
Example 1: Testing the Equivalence of Two Means
16.2.5
Advantages of TOST Equivalence Testing
16.2.6
When
Not
to Use TOST
16.3
False Discovery Rate
16.3.1
Benjamini-Hochberg Procedure
16.3.2
Benjamini-Yekutieli Procedure
16.3.3
Storey’s q-value Approach
16.3.4
Summary: False Discovery Rate Methods
16.4
Comparison of Testing Frameworks
17
Marginal Effects
17.1
Definition of Marginal Effects
17.1.1
Analytical Derivation of Marginal Effects
17.1.2
Numerical Approximation of Marginal Effects
17.2
Marginal Effects in Different Contexts
17.3
Marginal Effects Interpretation
17.4
Delta Method
17.5
Comparison: Delta Method vs. Alternative Approaches
17.5.1
Example: Applying the Delta Method in a logistic regression
17.6
Types of Marginal Effect
17.6.1
Average Marginal Effect
17.6.2
Marginal Effects at the Mean
17.6.3
Marginal Effects at the Average
17.7
Packages for Marginal Effects
17.7.1
marginaleffects
Package (Recommended)
17.7.2
margins
Package
17.7.3
mfx
Package
17.7.4
Comparison of Packages
18
Moderation
18.1
emmeans package
18.1.1
Continuous by continuous
18.1.2
Continuous by categorical
18.1.3
Categorical by categorical
18.2
probmod package
18.3
interactions package
18.3.1
Continuous interaction
18.3.2
Categorical interaction
18.4
interactionR package
18.5
sjPlot package
19
Mediation
19.1
Traditional Approach
19.1.1
Assumptions
19.1.2
Indirect Effect Tests
19.1.3
Multiple Mediation
19.2
Causal Inference Approach
19.2.1
Example 1
19.3
Model-based causal mediation analysis
20
Prediction and Estimation
20.1
Conceptual Framing
20.1.1
Predictive Modeling
20.1.2
Estimation or Causal Inference
20.2
Mathematical Setup
20.2.1
Probability Space and Data
20.2.2
Loss Functions and Risk
20.3
Prediction in Detail
20.3.1
Empirical Risk Minimization and Generalization
20.3.2
Bias-Variance Decomposition
20.3.3
Example: Linear Regression for Prediction
20.3.4
Applications in Economics
20.4
Parameter Estimation and Causal Inference
20.4.1
Estimation in Parametric Models
20.4.2
Causal Inference Fundamentals
20.4.3
Role of Identification
20.4.4
Challenges
20.5
Causation versus Prediction
20.6
Illustrative Equations and Mathematical Contrasts
20.6.1
Risk Minimization vs. Consistency
20.6.2
Partial Derivatives vs. Predictions
20.6.3
Example: High-Dimensional Regularization
20.6.4
Potential Outcomes Notation
20.7
Extended Mathematical Points
20.7.1
M-Estimation and Asymptotic Theory
20.7.2
The Danger of Omitted Variables
20.7.3
Cross-Validation vs. Statistical Testing
20.8
Putting It All Together: Comparing Objectives
20.9
Conclusion
IV. CAUSAL INFERENCE
21
Causal Inference
21.1
Treatment effect types
21.1.1
Average Treatment Effects
21.1.2
Conditional Average Treatment Effects
21.1.3
Intent-to-treat Effects
21.1.4
Local Average Treatment Effects
21.1.5
Population vs. Sample Average Treatment Effects
21.1.6
Average Treatment Effects on the Treated and Control
21.1.7
Quantile Average Treatment Effects
21.1.8
Mediation Effects
21.1.9
Log-odds Treatment Effects
A. EXPERIMENTAL DESIGN
22
Experimental Design
22.1
Notes
22.2
Semi-random Experiment
22.3
Rerandomization
22.4
Two-Stage Randomized Experiments with Interference and Noncompliance
22.5
A/B Testing Caution
23
Sampling
23.1
Simple Sampling
23.2
Stratified Sampling
23.3
Unequal Probability Sampling
23.4
Balanced Sampling
23.4.1
Cube
23.4.2
Stratification
23.4.3
Cluster
23.4.4
Two-stage
24
Analysis of Variance (ANOVA)
24.1
Completely Randomized Design (CRD)
24.1.1
Single Factor Fixed Effects Model
24.1.2
Single Factor Random Effects Model
24.1.3
Two Factor Fixed Effect ANOVA
24.1.4
Two-Way Random Effects ANOVA
24.1.5
Two-Way Mixed Effects ANOVA
24.2
Nonparametric ANOVA
24.2.1
Kruskal-Wallis
24.2.2
Friedman Test
24.3
Sample Size Planning for ANOVA
24.3.1
Balanced Designs
24.3.2
Randomized Block Experiments
24.4
Randomized Block Designs
24.4.1
Tukey Test of Additivity
24.5
Nested Designs
24.5.1
Two-Factor Nested Designs
24.6
Single Factor Covariance Model
25
Multivariate Methods
25.0.1
Properties of MVN
25.0.2
Mean Vector Inference
25.0.3
General Hypothesis Testing
25.1
MANOVA
25.1.1
Testing General Hypotheses
25.1.2
Profile Analysis
25.1.3
Summary
25.2
Principal Components
25.2.1
Population Principal Components
25.2.2
Sample Principal Components
25.2.3
Application
25.3
Factor Analysis
25.3.1
Methods of Estimation
25.3.2
Factor Rotation
25.3.3
Estimation of Factor Scores
25.3.4
Model Diagnostic
25.3.5
Application
25.4
Discriminant Analysis
25.4.1
Known Populations
25.4.2
Probabilities of Misclassification
25.4.3
Unknown Populations/ Nonparametric Discrimination
25.4.4
Application
B. QUASI-EXPERIMENTAL DESIGN
26
Quasi-experimental
26.1
Natural Experiments
27
Regression Discontinuity
27.1
Estimation and Inference
27.1.1
Local Randomization-based
27.1.2
Continuity-based
27.2
Specification Checks
27.2.1
Balance Checks
27.2.2
Sorting/Bunching/Manipulation
27.2.3
Placebo Tests
27.2.4
Sensitivity to Bandwidth Choice
27.2.5
Manipulation Robust Regression Discontinuity Bounds
27.3
Fuzzy RD Design
27.4
Regression Kink Design
27.5
Multi-cutoff
27.6
Multi-score
27.7
Steps for Sharp RD
27.8
Steps for Fuzzy RD
27.9
Steps for RDiT (Regression Discontinuity in Time)
27.10
Evaluation of an RD
27.11
Applications
27.11.1
Example 1
27.11.2
Example 2
27.11.3
Example 3
27.11.4
Example 4
28
Synthetic Difference-in-Differences
28.1
Understanding
28.2
Application
28.2.1
Block Treatment
28.2.2
Staggered Adoption
29
Difference-in-differences
29.1
Visualization
29.2
Simple Dif-n-dif
29.3
Notes
29.4
Standard Errors
29.5
Examples
29.5.1
Example by
Doleac and Hansen (2020)
29.5.2
Example from Princeton
29.5.3
Example by
Card and Krueger (1993)
29.5.4
Example by
Butcher, McEwan, and Weerapana (2014)
29.6
One Difference
29.7
Two-way Fixed-effects
29.8
Multiple periods and variation in treatment timing
29.9
Staggered Dif-n-dif
29.9.1
Stacked DID
29.9.2
Goodman-Bacon Decomposition
29.9.3
DID with in and out treatment condition
29.9.4
Gardner (2022)
and
Borusyak, Jaravel, and Spiess (2021)
29.9.5
Clément De Chaisemartin and d’Haultfoeuille (2020)
29.9.6
Callaway and Sant’Anna (2021)
29.9.7
L. Sun and Abraham (2021)
29.9.8
Wooldridge (2022)
29.9.9
Doubly Robust DiD
29.9.10
Augmented/Forward DID
29.10
Multiple Treatments
29.11
Mediation Under DiD
29.12
Assumptions
29.12.1
Prior Parallel Trends Test
29.12.2
Placebo Test
29.12.3
Assumption Violations
29.12.4
Robustness Checks
30
Changes-in-Changes
30.1
Application
30.1.1
ECIC package
30.1.2
QTE package
31
Synthetic Control
31.1
Applications
31.1.1
Example 1
31.1.2
Example 2
31.1.3
Example 3
31.1.4
Example 4
31.2
Augmented Synthetic Control Method
31.3
Synthetic Control with Staggered Adoption
31.4
Bayesian Synthetic Control
31.5
Generalized Synthetic Control
31.6
Other Advances
32
Event Studies
32.1
Other Issues
32.1.1
Event Studies in marketing
32.1.2
Economic significance
32.1.3
Statistical Power
32.2
Testing
32.2.1
Parametric Test
32.2.2
Non-parametric Test
32.3
Sample
32.3.1
Confounders
32.4
Biases
32.5
Long-run event studies
32.5.1
Buy and Hold Abnormal Returns (BHAR)
32.5.2
Long-term Cumulative Abnormal Returns (LCARs)
32.5.3
Calendar-time Portfolio Abnormal Returns (CTARs)
32.6
Aggregation
32.6.1
Over Time
32.6.2
Across Firms + Over Time
32.7
Heterogeneity in the event effect
32.7.1
Common variables in marketing
32.8
Expected Return Calculation
32.8.1
Statistical Models
32.8.2
Economic Model
32.9
Application
32.9.1
Eventus
32.9.2
Evenstudies
32.9.3
EventStudy
33
Instrumental Variables
33.1
Framework
33.2
Estimation
33.2.1
2SLS Estimation
33.2.2
IV-GMM
33.3
Inference
33.3.1
AR approach
33.3.2
tF Procedure
33.3.3
AK approach
33.4
Testing Assumptions
33.4.1
Relevance Assumption
33.4.2
Exogeneity Assumption
33.5
Negative
R
2
33.6
Treatment Intensity
33.7
Control Function
33.7.1
Simulation
33.8
New Advances
34
Matching Methods
34.1
Selection on Observables
34.1.1
MatchIt
34.1.2
designmatch
34.1.3
MatchingFrontier
34.1.4
Propensity Scores
34.1.5
Mahalanobis Distance
34.1.6
Coarsened Exact Matching
34.1.7
Genetic Matching
34.1.8
Entropy Balancing
34.1.9
Matching for high-dimensional data
34.1.10
Matching for time series-cross-section data
34.1.11
Matching for multiple treatments
34.1.12
Matching for multi-level treatments
34.1.13
Matching for repeated treatments
34.2
Selection on Unobservables
34.2.1
Rosenbaum Bounds
34.2.2
Relative Correlation Restrictions
34.2.3
Coefficient-stability Bounds
35
Interrupted Time Series
35.1
Autocorrelation
35.2
Multiple Groups
C. OTHER CONCERNS
36
Endogeneity
36.1
Endogenous Treatment
36.1.1
Measurement Error
36.1.2
Simultaneity
36.1.3
Endogenous Treatment Solutions
36.2
Endogenous Sample Selection
36.2.1
Tobit-2
36.2.2
Tobit-5
37
Other Biases
37.1
Aggregation Bias
37.1.1
Simpson’s Paradox
37.2
Contamination Bias
37.3
Survivorship Bias
37.4
Publication Bias
38
Controls
38.1
Bad Controls
38.1.1
M-bias
38.1.2
Bias Amplification
38.1.3
Overcontrol bias
38.1.4
Selection Bias
38.1.5
Case-control Bias
38.2
Good Controls
38.2.1
Omitted Variable Bias Correction
38.2.2
Omitted Variable Bias in Mediation Correction
38.3
Neutral Controls
38.3.1
Good Predictive Controls
38.3.2
Good Selection Bias
38.3.3
Bad Predictive Controls
38.3.4
Bad Selection Bias
38.4
Choosing Controls
39
Directed Acyclic Graph
39.1
Basic Notations
V. MISCELLANEOUS
40
Report
40.1
One summary table
40.2
Model Comparison
40.3
Changes in an estimate
40.4
Standard Errors
40.5
Coefficient Uncertainty and Distribution
40.6
Descriptive Tables
40.7
Visualizations and Plots
41
Exploratory Data Analysis
42
Sensitivity Analysis/ Robustness Check
42.1
Specification curve
42.1.1
starbility
42.1.2
rdfanalysis
42.2
Coefficient stability
42.3
Omitted Variable Bias Quantification
43
Replication and Synthetic Data
43.1
The Replication Standard
43.1.1
Solutions for Empirical Replication
43.1.2
Free Data Repositories
43.1.3
Exceptions to Replication
43.1.4
Replication Landscape
43.2
Synthetic Data
43.2.1
Benefits of Synthetic Data
43.2.2
Concerns and Limitations
43.2.3
Further Insights on Synthetic Data
43.2.4
Generating Synthetic Data
43.3
Application
43.3.1
Original Dataset
43.3.2
Restricted Dataset
43.3.3
Synthpop
APPENDIX
A
Appendix
A.1
Git
A.2
Short-cut
A.3
Function short-cut
A.4
Citation
A.5
Install all necessary packages/libaries on your local machine
B
Bookdown cheat sheet
B.1
Operation
B.2
Math Expression/ Syntax
B.2.1
Statistics Notation
B.3
Table
References
Published with bookdown
Facebook
Twitter
LinkedIn
Weibo
Instapaper
A
A
Serif
Sans
White
Sepia
Night
PDF
EPUB
MOBI
A Guide on Data Analysis
18.5
sjPlot package
For publication purposes (recommend, but more advanced)
link