A Guide on Data Analysis
Preface
How to cite this book
More books
1
Introduction
2
Prerequisites
2.1
Matrix Theory
2.1.1
Rank
2.1.2
Inverse
2.1.3
Definiteness
2.1.4
Matrix Calculus
2.1.5
Optimization
2.2
Probability Theory
2.2.1
Axiom and Theorems of Probability
2.2.2
Central Limit Theorem
2.2.3
Random variable
2.2.4
Moment generating function
2.2.5
Moment
2.2.6
Distributions
2.3
General Math
2.3.1
Number Sets
2.3.2
Summation Notation and Series
2.3.3
Taylor Expansion
2.3.4
Law of large numbers
2.3.5
Law of Iterated Expectation
2.3.6
Convergence
2.3.7
Sufficient Statistics
2.3.8
Parameter transformations
2.4
Data Import/Export
2.4.1
Medium size
2.4.2
Large size
2.5
Data Manipulation
I. BASIC
3
Descriptive Statistics
3.1
Numerical Measures
3.2
Graphical Measures
3.2.1
Shape
3.2.2
Scatterplot
3.3
Normality Assessment
3.3.1
Graphical Assessment
3.3.2
Summary Statistics
3.4
Bivariate Statistics
3.4.1
Two Continuous
3.4.2
Categorical and Continuous
3.4.3
Two Discrete
3.5
Summary
3.5.1
Visualization
4
Basic Statistical Inference
4.1
One Sample Inference
4.1.1
The Mean
4.1.2
Single Variance
4.1.3
Single Proportion (p)
4.1.4
Power
4.1.5
Sample Size
4.1.6
Note
4.1.7
One-sample Non-parametric Methods
4.2
Two Sample Inference
4.2.1
Means
4.2.2
Variances
4.2.3
Power
4.2.4
Sample Size
4.2.5
Matched Pair Designs
4.2.6
Nonparametric Tests for Two Samples
4.3
Categorical Data Analysis
4.3.1
Inferences for Small Samples
4.3.2
Test of Association
4.3.3
Ordinal Association
4.4
Divergence Metrics and Test for Comparing Distributions
4.4.1
Kullback-Leibler Divergence
4.4.2
Jensen-Shannon Divergence
4.4.3
Wasserstein Distance
4.4.4
Kolmogorov-Smirnov Test
II. REGRESSION
5
Linear Regression
5.1
Ordinary Least Squares
5.1.1
Simple Regression (Basic Model)
5.1.2
Multiple Linear Regression
5.1.3
OLS Assumptions
5.1.4
Theorems
5.1.5
Variable Selection
5.1.6
Diagnostics
5.1.7
Model Validation
5.1.8
Finite Sample Properties
5.1.9
Large Sample Properties
5.2
Feasible Generalized Least Squares
5.2.1
Heteroskedasticity
5.2.2
Serial Correlation
5.3
Weighted Least Squares
5.4
Generalized Least Squares
5.5
Feasible Prais Winsten
5.6
Feasible group level Random Effects
5.7
Ridge Regression
5.8
Principal Component Regression
5.9
Robust Regression
5.9.1
Least Absolute Residuals (LAR) Regression
5.9.2
Least Median of Squares (LMS) Regression
5.9.3
Iteratively Reweighted Least Squares (IRLS) Robust Regression
5.10
Maximum Likelihood
5.10.1
Motivation for MLE
5.10.2
Assumption
5.10.3
Properties
5.10.4
Compare to OLS
5.10.5
Application
6
Non-linear Regression
6.1
Inference
6.1.1
Linear Function of the Parameters
6.1.2
Nonlinear
6.2
Non-linear Least Squares
6.2.1
Alternative of Gauss-Newton Algorithm
6.2.2
Practical Considerations
6.2.3
Model/Estimation Adequacy
6.2.4
Application
7
Generalized Linear Models
7.1
Logistic Regression
7.1.1
Application
7.2
Probit Regression
7.3
Binomial Regression
7.4
Poisson Regression
7.4.1
Application
7.5
Negative Binomial Regression
7.6
Multinomial
7.7
Generalization
7.7.1
Estimation
7.7.2
Inference
7.7.3
Deviance
7.7.4
Diagnostic Plots
7.7.5
Goodness of Fit
7.7.6
Over-Dispersion
8
Linear Mixed Models
8.1
Dependent Data
8.1.1
Random-Intercepts Model
8.1.2
Covariance Models
8.2
Estimation
8.2.1
Estimating
\(\mathbf{V}\)
8.3
Inference
8.3.1
Parameters
\(\beta\)
8.3.2
Variance Components
8.4
Information Criteria
8.4.1
Akaike’s Information Criteria (AIC)
8.4.2
Corrected AIC (AICC)
8.4.3
Bayesian Information Criteria (BIC)
8.5
Split-Plot Designs
8.5.1
Application
8.6
Repeated Measures in Mixed Models
8.7
Unbalanced or Unequally Spaced Data
8.8
Application
8.8.1
Example 1 (Pulps)
8.8.2
Example 2 (Rats)
8.8.3
Example 3 (Agridat)
9
Nonlinear and Generalized Linear Mixed Models
9.1
Estimation
9.1.1
Estimation by Numerical Integration
9.1.2
Estimation by Linearization
9.1.3
Estimation by Bayesian Hierarchical Models
9.2
Application
9.2.1
Binomial (CBPP Data)
9.2.2
Count (Owl Data)
9.2.3
Binomial
9.2.4
Example from
(Schabenberger and Pierce 2001)
section 8.4.1
9.3
Summary
III. RAMIFICATIONS
10
Model Specification
10.1
Nested Model
10.1.1
Chow test
10.2
Non-Nested Model
10.2.1
Davidson-Mackinnon test
10.3
Heteroskedasticity
10.3.1
Breusch-Pagan test
10.3.2
White test
11
Imputation (Missing Data)
11.1
Assumptions
11.1.1
Missing Completely at Random (MCAR)
11.1.2
Missing at Random (MAR)
11.1.3
Ignorable
11.1.4
Nonignorable
11.2
Solutions to Missing data
11.2.1
Listwise Deletion
11.2.2
Pairwise Deletion
11.2.3
Dummy Variable Adjustment
11.2.4
Imputation
11.2.5
Other methods
11.3
Criteria for Choosing an Effective Approach
11.4
Another Perspective
11.5
Diagnosing the Mechanism
11.5.1
MAR vs. MNAR
11.5.2
MCAR vs. MAR
11.6
Application
11.6.1
Imputation with mean / median / mode
11.6.2
KNN
11.6.3
rpart
11.6.4
MICE (Multivariate Imputation via Chained Equations)
11.6.5
Amelia
11.6.6
missForest
11.6.7
Hmisc
11.6.8
mi
12
Data
12.1
Cross-Sectional
12.2
Time Series
12.2.1
Deterministic Time trend
12.2.2
Feedback Effect
12.2.3
Dynamic Specification
12.2.4
Dynamically Complete
12.2.5
Highly Persistent Data
12.3
Repeated Cross Sections
12.3.1
Pooled Cross Section
12.4
Panel Data
12.4.1
Pooled OLS Estimator
12.4.2
Individual-specific effects model
12.4.3
Tests for Assumptions
12.4.4
Model Selection
12.4.5
Summary
12.4.6
Application
13
Variable Transformation
13.1
Continuous Variables
13.1.1
Standardization
13.1.2
Min-max scaling
13.1.3
Square Root/Cube Root
13.1.4
Logarithmic
13.1.5
Exponential
13.1.6
Power
13.1.7
Inverse/Reciprocal
13.1.8
Hyperbolic arcsine
13.1.9
Ordered Quantile Norm
13.1.10
Arcsinh
13.1.11
Lambert W x F Transformation
13.1.12
Inverse Hyperbolic Sine (IHS) transformation
13.1.13
Box-Cox Transformation
13.1.14
Yeo-Johnson Transformation
13.1.15
RankGauss
13.1.16
Summary
13.2
Categorical Variables
14
Hypothesis Testing
14.1
Types of hypothesis testing
14.2
Wald test
14.2.1
Multiple Hypothesis
14.2.2
Linear Combination
14.2.3
Estimate Difference in Coefficients
14.2.4
Application
14.2.5
Nonlinear
14.3
The likelihood ratio test
14.4
Lagrange Multiplier (Score)
14.5
Two One-Sided Tests (TOST) Equivalence Testing
15
Marginal Effects
15.1
Delta Method
15.2
Average Marginal Effect Algorithm
15.3
Packages
15.3.1
MarginalEffects
15.3.2
margins
15.3.3
mfx
16
Prediction and Estimation
16.1
Prediction
16.2
Parameter Estimation
16.3
Causation versus Prediction
17
Moderation
17.1
emmeans package
17.1.1
Continuous by continuous
17.1.2
Continuous by categorical
17.1.3
Categorical by categorical
17.2
probmod package
17.3
interactions package
17.3.1
Continuous interaction
17.3.2
Categorical interaction
17.4
interactionR package
17.5
sjPlot package
IV. CAUSAL INFERENCE
18
Causal Inference
18.1
Treatment effect types
18.1.1
Average Treatment Effects
18.1.2
Conditional Average Treatment Effects
18.1.3
Intent-to-treat Effects
18.1.4
Local Average Treatment Effects
18.1.5
Population vs. Sample Average Treatment Effects
18.1.6
Average Treatment Effects on the Treated and Control
18.1.7
Quantile Average Treatment Effects
18.1.8
Mediation Effects
18.1.9
Log-odds Treatment Effects
A. EXPERIMENTAL DESIGN
19
Experimental Design
19.1
Notes
19.2
Semi-random Experiment
19.3
Rerandomization
19.4
Two-Stage Randomized Experiments with Interference and Noncompliance
20
Sampling
20.1
Simple Sampling
20.2
Stratified Sampling
20.3
Unequal Probability Sampling
20.4
Balanced Sampling
20.4.1
Cube
20.4.2
Stratification
20.4.3
Cluster
20.4.4
Two-stage
21
Analysis of Variance (ANOVA)
21.1
Completely Randomized Design (CRD)
21.1.1
Single Factor Fixed Effects Model
21.1.2
Single Factor Random Effects Model
21.1.3
Two Factor Fixed Effect ANOVA
21.1.4
Two-Way Random Effects ANOVA
21.1.5
Two-Way Mixed Effects ANOVA
21.2
Nonparametric ANOVA
21.2.1
Kruskal-Wallis
21.2.2
Friedman Test
21.3
Sample Size Planning for ANOVA
21.3.1
Balanced Designs
21.3.2
Randomized Block Experiments
21.4
Randomized Block Designs
21.4.1
Tukey Test of Additivity
21.5
Nested Designs
21.5.1
Two-Factor Nested Designs
21.6
Single Factor Covariance Model
22
Multivariate Methods
22.0.1
Properties of MVN
22.0.2
Mean Vector Inference
22.0.3
General Hypothesis Testing
22.1
MANOVA
22.1.1
Testing General Hypotheses
22.1.2
Profile Analysis
22.1.3
Summary
22.2
Principal Components
22.2.1
Population Principal Components
22.2.2
Sample Principal Components
22.2.3
Application
22.3
Factor Analysis
22.3.1
Methods of Estimation
22.3.2
Factor Rotation
22.3.3
Estimation of Factor Scores
22.3.4
Model Diagnostic
22.3.5
Application
22.4
Discriminant Analysis
22.4.1
Known Populations
22.4.2
Probabilities of Misclassification
22.4.3
Unknown Populations/ Nonparametric Discrimination
22.4.4
Application
B. QUASI-EXPERIMENTAL DESIGN
23
Quasi-experimental
23.1
Natural Experiments
24
Regression Discontinuity
24.1
Estimation and Inference
24.1.1
Local Randomization-based
24.1.2
Continuity-based
24.2
Specification Checks
24.2.1
Balance Checks
24.2.2
Sorting/Bunching/Manipulation
24.2.3
Placebo Tests
24.2.4
Sensitivity to Bandwidth Choice
24.2.5
Manipulation Robust Regression Discontinuity Bounds
24.3
Fuzzy RD Design
24.4
Regression Kink Design
24.5
Multi-cutoff
24.6
Multi-score
24.7
Steps for Sharp RD
24.8
Steps for Fuzzy RD
24.9
Steps for RDiT (Regression Discontinuity in Time)
24.10
Evaluation of an RD
24.11
Applications
24.11.1
Example 1
24.11.2
Example 2
24.11.3
Example 3
24.11.4
Example 4
25
Synthetic Difference-in-Differences
25.1
Understanding
25.2
Application
25.2.1
Block Treatment
25.2.2
Staggered Adoption
26
Difference-in-differences
26.1
Visualization
26.2
Simple Dif-n-dif
26.3
Notes
26.4
Standard Errors
26.5
Examples
26.5.1
Example by
Doleac and Hansen (2020)
26.5.2
Example from Princeton
26.5.3
Example by
Card and Krueger (1993)
26.5.4
Example by
Butcher, McEwan, and Weerapana (2014)
26.6
One Difference
26.7
Two-way Fixed-effects
26.8
Multiple periods and variation in treatment timing
26.9
Staggered Dif-n-dif
26.9.1
Stacked DID
26.9.2
Goodman-Bacon Decomposition
26.9.3
DID with in and out treatment condition
26.9.4
Gardner (2022)
and
Borusyak, Jaravel, and Spiess (2021)
26.9.5
Clément De Chaisemartin and d’Haultfoeuille (2020)
26.9.6
Callaway and Sant’Anna (2021)
26.9.7
L. Sun and Abraham (2021)
26.9.8
Wooldridge (2022)
26.9.9
Doubly Robust DiD
26.10
Multiple Treatment
26.11
Mediation Under DiD
26.12
Assumptions
26.12.1
Prior Parallel Trends Test
26.12.2
Placebo Test
26.12.3
Assumption Violations
26.12.4
Robustness Checks
27
Synthetic Control
27.1
Applications
27.1.1
Example 1
27.1.2
Example 2
27.1.3
Example 3
27.1.4
Example 4
27.2
Synthetic Difference-in-differences
27.3
Augmented Synthetic Control Method
27.4
Synthetic Controls with Staggered Adoption
27.5
Generalized Synthetic Control
28
Event Studies
28.1
Other Issues
28.1.1
Event Studies in marketing
28.1.2
Economic significance
28.1.3
Statistical Power
28.2
Testing
28.2.1
Parametric Test
28.2.2
Non-parametric Test
28.3
Sample
28.3.1
Confounders
28.4
Biases
28.5
Long-run event studies
28.5.1
Buy and Hold Abnormal Returns (BHAR)
28.5.2
Long-term Cumulative Abnormal Returns (LCARs)
28.5.3
Calendar-time Portfolio Abnormal Returns (CTARs)
28.6
Aggregation
28.6.1
Over Time
28.6.2
Across Firms + Over Time
28.7
Heterogeneity in the event effect
28.7.1
Common variables in marketing
28.8
Expected Return Calculation
28.8.1
Statistical Models
28.8.2
Economic Model
28.9
Application
28.9.1
Eventus
28.9.2
Evenstudies
28.9.3
EventStudy
29
Instrumental Variables
29.1
Framework
29.2
Estimation
29.2.1
2SLS Estimation
29.2.2
IV-GMM
29.3
Inference
29.3.1
AR approach
29.3.2
tF Procedure
29.3.3
AK approach
29.4
Testing Assumptions
29.4.1
Relevance Assumption
29.4.2
Exogeneity Assumption
29.5
Negative
\(R^2\)
29.6
Treatment Intensity
29.7
Control Function
29.7.1
Simulation
30
Matching Methods
30.1
Selection on Observables
30.1.1
MatchIt
30.1.2
designmatch
30.1.3
MatchingFrontier
30.1.4
Propensity Scores
30.1.5
Mahalanobis Distance
30.1.6
Coarsened Exact Matching
30.1.7
Genetic Matching
30.1.8
Entropy Balancing
30.1.9
Matching for time series-cross-section data
30.1.10
Matching for multiple treatments
30.1.11
Matching for multi-level treatments
30.1.12
Matching for repeated treatments
30.2
Selection on Unobservables
30.2.1
Rosenbaum Bounds
30.2.2
Relative Correlation Restrictions
30.2.3
Coefficient-stability Bounds
31
Interrupted Time Series
31.1
Autocorrelation
31.2
Multiple Groups
32
Matching Methods
32.1
Selection on Observables
32.1.1
MatchIt
32.1.2
designmatch
32.1.3
MatchingFrontier
32.1.4
Propensity Scores
32.1.5
Mahalanobis Distance
32.1.6
Coarsened Exact Matching
32.1.7
Genetic Matching
32.1.8
Entropy Balancing
32.1.9
Matching for time series-cross-section data
32.1.10
Matching for multiple treatments
32.1.11
Matching for multi-level treatments
32.1.12
Matching for repeated treatments
32.2
Selection on Unobservables
32.2.1
Rosenbaum Bounds
32.2.2
Relative Correlation Restrictions
32.2.3
Coefficient-stability Bounds
C. OTHER CONCERNS
33
Endogeneity
33.1
Endogenous Treatment
33.1.1
Measurement Error
33.1.2
Simultaneity
33.1.3
Endogenous Treatment Solutions
33.2
Endogenous Sample Selection
33.2.1
Tobit-2
33.2.2
Tobit-5
34
Interrupted Time Series
34.1
Autocorrelation
34.2
Multiple Groups
34.3
Recurrent Events
C. OTHER CONCERNS
35
Other Biases
35.1
Aggregation Bias
35.1.1
Simpson’s Paradox
35.2
Survivorship Bias
35.3
Publication Bias
36
Endogeneity
36.1
Endogenous Treatment
36.1.1
Measurement Error
36.1.2
Simultaneity
36.1.3
Endogenous Treatment Solutions
36.2
Endogenous Sample Selection
36.2.1
Tobit-2
36.2.2
Tobit-5
37
Other Biases
37.1
Aggregation Bias
37.1.1
Simpson’s Paradox
37.2
Contamination Bias
37.3
Survivorship Bias
37.4
Publication Bias
38
Controls
38.1
Bad Controls
38.1.1
M-bias
38.1.2
Bias Amplification
38.1.3
Overcontrol bias
38.1.4
Selection Bias
38.1.5
Case-control Bias
38.2
Good Controls
38.2.1
Omitted Variable Bias Correction
38.2.2
Omitted Variable Bias in Mediation Correction
38.3
Neutral Controls
38.3.1
Good Predictive Controls
38.3.2
Good Selection Bias
38.3.3
Bad Predictive Controls
38.3.4
Bad Selection Bias
38.4
Choosing Controls
V. MISCELLANEOUS
39
Controls
39.1
Bad Controls
39.1.1
M-bias
39.1.2
Bias Amplification
39.1.3
Overcontrol bias
39.1.4
Selection Bias
39.1.5
Case-control Bias
39.2
Good Controls
39.2.1
Omitted Variable Bias Correction
39.2.2
Omitted Variable Bias in Mediation Correction
39.3
Neutral Controls
39.3.1
Good Predictive Controls
39.3.2
Good Selection Bias
39.3.3
Bad Predictive Controls
39.3.4
Bad Selection Bias
39.4
Choosing Controls
40
Mediation
40.1
Traditional Approach
40.1.1
Assumptions
40.1.2
Indirect Effect Tests
40.1.3
Multiple Mediations
40.2
Causal Inference Approach
40.2.1
Example 1
40.3
Model-based causal mediation analysis
V. MISCELLANEOUS
41
Directed Acyclic Graph
41.1
Basic Notations
42
Mediation
42.1
Traditional Approach
42.1.1
Assumptions
42.1.2
Indirect Effect Tests
42.1.3
Multiple Mediations
42.2
Causal Inference Approach
42.2.1
Example 1
42.3
Model-based causal mediation analysis
43
Directed Acyclic Graph
43.1
Basic Notations
44
Report
44.1
One summary table
44.2
Model Comparison
44.3
Changes in an estimate
44.4
Standard Errors
44.5
Coefficient Uncertainty and Distribution
44.6
Descriptive Tables
44.7
Visualizations and Plots
45
Exploratory Data Analysis
46
Report
46.1
One summary table
46.2
Model Comparison
46.3
Changes in an estimate
46.4
Standard Errors
46.5
Coefficient Uncertainty and Distribution
46.6
Descriptive Tables
46.7
Visualizations and Plots
47
Exploratory Data Analysis
48
Sensitivity Analysis/ Robustness Check
48.1
Specification curve
48.1.1
starbility
48.1.2
rdfanalysis
48.2
Coefficient stability
49
Replication and Synthetic Data
49.1
The Replication Standard
49.1.1
Solutions for Empirical Replication
49.1.2
Free Data Repositories
49.1.3
Exceptions to Replication
49.2
Synthetic Data: An Overview
49.2.1
Benefits
49.2.2
Concerns
49.2.3
Further Insights on Synthetic Data
49.3
Application
50
Sensitivity Analysis/ Robustness Check
50.1
Specification curve
50.1.1
starbility
50.1.2
rdfanalysis
50.2
Coefficient stability
51
Replication and Synthetic Data
51.1
The Replication Standard
51.1.1
Solutions for Empirical Replication
51.1.2
Free Data Repositories
51.1.3
Exceptions to Replication
51.2
Synthetic Data: An Overview
51.2.1
Benefits
51.2.2
Concerns
51.2.3
Further Insights on Synthetic Data
51.3
Application
APPENDIX
A
Appendix
A.1
Git
A.2
Short-cut
A.3
Function short-cut
A.4
Citation
A.5
Install all necessary packages/libaries on your local machine
B
Bookdown cheat sheet
B.1
Operation
B.2
Math Expression/ Syntax
B.2.1
Statistics Notation
B.3
Table
References
Published with bookdown
A Guide on Data Analysis
27.2
Synthetic Difference-in-differences
reference:
(
Arkhangelsky et al. 2021
)
References
———. 2021.
“Synthetic Difference-in-Differences.”
American Economic Review
111 (12): 4088–118.