A Guide on Data Analysis
Preface
How to cite this book
More books
1
Introduction
2
Prerequisites
2.1
Matrix Theory
2.1.1
Rank
2.1.2
Inverse
2.1.3
Definiteness
2.1.4
Matrix Calculus
2.1.5
Optimization
2.2
Probability Theory
2.2.1
Axiom and Theorems of Probability
2.2.2
Central Limit Theorem
2.2.3
Random variable
2.2.4
Moment generating function
2.2.5
Moment
2.2.6
Distributions
2.3
General Math
2.3.1
Number Sets
2.3.2
Summation Notation and Series
2.3.3
Taylor Expansion
2.3.4
Law of large numbers
2.3.5
Law of Iterated Expectation
2.3.6
Convergence
2.3.7
Sufficient Statistics
2.3.8
Parameter transformations
2.4
Data Import/Export
2.4.1
Medium size
2.4.2
Large size
2.5
Data Manipulation
I. BASIC
3
Descriptive Statistics
3.1
Numerical Measures
3.2
Graphical Measures
3.2.1
Shape
3.2.2
Scatterplot
3.3
Normality Assessment
3.3.1
Graphical Assessment
3.3.2
Summary Statistics
3.4
Bivariate Statistics
3.4.1
Two Continuous
3.4.2
Categorical and Continuous
3.4.3
Two Discrete
3.5
Summary
3.5.1
Visualization
4
Basic Statistical Inference
4.1
One Sample Inference
4.1.1
The Mean
4.1.2
Single Variance
4.1.3
Single Proportion (p)
4.1.4
Power
4.1.5
Sample Size
4.1.6
Note
4.1.7
One-sample Non-parametric Methods
4.2
Two Sample Inference
4.2.1
Means
4.2.2
Variances
4.2.3
Power
4.2.4
Sample Size
4.2.5
Matched Pair Designs
4.2.6
Nonparametric Tests for Two Samples
4.3
Categorical Data Analysis
4.3.1
Inferences for Small Samples
4.3.2
Test of Association
4.3.3
Ordinal Association
4.4
Divergence Metrics and Test for Comparing Distributions
4.4.1
Kullback-Leibler Divergence
4.4.2
Jensen-Shannon Divergence
4.4.3
Wasserstein Distance
4.4.4
Kolmogorov-Smirnov Test
II. REGRESSION
5
Linear Regression
5.1
Ordinary Least Squares
5.1.1
Simple Regression (Basic Model)
5.1.2
Multiple Linear Regression
5.1.3
OLS Assumptions
5.1.4
Theorems
5.1.5
Variable Selection
5.1.6
Diagnostics
5.1.7
Model Validation
5.1.8
Finite Sample Properties
5.1.9
Large Sample Properties
5.2
Feasible Generalized Least Squares
5.2.1
Heteroskedasticity
5.2.2
Serial Correlation
5.3
Weighted Least Squares
5.4
Generalized Least Squares
5.5
Feasible Prais Winsten
5.6
Feasible group level Random Effects
5.7
Ridge Regression
5.8
Principal Component Regression
5.9
Robust Regression
5.9.1
Least Absolute Residuals (LAR) Regression
5.9.2
Least Median of Squares (LMS) Regression
5.9.3
Iteratively Reweighted Least Squares (IRLS) Robust Regression
5.10
Maximum Likelihood
5.10.1
Motivation for MLE
5.10.2
Assumption
5.10.3
Properties
5.10.4
Compare to OLS
5.10.5
Application
6
Non-linear Regression
6.1
Inference
6.1.1
Linear Function of the Parameters
6.1.2
Nonlinear
6.2
Non-linear Least Squares
6.2.1
Alternative of Gauss-Newton Algorithm
6.2.2
Practical Considerations
6.2.3
Model/Estimation Adequacy
6.2.4
Application
7
Generalized Linear Models
7.1
Logistic Regression
7.1.1
Application
7.2
Probit Regression
7.3
Binomial Regression
7.4
Poisson Regression
7.4.1
Application
7.5
Negative Binomial Regression
7.6
Multinomial
7.7
Generalization
7.7.1
Estimation
7.7.2
Inference
7.7.3
Deviance
7.7.4
Diagnostic Plots
7.7.5
Goodness of Fit
7.7.6
Over-Dispersion
8
Linear Mixed Models
8.1
Dependent Data
8.1.1
Random-Intercepts Model
8.1.2
Covariance Models
8.2
Estimation
8.2.1
Estimating
\(\mathbf{V}\)
8.3
Inference
8.3.1
Parameters
\(\beta\)
8.3.2
Variance Components
8.4
Information Criteria
8.4.1
Akaike’s Information Criteria (AIC)
8.4.2
Corrected AIC (AICC)
8.4.3
Bayesian Information Criteria (BIC)
8.5
Split-Plot Designs
8.5.1
Application
8.6
Repeated Measures in Mixed Models
8.7
Unbalanced or Unequally Spaced Data
8.8
Application
8.8.1
Example 1 (Pulps)
8.8.2
Example 2 (Rats)
8.8.3
Example 3 (Agridat)
9
Nonlinear and Generalized Linear Mixed Models
9.1
Estimation
9.1.1
Estimation by Numerical Integration
9.1.2
Estimation by Linearization
9.1.3
Estimation by Bayesian Hierarchical Models
9.2
Application
9.2.1
Binomial (CBPP Data)
9.2.2
Count (Owl Data)
9.2.3
Binomial
9.2.4
Example from
(Schabenberger and Pierce 2001)
section 8.4.1
9.3
Summary
III. RAMIFICATIONS
10
Model Specification
10.1
Nested Model
10.1.1
Chow test
10.2
Non-Nested Model
10.2.1
Davidson-Mackinnon test
10.3
Heteroskedasticity
10.3.1
Breusch-Pagan test
10.3.2
White test
11
Imputation (Missing Data)
11.1
Assumptions
11.1.1
Missing Completely at Random (MCAR)
11.1.2
Missing at Random (MAR)
11.1.3
Ignorable
11.1.4
Nonignorable
11.2
Solutions to Missing data
11.2.1
Listwise Deletion
11.2.2
Pairwise Deletion
11.2.3
Dummy Variable Adjustment
11.2.4
Imputation
11.2.5
Other methods
11.3
Criteria for Choosing an Effective Approach
11.4
Another Perspective
11.5
Diagnosing the Mechanism
11.5.1
MAR vs. MNAR
11.5.2
MCAR vs. MAR
11.6
Application
11.6.1
Imputation with mean / median / mode
11.6.2
KNN
11.6.3
rpart
11.6.4
MICE (Multivariate Imputation via Chained Equations)
11.6.5
Amelia
11.6.6
missForest
11.6.7
Hmisc
11.6.8
mi
12
Data
12.1
Cross-Sectional
12.2
Time Series
12.2.1
Deterministic Time trend
12.2.2
Feedback Effect
12.2.3
Dynamic Specification
12.2.4
Dynamically Complete
12.2.5
Highly Persistent Data
12.3
Repeated Cross Sections
12.3.1
Pooled Cross Section
12.4
Panel Data
12.4.1
Pooled OLS Estimator
12.4.2
Individual-specific effects model
12.4.3
Tests for Assumptions
12.4.4
Model Selection
12.4.5
Summary
12.4.6
Application
13
Variable Transformation
13.1
Continuous Variables
13.1.1
Standardization
13.1.2
Min-max scaling
13.1.3
Square Root/Cube Root
13.1.4
Logarithmic
13.1.5
Exponential
13.1.6
Power
13.1.7
Inverse/Reciprocal
13.1.8
Hyperbolic arcsine
13.1.9
Ordered Quantile Norm
13.1.10
Arcsinh
13.1.11
Lambert W x F Transformation
13.1.12
Inverse Hyperbolic Sine (IHS) transformation
13.1.13
Box-Cox Transformation
13.1.14
Yeo-Johnson Transformation
13.1.15
RankGauss
13.1.16
Summary
13.2
Categorical Variables
14
Hypothesis Testing
14.1
Types of hypothesis testing
14.2
Wald test
14.2.1
Multiple Hypothesis
14.2.2
Linear Combination
14.2.3
Estimate Difference in Coefficients
14.2.4
Application
14.2.5
Nonlinear
14.3
The likelihood ratio test
14.4
Lagrange Multiplier (Score)
14.5
Two One-Sided Tests (TOST) Equivalence Testing
15
Marginal Effects
15.1
Delta Method
15.2
Average Marginal Effect Algorithm
15.3
Packages
15.3.1
MarginalEffects
15.3.2
margins
15.3.3
mfx
16
Prediction and Estimation
17
Moderation
17.1
emmeans package
17.1.1
Continuous by continuous
17.1.2
Continuous by categorical
17.1.3
Categorical by categorical
17.2
probmod package
17.3
interactions package
17.3.1
Continuous interaction
17.3.2
Categorical interaction
17.4
interactionR package
17.5
sjPlot package
IV. CAUSAL INFERENCE
18
Causal Inference
18.1
Treatment effect types
18.1.1
Average Treatment Effects
18.1.2
Conditional Average Treatment Effects
18.1.3
Intent-to-treat Effects
18.1.4
Local Average Treatment Effects
18.1.5
Population vs. Sample Average Treatment Effects
18.1.6
Average Treatment Effects on the Treated and Control
18.1.7
Quantile Average Treatment Effects
18.1.8
Mediation Effects
18.1.9
Log-odds Treatment Effects
A. EXPERIMENTAL DESIGN
19
Experimental Design
19.1
Semi-random Experiment
19.2
Rerandomization
19.3
Two-Stage Randomized Experiments with Interference and Noncompliance
20
Sampling
20.1
Simple Sampling
20.2
Stratified Sampling
20.3
Unequal Probability Sampling
20.4
Balanced Sampling
20.4.1
Cube
20.4.2
Stratification
20.4.3
Cluster
20.4.4
Two-stage
21
Analysis of Variance (ANOVA)
21.1
Completely Randomized Design (CRD)
21.1.1
Single Factor Fixed Effects Model
21.1.2
Single Factor Random Effects Model
21.1.3
Two Factor Fixed Effect ANOVA
21.1.4
Two-Way Random Effects ANOVA
21.1.5
Two-Way Mixed Effects ANOVA
21.2
Nonparametric ANOVA
21.2.1
Kruskal-Wallis
21.2.2
Friedman Test
21.3
Sample Size Planning for ANOVA
21.3.1
Balanced Designs
21.3.2
Randomized Block Experiments
21.4
Randomized Block Designs
21.4.1
Tukey Test of Additivity
21.5
Nested Designs
21.5.1
Two-Factor Nested Designs
21.6
Single Factor Covariance Model
22
Multivariate Methods
22.0.1
Properties of MVN
22.0.2
Mean Vector Inference
22.0.3
General Hypothesis Testing
22.1
MANOVA
22.1.1
Testing General Hypotheses
22.1.2
Profile Analysis
22.1.3
Summary
22.2
Principal Components
22.2.1
Population Principal Components
22.2.2
Sample Principal Components
22.2.3
Application
22.3
Factor Analysis
22.3.1
Methods of Estimation
22.3.2
Factor Rotation
22.3.3
Estimation of Factor Scores
22.3.4
Model Diagnostic
22.3.5
Application
22.4
Discriminant Analysis
22.4.1
Known Populations
22.4.2
Probabilities of Misclassification
22.4.3
Unknown Populations/ Nonparametric Discrimination
22.4.4
Application
B. QUASI-EXPERIMENTAL DESIGN
23
Quasi-experimental
24
Regression Discontinuity
24.1
Estimation and Inference
24.1.1
Local Randomization-based
24.1.2
Continuity-based
24.2
Specification Checks
24.2.1
Balance Checks
24.2.2
Sorting/Bunching/Manipulation
24.2.3
Placebo Tests
24.2.4
Sensitivity to Bandwidth Choice
24.3
Fuzzy RD Design
24.4
Regression Kink Design
24.5
Multi-cutoff
24.6
Multi-score
24.7
Steps for Sharp RD
24.8
Steps for Fuzzy RD
24.9
Steps for RDiT (Regression Discontinuity in Time)
24.10
Evaluation of an RD
24.11
Applications
24.11.1
Example 1
24.11.2
Example 2
24.11.3
Example 3
24.11.4
Example 4
25
Synthetic Difference-in-Differences
25.1
Understanding
25.2
Application
25.2.1
Block Treatment
25.2.2
Staggered Adoption
26
Difference-in-differences
26.1
Visualization
26.2
Simple Dif-n-dif
26.3
Notes
26.4
Standard Errors
26.5
Examples
26.5.1
Example by
Doleac and Hansen (2020)
26.5.2
Example from Princeton
26.5.3
Example by
Card and Krueger (1993)
26.5.4
Example by
Butcher, McEwan, and Weerapana (2014)
26.6
One Difference
26.7
Two-way Fixed-effects
26.8
Multiple periods and variation in treatment timing
26.9
Staggered Dif-n-dif
26.9.1
Assumptions
26.9.2
Stacked DID
26.9.3
Goodman-Bacon Decomposition
26.9.4
DID with in and out treatment condition
26.9.5
Gardner (2022)
and
Borusyak, Jaravel, and Spiess (2021)
26.9.6
Chaisemartin-d’Haultfoeuille
26.9.7
didimputation
26.9.8
Callaway and Sant’Anna (2021)
26.9.9
L. Sun and Abraham (2021)
26.9.10
Wooldridge (2022)
26.9.11
Doubly Robust DiD
26.10
Multiple Treatment
26.11
Assumption Violation
26.11.1
Endogenous Timing
26.11.2
Questionable Counterfactuals
26.12
Mediation Under DiD
26.13
Assumptions
26.13.1
Prior Parallel Trends Test
26.13.2
Placebo Test
26.13.3
Rosenbaum Bounds
27
Synthetic Control
27.1
Applications
27.1.1
Example 1
27.1.2
Example 2
27.1.3
Example 3
27.1.4
Example 4
27.2
Synthetic Difference-in-differences
27.3
Augmented Synthetic Control Method
27.4
Synthetic Controls with Staggered Adoption
27.5
Generalized Synthetic Control
28
Event Studies
28.1
Other Issues
28.1.1
Event Studies in marketing
28.1.2
Economic significance
28.1.3
Statistical Power
28.2
Testing
28.2.1
Parametric Test
28.2.2
Non-parametric Test
28.3
Sample
28.3.1
Confounders
28.4
Biases
28.5
Long-run event studies
28.5.1
Buy and Hold Abnormal Returns (BHAR)
28.5.2
Long-term Cumulative Abnormal Returns (LCARs)
28.5.3
Calendar-time Portfolio Abnormal Returns (CTARs)
28.6
Aggregation
28.6.1
Over Time
28.6.2
Across Firms + Over Time
28.7
Heterogeneity in the event effect
28.7.1
Common variables in marketing
28.8
Expected Return Calculation
28.8.1
Statistical Models
28.8.2
Economic Model
28.9
Application
28.9.1
Eventus
28.9.2
Evenstudies
28.9.3
EventStudy
29
Matching Methods
29.1
MatchIt
29.2
designmatch
29.3
MatchingFrontier
29.4
Propensity Scores
29.5
Mahalanobis Distance
29.6
Coarsened Exact Matching
29.7
Genetic Matching
29.8
Entropy Balancing
29.9
Matching for time series-cross-section data
29.10
Matching for multiple treatments
29.11
Matching for multi-level treatments
29.12
Matching for repeated treatments
30
Interrupted Time Series
30.1
Autocorrelation
30.2
Multiple Groups
C. OTHER CONCERNS
31
Endogeneity
31.1
Endogenous Treatment
31.1.1
Measurement Error
31.1.2
Simultaneity
31.1.3
Endogenous Treatment Solutions
31.2
Endogenous Sample Selection
31.2.1
Tobit-2
31.2.2
Tobit-5
32
Other Biases
32.1
Aggregation Bias
32.1.1
Simpson’s Paradox
32.2
Survivorship Bias
32.3
Publication Bias
33
Controls
33.1
Bad Controls
33.1.1
M-bias
33.1.2
Bias Amplification
33.1.3
Overcontrol bias
33.1.4
Selection Bias
33.1.5
Case-control Bias
33.2
Good Controls
33.2.1
Omitted Variable Bias Correction
33.2.2
Omitted Variable Bias in Mediation Correction
33.3
Neutral Controls
33.3.1
Good Predictive Controls
33.3.2
Good Selection Bias
33.3.3
Bad Predictive Controls
33.3.4
Bad Selection Bias
33.4
Choosing Controls
V. MISCELLANEOUS
34
Mediation
34.1
Traditional Approach
34.1.1
Assumptions
34.1.2
Indirect Effect Tests
34.1.3
Multiple Mediations
34.2
Causal Inference Approach
34.2.1
Example 1
34.3
Model-based causal mediation analysis
35
Directed Acyclic Graph
35.1
Basic Notations
36
Report
36.1
One summary table
36.2
Model Comparison
36.3
Changes in an estimate
36.4
Standard Errors
36.5
Coefficient Uncertainty and Distribution
36.6
Descriptive Tables
36.7
Visualizations and Plots
37
Exploratory Data Analysis
38
Sensitivity Analysis/ Robustness Check
38.1
Specification curve
38.1.1
starbility
38.1.2
rdfanalysis
38.2
Coefficient stability
39
Replication and Synthetic Data
39.1
The Replication Standard
39.1.1
Solutions for Empirical Replication
39.1.2
Free Data Repositories
39.1.3
Exceptions to Replication
39.2
Synthetic Data: An Overview
39.2.1
Benefits
39.2.2
Concerns
39.2.3
Further Insights on Synthetic Data
39.3
Application
APPENDIX
A
Appendix
A.1
Git
A.2
Short-cut
A.3
Function short-cut
A.4
Citation
A.5
Install all necessary packages/libaries on your local machine
B
Bookdown cheat sheet
B.1
Operation
B.2
Math Expression/ Syntax
B.2.1
Statistics Notation
B.3
Table
References
Published with bookdown
A Guide on Data Analysis
29.12
Matching for repeated treatments
https://cran.r-project.org/web/packages/twang/vignettes/iptw.pdf
package in R
twang