AI and ML for Social Scientists - Descriptive/causal inference vs. prediction

1 Inference (1): Descriptive inference

Goal of descriptive inference: Estimate a parameter in a population
- e.g., Research question: What is the average of life satisfaction/unemployment among Mannheim University students?
  - Q: Easy to find out? What could be the problem?

Table 1: Dataset/sample
$Unit i$	$N a m e$	$X 1_{i}^{A g e}$	$X 2_{i}^{E d u c .}$	$D_{i}^{U n e m p l .}$	$Y_{i}^{L i f e s a t .}$
1	Sofia	29	1	$1$	3
2	Sara	30	2	$1$	2
3	José	28	0	$0$	5
4	Yiwei	27	2	$1$	?
5	Julia	25	0	$0$	6
6	Hans	23	0	$1$	?
..	..	..	..	..	..
1000	Hugo	23	1	$0$	8

Table 1 displays our sample
- Assuming it were the population we could add a vector $R_{i}$ that indicates whether someone in the population has been sampled (cf. Abadie et al. 2020)

Goal of causal inference: Identify whether particular cause/treatment $D$ has a causal effect on $Y$ in a population
- e.g., Research question: What is the causal effect of unemployment $D$ on life satisfaction $Y$ among Mannheim students?

Table 2: Dataset/sample with potential outcomes
$Unit i$	$N a m e$	$X 1_{i}^{A g e}$	$X 2_{i}^{E d u c .}$	$D_{i}^{U n e m p l .}$	$Y_{i}^{L i f e s a t .}$	$Y_{i} (0)$	$Y_{i} (1)$
1	Sofia	29	1	$1$	3	?	3
2	Sara	30	2	$1$	2	?	2
3	José	28	0	$0$	5	5	?
4	Yiwei	27	2	$1$	?	?	?
5	Julia	25	0	$0$	6	6	?
6	Hans	23	0	$1$	?	?	?
..	..	..	..	..	..	..	..
1000	Hugo	23	1	$0$	8	8	?

Causal inference: Every-day notion of causality $\to$ formalized through potential outcomes framework (Rubin 1974, ~2012)
- $δ_{i} =$ $Y_{i} (1) - Y_{i} (0)$ , e.g., $δ_{S o f i a}$ $= {Life satisfaction}_{S o f i a} (U n e m p l o y e d) - {Life satisfaction}_{S o f i a} (E m p l o y e d)$
- FPCI (Holland 1986): Either observe $Y_{i} (1)$ or $Y_{i} (0)$ … missing data problem!
- Usual focus on average treatment effect: $A T E = E [Y_{i} (1) - Y_{i} (0)]$ (or ATT)
Designs, methods & models (with examples from my own research)
- experiments (Bauer et al. 2019, Bauer & Clemm 2021, Bauer et al. 2021, Bauer & Poama 2020), matching, instrumental variables (Bauer & Fatke 2014), regression discontinuity design, difference-in-differences, fixed-effects model (Bauer 2015, 2019), etc. (e.g., Gangl 2010 for overview)
Potential outcomes & identification revolution (Imai 2011):
- Statistical inference: Models + statistical assumptions $\to$ Causal inference: Models + statistical assumptions + identification assumptions

Table 3: Causal inference and prediction from a missing data perspective
$Unit i$	$N a m e$	$X 1_{i}^{A g e}$	$X 2_{i}^{E d u c .}$	$D_{i}^{U n e m p l .}$	$Y_{i}^{L i f e s a t .}$	$Y_{i} (0)$	$Y_{i} (1)$
1	Sofia	29	1	$1$	3	?	3
2	Sara	30	2	$1$	2	?	2
3	José	28	0	$0$	5	5	?
4	Yiwei	27	2	$1$	?	?	?
5	Julia	25	0	$0$	6	6	?
6	Hans	23	0	$1$	?	?	?
..	..	..	..	..	..	..	..
1000	Hugo	23	1	$0$	8	8	?

Data perspective: Both causal inference and machine learning are about missing data!
Causal inference perspective
- Replace (predict) Sofia’s (and others’) missing potential outcome(s) on variable $Life satisfaction$ with other people’s observed outcomes!
Prediction/ML perspective
- Train model to predict missing observations on variable $Life satisfaction$ (see “?”s)

Supervised machine learning (SML): Focusses on prediction problems
- Goal: Predict $Y_{i}$ using $X_{i}$
- Approach
  - Estimate a model on subset (training data)
  - Test this model’s predictive accuracy in another subset (test data); This model has not seen test data outputs $Y_{i}$
  - If accurate enough, use this model to predict missing data (? in Table 4)

Table 4: Dataset/sample
$Unit i$	$N a m e$	$X 1_{i}^{A g e}$	$X 2_{i}^{E d u c .}$	$D_{i}^{U n e m p l .}$	$Y_{i}^{L i f e s a t .}$
1	Sofia	29	1	1	3
2	Sara	30	2	1	2
3	José	28	0	0	5
4	Yiwei	27	2	1	?
5	Julia	25	0	0	6
6	Hans	23	0	1	?
..	..	..	..	..	..
1000	Hugo	23	1	0	8

Q: Assume we want to predict life satisfaction. What are the features in the table above?

Methods & models: Linear/logistic regression, Penalized regression, classification and regression trees, nearest neighbor, neural networks/deep learning
Social science examples: Recidivism (Dressel & Farid 2018), deadly conflict (Cederman & Weidmann 2017), divorce (Heyman et al. 2001), mental health (Chancellor & De Choudhury 2020), poverty/wealth (Blumenstock 2015), unemployment (Sundsøy et al. 2017), sentiment (Martínez-Cámara et al. 2014, Bauer & Clemm 2021), vote shares/elections (Stoetzer et al. 2019)
Salganik et al. (2020): “Fragile Families Challenge”
- Asked 160 teams to built predictive models for life outcomes [Material hardship, GPA, Grit, Eviction, Job training, Layoff]
- “no one made very accurate predictions”
SML can be used to predict both missing observations in a dataset, e.g., in Table 4, but also to forecast future observations
- In latter case we would add a variable $time T$ to our dataset

Unsupervised machine learning (UML): Methods for finding patterns in data
- Goal: Classify $Y_{i}$ , e.g., $Y_{i}^{L i f e s a t .}$ into groups that are similar
  - $Y_{i}$ are often texts, images, audio snippets, videos
  - e.g, groups of people with similar life satisfaction
- Approach: Use model to find lower dimensional representation of $Y_{i}$ (sometimes using $X_{i}$ )
  - No training, i.e., data-splitting necessary
Methods & models: Principal component, factor- , cluster-, latent class and sequence analysis; Topic modelling; Community detection
Examples: Find topics in… newspaper articles (Barberà et al. 2021), open-ended responses (Bauer et al. 2017), academic publications (McFarland et al. 2013), ted talks (Schwemmer & Jungkunz 2019), media discourses (DiMaggio et al. 2013), state documents (Mohr et al. 2013), tweets (Dahal et al. 2019, Bauer); Community detection.. twitter botnets (Lingam et al. 2020)
General insights
- Social scientists that apply ML are still rare!
- Distinction between SML and UML sometimes blurry (e.g., pretrained BERT models)

Beginning of the 19th century: Legendre and Gauss - method of least squares
- Earliest form linear regression [Astronomy, quantitative output values
1936: Fisher - Linear Discriminant Analysis
1940s: various authors - Logistic Regression
1970: Nelder and Wedderburn - Generalized Linear Models (GLM) of which linear and logistic regression are special cases
By end of the 1970s: Many more techniques available but almost exclusively linear methods
- Fitting non-linear relationships was computationally infeasible at the time
By the 1980s: Better computing technology facility non-linear methods
Mid 1980s: Breiman, Friedman, Olshen and Stone - Classification and Regression Trees
- practical implementation including cross-validation for model selection
1986: Hastie/Tibshirani coin term “generalized additive models” for a class of non-linear extensions to generalized linear models (+ practical software implementation)
Since then statistical learning has emerged as a new subfield!