Descriptive/causal inference vs. prediction

Learning outcomes/objective:

1 Inference (1): Descriptive inference

  • Goal of descriptive inference: Estimate a parameter in a population
    • e.g., Research question: What is the average of life satisfaction/unemployment among Mannheim University students?
      • Q: Easy to find out? What could be the problem?
Table 1: Dataset/sample
\(\text{Unit i} \quad\) \(Name \quad\) \(X1_{i}^{Age} \quad\) \(X2_{i}^{Educ.} \quad\) \(D_{i}^{Unempl.} \quad\) \(Y_{i}^{Lifesat.} \quad\)
1 Sofia 29 1 \({\color{red}{1}}\) 3
2 Sara 30 2 \({\color{red}{1}}\) 2
3 José 28 0 \({\color{blue}{0}}\) 5
4 Yiwei 27 2 \({\color{red}{1}}\) ?
5 Julia 25 0 \({\color{blue}{0}}\) 6
6 Hans 23 0 \({\color{red}{1}}\) ?
.. .. .. .. .. ..
1000 Hugo 23 1 \({\color{blue}{0}}\) 8
  • Table 1 displays our sample
    • Assuming it were the population we could add a vector \(R_{i}\) that indicates whether someone in the population has been sampled (cf. Abadie et al. 2020)

2 Inference (2): Causal inference

  • Goal of causal inference: Identify whether particular cause/treatment \(D\) has a causal effect on \(Y\) in a population
    • e.g., Research question: What is the causal effect of unemployment \(D\) on life satisfaction \(Y\) among Mannheim students?
Table 2: Dataset/sample with potential outcomes
\(\text{Unit i} \quad\) \(Name \quad\) \(X1_{i}^{Age} \quad\) \(X2_{i}^{Educ.} \quad\) \(D_{i}^{Unempl.} \quad\) \(Y_{i}^{Lifesat.} \quad\) \(Y_{i}({\color{blue}{0}})\quad\) \(Y_{i}({\color{red}{1}})\quad\)
1 Sofia 29 1 \({\color{red}{1}}\) 3 ? 3
2 Sara 30 2 \({\color{red}{1}}\) 2 ? 2
3 José 28 0 \({\color{blue}{0}}\) 5 5 ?
4 Yiwei 27 2 \({\color{red}{1}}\) ? ? ?
5 Julia 25 0 \({\color{blue}{0}}\) 6 6 ?
6 Hans 23 0 \({\color{red}{1}}\) ? ? ?
.. .. .. .. .. .. .. ..
1000 Hugo 23 1 \({\color{blue}{0}}\) 8 8 ?

3 Inference (3): Causal inference

  • Causal inference: Every-day notion of causality \(\rightarrow\) formalized through potential outcomes framework (Rubin 1974, ~2012)

    • \(\delta_{i} =\) \(Y_{i}({\color{red}{1}}) - Y_{i}({\color{blue}{0}})\), e.g., \(\delta_{Sofia}\) \(= \text{Life satisfaction}_{Sofia}({\color{red}{Unemployed}}) - \text{Life satisfaction}_{Sofia}({\color{blue}{Employed}})\)
    • FPCI (Holland 1986): Either observe \(Y_{i}({\color{red}{1}})\) or \(Y_{i}({\color{blue}{0}})\) … missing data problem!
    • Usual focus on average treatment effect: \(ATE = E[Y_{i}(1) - Y_{i}(0)]\) (or ATT)
  • Designs, methods & models (with examples from my own research)

    • experiments (Bauer et al. 2019, Bauer & Clemm 2021, Bauer et al. 2021, Bauer & Poama 2020), matching, instrumental variables (Bauer & Fatke 2014), regression discontinuity design, difference-in-differences, fixed-effects model (Bauer 2015, 2019), etc. (e.g., Gangl 2010 for overview)
  • Potential outcomes & identification revolution (Imai 2011):

    • Statistical inference: Models + statistical assumptions \(\rightarrow\) Causal inference: Models + statistical assumptions + identification assumptions

4 Inference (3): Missing data perspective

Table 3: Causal inference and prediction from a missing data perspective
\(\text{Unit i} \quad\) \(Name \quad\) \(X1_{i}^{Age} \quad\) \(X2_{i}^{Educ.} \quad\) \(D_{i}^{Unempl.} \quad\) \(Y_{i}^{Lifesat.} \quad\) \(Y_{i}({\color{blue}{0}})\quad\) \(Y_{i}({\color{red}{1}})\quad\)
1 Sofia 29 1 \({\color{red}{1}}\) 3 ? 3
2 Sara 30 2 \({\color{red}{1}}\) 2 ? 2
3 José 28 0 \({\color{blue}{0}}\) 5 5 ?
4 Yiwei 27 2 \({\color{red}{1}}\) ? ? ?
5 Julia 25 0 \({\color{blue}{0}}\) 6 6 ?
6 Hans 23 0 \({\color{red}{1}}\) ? ? ?
.. .. .. .. .. .. .. ..
1000 Hugo 23 1 \({\color{blue}{0}}\) 8 8 ?
  • Data perspective: Both causal inference and machine learning are about missing data!
  • Causal inference perspective
    • Replace (predict) Sofia’s (and others’) missing potential outcome(s) on variable \(\text{Life satisfaction}\) with other people’s observed outcomes!
  • Prediction/ML perspective
    • Train model to predict missing observations on variable \(\text{Life satisfaction}\) (see “?”s)

5 SML in the social sciences (1)

  • Supervised machine learning (SML): Focusses on prediction problems
    • Goal: Predict \(Y_{i}\) using \(X_{i}\)
    • Approach
      • Estimate a model on subset (training data)
      • Test this model’s predictive accuracy in another subset (test data); This model has not seen test data outputs \(\color{#984ea3}{Y_{i}}\)
      • If accurate enough, use this model to predict missing data (? in Table 4)
Table 4: Dataset/sample
\(\text{Unit i} \quad\) \(Name \quad\) \(X1_{i}^{Age} \quad\) \(X2_{i}^{Educ.} \quad\) \(D_{i}^{Unempl.} \quad\) \(Y_{i}^{Lifesat.} \quad\)
1 Sofia 29 1 1 3
2 Sara 30 2 1 2
3 José 28 0 0 5
4 Yiwei 27 2 1 ?
5 Julia 25 0 0 6
6 Hans 23 0 1 ?
.. .. .. .. .. ..
1000 Hugo 23 1 0 8
  • Q: Assume we want to predict life satisfaction. What are the features in the table above?

6 SML in the social sciences (2)

  • Methods & models: Linear/logistic regression, Penalized regression, classification and regression trees, nearest neighbor, neural networks/deep learning

  • Social science examples: Recidivism (Dressel & Farid 2018), deadly conflict (Cederman & Weidmann 2017), divorce (Heyman et al. 2001), mental health (Chancellor & De Choudhury 2020), poverty/wealth (Blumenstock 2015), unemployment (Sundsøy et al. 2017), sentiment (Martínez-Cámara et al. 2014, Bauer & Clemm 2021), vote shares/elections (Stoetzer et al. 2019)

  • Salganik et al. (2020): “Fragile Families Challenge”

    • Asked 160 teams to built predictive models for life outcomes [Material hardship, GPA, Grit, Eviction, Job training, Layoff]
    • “no one made very accurate predictions”
  • SML can be used to predict both missing observations in a dataset, e.g., in Table 4, but also to forecast future observations

    • In latter case we would add a variable \(\text{time }T\) to our dataset

7 UML in the social sciences (2)

  • Unsupervised machine learning (UML): Methods for finding patterns in data

    • Goal: Classify \(Y_{i}\), e.g., \(Y_{i}^{Lifesat.}\) into groups that are similar
      • \(Y_{i}\) are often texts, images, audio snippets, videos
      • e.g, groups of people with similar life satisfaction
    • Approach: Use model to find lower dimensional representation of \(Y_{i}\) (sometimes using \(X_{i}\))
      • No training, i.e., data-splitting necessary
  • Methods & models: Principal component, factor- , cluster-, latent class and sequence analysis; Topic modelling; Community detection

  • Examples: Find topics in… newspaper articles (Barberà et al. 2021), open-ended responses (Bauer et al. 2017), academic publications (McFarland et al. 2013), ted talks (Schwemmer & Jungkunz 2019), media discourses (DiMaggio et al. 2013), state documents (Mohr et al. 2013), tweets (Dahal et al. 2019, Bauer); Community detection.. twitter botnets (Lingam et al. 2020)

  • General insights

    • Social scientists that apply ML are still rare!
    • Distinction between SML and UML sometimes blurry (e.g., pretrained BERT models)

8 Timeline of statistical learning (James et al. 2013, 6–7)

  • Beginning of the 19th century: Legendre and Gauss - method of least squares
    • Earliest form linear regression [Astronomy, quantitative output values
  • 1936: Fisher - Linear Discriminant Analysis
  • 1940s: various authors - Logistic Regression
  • 1970: Nelder and Wedderburn - Generalized Linear Models (GLM) of which linear and logistic regression are special cases
  • By end of the 1970s: Many more techniques available but almost exclusively linear methods
    • Fitting non-linear relationships was computationally infeasible at the time
  • By the 1980s: Better computing technology facility non-linear methods
  • Mid 1980s: Breiman, Friedman, Olshen and Stone - Classification and Regression Trees
    • practical implementation including cross-validation for model selection
  • 1986: Hastie/Tibshirani coin term “generalized additive models” for a class of non-linear extensions to generalized linear models (+ practical software implementation)
  • Since then statistical learning has emerged as a new subfield!

References

Abadie, Alberto, Susan Athey, Guido W Imbens, and Jeffrey M Wooldridge. 2020. Sampling-Based Versus Design-Based Uncertainty in Regression Analysis.” Econometrica 88 (1): 265–96.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.
Salganik, Matthew J, Ian Lundberg, Alexander T Kindel, Caitlin E Ahearn, Khaled Al-Ghoneim, Abdullah Almaatouq, Drew M Altschul, et al. 2020. “Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration.” Proc. Natl. Acad. Sci. U. S. A. 117 (15): 8398–8403.