Unit i | Name | X1Agei | X2Educ.i | DUnempl.i | YLifesat.i |
---|---|---|---|---|---|
1 | Sofia | 29 | 1 | 1 | 3 |
2 | Sara | 30 | 2 | 1 | 2 |
3 | José | 28 | 0 | 0 | 5 |
4 | Yiwei | 27 | 2 | 1 | ? |
5 | Julia | 25 | 0 | 0 | 6 |
6 | Hans | 23 | 0 | 1 | ? |
.. | .. | .. | .. | .. | .. |
1000 | Hugo | 23 | 1 | 0 | 8 |
Updated: Jul 18, 2024
Source: https://en.wikipedia.org/wiki/Big_data
Q: Assume we want to predict life satisfaction. What are the features in the table above?
Q: Where does the training data come from? Do we always have the outputs/outcome readily available?
Q: What is training data you are using? Or missing data you want to predict?
Missing data could be future observations, but also observations that are missing in a dataset we already collected, i.e., missing data imputation simply predicts missing datapoints in a dataset
Unit i | Name | X1Agei | X2Educ.i | DUnempl.i | YLifesat.i |
---|---|---|---|---|---|
1 | Sofia | 29 | 1 | 1 | 3 |
2 | Sara | 30 | 2 | 1 | 2 |
3 | José | 28 | 0 | 0 | 5 |
4 | Yiwei | 27 | 2 | 1 | ? |
5 | Julia | 25 | 0 | 0 | 6 |
6 | Hans | 23 | 0 | 1 | ? |
.. | .. | .. | .. | .. | .. |
1000 | Hugo | 23 | 1 | 0 | 8 |
Methods & models: Linear/logistic regression, Penalized regression, classification and regression trees, nearest neighbor, neural networks/deep learning
Social science examples: Recidivism (Dressel & Farid 2018), deadly conflict (Cederman & Weidmann 2017), divorce (Heyman et al. 2001), mental health (Chancellor & De Choudhury 2020), poverty/wealth (Blumenstock 2015), unemployment (Sundsøy et al. 2017), sentiment (Martínez-Cámara et al. 2014, Bauer & Clemm 2021), vote shares/elections (Stoetzer et al. 2019)
Salganik et al. (2020): “Fragile Families Challenge”
SML can be used to predict both missing observations in a dataset, e.g., in Table 1, but also to forecast future observations
Unsupervised machine learning (UML): Methods for finding patterns in data
Methods & models: Principal component, factor- , cluster-, latent class and sequence analysis; topic modelling; community detection
Examples: Find topics in… newspaper articles (Barberà et al. 2021), open-ended responses (Bauer et al. 2017), academic publications (McFarland et al. 2013), ted talks (Schwemmer & Jungkunz 2019), media discourses (DiMaggio et al. 2013), state documents (Mohr et al. 2013), tweets (Dahal et al. 2019, Bauer); Community detection.. twitter botnets (Lingam et al. 2020)
General insights: Social scientists that apply ML are still rare! Distinction between SML and UML sometimes blurry (e.g., pretrained BERT models)
Big: Large datasets are a means to an end; they are not an end in themselves.
Always-on: Always-on big data enables the study of unexpected events and real-time measurement.
Non-reactive: Measurement in big data sources is much less likely to change behavior.
Incomplete: No matter how big your big data, it probably doesn’t have the information you want.
Inaccessible: Data held by companies and governments are difficult for researchers to access.
Nonrepresentative: Nonrepresentative data are bad for out-of-sample generalizations, but can be quite useful for within-sample comparisons.
Drifting: Population drift, usage drift, and system drift make it hard to use big data sources to study long-term trends.
Algorithmically confounded: Behavior in big data systems is not natural; it is driven by the engineering goals of the systems (the algorithm is a secret!).
Dirty: Big data sources can be loaded with junk and spam.
Sensitive: Some of the information that companies and governments have is sensitive.
Source: Salganik (2017)
Hausarbeit = vertiefende Auseinandersetzung mit Themen, die im Seminar besprochen wurden
Ist die Links-Rechts-Skala ein geeignetes Maß für Ideologie?
Beeinflussen (negative) Erfahrungen das generalisierte Vertrauen?
Wie wurde Vertrauen in der Vergangenheit gemessen und wie sollten wir Vertrauen in Zukunft messen?
Gute Forschungsfrage: endet mit Fragezeichen; ist informativ
Häufige Probleme: zu breit oder zu vage; zu kompliziert; relevanz nicht klar
Literaturverzeichnis
Format
Verwendung von AI/ChatGPT
“We estimated the world’s technological capacity to store, communicate, and compute information, tracking 60 analog and digital technologies during the period from 1986 to 2007. In 2007, humankind was able to store 2.9 × 1020 optimally compressed bytes, communicate almost 2 × 1021 bytes, and carry out 6.4 × 1018 instructions per second on general-purpose computers. General-purpose computing capacity grew at an annual rate of 58%. The world’s capacity for bidirectional telecommunication grew at 28% per year, closely followed by the increase in globally stored information (23%). Humankind’s capacity for unidirectional information diffusion through broadcasting channels has experienced comparatively modest annual growth (6%). Telecommunication has been dominated by digital technologies since 1990 (99.9% in digital format in 2007), and the majority of our technological memory has been in digital format since the early 2000s (94% digital in 2007).”
Rather than programmers crafting data-processing rules by hand, could a computer automatically learn these rules by looking at data?
Neben Zusammenfassungen im Text bitte tabellarische Übersicht.
Neben Zusammenfassungen im Text bitte tabellarische Übersicht.
Seminar: Digitalisierung, Künstliche Intelligenz und Demokratie