Preface
1
Introduction
1.1
Script & Material
1.2
About me
1.3
Who are you?
1.4
Content & Objectives & readings
1.5
Tools and software we use
1.5.1
R: Why use it?
1.5.2
R: Where/how to study?
1.5.3
R: Installation and setup
1.5.4
Datacamp
1.5.5
Google Cloud
1.6
Research questions
1.6.1
Research questions: Types
1.6.2
Research questions: Descriptive (What?)
1.6.3
Research questions: Causal (Why?)
1.6.4
Research questions: Predictive
1.7
The digital revolution
1.8
The Internet (+ access)
1.9
Technology adoption: United States
1.10
Platform usage (1): Social Media Adoption
1.11
Platform usage (2): Social Media Adoption (Barchart)
1.12
Platform usage (3): Social Networking Young
1.13
Platform usage (4): Daily hours digital media
1.14
What is Computational Social science (CSS)?
1.15
CSS: Chances
1.16
CSS: Challenges
1.17
Exercise: What data can reveal about you…
1.18
Example-Presentation: Example Barbera (2015)
2
Big data & new data sources (1)
2.1
Good practices in data analysis (X)
2.1.1
Why reproducability?
2.1.2
Reproducability: My current approach
2.2
Appetizer
2.3
What is data?
2.4
Big data: Quotes for a start
2.5
Big data: Definitions
2.6
Big data: The Vs
2.7
Big data: Analog age vs. digital age (1)
2.8
Big data: Analog age vs. digital age (2)
2.9
Big data: Repurposing
2.10
Presentation
2.11
Exercise: Ten common characteristics of big data (Salganik 2017)
2.12
New forms of data: Overview
2.13
Exercise: Big Data is not about the data
2.14
Where can we find big data sources data?
3
Big data & new data sources (2)
3.1
Presentation
3.2
Example: Salience of issues
3.3
Google trends: Caveats
3.4
Data: How is it stored?
3.5
Short lab: Create data of different size
3.6
Data & Databases
3.7
R Database Packages
3.8
SQL: Intro
3.9
SQL: Components of a query
3.10
Lab: Working with a SQL database
3.10.1
Creating an SQL database
4
Big data & new data sources (3)
4.1
Lab: Working with a SQL database
4.1.1
Querying an SQL database
4.1.2
Querying multiple SQL tables
4.1.3
Grouping and aggregating
4.2
Exercise: Local SQL database
4.3
Lab: Three strategies: Local SQL database
4.3.1
Strategy 1: Sample and Model
4.3.2
Strategy 2: Chunk and Pull
4.3.3
Strategy 3: Push Compute to Data
5
Data collection: Platform APIs (1)
5.1
Web APIs
5.2
API = Application Programming Interface
5.3
Why APIs?
5.4
Scraping: Decisions, decisions…
5.5
Types of APIs
5.6
Some APIs
5.7
R packages & access
5.8
(Reponse) Formats: JSON
5.9
(Reponse) Formats: XML
5.10
Authentication
5.11
Lab: Connect to Google Geocoding API
5.12
Lab: Connect to Twitter Academic API
6
Data collection: Platform APIs (2)
6.1
Data security & ethics (1): What might happen?
6.2
Data security & ethics (2): Yes…
6.3
Data security & ethics (3): Protection
6.4
Lab: Media Cloud API
6.5
Lab: Twitter API
6.6
Exercises: Media Cloud API & Twitter API
7
Machine learning: Basics (1)
7.1
API Reviews
7.2
Classical statistics vs. machine learning
7.3
Machine learning as programming paradigm
7.4
Terminological differences (1)
7.5
Terminological differences (2)
7.6
Prediction: Mean
7.7
Prediction: Linear model (Equation) (1)
7.8
Prediction: Linear model (Equation) (2)
7.9
Prediction: Linear model (Visualization)
7.10
Prediction: Linear model (Estimation)
7.11
Prediction: Linear model (Prediction)
7.12
Exercise: What’s predicted?
7.13
Exercise: Discussion
7.14
Regression vs. Classification
7.15
Overview of Classification
7.16
Assessing Model Accuracy
8
Machine learning: Basics (2)
8.1
The Logistic Model
8.2
LR in R: Predicting Recidvism (1)
8.3
LR in R: Predicting Recidvism (2): Estimate model
8.4
LR in R: Predicting Recidvism (3): Use model to predict
8.5
LR in R: Predicting Recidvism (5)
8.6
Lab: Predicting recidvism (Classification)
8.6.1
Inspecting the dataset
8.6.2
Splitting the datasets
8.6.3
Comparing the scores of black and white defendants
8.6.4
Building a predictive model
8.6.5
Predicting values
8.6.6
Training error rate
8.6.7
Test error rate
8.6.8
Comparison to COMPAS score
8.6.9
Model comparisons
8.7
Exercise
9
Machine learning: Basics (3)
9.1
Retake: Simple setup to build predictive model
9.2
Resampling methods (1)
9.3
Resampling methods (2): Cross-validation
9.4
Resampling methods (3): Validation set approach
9.5
Resampling methods (4): Leave-one-out cross-validation (LOOCV)
9.6
Resampling methods (5): Leave-one-out cross-validation (LOOCV)
9.7
Resampling methods (6): k-Fold Cross-Validation
9.8
Resampling methods (7): Some caveats
9.9
Exercise: Resampling methods
9.10
Lab: Resampling & cross-validation
9.10.1
Simple sampling
9.10.2
Validation set approach
9.10.3
Leave-one-out cross-validation (LOOCV)
9.10.4
k-Fold Cross-Validation
9.10.5
Comparing models
9.11
Other ML methods: Quick overview
9.12
Trade-Off: Prediction Accuracy vs. Model Interpretability
9.13
Exercise
10
Machine Learning: Text classification - Unsupervised (1)
10.1
Text as Data
10.2
Language in NLP
10.3
(R-)Workflow for Text Analysis
10.3.1
Data collection
10.3.2
Data manipulation: Basics (1)
10.3.3
Data manipulation: Basics (2)
10.3.4
Data manipulation: Basics (3)
10.3.5
Data manipulation: Tidytext Example (1)
10.3.6
Data manipulation: Tidytext Example (2)
10.3.7
Vectorization: Basics
10.3.8
Vectorization: Tidytext example
10.3.9
Vectorization: Tm example
10.3.10
Analysis: Supervised vs. unsupervised
10.3.11
Topic Modeling
10.3.12
Topic Modeling: Latent Dirichlet Allocation (1)
10.3.13
Topic Modeling: Latent Dirichlet Allocation (2)
10.3.14
Topic Modeling: Structural Topic Models (vs. LDA)
11
Machine Learning: Text classification - Unsupervised (2)
11.1
Lab: Structural Topic Model
11.1.1
Introduction
11.1.2
Setup
11.1.3
Data Pre-processing
11.1.4
Analysis: Structural Topic Model
11.1.5
Validation and Model Selection
11.1.6
Visualization and Model Interpretation
12
Machine Learning: Text classification - Supervised (3)
12.1
Supervised vs. unsupervised learning (1)
12.2
Topic models
12.3
Tree-based methods
12.4
Classification trees
12.5
Advantages and Disadvantages of Trees (C.h 8.1.4)
12.6
Bagging
12.7
Out-of-Bag (OOB) Error Estimation
12.8
Variable Importance Measures
12.9
Random forests
13
Machine Learning: Text classification - Supervised (4)
13.1
Lab: Random Forest for text classification
13.1.1
Preparing the data: DTM
13.1.2
Training data & RF classifier training
13.1.3
Evaluating the RF classifier
13.1.4
Exploring variable relevance & importance
13.1.5
Add predictions to nonlabelled data
13.1.6
Add predictions to labelled (training) data
13.1.7
Creating the final dataset
13.1.8
How to create a training dataset
14
Final session
15
Machine learning: Intro to Deep learning
15.1
Artificial, machine and deep learning
15.2
Classical ML: What it does (1)
15.3
Classical ML: What it does (2)
15.4
The ‘deep’ in deep learning (1)
15.5
The ‘deep’ in deep learning (2)
15.6
The ‘deep’ in deep learning (3)
15.7
Understanding how DL works (1)
15.8
Understanding how DL works (2)
15.9
Understanding how DL works (3)
15.10
Achievements of deep learning
15.11
Short-term hype & promise of AI (Ch. 1.1.7, 1.1.8)
15.12
The universal workflow of machine learning
15.13
Getting started: Network anatomy
15.14
Layers: the building blocks of deep learning
15.15
Loss functions and optimizers
15.16
Keras & R packages
15.17
Installation
15.18
Lab: Predicting house prices: a regression example
16
Machine learning: APIs
16.1
Using ML APIs for research: Pros and Cons
16.2
Lab: Using Google ML APIs
16.2.1
Software
16.2.2
Install & load packages
16.2.3
Twitter: Authenticate & load data
16.2.4
Google: Authenticate
16.2.5
Translation API
16.2.6
NLP API: Sentiment
16.2.7
NLP API: Syntax
16.2.8
Analyzing images
16.2.9
References
17
Summary: Computational Social Science
18
References
19
Content & exercises that are optional
20
Big data in the cloud
20.1
SQL at scale
20.1.1
SQL at scale: Google BigQuery
20.1.2
Lab: Google Big Query & 3 strategies
20.1.3
Exercise: Setting up & querying Google BigQuery
20.2
Optional-Exercise: Documentaries by Deutsche Welle
20.3
Optional-Exercise: Farecast and Google Flu
20.4
Lab (Skip!): Setting up GCP research credits
20.4.1
Google Translation API
20.5
Optional-Exercise: Download your data!
21
Appendix (old material)
21.1
Classifier performance & fairness (1): False positives & negatives
21.2
Classifier performance & fairness (2)
21.3
Lab: Classifier performance & fairness
21.3.1
Initial evaluation of the COMPAS scores
21.3.2
False positives negatives and correct classification
21.3.3
Altering the threshold
21.3.4
Adjusting thresholds
22
References
Computational Social Science
1.10
Platform usage (1): Social Media Adoption
All estimates: Reliable data notoriously hard to get (Q: Why?)