Preface
1
Introduction
1.1
Script & Material
1.2
About me
1.3
Who are you?
1.4
Content & Objectives
1.5
Overview of some readings
1.6
Tools and software we use
1.6.1
R: Why use it?
1.6.2
R: Where/how to study?
1.6.3
R: Installation and setup
1.6.4
Datacamp
1.6.5
Google Cloud
1.7
Descriptive inference, causal inference & prediction
1.7.1
Descriptive questions
1.7.2
Causal questions
1.7.3
Prediction
1.8
The digital revolution
1.9
How does the internet work? (+ access)
1.10
Technology adoption: United States
1.11
Platform usage (1): Social Media Adoption
1.12
Platform usage (2): Social Media Adoption (Barchart)
1.13
Platform usage (3): Social Networking Young
1.14
Platform usage (4): Daily hours digital media
1.15
What is Computational Social science (CSS)?
1.16
CSS: Challenges for Social Scientists
1.17
Exercise: What data can reveal about you…
1.18
Exercise: Documentaries by Deutsche Welle
1.19
X-Exercise: Farecast and Google Flu
1.20
X-Exercise: Big Data is not about the data
1.21
X-Exercise: Download your data!
1.22
Presentations: Example Barbera (2015)
1.23
Good practices in data analysis (X)
1.23.1
Reproducibility & Replicability
1.23.2
Reproducibility & Replicability
1.23.3
Why reproducability?
1.23.4
Reproducability: My current approach
1.23.5
Reproducability in practice
1.24
References
2
Big data & new data sources (1)
2.1
For a starter
2.2
What is Big Data?
2.3
Big data: Quotes for a start
2.4
Big data: Definitions
2.5
Big data: The Vs
2.6
Big data: Analog age vs. digital age (1)
2.7
Big data: Analog age vs. digital age (2)
2.8
Big data: Repurposing
2.9
Presentations
2.10
Exercise: Ten common characteristics of big data (Salganik 2017)
2.11
New forms of data: Overview
2.12
Where can we find big data sources data?
2.13
References
3
Big data & new data sources (2)
3.1
Presentations
3.2
Example: Salience of issues
3.3
Google trends: Caveats
3.4
Data security & ethics (1): What might happen?
3.5
Data security & ethics (2): Yes…
3.6
Data security & ethics (3): Protection
3.7
Data: Size & dimensions & creation
3.8
Data: How is it stored?
3.9
Data & Databases
3.10
R Database Packages
3.11
SQL: Intro
3.12
SQL: Components of a query
3.13
Lab: Working with a SQL database
3.13.1
Creating an SQL database
3.13.2
Querying an SQL database
3.13.3
Querying multiple SQL tables
3.13.4
Grouping and aggregating
3.14
Exercise: SQL database
3.15
SQL at scale: Strategy
3.16
SQL at scale: Google BigQuery
3.17
Lab (Skip!): Setting up GCP research credits
3.17.1
Google Translation API
3.18
Lab: Google Big Query
3.19
Exercise: Setting up & querying Google BigQuery
3.20
Strategies to work with big data
3.21
References
4
Data collection: APIs
4.1
Web APIs
4.2
API = Application Programming Interface
4.3
Why APIs?
4.4
Scraping: Decisions, decisions…
4.5
Types of APIs
4.6
Some APIs
4.7
R packages
4.8
(Reponse) Formats: JSON
4.9
(Reponse) Formats: XML
4.10
Authentication
4.11
Connect to API: Example
4.12
Lab: Scraping data from APIs
4.13
Exercise: Scraping data from APIs
4.13.1
Homework: APIs for social scientists
4.14
X-Lab: Clarify API
4.15
X-Twitter’s APIs
4.16
X-Lab: Twitter’s streaming API
4.16.1
Authenticating
4.16.2
Collecting data from Twitter’s Streaming API
4.17
X-Exercise: Twitter’s streaming API
4.18
X-Lab: Twitter’s REST API
4.18.1
Searching recent tweets
4.18.2
Extracting users’ profile information
4.18.3
Building friend and follower networks
4.18.4
Estimating ideology based on Twitter networks
4.18.5
Other types of data
4.19
X-Exercise: Twitter’s REST API
5
Data collection: Web (screen) scraping
5.1
Web scraping: Basics
5.1.1
Scraping data from websites: Why?
5.1.2
Scraping the web: two approaches
5.1.3
The rules of the game
5.1.4
The art of web scraping
5.2
Screen (Web) scraping
5.2.1
Scenarios
5.2.2
HTML: a primer
5.2.3
HTML: a primer
5.2.4
Beyond HTML
5.2.5
Parsing HTML code
5.3
Lab: Scraping tables
5.4
Exercise: Scraping tables
5.5
Lab: Scraping (more) tables
5.6
Exercise: Scraping (more) tables
5.7
Lab: Scraping unstructured data
5.8
Exercise: Scraping unstructured data
5.9
Scrape dynamic webpages: Selenium
5.10
Lab: Scraping web data behind web forms
5.10.1
Using RSelenium
5.11
RSS: Scraping newspaper websites
5.12
Lab: Scraping newspaper website
6
Machine learning: Introduction
6.1
Classical statistics vs. machine learning
6.2
Machine learning as programming paradigm
6.3
Terminological differences (1)
6.4
Terminological differences (2)
6.5
Prediction: Mean
6.6
Prediction: Linear model (Equation) (1)
6.7
Prediction: Linear model (Equation) (2)
6.8
Prediction: Linear model (Visualization)
6.9
Prediction: Linear model (Estimation)
6.10
Prediction: Linear model (Prediction)
6.11
Exercise: What’s predicted?
6.12
Exercise: Discussion
6.13
Regression vs. Classification
6.14
Overview of Classification
6.15
Assessing Model Accuracy
6.16
The Logistic Model
6.17
LR in R: Predicting Recidvism (1)
6.18
LR in R: Predicting Recidvism (2)
6.19
LR in R: Predicting Recidvism (3)
6.20
Lab: Predicting recidvism (Classification)
6.20.1
Inspecting the dataset
6.20.2
Splitting the datasets
6.20.3
Comparing the scores of black and white defendants
6.20.4
Building a predictiv model
6.20.5
Predicting values
6.20.6
Training error rate
6.20.7
Test error rate
6.20.8
Comparison to COMPAS score
6.20.9
Model comparisons
6.21
Exercise
6.22
Resampling methods (1)
6.23
Resampling methods (2): Cross-validation
6.24
Resampling methods (3): Validation set approach
6.25
Resampling methods (3): Leave-one-out cross-validation (LOOCV)
6.26
Resampling methods (4): Leave-one-out cross-validation (LOOCV)
6.27
Resampling methods (5): k-Fold Cross-Validation
6.28
Exercise: Resampling methods
6.29
Lab: Resampling & cross-validation
6.29.1
Simple sampling
6.29.2
Validation set approach
6.29.3
Leave-one-out cross-validation (LOOCV)
6.29.4
k-Fold Cross-Validation
6.29.5
Comparing models
6.30
Classifier performance & fairness (1): False positives & negatives
6.31
Classifier performance & fairness (2)
6.32
Lab: Classifier performance & fairness
6.32.1
Initial evaluation of the COMPAS scores
6.32.2
False positives negatives and correct classification
6.32.3
Altering the threshold
6.32.4
Adjusting thresholds
6.33
Other ML methods: Quick overview
6.34
Trade-Off: Prediction Accuracy vs. Model Interpretability
6.35
References
7
Machine Learning: Text classification
7.1
Text as Data
7.2
Language in NLP
7.3
(R-) Workflow for Text Analysis
7.3.1
Data collection
7.3.2
Data manipulation: Basics (1)
7.3.3
Data manipulation: Basics (2)
7.3.4
Data manipulation: Basics (3)
7.3.5
Data manipulation: Tidytext Example (1)
7.3.6
Data manipulation: Tidytext Example (2)
7.3.7
Data manipulation: Tm Example
7.3.8
Data manipulation: Quanteda Example
7.3.9
Data manipulation: Summary
7.3.10
Vectorization: Basics
7.3.11
Vectorization: Tidytext example
7.3.12
Vectorization: Tm example
7.3.13
Vectorization: Quanteda example
7.3.14
Analysis: Unsupervised text classification
7.3.15
Analysis: Topic Models
7.3.16
Analysis: Latent Dirichlet Allocation (1)
7.3.17
Analysis: Latent Dirichlet Allocation (2)
7.3.18
Analysis: Structural Topic Models
7.4
Lab: Structural Topic Model
7.4.1
Setup
7.4.2
Data Pre-processing
7.4.3
Analysis: (Structural) Topic Model
7.4.4
Validation and Model Selection
7.4.5
Visualization and Model Interpretation
7.4.6
Highest word probabilities for each topic
8
Machine learning: Intro to Deep learning
8.1
Artificial, machine and deep learning
8.2
Classical ML: What it does (1)
8.3
Classical ML: What it does (2)
8.4
The ‘deep’ in deep learning (1)
8.5
The ‘deep’ in deep learning (2)
8.6
The ‘deep’ in deep learning (3)
8.7
Understanding how DL works (1)
8.8
Understanding how DL works (2)
8.9
Understanding how DL works (3)
8.10
Achievements of deep learning
8.11
Short-term hype & promise of AI (Ch. 1.1.7, 1.1.8)
8.12
The universal workflow of machine learning
8.13
Getting started: Network anatomy
8.14
Layers: the building blocks of deep learning
8.15
Loss functions and optimizers
8.16
Keras & R packages
8.17
Installation
8.18
Lab: Predicting house prices: a regression example
9
Machine learning: APIs
9.1
Using ML APIs for research: Pros and Cons
9.2
Lab: Using Google ML APIs
9.2.1
Software
9.2.2
Install & load packages
9.2.3
Twitter: Authenticate & load data
9.2.4
Google: Authenticate
9.2.5
Translation API
9.2.6
NLP API: Sentiment
9.2.7
NLP API: Syntax
9.2.8
Analyzing images
9.2.9
References
10
Summary: Computational Social Science
11
References
Computational Social Science: Theory & Application
3.20
Strategies to work with big data
https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/
Strategy 1: Sample and Model
Strategy 2: Chunk and Pull
Strategy 3: Push Compute to Data