Preface
1
The seminar
1.1
Script & Material
1.2
About me
1.3
Who are you?
1.4
Why are we using R?
1.5
Where/how to study R?
1.6
Installation and setup of R
1.7
Content & Objectives
1.8
Overview of some readings
1.9
Descriptive vs. causal research questions
1.9.1
Descriptive questions
1.9.2
Causal questions
2
What is big data?
2.1
What is data? (Wikipedia)
2.2
Big data: Quotes for a start
2.3
Definitions
2.4
The Vs
2.5
Exercise: Farecast and Google Flu
2.6
Exercise: Big Data is not about the data
2.7
Exercise: What data can reveal about you…
2.8
Exercise: Download your data!
2.9
Big Data: Challenges for Social Scientists
3
Measurement & Variables
3.1
Discussion: Objective vs. subjective reality
3.2
Example 1
3.3
Example 2
3.4
Measurement abstract
3.5
Scenarios, planned and realized measurements
3.6
Distribution(s) of measurements
4
Data: Fundementals
4.1
Basics
4.2
Table format
4.3
(Empirical) Univariate distributions
4.4
(Empirical) Joint distributions
4.5
One more joint distribution
4.6
Theoretical (Probability) Distributions
5
Models
5.1
What is a model?
5.2
Example: Mean as a model
5.3
Example: Linear model (Equation)
5.4
Example: Linear model (Visualization)
5.5
Estimation
5.6
Prediction
5.7
Estimand, estimator and estimation
5.8
Associational vs. causal inference
5.9
Assumptions
6
New forms/types of data
6.1
The digital revolution
6.2
From traditional to new data
6.3
New forms of data
6.4
Where can we find such data?
7
Good practices in data analysis
7.1
Reproducibility & Replicability
7.2
Why reproducability?
7.3
Reproducability: My current approach
7.4
Reproducability in practice
8
Capture and collect data
8.1
Web scraping: Basics
8.1.1
Scraping data from websites: Why?
8.1.2
Scraping the web: two approaches
8.1.3
The rules of the game
8.1.4
The art of web scraping
8.2
Screen (Web) scraping
8.2.1
Scenarios
8.2.2
HTML: a primer
8.2.3
HTML: a primer
8.2.4
Beyond HTML
8.2.5
Parsing HTML code
8.3
Lab 1: Scraping tables
8.4
Exercise 1: Scraping tables
8.5
Today & repetition
8.6
Lab 2: Scraping (many) tables
8.7
Exercise 2.1: Scraping (many) tables
8.8
Exercise 2.2: Scraping (many) tables
8.9
Lab 3: Scraping unstructured data
8.10
Exercise 3: Scraping unstructured data
8.11
Scrape dynamic webpages: Selenium
8.12
Lab 4: Scraping web data behind web forms
8.13
Exercise 4: Scraping web data behind web forms
8.14
RSS: Scraping newspaper websites
8.15
Lab 4: Scraping newspaper website
8.16
Web APIs
8.16.1
APIS
8.16.2
Types of APIs:
8.16.3
Connecting with an API
8.16.4
JSON format (responses)
8.16.5
Authentication
8.16.6
R packages
8.16.7
Why APIs?
8.16.8
Scraping: Decisions, decisions…
8.17
Some APIs
8.18
Lab 5: Scraping data from APIs
8.19
Exercise 5: Scraping data from APIs
8.20
What comes next? 3rd session/day
8.21
Lab 6: Clarify API
8.22
Twitter’s APIs
8.23
Lab 7: Twitter’s streaming API
8.23.1
Authenticating
8.23.2
Collecting data from Twitter’s Streaming API
8.24
Exercise 7: Twitter’s streaming API
8.25
Lab 8: Twitter’s REST API
8.25.1
Searching recent tweets
8.25.2
Extracting users’ profile information
8.25.3
Building friend and follower networks
8.25.4
Estimating ideology based on Twitter networks
8.25.5
Other types of data
8.26
Exercise 8: Twitter’s REST API
9
Encoding issues
9.1
Lab 9: Basics of character encoding in R
9.2
Exercise 9: Character encoding
10
Research examples
10.1
Dressel and Farid (2018): Predicting recidivism
10.2
Barbera (2015): Birds of the Same Feather Tweet Together
10.3
Edelmann et al (2017): Racial Discrimination in the Sharing Economy
10.4
Lazer et al (2014) The Parable of Google Flu: Traps in Big Data Analysis
10.5
Swan (2013) The Quantified Self
10.6
Göbel & Munzert (2018) Political Advertising on the Wikipedia Marketplace of Information
10.7
Przepiorka et al (2017) Order without Law
11
Storing and managing (big) data
11.1
Size & dimensions & creation of data
11.2
How is data stored?
11.3
Introduction to SQL
11.3.1
Databases
11.3.2
SQL
11.3.3
Components of a SQL query
11.3.4
SQL at scale: Google BigQuery
11.4
Lab 9: Working with a SQL database
11.4.1
Creating an SQL database
11.4.2
Querying an SQL database
11.4.3
Querying multiple SQL tables
11.4.4
Grouping and aggregating
11.5
Exercise 9: SQL database
11.6
Data warehouses
11.7
Lab 10: Setting up & using Google BigQuery
11.7.1
More advanced queries
11.8
Exercise 10: Setting up & querying Google BigQuery
12
Analyzing Big Data
12.1
Descriptive vs. causal questions (Repetition)
12.1.1
Descriptive questions (and analysis)
12.1.2
Causal questions (and analysis)
12.2
Lab 11: Descriptive statistics
12.3
Lab 12: Visualization
12.4
Lab 13: Sentiment analysis
12.5
Lab 14: Time in R
13
Data security & ethics
13.1
What could happen to your data?
13.2
Yes…
13.3
Protection against different problems
14
R Basics
14.1
Start R and help function
14.2
Objects, working directory and workspace
14.2.1
Example: Objects, working directory and workspace
14.2.2
Exercise: Working directory, objects and workspace
14.2.3
Solution: Working directory, objects and workspace
14.3
Calculations and logical comparisons
14.3.1
Example: Calculations and logical comparisons
14.3.2
Exercise: Calculations and logical comparisons (Homework)
14.3.3
Solution: Calculations and logical comparisons
14.4
How to write good code/workflow!
14.5
Objects: Classes and their structure
14.5.1
Overview of structure of object classes
14.5.2
Vectors: Numerical, logical and character
14.5.3
Factors and lists
14.6
Packages
14.6.1
Example: Packages
14.6.2
Exercise: Packages
14.6.3
Solution: Packages (HOMEWORK)
14.7
Data frames and data management
14.7.1
The basics
14.7.2
The attach()-function
14.7.3
Example: The basics
14.7.4
Logic of accessing subsets of data frames
14.7.5
Recoding variables
14.8
DPLYR: Grammar of data management (Hadley Wickham)
14.8.1
filter() & slice()
14.8.2
arrange(): Reorder/sort rows
14.8.3
select(): Subsetting and renaming
14.8.4
distinct() & unique(): Extract distinct/unique rows/values
14.8.5
mutate() & transform ()
14.8.6
group_by(): Applying functions across groups
14.8.7
Chaining with dplyr
14.8.8
anti_join(): Merging data frames
Big data and Social Science
Chapter 11
Storing and managing (big) data