About this workshop

1 Instructor: About me

Akad. Rat (auf Zeit) at University of Freiburg (Department of politics), Postdoc at LMU Munich (statistics department, Frauke Kreuter), External fellow at Mannheim Centre for European Social Research
Previously
PhD in Social Sciences at the University of Bern
research/postdoctoral fellow at the Mannheim Centre for European Social Research, the European University Institute (Florence, Italy)
My research:
- Substantive: Political sociology (Trust, polarization, social media, fake news)
- Methods/data: Causal inference, experiments, text data, data visualization, machine learning (PCA, Topic models, RFs, BERT), big data (Google trends, Twitter)
- Started R around 2009, first exposure to ML through topic models ~ 2015 (unsupervised learning)
  - Currently: Using ML for predicting populism (tweets), conceptions (open-ended answers), transcription, emotion detection
- Other Gesis workshops: Applied Data Visualization, Interactive Data Analysis with Shiny
Contact: mail@paulcbauer.de; www.paulcbauer.de; Twitter; Github

2 Your turn

Let’s check our the survey results…

Name?
Affiliation? Country?
What do you want to use machine learning for? (or research questions?)

3 Contact & Outline & Dates

Dates: Tuesday - Friday, 10am - 12am; 1pm - 4pm
Course outline/content/dates: (see toc on the left)
- Important: 1st time I teach workshop/material
- Day I
  - 1 Introduction (James et al. 2013, Ch. 1, 2)
  - 2 Regression I & Intro to Tidymodels (Kuhn & Silge 2022, Ch. 1-3; James et al. 2013, Ch. 3-3.3)
  - 3 Regression II (James et al. 2013, Ch. 3-3.3)
  - 4 Resampling methods (James et al. 2013, Ch. 5-5.2)
- Day II
  - 5 Classification I (James et al. 2013, Ch. 4-4.3)
  - 6 Classification II (James et al. 2013, Ch. 4-4.3)
  - 7 Bias & fairness in ML I: Classification (Mehrabi et al. 2021)
  - 8 Bias & fairness in ML II: Regression (Mehrabi et al. 2021)
- Day III
  - 9 Linear Model Selection and Regularization I (James et al. 2013, Ch. 6-6.4)
  - 10 Linear Model Selection and Regularization II (James et al. 2013, Ch. 6-6.4)
  - 11 Tree-Based Methods I (James et al. 2013, Ch. 8-8.2)
  - 12 Tree-Based Methods II (James et al. 2013, Ch. 8-8.2)
- Day IV
  - 13 Support Vector Machines (SVMs) (James et al. 2013, Ch. 9-9.5)
  - 14 Using Python with tidymodels in R
  - 15 Deep Learning (Chollet 2021, Ch. 1,2)
  - 16 Summary of workshop & outlook

4 Actual content (after first teaching)

+ Day I
    - 1 Workshop & Introduction
    - 2 Introduction
    - 3 Machine learning: Fundamental concepts
    - 4 Machine learning: Fundamental concepts & Data exploration
+ Day II
    - 5 Data exploration
    - 6 Data exploration & Linear regression
    - 7 Linear regression
    - 8 Linear regression & Logistic regression
+ Day III
    - Logistic regression
    - Preprocessing data & recipes & workflows
    - Resampling & cross-validation
    - Feature selection & regularization (regression)
+ Day IV
    - Tree-based models
    - Tree-based models
    - Tuning models (Lab: Random forest)
    - Text analysis

5 Script & material

Literature: See syllabus.
Website/script: https://bookdown.org/paulcbauer/applied_machine_learning/
- Find it: Google “paul bauer applied machine learning”
- Document = slides + script (Zoom in/out with STRG + mousewheel)
- Code: can all be found in the script
- Data: can usually be downloaded over links in the script. If not we’ll share the files.
- Full screen: F11
- Navigation: TOCs on left and right
- Search document (upper left)
- Document generated with quarto
Motivation: Have a go-to script for participants (and ourselves!)
Content: Mixture of theory, lab sessions, exercises and pure code examples for discussion

6 Strategy & Goals/Learning outcomes

Strategy: From the simple to the complex, slowly diving into the logic of machine learning using building blocks that we already know
Goals: By the end of the course participants will:
- understand key concepts underlying machine learning.
- be able to interpret and evaluate machine learning models.
- be able to critically assess model performance on different dimensions of quality.
- be able to use various machine learning models for predictive and classification purposes.
- have learned how to use the tidymodels framework for machine learning in R.
- have learned how to evaluate and visualize model performance using packages like ggplot2.
Difficulty: How to choose out of many methods & models!

7 Online vs. offline

Negative
- Screen fatigue
- Can’t run around to check your code
- Less engaging, less social
- Voice
- Screen sharing & less screen space than classroom
Positive
- Participation from everywhere
- That’s how we interact more and more
Rule(s):
- Please keep your camera online!
  - Distracting animals/children/partners are a welcome distraction!
  - Yawning, leaving, looking bored etc. allowed!
  - Use a virtual background if you like!
  - Leaving, returning allowed!
- There no stupid questions¹

8 Recommended readings

Important: The workshop does not require any prior reading.
However, our schedule is primarily based on two textbooks which we generally recommend for further reading (see references on website):
- Kuhn, M., & Silge, J. (2022). Tidy modeling with R. O’Reilly Media, Inc.
  - Link: https://www.tmwr.org/
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013, corrected printing June 2023). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
  - Link: https://hastie.su.domains/ISLR2/ISLRv2_corrected_June_2023.pdf
- Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6), 1-35.
  - Link: https://arxiv.org/abs/1908.09635
- Chollet, F. (2021). Deep learning with Python. Simon and Schuster.
  - See R version: Chollet, F., & Allaire, J. J. (2018). Deep learning with R, Second Edition
  - Link: https://livebook.manning.com/book/deep-learning-with-r-second-edition/

9 Software we will use

Open-source software! (Q: Why?)
R (R Core Team 2023)²
- only viable competitor is Python
- Install the necessary packages using the code below.

# install.packages('pacman')
library(pacman)
p_load('reticulate', 'keras',
'tidyverse', 'lemon', 'knitr', 'kableExtra', 'plotly', 'randomNames', 'dplyr',
'stargazer', 'tidymodels', 'gghighlight', 'gt', 'ggplot2', 'patchwork',
'latex2exp', 'rsample', 'skimr', 'modelsummary', 'DataExplorer', 'visdat',
'haven', 'naniar', 'rpart', 'rattle', 'magick', 'vip', 'xgboost', 'tm',
'sjlabelled', 'profvis', 'rsconnect', 'whereami', 'DT')

Ggplot2³ (Wickham 2016)
Maybe: Plotly⁴ (Sievert 2020)
Note: Ideally cite the software you use in your research especially when it is open-source (e.g., run citation("ggplot2"))

10 Data we will use

Download it here
European Social Survey (ESS) [Round 10 - 2020. Democracy, Digital social contacts]
- Outcome: Life satisfaction (0-10)
- The ESS contains different outcomes amenable to both classification and regression as well as a lot of variables that could be used as features (~580 variables).
- Research: Shen, Yin, and Jiao (2023); Collins et al. (2015); Kaiser, Otterbach, and Sousa-Poza (2022); (Ciftci2023-kiv?); Pan and Cutumisu (2023); Prati (2022)
COMPAS
- Outcome: Recidvism (0,1)
- We will be using the dataset at LINK that is described by Angwin et al. (2016; and Lee, Du, and Guerzhoy 2020; James et al. 2013)
- The data is based on the COMPAS risk assessment tools (RAT). RATs are increasingly being used to assess a criminal defendant’s probability of re-offending.
Datasets have been prepared for the workhop (e.g., rename variables, data preparation) to facilitate focus on course content
- But building ML models also involves all the tedious data management tasks that we neglect here (recoding, renaming, etc.)
- Missings on outcomes (life satisfaction, recidvism) were added to both datasets

Overview of Compas dataset variables

id: ID of prisoner, numeric
name: Name of prisoner, factor
compas_screening_date: Date of compass screening, date
decile_score: the decile of the COMPAS score, numeric
is_recid: whether somone reoffended/recidivated (=1) or not (=0), numeric
is_recid_factor: same but factor variable
age: a continuous variable containing the age (in years) of the person, numeric
age_cat: age categorized
priors_count: number of prior crimes committed, numeric
sex: gender with levels “Female” and “Male”, factor
race: race of the person, factor
juv_fel_count: number of juvenile felonies, numeric
juv_misd_count: number of juvenile misdemeanors, numeric
juv_other_count: number of prior juvenile convictions that are not considered either felonies or misdemeanors, numeric

11 Tools and software

11.0.1 R: Why use it?

Free and open source (think of science in developing countries)
Good online-documentation
Lively community of users (forums etc.)
Pioneering role
Visualization capabilities
Intuitiv
Cooperates with other programs
Used across wide variety of disciplines
Object-oriented programming language
Popularity (See popularity statistics on books, blogs, forums)
RStudio as powerful integrated development environment (IDE) for R
- Evolves into a scientific work suite optimizing workflow (replication, reproducability etc.)
Institutions/people (Gary King, Andrew Gelman etc.)
Economic power (Revolution Analytics, Microsoft R Open)
Python is only real competitor.. can be used from R (e.g. reticulate package!)⁵

11.0.2 R: Where/how to study?

If you haven’t used R sofar it’s necessary that you learn some basics in R. As a participant of the seminar you get 6 months access to all the courses on DataCamp. DataCamp has become the go-to site for self-studying various data science skills (mostly software).

See this site for an overview of the R courses they offer. Basically, datacamp offers a track “Data Scientist with R”.
While the introduction is free for everyone you also have access to all other courses for six months.

If you like you can also have a look at the other options below but I would recommend that you start with data camp.

Try R: A short interactive intro to the language can be found here: http://tryr.codeschool.com/
Swirl: Learn R interactively within R itself: http://swirlstats.com/

11.0.3 R: Installation and setup

Below some notes on the installation and setup of R and relevant packages on your own computer:

Install Rtools for Windows machines from CRAN (https://cran.r-project.org/bin/windows/Rtools/). If you are using OS X, you will need to to install XCode, available for free from the App Store. This will install a compiler (if you don’t have a compiler installed) which will be needed when installing packages from GitHub that require compilation from C++ source code.
Install the latest version of R from CRAN (https://cran.r-project.org/).
Install the latest version of RStudio (https://www.rstudio.com/products/RStudio/). RStudio is the editor we’ll rely on, i.e. we’ll write code in RStudio which is subsequently sent to and run within R.
Start RStudio and install & load the latest versions of various packages that we need.

install.packages("pacman", repos = "http://cran.us.r-project.org")
library(pacman)
pacman::p_load(conflicted, tidyverse, foreign,
       knitr, printr,
       stargazer, plotly,
       scales, Matching,
       rgenoud, AER, lfe, plm,
       aod, randomizr, rdrobust, rddensity, 
       reshape2, mnormt, rmarkdown,
       cobalt, haven, tidyselect,
       kableExtra, sandwich, lmtest, randomNames,
       DiagrammeR, textdata,
       RSelenium, wordcloud, printr, keras,
               googleCloudVisionR, ggridges,
       stm, stminsights,
       ggthemes, RSQLite,
       emo,
       update = FALSE) # Set TRUE to update all

# reticulate::install_miniconda()
# devtools::install_github("hadley/emo")
# Install Anaconda beforehand: https://www.anaconda.com/products/individual
# install.packages("tensorflow")
# install_tensorflow()
# use_condaenv("r-tensorflow")
# Sometime you have to run R in admin mode.


library(reticulate)
library(keras)
virtualenv_create("myenv")
use_virtualenv("myenv")
install_keras(method="virtualenv", envname="myenv")
use_virtualenv("myenv")

# Choose functions with the conflicted package
  conflict_prefer("mutate", "dplyr")
  conflict_prefer("group_by", "dplyr")
  conflict_prefer("ungroup", "dplyr")
  conflict_prefer("filter", "dplyr")
  conflict_prefer("pivot_wider", "tidyr")
  conflict_prefer("pivot_longer", "tidyr")
  conflict_prefer("arrange", "dplyr")
  conflict_prefer("layout", "plotly")
  conflict_prefer("select", "dplyr")

You may also read up on how to create and “knit” an RMarkdown files. Essentially, such files allow you to integrate the analyses you conduct with the text you write which is ideal for reproducability. Here is an intro to the concept and a simple example: http://rmarkdown.rstudio.com/lesson-1.html.

11.0.4 Datacamp

Adress: https://www.datacamp.com/
6-month access to do all courses you like (Q: Did everyone get the invitation?)
Provide various tracks
- Data Scientist with R
- Data Scientist with Python
Q: What is your experience with Datacamp? Do you like it?
Q: What makes Datacamp as a platform so powerful? Will it replace humans? (Memory!)

References

Angwin, Julia, Jeff Larson, Lauren Kirchner, and Surya Mattu. 2016. “Machine Bias.” https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.

Collins, Susan, Yizhou Sun, Michal Kosinski, David Stillwell, and Natasha Markuzon. 2015. “Are You Satisfied with Life?: Predicting Satisfaction with Life from Facebook.” In Social Computing, Behavioral-Cultural Modeling, and Prediction, 24–33. Springer International Publishing.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.

Kaiser, Micha, Steffen Otterbach, and Alfonso Sousa-Poza. 2022. “Using Machine Learning to Uncover the Relation Between Age and Life Satisfaction.” Sci. Rep. 12 (1): 5263.

Lee, Claire S, Jeremy Du, and Michael Guerzhoy. 2020. “Auditing the COMPAS Recidivism Risk Assessment Tool: Predictive Modelling and Algorithmic Fairness in CS1.” In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, 535–36. ITiCSE ’20. New York, NY, USA: Association for Computing Machinery.

Pan, Zexuan, and Maria Cutumisu. 2023. “Using Machine Learning to Predict UK and Japanese Secondary Students’ Life Satisfaction in PISA 2018.” Br. J. Educ. Psychol., December.

Prati, Gabriele. 2022. “Correlates of Quality of Life, Happiness and Life Satisfaction Among European Adults Older Than 50 Years: A Machine‐learning Approach.” Arch. Gerontol. Geriatr. 103 (November): 104791.

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Shen, Xiaofang, Fei Yin, and Can Jiao. 2023. “Predictive Models of Life Satisfaction in Older People: A Machine Learning Approach.” Int. J. Environ. Res. Public Health 20 (3).

Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Footnotes

Please ask no matter if you think the question is stupid or you have missed something in the course. Any question is valuable and any repetition of content is useful.↩︎
Creators: Core contributors and thousands of package authors.↩︎
Creators: https://github.com/tidyverse/ggplot2↩︎
Creators: https://github.com/plotly/plotly.js; https://github.com/ropensci/plotly↩︎
The seminar consists of a mix of theoretical and applied sessions. For the applied session we will rely on the software R. While there are various programs one could use, the reasons mentioned above speak for R (my personal view). The only real contendor for data science is Python. See here for a nice overview of the differences between the two.↩︎