Data cleaning for social scientists

Author

Felix Lennert

Published

November 23, 2022

Preface

This is the script for a two-day workshop on data cleaning for the social sciences. I assume you are familiar with basic R concepts such as the different data types and how to index them, the general structure of the syntax, and how to make function calls.

Motivation

In the 21st century, social scientists are able to tap into wells of data that are deeper than ever before. Not only can we use more designed data, i.e., data that have been generated with the clear goal of performing research using them in mind, e.g., survey data, than ever. Also, the rise of the internet as a sensor for human behavior provides us with new opportunities.

Given their variety, these data sets may all come in different structures, and if we want to leverage them to improve our understanding of the world, we need to reshape them properly. Therefore, we are in need of tools that are capable and flexible enough to work with all different kinds of data while remaining easily accessible. In my opinion, R (R Core Team 2021), RStudio, and the tidyverse packages (Wickham et al. 2019) strike a good balance here. R is a powerful, flexible, statistical programming language, RStudio serves as a convenient Graphic User Interface (GUI) that can be used by researchers free of charge, and, finally, the tidyverse, and also its adjacent packages, are a collection of packages with concisely defined use cases and consistent syntax that covers a wide array of data science applications from acquiring data, cleaning data, transforming data, to finally visualizing them, and using them to draw inferences on the underlying real-world data-generation processes.

This script is going to introduce you to the following things: we start with the basics, i.e., the RStudio environment, Projects, and how to read data. Thereafter, I will introduce you to the idea behind “tidy data” (Wickham 2014). This describes a particular way of structuring data and is necessary insofar as, basically, all tidyverse packages require data in this structure to function properly. Then, once the data set is in a tidy format, we can delve into the actual transformation process. Finally, an introduction to data visualization will be provided.

Some final introductory remarks

This stuff is far from trivial. Just like learning a new language, mastering coding in R requires hard and consistent effort. In order to facilitate your learning, I will link to as many external resources as possible and provide all data sets used in the script so that you can run and adapt (“play with”) the code yourself.

Let me also tell you that you will run into problems. Constantly. Therefore, googling error messages takes up considerable space in every thorough data analysis endeavor. Usually, you can just copy the error message and throw it into Google. The tricky part here is to strike a balance between staying generalizable enough (i.e., do not include your self-chosen file names or replace them with the data type, i.e., tibble or df – short for Data.Frame). Platforms to find help and further guidance are for instance StackOverflow and the RStudio forum.