B Datasets

ds4psy: Datasets

Data is the stuff of data science, the raw material that — given the right spells and concoctions of the data alchemist — is to be turned into gold. Although data is increasingly ubiquitous, it is difficult to turn raw data into something that makes sense or creates value. This should not surprise us: It also takes a lot of knowledge, skills, and effort to build a house out of a heap of dirt, stones, and wood. And while data science typically does not get our hands dirty, we should never underestimate the amount of effort and frustration involved in cleaning up some messy pile of data.

This chapter provides some background information on the main datasets used in this book and their sources. Most of the data used throughout this book is already included in R (in datasets) or provided by R packages (e.g., the tidyverse). Occasionally, we create small toy datasets to illustrate a command or technical point, but the vast majority of analyses and visualizations throughout this book use real datasets that people have collected to answer empirical questions.

To address the interests of psychologists and social scientists, we focus on people-related data, in which cases represent persons and the variables provide information about them (e.g., characteristics like age, gender, etc., but also choices, opinions, preferences, etc.). Aiming for real data that addresses scientific questions in psychology prompted us to use 2 datasets that pop up frequently throughout this book:

  1. Positive psychology: A dataset examining the effectiveness of web-based positive psychology interventions (Woodworth et al., 2017, 2018). (See Section B.1 for details.)

  2. False positive psychology: A dataset showing problematic research practices within psychology (J. P. Simmons et al., 2011; J. Simmons et al., 2014). (See Section B.2 for details.)

To make it simple to use these datasets in R, we store them in easily accessible formats on a web server (at http://rpository.com/ds4psy/). The following sections provide the context in which the data were collected and provides credit and references to the original sources. The concluding Section B.3 provides pointers to additional sources of data.


Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

Simmons, J., Nelson, L., & Simonsohn, U. (2014). Data from paper “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”. Journal of Open Psychology Data, 2(1). https://doi.org/10.5334/jopd.aa

Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2017). Web-based positive psychology interventions: A reexamination of effectiveness. Journal of Clinical Psychology, 73(3), 218–232. https://doi.org/10.1002/jclp.22328

Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2018). Data from “Web-based positive psychology interventions: A reexamination of effectiveness”. Journal of Open Psychology Data, 6(1). https://doi.org/10.5334/jopd.35