B Datasets and sources
The word information, in this theory, is used in a special sense
that must not be confused with its ordinary usage.
In particular, information must not be confused with meaning.Claude E. Shannon and Warren Weaver (1949)
Data is the raw material of data science. As the chapters of this book explore various types and shapes of data, we tacitly assume that analyzing and manipulating data can uncover some aspect that we did not see before and — by adopting new perspectives — enable valuable insights. While this is possible, of course, it should not be taken for granted. Data can contain information, but can also consist of mere bullshit, gibberish, and trash. Just as more data does not necessarily contain more information, data can be valuable or worthless, and thus guide or mislead our investigations.
In data science, data is the stuff that — given the right spells and concoctions of the data alchemist — is to be turned into gold. Although data is increasingly ubiquitous, it remains difficult to turn raw data into something that makes sense or creates value. This should not surprise us: Air and water appear to be ubiquitous in many parts of the world, yet any scarcity of them make them immensely valuable. Similarly, it requires a lot of knowledge, skill, and effort to build a solid house out of a heap of dirt, wood, and stones. And while data science typically does not get our hands dirty, we should never underestimate the amount of effort and frustration involved in cleaning up some messy pile of data.
Datasets in ds4psy
A key aspect of mining data for hidden treasures is that its origin and source is documented and properly acknowledged. This appendix provides some background information on the main datasets used in this book and their sources. Most of the data used throughout this book is already included in R (in datasets) or provided by R packages (e.g., the ds4psy and tidyverse packages). Occasionally, we create and use small toy datasets to illustrate a particular command or technical point, but the vast majority of analyses and visualizations throughout this book use real datasets that people have collected to answer empirical questions.
To accommodate the special interests of social scientists, we focus on people-related data, in which cases represent persons and the variables provide information about them (e.g., characteristics like age, gender, etc., but also choices, opinions, preferences, etc.). Aiming for real data that addresses scientific questions in psychology prompted us to use two datasets that pop up frequently throughout this book:
Positive psychology: A dataset examining the effectiveness of web-based positive psychology interventions (Woodworth et al., 2017, 2018). (See Section B.1 for details.)
False positive psychology: A dataset showing problematic research practices within psychology (Simmons et al., 2011, 2014). (See Section B.2 for details.)
The following sections provide the context in which the data were collected and provides credit and references to the original sources. The concluding Section B.3 provides pointers to additional sources of data.