B.3 Resources

ds4psy: Datasets

B.3.1 Data in base R

As R includes a datasets package, every version of R comes with a collection of datasets. To learn which datasets exist and obtain basic information about them, call

library(help = "datasets") 

To obtain information about any particular dataset x, call ?x. Throughout this book, we use quite a few of the datasets in examples and exercises.

# Info on datasets:
?anscombe 
?cars
?sleep
?Titanic

# Check dimensions:
dim(ChickWeight)
dim(iris)
dim(sleep)     # Student's Sleep Data
dim(Titanic)   # see also dim(FFTrees::titanic)

As the datasets are included to illustrate particular types of data or problems, they vary widely in size and shape. For instance, the Nile dataset contains a single time series with measurement values of the annual flow of the river Nile from the years 1871 to 1970.

# ?Nile
length(Nile)
#> [1] 100
typeof(Nile)
#> [1] "double"

plot(Nile, col = unikn::Seeblau, lwd = 3)

B.3.2 Data in R packages

Many R packages contain datasets for demonstration purposes. For instance, this book uses the datasets included in the ds4psy package and the following datasets from various tidyverse packages:

  • ggplot2: diamonds, economics, mpg, msleep, etc.
  • dplyr: starwars, band_members, band_instruments, nasa, storms, etc.
  • tidyr: table1table5, etc.
  • stringr: words, sentences, etc.

Other packages with many small and large data sets include:

  • babynames: Data on the number of children of each sex given each name (of at least 5 children) by the U.S. social security administration (Wickham, 2019).

  • DAAG: Data for Data Analysis and Graphics Using R (Maindonald & Braun, 2003, 2007, 2010).

  • dslabs: Datasets used for training for the HarvardX’s Data Science Professional Certificate (Irizarry & Gill, 2019).

  • eurostat: Tools for downloading data from the Eurostat database (Lahti et al., 2017).

  • FFTrees: Data for binary classification tasks: breastcancer, car, heartdisease, mushrooms, titanic, wine (Phillips et al., 2017).

  • fpp2: Data for Forecasting: Principles and practice (Hyndman & Athanasopoulos, 2018).

  • nycflights13: Data for all flights departing from NYC in 2013 (Wickham, 2019).

  • ISLR: Data for An Introduction to Statistical Learning with Applications in R (James, Witten, Hastie, & Tibshirani, 2013).

  • MASS: Support functions and datasets for Venables and Ripley’s MASS (Ripley et al., 2019).

  • psych: Data for the Personality-project.org (Revelle et al., 2018).

  • yarrr: Data on pirates, movies, auction, etc. (Phillips, 2018).

This list is incidental and guaranteed to be incomplete. See Rdatasets for a more systematic collection of over 1300 datasets distributed through R and its packages.

B.3.3 Online sources

The web is full of data, of course, but most of it needs sound data science and a sound dose of scepticism to be of any use. Here are some good starting points for finding free data.

Collections of datasets

Economic datasets

  • FRED provides mostly time series data on economic trends

  • IPUMS provides census and survey data on various issues from around the world

  • Using survey data of the Pew Research Center requires a free account

  • UC DATA provides data in the areas of political, social and health sciences.

Specific datasets

  • PanTHERIA: A species-level database of life history, ecology, and geography of extant and recently extinct mammals