B.3.1 Data in base R
As R includes a datasets package, every version of R comes with a collection of datasets. To learn which datasets exist and obtain basic information about them, call
library(help = "datasets")
To obtain information about any particular dataset
?x. Throughout this book, we use quite a few of the datasets in examples and exercises.
# Info on datasets: ?anscombe ?cars ?sleep ?Titanic # Check dimensions: dim(ChickWeight) dim(iris) dim(sleep) # Student's Sleep Data dim(Titanic) # see also dim(FFTrees::titanic)
As the datasets are included to illustrate particular types of data or problems, they vary widely in size and shape. For instance, the
Nile dataset contains a single time series with measurement values of the annual flow of the river Nile from the years 1871 to 1970.
# ?Nile length(Nile) #>  100 typeof(Nile) #>  "double" plot(Nile, col = unikn::Seeblau, lwd = 3)
B.3.2 Data in R packages
Many R packages contain datasets for demonstration purposes. For instance, this book uses the datasets included in the ds4psy package and the following datasets from various tidyverse packages:
Other packages with many small and large data sets include:
babynames: Data on the number of children of each sex given each name (of at least 5 children) by the U.S. social security administration (Wickham, 2019).
DAAG: Data for Data Analysis and Graphics Using R (Maindonald & Braun, 2003, 2007, 2010).
dslabs: Datasets used for training for the HarvardX’s Data Science Professional Certificate (Irizarry & Gill, 2019).
eurostat: Tools for downloading data from the Eurostat database (Lahti et al., 2017).
FFTrees: Data for binary classification tasks:
wine(Phillips et al., 2017).
fpp2: Data for Forecasting: Principles and practice (Hyndman & Athanasopoulos, 2018).
nycflights13: Data for all flights departing from NYC in 2013 (Wickham, 2019).
ISLR: Data for An Introduction to Statistical Learning with Applications in R (James, Witten, Hastie, & Tibshirani, 2013).
MASS: Support functions and datasets for Venables and Ripley’s MASS (Ripley et al., 2019).
psych: Data for the Personality-project.org (Revelle et al., 2018).
yarrr: Data on
auction, etc. (Phillips, 2018).
This list is incidental and guaranteed to be incomplete. See Rdatasets for a more systematic collection of over 1300 datasets distributed through R and its packages.
B.3.3 Online sources
The web is full of data, of course, but most of it needs sound data science and a sound dose of scepticism to be of any use. Here are some good starting points for finding free data.
Collections of datasets
Rdatasets: A collection of over 1300 datasets distributed through R and its packages
Kaggle: A place for data science projects (with many large datasets)
Wikidata: Wikipedia data
Gapminder: An independent Swedish foundation that fights misconceptions about global development with facts
Statistisches Bundesamt: German data on various issues
- PanTHERIA: A species-level database of life history, ecology, and geography of extant and recently extinct mammals