B.3.1 Data in base R
As R includes a datasets package, every version of R comes with a collection of datasets. To learn which datasets exist and obtain basic information about them, call
To obtain information about any particular dataset
Throughout this book, we use quite a few of the datasets in examples and exercises.
As the datasets are included to illustrate particular types of data or problems, they vary widely in size and shape.
For instance, the
Nile dataset contains a single time series with measurement values of the annual flow of the river Nile from the years 1871 to 1970.
B.3.2 Data in R packages
Many R packages contain datasets for demonstration purposes. For instance, this book primarily uses a variety of datasets — some real ones, but also smaller tables that were generated to highlight or practice particular aspects or tasks — that come with the ds4psy R package (Neth, 2020).
Including data in packages has both benefits and costs. It makes the corresponding tables easily accessible, but this convenience comes at the price that students no longer learn to retrieve and load real data sets that often require extensive pre-processing. To somewhat alleviate this dilemma, we store some datasets in a variety of formats on a web server (at http://rpository.com/ds4psy/, see Chapter 6 on Importing data).
msleep(see Chapter 2)
storms, etc. (see Chapters 3, 4, 5, and 8)
table5, etc. (see Chapter 7)
sentences(see Chapter 9)
lakers(see Chapter 10)
Other packages with many small and large data sets include:
babynames: Data on the number of children of each sex given each name (of at least 5 children) by the U.S. social security administration (Wickham, 2019).
DAAG: Data for Data Analysis and Graphics Using R (Maindonald & Braun, 2003, 2007, 2010).
dslabs: Datasets used for training for the HarvardX’s Data Science Professional Certificate (Irizarry & Gill, 2019).
eurostat: Tools for downloading data from the Eurostat database (Lahti et al., 2017).
FFTrees: Data to be used for solving binary classification tasks:
wine(Phillips, Neth, Woike, & Gaissmaer, 2020).
fpp2: Data for Forecasting: Principles and practice (Hyndman & Athanasopoulos, 2018).
nycflights13: Data for all flights departing from NYC in 2013 (Wickham, 2019).
ISLR: Data for An Introduction to Statistical Learning with Applications in R (James, Witten, Hastie, & Tibshirani, 2013).
psych: Data for the Personality-project.org (Revelle et al., 2018).
yarrr: Data on
auction, etc. (Phillips, 2017).
This list is incidental and guaranteed to be incomplete. See Rdatasets for a more systematic collection of over 1300 datasets distributed through R and its packages.
B.3.3 Online sources
The web is full of data, of course, but most of it needs sound data science and a sound dose of scepticism to be of any use. Here are some good starting points for finding free data.
Collections of datasets
Rdatasets: A collection of over 1300 datasets distributed through R and its packages
Kaggle: A place for data science projects (with many large datasets)
Wikidata: Wikipedia data
Gapminder: An independent Swedish foundation that fights misconceptions about global development with facts
Statistisches Bundesamt: German data on various issues
- PanTHERIA: A species-level database of life history, ecology, and geography of extant and recently extinct mammals
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Retrieved from http://faculty.marshall.usc.edu/gareth-james/ISL/
Neth, H. (2020). ds4psy: Data science for psychologists. Retrieved from https://CRAN.R-project.org/package=ds4psy
Phillips, N. (2017). yarrr: A companion to the e-book "yarrr!: The pirate’s guide to R". Retrieved from www.thepiratesguidetor.com
Phillips, N., Neth, H., Woike, J., & Gaissmaer, W. (2020). FFTrees: Generate, visualise, and evaluate fast-and-frugal decision trees. Retrieved from https://CRAN.R-project.org/package=FFTrees
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Retrieved from http://www.stats.ox.ac.uk/pub/MASS4
Wickham, H. (2019c). tidyverse: Easily install and load the ’tidyverse’. Retrieved from https://CRAN.R-project.org/package=tidyverse