4.5 Resources
Here are some helpful pointers on the attitude and process of EDA in R (R Core Team, 2024) and the tidyverse (Wickham, 2023) (from more general to more specific):
4.5.1 EDA
Background readings
Key references on EDA include the books and papers by John W. Tukey (Tukey, 1969, 1977, 1980). An insightful overview is provided by Behrens (1997).
The broader implications of two distinct cultures in the use of statistical modeling are discussed by Breiman (2001) and the historical background and perspectives on data science are discussed by Donoho (2017).
Specific sources
Links to specific resources:
Exploratory Data Analysis with R (by Roger D. Peng) provides a book-length treatment of EDA.
To see a world in grains of sand (by Yihui Xie) illustrates potential solutions to overplotting.
Data analysis principles (by Karl Broman) maybe a bit cryptic, but true.
R packages
For practical advice on the tidyverse packages dplyr and ggplot2:
read Chapter 7: Exploratory data analysis (EDA) of r4ds (Wickham & Grolemund, 2017) and complete the corresponding exercises.
study
vignette("dplyr")
andvignette("ggplot")
study the examples at https://dplyr.tidyverse.org and https://ggplot2.tidyverse.org/reference/;
study the Posit cheatsheets on data transformation and data visualization
consult the R Graphics Cookbook (by Winston Chang) or the original ggplot2 book (Wickham, 2016)
check out various ggplot2 extensions
There are many other R packages that may become useful the context of EDA. Examples include:
the codebook package automates the description of data frames
the dlookr package supports various tasks of data diagnosis, exploration, and transformation (including visualizations of missing data or outliers)
the sjmisc package provides miscellaneous utility functions, supporting data transformation tasks like recoding, dichotomizing or grouping variables
4.5.2 Visualization
Background readings
Given the variety of options, it is often difficult to decide when to use which type of plot (or geom). The landmark publications by Jacques Bertin (e.g., Bertin, 2011) and Edward R. Tufte (Tufte, 2001, 2006; Tufte et al., 1990) provide solid advice and many inspiring examples.
Specific sources
More specific resources on the principles of data visualization (with many beautiful or bizarre examples) include:
Various books (e.g., Cairo, 2012, 2016) and The function art weblog (by Alberto Cairo)
Various books (e.g., Yau, 2011, 2013) and the Flowing data site (by Nathan Yau)
The principle of proportional ink and the Calling bullshit weblog (by Carl T. Bergstrom and Jevin West)
Data visualization principles (by Rafael A. Irizarry)
Data visualization: Basic principles (by Peter Aldhous)
More recent publications that are geared to the needs of aspiring data scientists include:
Data Visualization. A practical introduction (by Kieran Healy)
Fundamentals of Data Visualization (by Claus O. Wilke)
4.5.3 Miscellaneous links
The Simply Statistics blog (by Rafa Irizarry, Roger Peng, and Jeff Leek) provides many insightful and inspiring articles. For instance, the following posts relate to EDA:
Tukey, Design Thinking, and Better Questions (2019-04-17)
The Role of Theory in Data Analysis (2018-12-11)
Know your data. Really, really, know it: An article (by Randy Au, 2019-02-15) on what “knowing your data” means in applied contexts
[04_explore.Rmd updated on 2024-09-06 12:17:29.803733 by hn.]