4.5 Resources

Here are some helpful pointers on the attitude and process of EDA in R (R Core Team, 2024) and the tidyverse (Wickham, 2023) (from more general to more specific):

4.5.1 EDA

Background readings

Key references on EDA include the books and papers by John W. Tukey (Tukey, 1969, 1977, 1980). An insightful overview is provided by Behrens (1997).

The broader implications of two distinct cultures in the use of statistical modeling are discussed by Breiman (2001) and the historical background and perspectives on data science are discussed by Donoho (2017).

Specific sources

Links to specific resources:

R packages

For practical advice on the tidyverse packages dplyr and ggplot2:

There are many other R packages that may become useful the context of EDA. Examples include:

  • the codebook package automates the description of data frames

  • the dlookr package supports various tasks of data diagnosis, exploration, and transformation (including visualizations of missing data or outliers)

  • the sjmisc package provides miscellaneous utility functions, supporting data transformation tasks like recoding, dichotomizing or grouping variables

4.5.2 Visualization

Background readings

Given the variety of options, it is often difficult to decide when to use which type of plot (or geom). The landmark publications by Jacques Bertin (e.g., Bertin, 2011) and Edward R. Tufte (Tufte, 2001, 2006; Tufte et al., 1990) provide solid advice and many inspiring examples.

Specific sources

More specific resources on the principles of data visualization (with many beautiful or bizarre examples) include:

More recent publications that are geared to the needs of aspiring data scientists include:

4.5.3 Miscellaneous links

The Simply Statistics blog (by Rafa Irizarry, Roger Peng, and Jeff Leek) provides many insightful and inspiring articles. For instance, the following posts relate to EDA:

Know your data. Really, really, know it: An article (by Randy Au, 2019-02-15) on what “knowing your data” means in applied contexts


ds4psy

[04_explore.Rmd updated on 2024-09-06 12:17:29.803733 by hn.]

References

Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131–160. https://doi.org/10.1037/1082-989X.2.2.131
Bertin, J. (2011). Semiology of graphics: Diagrams, networks, maps (Vol. 1). ESRI Press.
Breiman, L. et al. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726
Cairo, A. (2012). The functional art: An introduction to information graphics and visualization. Berkeley CA: New Riders.
Cairo, A. (2016). The truthful art: Data, charts, and maps for communication. Berkeley CA: New Riders.
Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745–766. Retrieved from https://doi.org/10.1080/10618600.2017.1384734
R Core Team. (2024). R base: A language and environment for statistical computing. Retrieved from https://www.R-project.org
Tufte, E. R. (2001). The visual display of quantitative information (2nd ed.). Cheshire, CT: Graphics Press.
Tufte, E. R. (2006). Beautiful evidence (Vol. 1). Cheshire, CT: Graphics Press.
Tufte, E. R., Goeler, N. H., & Benson, R. (1990). Envisioning information (Vol. 126). Cheshire, CT: Graphics Press.
Tukey, J. W. (1969). Analyzing data: Sanctification or detective work. American Psychologist, 2, 83–91. https://doi.org/10.1037/h0027108
Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
Tukey, J. W. (1980). We need both exploratory and confirmatory. The American Statistician, 34(1), 23–25. Retrieved from https://www.jstor.org/stable/2682991
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Retrieved from https://ggplot2-book.org/
Wickham, H. (2023). tidyverse: Easily install and load the ’tidyverse’. Retrieved from https://tidyverse.tidyverse.org
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz
Yau, N. (2011). Visualize this: The FlowingData guide to design, visualization, and statistics. Hoboken, NJ: John Wiley & Sons.
Yau, N. (2013). Data points: Visualization that means something. Hoboken, NJ: John Wiley & Sons.