The broader implications of two distinct cultures in the use of statistical modeling are discussed by Breiman (2001) and the historical background and perspectives on data science are discussed by Donoho (2017).
Links to particular resources:
For practical advice on the tidyverse packages dplyr and ggplot2:
study the RStudio cheatsheet on data transformation and data visualization
check out various ggplot2 extensions
There are many other R packages that may become useful the context of EDA. Examples include:
the codebook package automates the description of data frames
the dlookr package supports various tasks of data diagnosis, exploration, and transformation (including visualizations of missing data or outliers)
the sjmisc package provides miscellaneous utility functions, supporting data transformation tasks like recoding, dichotomizing or grouping variables
Given the variety of options, it is often difficult to decide when to use which type of plot (or geom). The landmark publications by Jacques Bertin (e.g., Bertin, 2011) and Edward R. Tufte (Tufte, 2001, 2006; Tufte et al., 1990) provide solid advice and many inspiring examples.
More specific resources on the principles of data visualization (with many beautiful or bizarre examples) include:
Data visualization principles (by Rafael A. Irizarry)
Data visualization: Basic principles (by Peter Aldhous)
More recent publications that are geared to the needs of aspiring data scientists include:
4.5.3 Miscellaneous links
The Simply Statistics blog (by Rafa Irizarry, Roger Peng, and Jeff Leek) provides many insightful and inspiring articles. For instance, the following posts relate to EDA:
Tukey, Design Thinking, and Better Questions (2019-04-17)
The Role of Theory in Data Analysis (2018-12-11)
[04_explore.Rmd updated on 2020-11-20 16:56:12 by hn.]
Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131–160. https://doi.org/10.1037/1082-989X.2.2.131
Bertin, J. (2011). Semiology of graphics: Diagrams, networks, maps (Vol. 1). ESRI Press.
Breiman, L., & others. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726
Cairo, A. (2012). The functional art: An introduction to information graphics and visualization. Berkeley CA: New Riders.
Cairo, A. (2016). The truthful art: Data, charts, and maps for communication. Berkeley CA: New Riders.
Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745–766. Retrieved from https://doi.org/10.1080/10618600.2017.1384734
R Core Team. (2020). R base: A language and environment for statistical computing. Retrieved from https://www.R-project.org
Tufte, E. R. (2001). The visual display of quantitative information (2nd ed.). Cheshire, CT: Graphics Press.
Tufte, E. R. (2006). Beautiful evidence (Vol. 1). Cheshire, CT: Graphics Press.
Tufte, E. R., Goeler, N. H., & Benson, R. (1990). Envisioning information (Vol. 126). Cheshire, CT: Graphics Press.
Tukey, J. W. (1969). Analyzing data: Sanctification or detective work. American Psychologist, 2, 83–91. https://doi.org/10.1037/h0027108
Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
Tukey, J. W. (1980). We need both exploratory and confirmatory. The American Statistician, 34(1), 23–25. Retrieved from https://www.jstor.org/stable/2682991
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Retrieved from https://ggplot2.tidyverse.org
Wickham, H. (2019c). tidyverse: Easily install and load the ’tidyverse’. Retrieved from https://CRAN.R-project.org/package=tidyverse
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz
Yau, N. (2011). Visualize this: The FlowingData guide to design, visualization, and statistics. Hoboken, NJ: John Wiley & Sons.
Yau, N. (2013). Data points: Visualization that means something. Hoboken, NJ: John Wiley & Sons.