Exploratory Data Analysis and Visualization
2024-05-16
1 Introduction
This document is an English translation of the book Análisis Exploratorio de Datos y Visualización, it covers the contents of an introductory course on exploratory data analysis and visualization in a university degree in Data Sciences. Exploratory data analysis is a very broad field, and it is not possible to teach all its aspects in depth in a single course. This course, of an introductory nature, aims to provide a solid foundation in the most important tools in this field, but any of the aspects that are going to be studied can lead to a much deeper and more detailed study than the one we are going to present here. .
1.1 Phases of an exploratory data analysis
The exploratory analysis of a dataset goes through the following phases:
Establish the objectives of our data analysis. Regardless of the particular objectives of each case, there is always a general generic objective that we can summarize as: discover and present the reality behind the data.
Find the data necessary for our study and import it into our work environment.
Understand the information contained in the data and how it is organized.
Select the appropriate tools to process and visualize the information provided by the data.
Transform and organize the data based on the objectives of the study and the requirements of the tools that we are going to use.
Apply models to establish relationships between data or make predictions
Communicate the results of our data analysis by explaining the conclusions, justifying the results efficiently with attractive graphs.
All these phases interact with each other, for example, in view of the final result of the study it may be necessary to incorporate new data to complete or confirm results. Therefore, the development of these phases must be interpreted as a process of iterations.
1.2 Course content
The course content is organized as follows:
Chapter 1: Introduction. The phases of an exploratory data analysis, a summary of the course content and the reasoned justification for the choice of
R
as a unified development environment are presented.Chapter 2: Basic aspects of
R
. There will be a review of the data types and structures that we will use and the functions inR
.Chapter 3: Data processing . The databases that we will use, the handling of the usual data storage formats and the data manipulation tools provided by the
dplyr
library will be introduced.Chapter 4: Static data visualization. The grammar of graphics, its implementation in the context of the powerful
ggplot
library and its use for the generation of static graphics will be addressed.Chapter 5: Time series . Due to its practical importance, in this chapter, the particular case of exploratory analysis of time series including prediction models is studied.
Chapter 6: Dynamic visualization. Through the use of the
ggplotly
,highcharter
andleaflet
libraries, powerful interactive visualization tools are introduced where the user can interact with the graphs.Chapter 7: Dashboards. Through the use of the
shiny
andflexdashboard
libraries, the design and implementation of dashboards is introduced that allow us to represent, in a dynamic, attractive and coherent way, the main indicators of our data exploration.Chapter 8: Dimensionality reduction. The problem of reducing the number of variables is studied using linear combinations between them through the correlation between pairs of variables and the principal components analysis method.
1.3 The choice of R
The work environment we are going to use is based on the usual combination of R
, Rstudio
and RMarkdown
. In this course, we will assume that the student is minimally familiar with these tools. The following tutorials are recommended: tutorial 1, tutorial 2 and tutorial 3 as a basic introduction to the combined use of these tools. The Rstudio summary tab and the Markdown summary tab may also be useful. In any case, Chapter 1 is dedicated to reviewing the basic usage of R
and an appendix has been added with the basic syntax of RMarkdown
.
Among the different possibilities of development platforms that we can use in Data Sciences, we decided to unify all the course content around R
for the following reasons:
R
is a free development environment for statistical applications and data analysis that has had enormous success and implementation worldwide. It works very well, it is robust, that is, it generates very few unexpected failures and it is easily installed and managed.As
R
is the reference platform for data analysis, for any important library related to the topic that is implemented in other languages such aspython
,javascript
, etc., packages appear inR
that serve as an interface for these libraries. This is something that is used intensively in this course and has the effect that the learning curve of these tools is much lower since everything is done from the same development environment. This allows, in particular, to address in a single course a large collection of powerful tools that, if each one were studied in their particular development environments, the learning would be much more complex and it would be impossible to address them in a single course.The combination of
R
,Rstudio
andRMarkdown
is ideal for experimenting and familiarizing yourself with all the concepts studied in the course. Furthermore, these tools are used by such a large number of people that practically any problem or error that arises has already been resolved and by searching the Internet, it is generally easy to find the solution.The number of libraries with tools developed in
R
is immense, only in the CRAN repository , which is the official reference repository to store and manage the libraries, There are more than 19,000 registered libraries. There are libraries, likeggplot2
, which we use in the course, that have more than 138 million downloads.
1.4 The virtues of a data scientist
Neutrality: The data must speak for itself, we must eliminate any preconceived ideas we have about the results of our Data Analysis that may introduce bias in the selection of data and conclusions.
Critical spirit. We must analyze and evaluate in detail the quality, veracity and reliability of the source of the data we handle. For example, the data provided by a weather station are very reliable because the measuring instruments it uses are reliable. The data on the number of daily infections by COVID-19 contain significant errors because the way of measuring is not precise given that the logistics of registering cases are not capable, especially at the beginning of the epidemic, of managing and recording all the cases. Furthermore, many cases, such as those without symptoms, fall directly under the radar. Beyond errors in measurement systems, another important source of unreliability is the bias with which they can be communicated, depending on the interest of the person communicating the data. For example, saying that taxes have been lowered when the volume of what has been lowered is irrelevant and only affects a small group of the population, is not technically false, but the data is being communicated with a bias that distorts the information that provide the data as a whole.
Attribute relevance assessment. Not all attributes are equally important. For example, when assessing the degree of development of a country (see [RRR18 ]), the infant mortality data is especially relevant because the effort of families to keep their children alive is always maximum and is one of the first parameters that improves/worsens when the situation of a country changes.
A ratio is better than an isolated numerical value. In general, an isolated numerical value provides information that is difficult to interpret. For example, the annual profit of a company is difficult to interpret if it is not compared to something. Comparing this data with the benefits of the previous year or with the amount of capital invested and operating expenses provides more useful information. In the same way, for the purposes of comparing wealth between countries, per capita income provides more useful information than the income of the entire country.
Be aware of the limitation of the data we handle. In general, the data we handle is the result of the accumulation of very diverse values. For example, the per capita income of a country tells us nothing about the inequalities, in terms of wealth, that exist within the country.
Avoid generalizations. Our natural tendency is to generalize from known particular data. For example, the media puts a lot of emphasis on communicating violent acts, which can lead us to think, in general terms, that we live in a very violent society. However, the reality provided by the data is that we live in the least violent society that has ever existed.
Referencias
[He19] Kieran Healy. Data Visualization, Princeton University Press, 2019.
[He19] Kieran Healy. Data Visualization, Princeton University Press, 2019.
[Ir19] Rafael A. Irizarry. Introduction to Data Science, Taylor & Francis, 2019.
[RRR18] Hans Rosling, Ola Rosling and Anna Rosling. Factfulness: Diez razones por las que estamos equivocados sobre el mundo, Deusto, 2018.
[SH16] Angelo Santana y Carmen N. Hernández. R4ULPGC: Introducción a R, Grupo de Estadística de la Universidad de Las Palmas de G.C., 2016.
[WiÇeGa23] Wickham, Hadley, Mine Çetinkaya-Rundel and Garrett Grolemund. R for Data Science (2e), O’Reilly Media, 2023.
[Xie15] Xie, Yihui. Dynamic Documents with R and Knitr. (2e). Boca Raton, Florida: Chapman; Hall/CRC*, 2015.
[Xie23] Xie, Yihui. Bookdown: Authoring Books and Technical Documents with r Markdown, 2023.