Chapter 1 Exploratory Data Analysis

1.1 Exploring Diamond Prices

We consider a dataset with prices (in $ US) and other information on 53,940 round cut diamonds. The first 6 rows are shown below.

library(tidyverse)
data(diamonds)
head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

The dataset incudes both:

  • categorical (or factor) variables cut, color, clarity, and
  • quantitative (or numeric) variables, depth, table, price, x, y, z

1.1.1 Boxplot of Diamond Prices

ggplot(data=diamonds, aes(x=price, y=cut, fill=cut)) + 
  geom_boxplot(outlier.size=0.01, outlier.alpha = 0.1) + 
    stat_summary(fun=mean, geom="point", shape=4, color="red", size=3)

What do we notice about the relationship between price and cut? Is this surprising?

1.1.2 Histogram of carat size and quality of cut

Next, we examine a histogram, displaying price, cut, and carat size.

ggplot(data=diamonds, aes(x=carat, fill=cut)) + geom_histogram()

How does the information in this plot help explain the surprising result we saw in the boxplot?

1.1.3 Table 1: Average carat size and price by quality of cut

diamonds %>% group_by(cut) %>% summarize(N=n(), 
                                         Avg_carat=mean(carat), 
                                         Avg_price=mean(price) )
## # A tibble: 5 x 4
##   cut           N Avg_carat Avg_price
##   <ord>     <int>     <dbl>     <dbl>
## 1 Fair       1610     1.05      4359.
## 2 Good       4906     0.849     3929.
## 3 Very Good 12082     0.806     3982.
## 4 Premium   13791     0.892     4584.
## 5 Ideal     21551     0.703     3458.

1.1.4 Scatterplot of carat size and quality of cut

Next, we use a scatterplot to visualize cut, price, and carat size.

ggplot(data=diamonds, aes(x=carat, y=price, color=cut)) + geom_point()

What should we conclude about the relationship between price and quality of cut? Are better cuts generally more expensive? less expensive? about the same? Does the relationship between price and cut seem to depend on carat size?

1.1.5 Terminology

The diamonds dataset is an example of two statistical concepts:

Simpson’s Paradox refers to a situation when an apparent relationship between two variables changes or reverses when additional variable(s) are considered.
Example: diamonds with higher quality of cut appear less expensive than lower quality cuts, until we account for carat size

An interaction between two variables X and Y occurs when the relationship between X and a third variable Z depends on Y.
Example: the relationship between cut and price depends on carat size, so there is an interaction between cut and carat size.

1.2 Tidy Data

1.2.1 Representations of Data

Data can be displayed in many different tabular forms. We’ll discuss one useful form, called tidy data.

Learning Outcomes:

  1. Define tidy data.

  2. Recognize when data are tidy form.

Consider the following representations of the same dataset, which dispays the number of tuberculosis cases in different countries, relative to population. This example comes from R for Data Science by Wickham and Grolemund

1.2.2 Representation 1

country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

1.2.3 Representation 2

country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

1.2.4 Representation 3

country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583

1.2.5 Representation 4

Table A:

country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766

Table B:

kable(table4b)
country 1999 2000
Afghanistan 19987071 20595360
Brazil 172006362 174504898
China 1272915272 1280428583

1.2.6 Variables and Observations

In this example, we have observed data on various countries at different points in time. The record for a single country, in a given year is called an observation.

For each observation, we record the country, year, number of cases, and population. These are called variables.

1.2.7 Tidy Data

A dataset is said to be tidy when it satisfies the following conditions:

  1. Each variable has its own column.
  2. Each observation has its own row.
  3. Each value must has own cell.

In fact, any two of these imply the third.

Image from R for Data Science by Wickham and Grolemund

1.2.8 Representation 1 Tidy

Representation 1 is in tidy form.

country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

1.2.9 Representation 2 not Tidy

Representation 2 is not in tidy form.

  • Observations are spread over multiple rows.
  • The variables cases and population do not have thir own column.
  • The column type contains variable names, not values.
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

1.2.10 Representation 3 not Tidy

Representation 3 is not in tidy form.

The variables cases and population do not have their own columns, but are combined in a single column called rate.

country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583

1.2.11 Representation 4 is not Tidy

Representation 4 is not in tidy form.

  • The variable year is spread across multiple columns.
  • The variables cases and population are spread over multiple tables.
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766
country 1999 2000
Afghanistan 19987071 20595360
Brazil 172006362 174504898
China 1272915272 1280428583

1.2.12 Why Use Tidy Data

  • Data are often easiest to work with when they are in tidy form

  • The tidyverse() R package is useful for creating graphs, and calculating summary statistics when data are in tidy form.

  • Sometimes there is good reason for data to not be in tidy form. This is ok, but it makes it harder to work with.

  • In this class, we will focus on data that are already in tidy form. However, if you come across data on your own, you should check that it is tidy before attempting to use the techniques we’ll see in this class.

  • In CMSC/STAT 205: Data-Scientific Programming, we study how to convert data into tidy form if it is not already. More information can be found in R For Data Science by Wickham and Grolemund.