# Chapter 7 Exploratory Data Analysis

Author: CW

Status: On-going

Reviewer:

## 7.1 Introduction

### 7.1.1 Prerequisites

```
library(tidyverse)
library(nycflights13)
```

## 7.2 Questions

## 7.3 Variation

```
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```

```
smaller <- diamonds %>%
filter(carat < 3)
```

### 7.3.4 Exercises

- Explore the distribution of each of the
`x`

,`y`

, and`z`

variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

```
# remove false data points
diamonds <- diamonds %>% filter(2 < y & y < 20 & 2 < x & 2 < z & z < 20)
ggplot(diamonds) +
geom_freqpoly(aes(x = x), binwidth = 0.01)
```

```
ggplot(diamonds) +
geom_freqpoly(aes(x = y), binwidth = 0.01)
```

```
ggplot(diamonds) +
geom_freqpoly(aes(x = z), binwidth = 0.01)
```

```
# x and y often share value
ggplot(diamonds) +
geom_point(aes(x = x, y = y)) +
geom_point(aes(x = x, y = z), color = "blue") +
coord_fixed()
```

Seems like `x`

and `y`

should be length and width, and `z`

is depth.

- Explore the distribution of
`price`

. Do you discover anything unusual or surprising? (Hint: Carefully think about the`binwidth`

and make sure you try a wide range of values.)

```
# remove false data points
diamonds <- diamonds %>% filter(2 < y & y < 20 & 2 < x & 2 < z & z < 20)
ggplot(diamonds) +
geom_freqpoly(aes(x = price), binwidth = 10) +
xlim(c(1000, 2000))
```

`## Warning: Removed 44207 rows containing non-finite values (stat_bin).`

`## Warning: Removed 2 rows containing missing values (geom_path).`

Somehow we don’t have diamonds that are priced around $1500.

- How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

`diamonds %>% filter(carat == 0.99) %>% count()`

```
## # A tibble: 1 x 1
## n
## <int>
## 1 23
```

`diamonds %>% filter(carat == 1) %>% count()`

```
## # A tibble: 1 x 1
## n
## <int>
## 1 1556
```

```
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.01) +
xlim(c(0.97, 1.03))
```

`## Warning: Removed 48599 rows containing non-finite values (stat_bin).`

There are much more diamonds with 1 carat. I think it is because psychologically, 1 carat represent a whole new level from 0.99 carat, so for makers, it is little more material for much more value.

- Compare and contrast
`coord_cartesian()`

vs`xlim()`

or`ylim()`

when zooming in on a histogram. What happens if you leave`binwidth`

unset? What happens if you try and zoom so only half a bar shows?

```
ggplot(diamonds) +
geom_histogram(aes(x = carat)) +
xlim(c(0.97, 1.035))
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

`## Warning: Removed 48599 rows containing non-finite values (stat_bin).`

`## Warning: Removed 1 rows containing missing values (geom_bar).`

```
ggplot(diamonds) +
geom_histogram(aes(x = carat)) +
coord_cartesian(xlim = c(0.97, 1.035))
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

```
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.01) +
xlim(c(0.97, 1.035))
```

`## Warning: Removed 48599 rows containing non-finite values (stat_bin).`

```
ggplot(diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.01) +
coord_cartesian(xlim = c(0.97, 1.035))
```

`coord_cartesian()`

plots and cuts, while `xlim()`

cuts and plots. So `xlim()`

does not show the half bar.

## 7.4 Missing values

### 7.4.1 Exercises

- What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

In a bar chart,`NA`

is considered as just another category. In a histogram,`NA`

is ignored because the x exis has order.

```
set.seed(0)
df <- tibble(norm = rnorm(100)) %>% mutate(inrange = ifelse(norm > 2, NA, norm))
ggplot(df) +
geom_histogram(aes(x = inrange))
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

`## Warning: Removed 2 rows containing non-finite values (stat_bin).`

`geom_histogram()`

removed rows with `NA`

values;

```
df <- diamonds %>% mutate(cut = as.factor(ifelse(y > 7, NA, cut)))
ggplot(df) + geom_bar(aes(x = cut))
```

Apparently `geom_bar()`

doesn’t remove `NA`

, but rather treat it as another factor or category.

- What does
`na.rm = TRUE`

do in`mean()`

and`sum()`

?

To ignore`NA`

s when calculating mean and sum.

## 7.5 Covariation

### 7.5.1 A categorical and continuous variable

#### 7.5.1.1 Exercises

- Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

```
flights %>%
mutate(cancelled = is.na(dep_time) | is.na(arr_time)) %>%
ggplot() +
geom_boxplot(aes(x = cancelled, y = dep_time))
```

`## Warning: Removed 8255 rows containing non-finite values (stat_boxplot).`

```
flights %>%
mutate(cancelled = is.na(dep_time) | is.na(arr_time)) %>%
filter(cancelled) %>%
select(dep_time)
```

```
## # A tibble: 8,713 x 1
## dep_time
## <int>
## 1 2016
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 2041
## 7 2145
## 8 NA
## 9 NA
## 10 NA
## # ... with 8,703 more rows
```

Puzzled by this question: how do we have departure times of cancelled flights?

- What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

```
ggplot(diamonds) +
geom_point(aes(x = carat, y = price), color = "blue", alpha = 0.5)
```

```
ggplot(diamonds) +
geom_point(aes(x = depth, y = price), color = "red", alpha = 0.5)
```

```
ggplot(diamonds) +
geom_point(aes(x = table, y = price), color = "red", alpha = 0.5)
```

```
ggplot(diamonds) +
geom_point(aes(x = x, y = price), color = "red", alpha = 0.5)
```

```
ggplot(diamonds) +
geom_point(aes(x = z, y = price), color = "red", alpha = 0.5)
```

Volumn and weight are two variables that is most important for predicting the price. Since volumn is highly correlated with weight, they can be considered to be one variable.

```
ggplot(diamonds) +
geom_boxplot(aes(x = cut, y = carat))
```

Because better `cut`

has lower `carat`

which makes their `price`

lower, so if we don’t look at `carat`

, it would appear that better `cut`

has lower `price`

.

- Install the
`ggstance`

package, and create a horizontal boxplot. How does this compare to using`coord_flip()`

?

`library(ggstance)`

```
##
## Attaching package: 'ggstance'
```

```
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
```

`ggplot(diamonds) + geom_boxplot(aes(x = cut, y = carat)) + coord_flip()`

`ggplot(diamonds) + geom_boxploth(aes(x = carat, y = cut))`

Seems like the result is the same; but the call of the function seems more natural.

- One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the
`lvplot`

package, and try using`geom_lv()`

to display the distribution of`price`

vs`cut`

. What do you learn? How do you interpret the plots?

```
library(lvplot)
ggplot(diamonds) + geom_lv(aes(x = cut, y = price))
```

While the boxplot only shows a few quantiles and outliers, the letter-value plot shows many quantiles.

- Compare and contrast
`geom_violin()`

with a facetted`geom_histogram()`

, or a coloured`geom_freqpoly()`

. What are the pros and cons of each method?

```
ggplot(diamonds) +
geom_histogram(aes(x = price)) +
facet_wrap(~cut)
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

```
ggplot(diamonds) +
geom_freqpoly(aes(x = price)) +
facet_wrap(~cut)
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

```
ggplot(diamonds) +
geom_violin(aes(x = cut, y = price))
```

```
ggplot(diamonds) +
geom_lv(aes(x = cut, y = price))
```

Violin plot is best to compare the density distribution across different categories.

- If you have a small dataset, it’s sometimes useful to use
`geom_jitter()`

to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to`geom_jitter()`

. List them and briefly describe what each one does.