11 Introduction to data visualization

11.1 Setup

This chunk of R code loads the packages that we will be using.

library(tidyverse)
library(gapminder)

11.2 Anscombe’s Quartet

“Anscombe’s Quartet” comprises four data sets that each have eleven rows, each with two variables (x and y). The quartet was constructed by Francis Anscombe, and published in a 1973 paper titled “Graphs in Statistical Analysis” in the journal American Statistician(Anscombe 1973).

The quartet is included in base R, but in an untidy format:

anscombe

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Read the quartet data in a tidy form.⁶ Note that the variables have been renamed “ex” and “why”.

anscombe_tidy <- read_csv("data/anscombe_tidy.csv")

anscombe_tidy

## # A tibble: 44 × 4
##    observation set      ex   why
##          <dbl> <chr> <dbl> <dbl>
##  1           1 I        10  8.04
##  2           2 I         8  6.95
##  3           3 I        13  7.58
##  4           4 I         9  8.81
##  5           5 I        11  8.33
##  6           6 I        14  9.96
##  7           7 I         6  7.24
##  8           8 I         4  4.26
##  9           9 I        12 10.8 
## 10          10 I         7  4.82
## # ℹ 34 more rows

11.2.1 Summary statistics

Each of the four sets in Anscombe’s Quartet has the same summary statistics. Let’s calculate the mean of ex for each of the four sets:

# example
anscombe_tidy |>
  group_by(set) |>
  summarize(mean_ex = mean(ex))

## # A tibble: 4 × 2
##   set   mean_ex
##   <chr>   <dbl>
## 1 I           9
## 2 II          9
## 3 III         9
## 4 IV          9

11.2.1.1 Exercise

Using the following functions, calculate the summary statistics of ex and why, and the correlation coefficient between ex and why, for all four of the sets in the quartet:

statistic	function
mean	mean()
standard deviation	sd()
correlation coefficient	cor()

Solution

To create a table with these statistics by the four sets, we first group_by() and then summarize() (or summarise()). Notice that we include all of the calculations within a single summarize() function.

# solution
anscombe_tidy |> 
  group_by(set) |> 
  summarize(mean(ex), 
            sd(ex), 
            mean(why), 
            sd(why), 
            cor(ex, why))

## # A tibble: 4 × 6
##   set   `mean(ex)` `sd(ex)` `mean(why)` `sd(why)` `cor(ex, why)`
##   <chr>      <dbl>    <dbl>       <dbl>     <dbl>          <dbl>
## 1 I              9     3.32        7.50      2.03          0.816
## 2 II             9     3.32        7.50      2.03          0.816
## 3 III            9     3.32        7.5       2.03          0.816
## 4 IV             9     3.32        7.50      2.03          0.817

11.3 Visualizing the quartet

Using the R visualization package {ggplot2}.

The template of a ggplot() function call looks like this:

ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

We use:

the data = argument to name the dataframe we are using.
the <GEOM_FUNCTION> to define how we want the data represented (as points, or lines, or bars, or…).
the mapping = is where we define which variables we want plotted, and
the aes() (for “aesthetic”) includes how we want those plotted.

11.3.1 Scatter plot

With this example, the dataframe we’re using is anscombe_tidy and we are plotting ex and why as points in a scatter plot.

# example
ggplot(anscombe_tidy) + 
  geom_point(mapping = aes(x = ex, y = why))

Now we will add another variable to those in our aes list. We will add the variable “set”, and use colour to differentiate each value within “set”. You will notice that the colour = specification goes inside the aes()—we are using colour to represent the variable “set”.

# example
ggplot(anscombe_tidy) + 
  geom_point(aes(x = ex, y = why, colour = set))

For the example below, we plot only set “I”. We start with our anscombe_tidy dataframe, and then using the pipe symbol, first filter() and then pass the results of the filter to our ggplot() function. Note that because the filtered dataframe is being passed, there is no specified data = in the ggplot() function: the data to be plotted is what is passed after the previous step in the pipe.

You will notice that the arguments aes() and x = etc are not specified. As we saw in earlier examples, if the arguments are in the order that the function expects, they are interpreted correctly.

In this example, the code increases the size of the points and colours them purple. These arguments are outside the aes() argument, so they apply to all of the points.

# solution
anscombe_tidy |>
  filter(set == "I") |>
ggplot() + 
  geom_point(aes(ex, why), size = 3, colour = "purple")

11.3.2 Facet plot

Another way to differentiate the sets is to use a facet plot. In this example, we use the function facet_wrap.

ggplot(anscombe_tidy) + 
  geom_point(aes(ex, why)) + 
  facet_wrap(~set)

Now we will add a trend line using the geom_smooth function.

the method = lm indicates a “linear model”, i.e. a standard regression line. (We will come back to the statistics that underlie this function in Modeling.) {ggplot2} provides access to other smoothing algorithms.
the se = FALSE turns off the “standard error” (a measure of uncertainty in the data)

ggplot(anscombe_tidy) + 
  geom_point(aes(ex, why)) + 
  geom_smooth(aes(ex, why), method = lm) +
  facet_wrap(~set)

But that duplicates the aes(ex, why) text…so we can move that into the ggplot() function. That way, the aesthetics apply to each of the geom_ calls.

p <- ggplot(anscombe_tidy, aes(ex, why)) + 
  geom_point() + 
  geom_smooth(method = lm, se = FALSE) + 
  facet_wrap(~set)
p

-30-

10 Assignment 1 - week 2 - data wrangling & import

12 More data visualization