Chapter 6 Exploratory data analysis (EDA)

6.1 Objectives

At the end of the chapter, readers will be able to

  • perform EDA using graphical methods
  • perform EDA using descriptive statistics
  • acquire basic skills to use ggplot2 and gtsummary packages

6.2 Introduction

Exploratory Data Analysis or EDA is the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations.

In statistics, exploratory data analysis (EDA) is an approach to data analysis to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task.

Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data and possibly formulate hypotheses that could lead to new data collection (Source)

6.3 EDA using ggplot2 package

6.3.1 Usage of ggplot2

We always start with ggplot() function, the state the specific dataset. Next, we choose the aesthetic mapping (with aes()) represent the variable or variables to plot) and then we add on layers. These can be converted to R codes as below:

  • geom_X : geom_point(), geom_histogram()
  • scales (like scale_colour_brewer())
  • faceting specifications (like facet_wrap())
  • coordinate systems (like coord_flip()).

6.4 Preparation

6.4.1 Load the libraries

We will use these packages for data exploration:

  1. tidyverse package for plotting and data wrangling
  2. gtsummary package for summary statistics
  3. patchwork to combine plots
  4. here to point to the correct working directory
  5. readxl to read a MS Excel file

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. We may find more information about the package here

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(gtsummary)
## #BlackLivesMatter
library(patchwork)
library(here)
## here() starts at C:/Users/drkim/OneDrive - Universiti Sains Malaysia/multivar_data_analysis
library(readxl)

6.4.2 Read the dataset into R

Let us read the peptic ulcer data in MS Excel format. To do this, we will

  • load the readxl package
  • use the function read_xlsx() to read data into R
pep <- read_xlsx(here('data', 'peptic_ulcer.xlsx'))

Examine the data by looking a the number of observations, type of variables and name of variables.

glimpse(pep)
## Rows: 121
## Columns: 34
## $ age                 <dbl> 42, 66, 67, 19, 77, 39, 62, 71, 69, 97, 52, 21, 57…
## $ gender              <chr> "male", "female", "male", "male", "male", "male", …
## $ epigastric_pain     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "…
## $ vomiting            <chr> "no", "no", "no", "no", "yes", "no", "no", "yes", …
## $ nausea              <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "…
## $ fever               <chr> "no", "no", "no", "no", "no", "yes", "no", "yes", …
## $ diarrhea            <chr> "no", "no", "yes", "no", "no", "no", "no", "yes", …
## $ malena              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "n…
## $ onset_more_24_hrs   <chr> "no", "no", "no", "yes", "yes", "yes", "yes", "no"…
## $ NSAIDS              <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "…
## $ septic_shock        <chr> "no", "no", "no", "no", "no", "no", "no", "no", "n…
## $ previous_OGDS       <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "…
## $ ASA                 <dbl> 1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2,…
## $ systolic            <dbl> 141, 197, 126, 90, 147, 115, 103, 159, 145, 105, 1…
## $ diastolic           <dbl> 98, 88, 73, 40, 82, 86, 55, 68, 75, 65, 74, 50, 86…
## $ inotropes           <chr> "no", "no", "no", "no", "no", "no", "no", "no", "n…
## $ pulse               <dbl> 109, 126, 64, 112, 89, 96, 100, 57, 86, 100, 109, …
## $ tenderness          <chr> "generalized", "generalized", "generalized", "loca…
## $ guarding            <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "n…
## $ hemoglobin          <dbl> 18.0, 12.0, 12.0, 12.0, 11.0, 18.0, 8.1, 13.3, 11.…
## $ twc                 <dbl> 6.0, 6.0, 13.0, 20.0, 21.0, 4.0, 5.0, 12.0, 6.0, 2…
## $ platelet            <dbl> 415, 292, 201, 432, 324, 260, 461, 210, 293, 592, …
## $ creatinine          <dbl> 135, 66, 80, 64, 137, 102, 69, 92, 94, 104, 58, 24…
## $ albumin             <chr> "27", "28", "32", "42", "38", "38", "30", "41", "N…
## $ PULP                <dbl> 2, 3, 3, 2, 7, 1, 2, 5, 3, 4, 2, 3, 4, 3, 5, 5, 1,…
## $ admission_to_op_hrs <dbl> 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6,…
## $ perforation         <dbl> 0.5, 1.0, 0.5, 0.5, 1.0, 1.0, 3.0, 1.5, 0.5, 1.5, …
## $ degree_perforation  <chr> "small", "small", "small", "small", "small", "smal…
## $ side_perforation    <chr> "distal stomach", "distal stomach", "distal stomac…
## $ ICU                 <chr> "no", "no", "no", "no", "yes", "no", "yes", "no", …
## $ SSSI                <chr> "no", "no", "no", "no", "no", "no", "no", "no", "n…
## $ anast_leak          <chr> "no", "no", "no", "no", "no", "no", "no", "no", "n…
## $ sepsis              <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "…
## $ outcome             <chr> "alive", "alive", "alive", "alive", "alive", "aliv…

The are quite several variables. For the sake of simplicity, we will select variables which we think might be important for simple data exploration using the select() function.

pep <- pep %>% 
  select(age, systolic, diastolic, hemoglobin, twc, 
         ASA, PULP, perforation, gender, epigastric_pain, 
         malena, tenderness, degree_perforation, outcome) 

6.5 EDA in tables

We can get the overall descriptive statistics for the peptic ulcer data by using the tbl_summary function.

pep %>% 
  tbl_summary(
    statistic = list(all_continuous() ~ "{mean} ({sd})", 
                     all_categorical() ~ "{n} / {N} ({p}%)"),
    digits = all_continuous() ~ 2) %>% 
  modify_caption("Patient Characteristics (N = {N})") %>%
  as_gt()
TABLE 6.1: Patient Characteristics (N = 121)
Characteristic N = 1211
age 60.43 (18.05)
systolic 128.56 (24.51)
diastolic 72.07 (13.99)
hemoglobin 12.32 (3.33)
twc 13.03 (6.66)
ASA
    1 63 / 121 (52%)
    2 50 / 121 (41%)
    3 8 / 121 (6.6%)
PULP 3.53 (2.28)
perforation 1.22 (0.91)
gender
    female 25 / 121 (21%)
    male 96 / 121 (79%)
epigastric_pain 116 / 121 (96%)
malena 4 / 121 (3.3%)
tenderness
    generalized 84 / 121 (69%)
    localized 37 / 121 (31%)
degree_perforation
    large 26 / 121 (21%)
    moderate 20 / 121 (17%)
    small 75 / 121 (62%)
outcome
    alive 83 / 121 (69%)
    dead 38 / 121 (31%)
1 Mean (SD); n / N (%)

To stratify the descriptive statistics based on variable outcome, we use the argument by =.

tab_outcome <- pep %>% 
  tbl_summary(
    by = outcome, 
    statistic = list(all_continuous() ~ "{mean} ({sd})", 
                     all_categorical() ~ "{n} / {N} ({p}%)"),
    digits = all_continuous() ~ 2) %>%
  modify_caption("Patient Characteristics and Fatality (N = {N})")

tab_outcome %>%
  as_gt()
TABLE 6.2: Patient Characteristics and Fatality (N = 121)
Characteristic alive, N = 831 dead, N = 381
age 58.20 (18.53) 65.29 (16.14)
systolic 130.95 (25.06) 123.34 (22.71)
diastolic 73.25 (14.02) 69.50 (13.75)
hemoglobin 12.74 (3.14) 11.41 (3.57)
twc 12.75 (6.00) 13.64 (7.96)
ASA
    1 49 / 83 (59%) 14 / 38 (37%)
    2 28 / 83 (34%) 22 / 38 (58%)
    3 6 / 83 (7.2%) 2 / 38 (5.3%)
PULP 3.16 (2.35) 4.34 (1.89)
perforation 0.95 (0.60) 1.82 (1.16)
gender
    female 13 / 83 (16%) 12 / 38 (32%)
    male 70 / 83 (84%) 26 / 38 (68%)
epigastric_pain 79 / 83 (95%) 37 / 38 (97%)
malena 2 / 83 (2.4%) 2 / 38 (5.3%)
tenderness
    generalized 60 / 83 (72%) 24 / 38 (63%)
    localized 23 / 83 (28%) 14 / 38 (37%)
degree_perforation
    large 9 / 83 (11%) 17 / 38 (45%)
    moderate 13 / 83 (16%) 7 / 38 (18%)
    small 61 / 83 (73%) 14 / 38 (37%)
1 Mean (SD); n / N (%)

Next, we stratify the descriptive statistics based on gender also using the by = argument.

tab_gender <- pep %>% 
  tbl_summary(
    by = gender, 
    statistic = list(all_continuous() ~ "{mean} ({sd})", 
                     all_categorical() ~ "{n} / {N} ({p}%)"),
    digits = all_continuous() ~ 2) %>%
  modify_caption("**Patient Characteristics and Gender** (N = {N})")

tab_gender %>%
  as_gt()
TABLE 6.3: Patient Characteristics and Gender (N = 121)
Characteristic female, N = 251 male, N = 961
age 70.12 (11.07) 57.91 (18.70)
systolic 125.84 (26.17) 129.27 (24.16)
diastolic 66.68 (13.57) 73.48 (13.82)
hemoglobin 10.43 (3.09) 12.81 (3.22)
twc 11.06 (7.44) 13.54 (6.38)
ASA
    1 13 / 25 (52%) 50 / 96 (52%)
    2 11 / 25 (44%) 39 / 96 (41%)
    3 1 / 25 (4.0%) 7 / 96 (7.3%)
PULP 4.32 (1.86) 3.32 (2.34)
perforation 1.49 (1.14) 1.16 (0.83)
epigastric_pain 24 / 25 (96%) 92 / 96 (96%)
malena 1 / 25 (4.0%) 3 / 96 (3.1%)
tenderness
    generalized 17 / 25 (68%) 67 / 96 (70%)
    localized 8 / 25 (32%) 29 / 96 (30%)
degree_perforation
    large 7 / 25 (28%) 19 / 96 (20%)
    moderate 4 / 25 (16%) 16 / 96 (17%)
    small 14 / 25 (56%) 61 / 96 (64%)
outcome
    alive 13 / 25 (52%) 70 / 96 (73%)
    dead 12 / 25 (48%) 26 / 96 (27%)
1 Mean (SD); n / N (%)

We can combine the two tables using tbl_merge().

tbl_merge(
  tbls = list(tab_gender, tab_outcome),
  tab_spanner = c("**Gender**", "**Outcome**")) %>%
  as_gt()
TABLE 6.4: Patient Characteristics and Gender (N = 121)
Characteristic Gender Outcome
female, N = 251 male, N = 961 alive, N = 831 dead, N = 381
age 70.12 (11.07) 57.91 (18.70) 58.20 (18.53) 65.29 (16.14)
systolic 125.84 (26.17) 129.27 (24.16) 130.95 (25.06) 123.34 (22.71)
diastolic 66.68 (13.57) 73.48 (13.82) 73.25 (14.02) 69.50 (13.75)
hemoglobin 10.43 (3.09) 12.81 (3.22) 12.74 (3.14) 11.41 (3.57)
twc 11.06 (7.44) 13.54 (6.38) 12.75 (6.00) 13.64 (7.96)
ASA
    1 13 / 25 (52%) 50 / 96 (52%) 49 / 83 (59%) 14 / 38 (37%)
    2 11 / 25 (44%) 39 / 96 (41%) 28 / 83 (34%) 22 / 38 (58%)
    3 1 / 25 (4.0%) 7 / 96 (7.3%) 6 / 83 (7.2%) 2 / 38 (5.3%)
PULP 4.32 (1.86) 3.32 (2.34) 3.16 (2.35) 4.34 (1.89)
perforation 1.49 (1.14) 1.16 (0.83) 0.95 (0.60) 1.82 (1.16)
epigastric_pain 24 / 25 (96%) 92 / 96 (96%) 79 / 83 (95%) 37 / 38 (97%)
malena 1 / 25 (4.0%) 3 / 96 (3.1%) 2 / 83 (2.4%) 2 / 38 (5.3%)
tenderness
    generalized 17 / 25 (68%) 67 / 96 (70%) 60 / 83 (72%) 24 / 38 (63%)
    localized 8 / 25 (32%) 29 / 96 (30%) 23 / 83 (28%) 14 / 38 (37%)
degree_perforation
    large 7 / 25 (28%) 19 / 96 (20%) 9 / 83 (11%) 17 / 38 (45%)
    moderate 4 / 25 (16%) 16 / 96 (17%) 13 / 83 (16%) 7 / 38 (18%)
    small 14 / 25 (56%) 61 / 96 (64%) 61 / 83 (73%) 14 / 38 (37%)
outcome
    alive 13 / 25 (52%) 70 / 96 (73%)
    dead 12 / 25 (48%) 26 / 96 (27%)
gender
    female 13 / 83 (16%) 12 / 38 (32%)
    male 70 / 83 (84%) 26 / 38 (68%)
1 Mean (SD); n / N (%)

6.6 EDA with plots

6.6.1 One variable: Distribution of a categorical variable

6.6.1.1 Bar chart

The frequency of the outcome

pep %>% 
  group_by(outcome) %>% 
  summarise(freq = n())
## # A tibble: 2 × 2
##   outcome  freq
##   <chr>   <int>
## 1 alive      83
## 2 dead       38

To plot the distribution of a categorical variable, we can use a Bar chart.

ggplot(data = pep) + 
  geom_bar(mapping = aes(x = outcome)) +
  theme_bw()

But we can also pass the aes() to ggplot

ggplot(data = pep, mapping = aes(x = outcome)) + 
  geom_bar() +
  theme_bw()

Readers can combine dplyr for data wrangling and then ggplot2 (both are packages inside the tidyverse metapackage) to plot the data. For example, dplyr part for data wrangling:

pep_age <- pep %>% group_by(outcome) %>% 
  summarize(mean_age = mean(age)) 
pep_age
## # A tibble: 2 × 2
##   outcome mean_age
##   <chr>      <dbl>
## 1 alive       58.2
## 2 dead        65.3

And the ggplot2 part to make the plot:

ggplot(pep_age, mapping = aes(x = outcome, y = mean_age)) + 
  geom_col()

We can combine both tasks dplyr and ggplot together that will save time:

pep %>% 
  group_by(outcome) %>% 
  summarize(mean_age = mean(age)) %>% 
  ggplot(mapping = aes(x = outcome, y = mean_age, fill = outcome)) + 
  geom_col() +
  ylab("Mean age (Years)") +
  xlab("Outcome of ulcer") +
  theme_bw()

As we mentioned before, an excellent resource for graphics using ggplot2 is availble on this website. It is a must-go-to website. Of if you prefer the read the physical version, purchase the R Graphics Cookbook (Chang 2013).

6.6.2 One variable: Distribution of a numerical variable

6.6.2.1 Histogram

To plot the distribution of a numerical variable, we can plot a histogram. To specify the number of bin, we can use binwidth and add some customization.

ggplot(data = pep, mapping = aes(x = systolic)) + 
  geom_histogram(binwidth = 10, fill = "blue") +
  ylab("Frequency") +
  xlab("Systolic Blood Pressure") +
  ggtitle("Systolic BP distribution") +
  theme_bw()

6.6.2.2 Density curve

Let us create a density curve. A density curve also helps us examine the distribution of observations.

ggplot(data = pep, mapping = aes(x = diastolic)) + 
  geom_density() +
  xlab("Diastolic BP (mmHg)") +
  ylab("Density") +
  labs(title = "Density distribution for diastolic BP",
       caption = "Source : Peptic ulcer disease data") +
  theme_bw()

6.6.2.3 Histogram and density curve together

If we want to plot both the histogram and the density curve, we can use geom_histogram() and geom_density() in the single line of codes, we can combine two geoms.

ggplot(pep, mapping = aes(x = diastolic)) + 
  geom_histogram(mapping = aes(y = ..density..), binwidth = 10) + 
  geom_density(colour = 'red') +
  xlab("Diastolic BP (mmHg)") +
  ylab("Density") +
  labs(title = "Density distribution for diastolic BP",
       caption = "Source : Peptic ulcer disease data") +
  theme_bw()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

6.6.3 Two variables: Plotting a numerical and a categorical variable

6.6.3.1 Overlaying histograms and boxplot

By overlaying histograms, examine the distribution of a numerical variable (var age) based on variable outcome. First, we create an object called hist_age. Next, we create a boxplot object and name it as box-age. After that, we combine the two objects side-by-side using a vertical bar.

hist_age <- ggplot(data = pep, aes(x = age, fill = outcome)) +
    geom_histogram(binwidth = 5, aes(y = ..density..),  
                   position = "identity", alpha = 0.75) + 
  geom_density(alpha = 0.25) +
  xlab("Age") +
  ylab("Density") +
  labs(title = "Density distribution",
       caption = "Source : Peptic ulcer disease data") +
  theme_bw()
box_age <- ggplot(data = pep, aes(x = outcome, y = age)) +
  geom_boxplot(outlier.shape = NA) + 
  geom_dotplot(binaxis = "y", binwidth = 1, fill = NA, 
               colour = "blue", alpha = 0.75) +
  xlab('Outcome') + ylab('Age') +
  labs(title = "Box-plot",
       caption = "Source : Peptic ulcer disease data") +
  theme_bw()
hist_age |  box_age

hist_age /  box_age

Read more on patchwork package to learn to arrange multiple plot.

6.6.4 Three variables: Plotting a numerical and two categorical variables

It is hard to visualize three variables in a single histogram plot. But we can use facet_.() function to further split the plots.

6.6.5 Faceting the plots

We can see better plots if we split the histogram based on particular grouping. In this example, we stratify the distribution of variable age (a numerical variable) based on outcome and gender (both are categorical variables)

ggplot(data = pep, aes(x = age, fill = gender)) +
    geom_histogram(binwidth = 5, aes(y = ..density..),  
                   position = "identity", alpha = 0.25) + 
  geom_density(aes(colour = gender), alpha = 0.55) +
  xlab("Age") +
  ylab("Density") +
  labs(title = "Density distribution of age for outcome and gender",
       caption = "Source : Peptic ulcer disease data") +
  theme_bw() +
  facet_wrap( ~ outcome)

We can provide the summary statistics to complement the plots or vice versa.

pep %>% 
  group_by(outcome, gender) %>%
  summarize(mean_age = mean(age, na.rm = TRUE), 
            sd_age = sd(age, na.rm = TRUE))
## `summarise()` has grouped output by 'outcome'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   outcome [2]
##   outcome gender mean_age sd_age
##   <chr>   <chr>     <dbl>  <dbl>
## 1 alive   female     70.2   11.4
## 2 alive   male       56.0   18.8
## 3 dead    female     70.1   11.2
## 4 dead    male       63.1   17.7

6.6.6 Line plot

Line graphs are typically used for visualizing how one continuous variable on the y-axis changes concerning another continuous variable on the x-axis. This is very useful for longitudinal data. Line graphs can also be used with a discrete variable on the x-axis. This is appropriate when the variable is ordered (e.g., “small,” “medium,” “large”).

Let us Load gapminder package and view the variables and the observations of the gapminder dataset.

library(gapminder)
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

We can plot the relationship between variable life expectancy lifeExp only for countried in the Asia continent. To do so, we

  • specify the data. In this case the gapminder dataset
  • choose only Asia continent using the filter()
  • choose variable life expectancy for the y axis and year for the x axis. Then we will give different colours to represent different countries.
  • will use geom_line() to generate the line plot
gapminder %>% 
  filter(continent == "Asia") %>% 
  ggplot(mapping = aes(x = year, y = lifeExp, colour = country)) +
  geom_line(show.legend = FALSE) +
  xlab("Year") + ylab("Life Expectancy")

And the summary statistics for the plot.

gapminder %>%
  filter(continent == 'Asia') %>%
  filter(year == 1962 | year == 1982 | year == 2002) %>%
  group_by(year) %>%
  summarise(mean_life = mean(lifeExp, na.rm = TRUE), 
            sd_life = sd(lifeExp, na.rm = TRUE))
## # A tibble: 3 × 3
##    year mean_life sd_life
##   <int>     <dbl>   <dbl>
## 1  1962      51.6    9.82
## 2  1982      62.6    8.54
## 3  2002      69.2    8.37

We may use gtsummary package to generate publication ready table. We need to filter for Asia continent only to represent the line plot we generated before.

gapminder %>%
  select(continent, year, lifeExp, pop, gdpPercap) %>%
  filter(continent == "Asia") %>%
  filter(year == 1962 | year == 1982 | year == 2002) %>%
  mutate(pop = pop/1000000, gdpPercap = gdpPercap/1000) %>%
  tbl_strata(
    strata = continent,
    .tbl_fun = 
      ~.x %>% 
      tbl_summary( 
        by = year, 
        label = list(lifeExp ~ "Life Exp", pop ~ "Pop size", gdpPercap ~ "Percap GDP"),
        statistic = list(all_continuous() ~ "{mean} ({sd})",  
                         all_categorical() ~ "{n} / {N} ({p}%)"),
        missing = "no", 
        digits = all_continuous() ~ 1 
        )) %>% 
   modify_footnote(all_stat_cols() ~ "Mean(SD), Pop size in million and GDP in thousand") %>%
  as_gt()
Characteristic Asia
1962, N = 331 1982, N = 331 2002, N = 331
Life Exp 51.6 (9.8) 62.6 (8.5) 69.2 (8.4)
Pop size 51.4 (136.1) 79.1 (206.5) 109.1 (276.7)
Percap GDP 5.7 (16.4) 7.4 (8.7) 10.2 (11.2)
1 Mean(SD), Pop size in million and GDP in thousand

6.6.7 Plotting means and error bars

We want to error bar for life expectancy for Asia continent only. Error bar that contains mean and standard deviation

Our approach is use filter to choose Asia continent only filter(). Then generate the mean and SD for life expectancy using mutate(). Next is to plot the scatterplot (country vs mean life expectancy) geom_point() and then plot errorbar geom_errorbar()

gap_continent <- gapminder %>% 
  filter(continent == "Asia") %>%
  group_by(country) %>% 
  mutate(mean = mean(lifeExp), sd = sd(lifeExp)) %>%
  arrange(desc(mean))

Plot them with coord_flip()

gap_continent %>%   
  ggplot(mapping = aes(x = country, y = mean)) + 
  geom_point(mapping = aes(x = country, y = mean)) +
  geom_errorbar(mapping = aes(ymin = mean - sd, ymax = mean + sd),
                position = position_dodge()) +
  coord_flip()

6.6.8 Scatterplot with fit line

We are using the the peptic ulcer data pep where we will create a fit line between age and size of perforation. We then plot the scatterplot and and a fit line on it. We name the plot object generated as pep_fit.

pep_fit <- pep %>% 
  ggplot(mapping = aes(x = age, y = perforation, colour = outcome)) +
  geom_point(show.legend = FALSE) +
  geom_smooth(method = lm, se = FALSE) +
  ylab('Size of perforation') +
  xlab('Age of patient') +
  labs(title = 'Distribution and fit line',
       caption = 'Source: Peptic ulcer data') +
  theme_bw() 
pep_fit 
## `geom_smooth()` using formula = 'y ~ x'

Let us see if the patterns are similar for men and women. We will use facet_wrap() to split the plots based on variable gender.

pep_fit + facet_grid(. ~ gender)
## `geom_smooth()` using formula = 'y ~ x'

pep %>%
  select(gender, perforation, age) %>%
  tbl_summary(
    by = gender, 
    statistic = list(all_continuous() ~ "{mean} ({sd})",  
                     all_categorical() ~ "{n} / {N} ({p}%)"),
    missing = "no", 
    digits = all_continuous() ~ 2 ) %>% 
  modify_caption("**Patient Characteristics and Gender** (N = {N})") %>%
  as_gt()
TABLE 6.5: Patient Characteristics and Gender (N = 121)
Characteristic female, N = 251 male, N = 961
perforation 1.49 (1.14) 1.16 (0.83)
age 70.12 (11.07) 57.91 (18.70)
1 Mean (SD)

6.7 Summary

This chapters provides understanding to readers about exploratory data analysis (EDA) and skills to perform EDA and descriptive statistics. We learned about graphical methods in R and how to report descriptive statistics in tables. We show readers how to read a MS Excel data and then use three important packages to perform EDA (dplyr, ggplot2 and gtsummary packages) on the data. The excellent resources available to help readers perform more complicated exploratory data analysis include ggplot2: Elegant Graphics for Data Analysis webpage or its physical book (Chang 2013), ggplot2 package, gtsummary webpage (Sjoberg et al. 2021) and R for Data Science website or its physical book (Wickham and Grolemund 2017).

References

Chang, W. 2013. R Graphics Cookbook. Oreilly and Associate Series. O’Reilly Media, Incorporated. https://books.google.com.my/books?id=\_iVFgKTRYrQC.
Sjoberg, Daniel D., Karissa Whiting, Michael Curry, Jessica A. Lavery, and Joseph Larmarange. 2021. “Reproducible Summary Tables with the Gtsummary Package.” The R Journal 13: 570–80. https://doi.org/10.32614/RJ-2021-053.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Paperback; O’Reilly Media. http://r4ds.had.co.nz/.