Applied Biostats 2024: Chapter 3: Data in `R`

Science and making scientific conclusions relies heavily on the analysis of data. This is in act what makes science science. We observe things, and/or do experiments and then come to some tentative conclusion about the world, This informs our theoretical understanding and allows or science to go on! Thus, while theory is important to science, data is king. So we need to be able to keep and analyze data.

Learning goals: By the end of this chapter you should be able to get data into R and explore it. Specifically, students will be well-positioned to analyze data by:
- Understanding the tidy data format.
- Using the mutate function in R to manipulate data.
- Using the summarize function in R to summarize the data
- (and doing so by groups with the group_by function.
- Combining these operations with the pipe %>% operator.

Tidy data

Most often data are kept in a spreadsheet. If we’re luck data are tidy. 🤔 A tidy data structure usually makes statistical analysis and data visualization much easier.

A tibble is the name for the primary structure that holds data in the tidyverse. Tibbles do not ensure that our data are tidy but they do make this easier. In a tidy tibble, each variable is a column (i.e. a vector), and all so while all entries in a column must be of the same class, each column in a tibble can have its own class. In a tidy tibble, rows unite values of various variables of a single entity (i.e. an individual at a time).

If you have spent time in base R you are likely familiar with matrices, arrays, and data frames - don’t even worry about these. A tibble is much like a data frame, but has numerous features that make them easier to deal with. If you care, see chapter 10 of R for Data Science (Grolemund and Wickham 2018) for more info. Anyways, in this course we will focus on vectors and tibbles, ignoring arrays and matrices, and avoiding lists for as long as possible.

Wrangling data in R

Lets get started with a fun data set that is commonly used for trying things out in R – some cute penguins. The penguins data are available in the palmerpenguins package.

install.packages("palmerpenguins")
library(palmerpenguins)

Of course, the first thing we do when we have data is to look at it. so we use the gwalkr() function in the GWalkR package to bring up a simple GUI to explore our data.

# Let's use the gwalkr function in the GWalkR package to look at these data (ignore the code below)
visConfig <- '[{"config":{"defaultAggregated":false,"geoms":["auto"],"coordSystem":"generic","limit":-1},"encodings":{"dimensions":[{"fid":"c3BlY2llcw==","name":"species","basename":"species","semanticType":"nominal","analyticType":"dimension"},{"fid":"aXNsYW5k","name":"island","basename":"island","semanticType":"nominal","analyticType":"dimension"},{"fid":"c2V4","name":"sex","basename":"sex","semanticType":"nominal","analyticType":"dimension"},{"fid":"eWVhcg==","name":"year","basename":"year","semanticType":"nominal","analyticType":"dimension"},{"fid":"gw_mea_key_fid","name":"Measure names","analyticType":"dimension","semanticType":"nominal"}],"measures":[{"fid":"YmlsbF9sZW5ndGhfbW0=","name":"bill_length_mm","basename":"bill_length_mm","analyticType":"measure","semanticType":"quantitative","aggName":"sum"},{"fid":"YmlsbF9kZXB0aF9tbQ==","name":"bill_depth_mm","basename":"bill_depth_mm","analyticType":"measure","semanticType":"quantitative","aggName":"sum"},{"fid":"ZmxpcHBlcl9sZW5ndGhfbW0=","name":"flipper_length_mm","basename":"flipper_length_mm","analyticType":"measure","semanticType":"quantitative","aggName":"sum"},{"fid":"Ym9keV9tYXNzX2c=","name":"body_mass_g","basename":"body_mass_g","analyticType":"measure","semanticType":"quantitative","aggName":"sum"},{"fid":"gw_count_fid","name":"Row count","analyticType":"measure","semanticType":"quantitative","aggName":"sum","computed":true,"expression":{"op":"one","params":[],"as":"gw_count_fid"}},{"fid":"gw_mea_val_fid","name":"Measure values","analyticType":"measure","semanticType":"quantitative","aggName":"sum"}],"rows":[{"fid":"ZmxpcHBlcl9sZW5ndGhfbW0=","name":"flipper_length_mm","basename":"flipper_length_mm","analyticType":"measure","semanticType":"quantitative","aggName":"sum"}],"columns":[{"fid":"Ym9keV9tYXNzX2c=","name":"body_mass_g","basename":"body_mass_g","analyticType":"measure","semanticType":"quantitative","aggName":"sum"}],"color":[],"opacity":[],"size":[],"shape":[],"radius":[],"theta":[],"longitude":[],"latitude":[],"geoId":[],"details":[],"filters":[],"text":[]},"layout":{"showActions":true,"showTableSummary":false,"stack":"stack","interactiveScale":false,"zeroScale":true,"size":{"mode":"auto","width":1520,"height":150},"format":{},"geoKey":"name","resolve":{"x":false,"y":false,"color":false,"opacity":false,"shape":false,"size":false}},"visId":"gw_t6SM","name":"Chart 1"}]';gwalkr(penguins, visConfig=visConfig)

As we look at the penguins data set below we can get a sense of the utility of a tibble. We can see, not only the first few values of the data set, but also the class of each variable (chr for character, fct for factor, dbl for double – a continuous class of data). Recall that we can get a sense of a dataset with the glimpse() function in dplyr, and scroll through it with View(). Note to make things easier to look at, I only show species, sex, body_mass_g, and bill_length_mm.

While glimpse() and View() are among the very handy tidyverse functions, the real utility of tidyverse is that it gives us a unified way to deal with data. Usually when we get data we want to handle/clean it, summarize it, visualize the results and develop a statistical model from it. This is where tidyverse really shines!

We first focus on handling data with the dplyr package. Today we’ll talk about using it to mutate, and summarize() our data.

Modify data with mutate

Say we are curious about the ratio of body mass to wing length in these penguins. We can add this column by with the mutate function in the dplyr package. The easiest way to do this is to:

penguins %>%
  mutate(wt_bill_ratio  = body_mass_g / bill_length_mm )

mutate() cartoon:

mutate() adds (or overwrites) columns from data we have in the tibble (Figure 4).

summarize data

Having the entire raw data at our fingertips is useful, but we often want to describe our data. Common descriptions include the mean (R function, mean()), maximum (R function, max()), and minimum (R function, min()) values of a variable (see the next chapter for more info). We can reduce large tibbles to simple summaries with the summarize() function.

As we see in the code below and Figure 5 below and summarize() reduces the number of columns in our tibble. As with, mutate() we set the name of the new summary column to whatever we wish.

penguins %>%
  mutate(wt_bill_ratio = body_mass_g / bill_length_mm )%>%
  summarize( mean_wt_bill_ratio = mean(wt_bill_ratio, na.rm = TRUE), 
             min_wt_bill_ratio = min(wt_bill_ratio, na.rm = TRUE),
             max_wt_bill_ratio = max(wt_bill_ratio, na.rm = TRUE))

Note: The tibble this generates has as many rows as the classes of things we’re summarizing (in this case, eight, representing all combinations of sex (including unknown), and species we have).

summarize() cartoon:

summarize() reduces our big data set down to summaries of variables (Figure 5).

Combine summarize with group_by() to summarize by groups.

We often want to compare values of variables for different types of individuals in our sample. For example, we might want to compare the mean and ranges of wt_bill_ratio for different species and sexes of penguins. We can do this by grouping by group_by() categories before summarizing (be sure to ungroup() when you’re done!)

penguins %>%
  group_by(sex, species) %>%
  mutate(wt_bill_ratio = body_mass_g / bill_length_mm )%>%
  summarize( mean_wt_bill_ratio = mean(wt_bill_ratio, na.rm = TRUE), 
             min_wt_bill_ratio = min(wt_bill_ratio, na.rm = TRUE),
             max_wt_bill_ratio = max(wt_bill_ratio, na.rm = TRUE)) %>%
  ungroup()

Figure 6 shows two different visulizations of the relationship between penguin body mass and bill length by sex and species (generated from Figure 2). This shows that different choices in our exploratory data visualizations can reveal different patterns, so rememebr to mess around a bit with this.

Change variable type with mutate()

The authors of an article at fivethirtyeight.com ( link) looked at how long different boxers have gone undefeated. Figure 7 shows that Floyd Mayweather has gone a long time without a loss.

Let’s check out some of these data in R. I provide a simple summary (link) of the number of consecutive wins each boxer (name) had to start their career.

undefeated <- read_csv("https://github.com/ybrandvain/datasets/raw/master/undefeated.csv")
undefeated

Looking at these data we see a problem. *wins should be a number as it described the number of consecutive wins to start a career, but it is a character <chr> instead.** We can see what went wrong here (although in other cases, we may need to use the View() function). Here Joe Devlin is marked as have 1 5 consecutive wins to start his career. 1 5 is obviously not a number. Is it one, five or fifteen? We have two options to fix this.

Let’s do the first option first, because it’s easiest, and then the second option (because it’s correct)

A simple mutate() to change class

The easiest option is to take our data and simply change wins to a number by using the as.numeric() function inside of a mutate. Here we can overwrite wins, as follows:

undefeated %>%
  mutate(wins = as.numeric(wins))

You can see that this solved the problem – wins is now a <dbl>. But in doing so our entry, 1 5 turned into an NA. This is a bummer.

The actual value is likely either one, five, or fifteen, and it would be best to know the truth and make it right. Consulting the Consulting the original dataset, we see that the actual answer is fifteen. So let’s try to fix this!

Use Large Language Models to help you code

The problem was, this was hard for me. I knew I wanted to combine the case_when() function with mutate, but I kept getting errors. So I turned to ChatGPT for help. Importantly, I could clearly state my problem, so I gave ChatGPT this prompt

You can use mutate along with case_when and some string manipulation functions to clean the "1 5" value and ensure that the "wins" column is numeric. Here’s how you can do it:

undefeated <- undefeated %>%
  mutate(
    wins = case_when(
      wins == "1 5" ~ "15",         # Convert "1 5" to "15"
      TRUE ~ wins                   # Keep other values as they are
    ),
    wins = as.numeric(wins)          # Convert the entire column to numeric
  )

This will ensure that the wins column is correctly transformed into a numeric vector.

glimpse(undefeated)

THIS TOTALLY WORKED!!! So if you have some coding know how, you can use ChatGPT or other LLMs to help you when you run into tough problems.

Quiz

Key functions

Many require the dplyr package. I type dplyr:: before these, as it is a(n optional) way to let R know that a function comes from the dplyr package.

dplyr::mutate(): Add a new column, usually as some function of existing columns.
dplyr::summarize(): Condense a data set into summaries of variables.
dplyr::group_by(): Tell tidyverse that you want to perform the next operation on a groups (noted by the variable we group_by). Undo this with [ungroup()] when you are done.
as.numeric(): Change the class of a vector to numeric.

Chapter 3: Data in R

Authors

Affiliations

Published

DOI

Tidy data

Wrangling data in `R`

Modify data with `mutate`

`mutate()` cartoon:

`summarize` data

`summarize()` cartoon:

Combine `summarize` with `group_by()` to summarize by groups.

Change variable type with `mutate()`

A simple `mutate()` to change class

Use Large Language Models to help you code

Quiz

Key functions

Footnotes

References

Chapter 3: Data in R

Authors

Affiliations

Published

DOI

Tidy data

Wrangling data in R

Modify data with mutate

mutate() cartoon:

summarize data

summarize() cartoon:

Combine summarize with group_by() to summarize by groups.

Change variable type with mutate()

A simple mutate() to change class

Use Large Language Models to help you code

Quiz

Key functions

Footnotes

References

Wrangling data in `R`

Modify data with `mutate`

`mutate()` cartoon:

`summarize` data

`summarize()` cartoon:

Combine `summarize` with `group_by()` to summarize by groups.

Change variable type with `mutate()`

A simple `mutate()` to change class