4 Introduction to R

Learning Objectives

  1. Be able to download and open R and RStudio.

  2. Understand how to import data to R.

  3. Understand how to create a graph in R.

4.1 R and RStudio

So far we’ve discussed the nature of science and the structure of scientific papers. Now we’re going to introduce a central method that is required of most modern science: scientific computing. It is almost impossible to conduct science without a computer, especially since we need to analyze data. We also need to store the data, organize it, plot it, summarize, and report it. Lots of tools do that. In this book, we use the statistical software R (R Core Team 2020). It is among the most popular programs for analyzing scientific data and it is designed specifically for the workflow we use in this book. It is also free. Here’s how you get it.

  1. Download R: (https://mirror.las.iastate.edu/CRAN/). Follow the link above and choose your operating system - Linux, Mac OS, or Windows

  2. Download RStudio: (https://posit.co/download/rstudio-desktop/)

*Alternatively, you can use a cloud version of RStudio: https://posit.cloud/plans/free

All of the examples in this book are generated using R (R Core Team 2020). Actually, that’s not quite correct. While R is the workhorse, the examples in this book are generated through an interface to R called RStudio (RStudio Team 2020).

R looks like this: Download only

RStudio looks like this: Download, open, and use

Once you’ve downloaded both programs, you’ll only need to open RStudio. It automatically uses R in the background. It is possible to do everything only in base R, but we prefer RStudio as a more user-friendly interface.1

How to Use This Chapter

This book is not meant as a stand-alone R reference. It is meant as a companion to university-level labs and lectures, in which students can work through examples with an instructor or TA nearby to fill in the gaps and troubleshoot.

When starting R, these are the types of questions many students have:

“Is this the right program?” “What is a script again?” “How do you make the arrow?” “What is that squiggly sign?” “I ran the code and nothing happened…”

In other words, we expect students to have lots of questions in this new and unfamiliar environment. Everyone started this way and the easiest way to find the answers is to ask an expert.

However, there are lots of excellent R guides out there for students who are interested in learning more detail. Here are a few of our favorites:

R for Data Science (free) - https://r4ds.had.co.nz/ (Grolemund and Wickham n.d.)

The R Book (Crawley 2012)

Getting Started with R: An Introduction for Biologists (Beckerman, Childs, and Petchey 2017)

Data Visualization: A Practical Introduction (Healy 2018)

dplyr and tidyr cheatsheets - (https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

ggplot2 cheatsheet - (https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)

4.2 Data Analysis Workflow

In this book, we focus on learning a few fundamental tasks that are common to the workflow of most data science projects (Wickham et al. 2019). Nearly every study that includes data has a workflow similar to that above. We gather data, get it into a program (Import), get it in the right format (Tidy), and then analyze it with plots (Visualize), Models, Transformations, etc. When we’ve finished, we communicate the results to our peers. You’ll learn how to complete these steps in R because its designed specifically for this type of workflow. But the workflow applies regardless of the software you use.

4.3 Getting Started in RStudio

Before the fun stuff happens, we need to determine where things will be saved on our computer. If you’d prefer to skip this step, that is OK. Just be prepared for certain doom.

Create a folder on your computer for your analyses

For example, if this is for a class called Biology 280, you might create a folder called BIO280_R. If you have data to analyze (like an Excel file or a .csv), save it in this folder as well.

Open RStudio

Click the RStudio icon

Create a project

File -> New Project… -> Existing Directory -> Browse -> [NAME OF YOUR FOLDER]

You only need to do this once. After you create a project, all of the work you do within that project (data analysis, graphs, text) will be saved in it. If it all goes well, you should see a screen like this, with the name of your project in the upper right hand corner.

Open a script

File -> New File -> R Script or ctrl+shift+N

You should now see a screen like the one below, with four windows.

The window on the lower right shows all of the Files in the folder you created on your computer. If you add something to that folder from outside of R, it will show up here as well.

The window on the upper right shows your Environment. When you create something in R, like a new data frame or a plot, it will show up in the Environment (But it won’t be saved. More on that later).

The window on the upper left shows your Script. This is where you tell R what to do.

The window on the bottom left is the Console. It keeps a running list of all of the procedures you perform. For example, if you run code in the script, it will show up here. When something goes wrong, you’ll also see the error message here.

4.3.1 Install a package

Click anywhere on the script window so that you see the flashing prompt. Type the code below and then type ctrl+enter. (NOTE: If you just hit enter without adding ctrl, it won’t work. It will just move you to the next line. Get in the habit of typing ctrl+enter to run your code).

#type this and then hit ctrl+enter
install.packages("tidyverse") # ctrl+enter

Like this:

The code above tells R to install a package called tidyverse. You only have to do this once. After it’s installed, it will always be available when you open R, but you’ll have to tell R when you want to use it each time by typing library(tidyverse).

Packages are bits of code that someone wrote and then converted into a series of shortcuts. R has 1000’s of packages for just about any task you can think of. The tidyverse package actually contains a bunch of other packages within it. As a result, when you install it for the first time, it will generate a lot of activity in your console, with red text and “https://..” links all over the place. That is all normal. Just give it a few minutes. You’ll know it’s done when you see the chevron > in the console.

4.3.2 Your first script

Now you are ready for the fun parts. To begin coding your first script, we are going to take an unorthodox approach. Instead of starting with first principles, we’ll start with the Visualize and Model steps from the workflow and then deconstruct that to learn the principles.

Copy the code below and paste it in your script. Then run the code (by clicking ctrl+enter from the first line down). Do not try to interpret it yet. There is a lot going on here. We’ll break it down next.

#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse) 

# make the data and "model" the mean and sd.
d <- mtcars %>% 
  group_by(cyl) %>% 
  mutate(mean = mean(mpg), sd = sd(mpg)) 

#plot the data
ggplot(data = d, aes(x = cyl,y = mpg)) +
  geom_point(shape = 21) +
  geom_pointrange(aes(y = mean, ymin = mean-sd, ymax = mean+sd)) +
  labs(y = "Miles Per Gallon",
       x = "Cylinders") 

Here is what you just did:

  1. Loaded the tidyverse package.

  2. Created a data frame d that was a modified version of the data frame mtcars.

  3. Added two new columns to d: one containing the mean mpg for each type of cylinder and another containing the standard deviation.

  4. Plotted miles per gallon as a function of cylinders as raw data.

  5. Added a mean and standard deviation to the plot.

  6. Modified the axis names.

If you’re a normal person, this should all be mysterious. Here’s the good news. The code above is about as complex as we will get in this book. It is also modular. That is crucial. It means that you don’t have to know every step to get started. Each batch of code that precedes %>% or + will run by itself.

Let’s break the code down into individual components:

1) Loaded the tidyverse package.

#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse) 

What it does This code uses the function library() to load a package called tidyverse. The rest of the code depends on loading this package first.

Did it work? Check the output in the console (lower left window). Red text is normal. It does not necessarily mean there is an error. If you see the prompt >, that is a good sign. If you see words like "there is no package called…", "Error…", "failed…", then it probably didn’t work.

Things to check if it doesn’t work

  • Did you install the package first?
  • Did you misspell anything?
  • Did you add a capital letter somewhere?
  • Did you hit enter instead of ctrl-enter?

2) Created a data frame d that was a modified version of the data frame mtcars.

#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse)

d <- mtcars

What it does This code creates a data frame called d that contains all of the data that are in mtcars. A data frame is just a table with rows and columns. Try running View(mtcars) and you’ll see what the data frame looks like. mtcars is one of many data frames that are built in to R. It contains data on things like miles per gallon, weight, and horsepower for different types of cars.

The symbol <- is how we assign bits of code to objects in R. It’s a combination of the lesser than sign < and the minus sign -. You will use this symbol all the time.

Did it work? Do you see an object named d in the Environment window (upper right) that has “32 obs. of 11 variables”? If not, it didn’t work.

Visually, the d in the Environment window is the only thing that will automatically show up if it worked. Another way to check is simply to view the data frame, like this:

head(d)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

This shows a snapshot of the first few rows of the data frame that exists in the object d. It is essentially the same as any other spreadsheet you might make in another program, like Excel. Each row contains information on mpg, cylinders (cyl), horsepower(hp), etc. for each type of car.

You can isolate individual columns:

d %>% select(mpg, cyl) %>% head()
##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6

Or check the types of columns:

str(d)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

This shows that the object d is a data frame with 32 observations (rows) of 11 variables (columns). It also lists the variables and gives a preview of the first 10 rows. [Note: instead of a data frame, you might see the word tibble. That is another name for a data frame used by the tidyverse package. It has some important distinctions, but they are not relevant for this chapter].

Things to check if it doesn’t work

  • Did you misspell anything?
  • Did you add a capital letter somewhere?
  • Did you hit enter instead of ctrl-enter?
  • Is there a space in the arrow < -? There shouldn’t be.

Challenges

  • Give the data frame a different name other than d.
  • Select other columns using the select() function.

3) Added two new columns to d: one containing the mean mpg for each type of cylinder and another containing the standard deviation.

#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse) 

# make the data and "model" the mean and sd
d <- mtcars %>% 
  group_by(cyl) %>% 
  mutate(mean = mean(mpg), sd = sd(mpg)) 

What it does This code adds a column summarizing the mean and sd of mpg for cars with different numbers of cylinders. We also have a new symbol %>% called a “pipe”. You can either type it directly or use a shortcut ctrl-shift-m. Think of it as a way of telling R “and then…”.

In sentence form, the code is saying this.

d <- mtcars %>%

create a data frame called d that contains all of the data that are in mtcars and then

group_by(cyl) %>%

assign each type of cylinder to a group and then

mutate(mean = mean(mpg), sd = sd(mpg))

create a new column called mean that contains the mean mpg’s for each type of cylinder. Also create a new column called sd that contains the standard deviation for each type of cylinder

Did it work? Check the columns again with str(). Do you see the columns mean and sd now? Do you see the odd addendum that starts with “-attr(*,”groups”)…“? If so, then it worked.

str(d)
## gropd_df [32 × 13] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num [1:32] 160 160 108 258 360 ...
##  $ hp  : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
##  $ vs  : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
##  $ mean: num [1:32] 19.7 19.7 26.7 19.7 15.1 ...
##  $ sd  : num [1:32] 1.45 1.45 4.51 1.45 2.56 ...
##  - attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ cyl  : num [1:3] 4 6 8
##   ..$ .rows: list<int> [1:3] 
##   .. ..$ : int [1:11] 3 8 9 18 19 20 21 26 27 28 ...
##   .. ..$ : int [1:7] 1 2 4 6 10 11 30
##   .. ..$ : int [1:14] 5 7 12 13 14 15 16 17 22 23 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

Things to check if it doesn’t work

  • Did you load the library library(tidyverse)? The pipes %>% only work if the tidyverse package is loaded.
  • Did you misspell anything?
  • Did you type out “cylinders” instead of using “cyl”? Computers don’t know those are related.
  • Did you add a capital letter somewhere?
  • Did you hit enter instead of ctrl-enter?
  • Is there a space in the arrow < - or the pipes % >%? There shouldn’t be.
  • Did you put the package in quotes when calling the library() function: library("tidyverse")?

Challenges

  • Summarize mpg by the number of gears instead of cylinders
  • Add a column that calculates the median in addition to the mean and sd

4) Plotted miles per gallon as a function of cylinders as raw data.

#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse) 

# make the data and "model" the mean and sd
d <- mtcars %>% 
  group_by(cyl) %>% 
  mutate(mean = mean(mpg), sd = sd(mpg)) 

#plot the data
ggplot(data = d, aes(x = cyl,y = mpg)) +
  geom_point(shape = 21) 

What it does This code uses the powerful plotting package called ggplot2 (Wickham 2016). It included when you installed the tidyverse. The “gg” stands for “The Grammar of Graphics” (Wilkinson 2012), a fundamental set of principles for producing just about any plot you can think of (and rules for why some types of plots are better than others).

Making anything with ggplot2 usually requires at least two things: 1) a call to ggplot(...) where we specify the data along with the x and y axes or other aesthetics, and 2) a call to geom_…, where we tell ggplot2 how to plot the data. There are lots of geoms, as you can see in this cheatsheet (https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf). For most practices in this book, we will use geom_point(), which simply adds a dot for each x-y coordinate that we specified in the aes() function.

Once we have our base plot, everything else is added with +. This can be a point of confusion. The + has a similar meaning as the pipe %>%, but ggplot2 only uses +. Accidentally typing %>% instead of + is a common mistake even for experienced coders (like authors of textbooks about data analysis).

You can see the iterative nature of ggplot by breaking it down further, adding one thing at a time.

#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse) 

# make the data and "model" the mean and sd
d <- mtcars %>% 
  group_by(cyl) %>% 
  mutate(mean = mean(mpg), sd = sd(mpg)) 

#Create a placeholder for a plot
ggplot()

#Assign the data frame d to the plot
ggplot(data = d) 

#Assign the aesthetics
#Put numbers from the column cyl on the x
#For each value of cyl on the x, add the corresponding value from the mpg column
ggplot(data = d, aes(x = cyl,y = mpg))

#Tell ggplot how to plot the x-y values with a geom.
ggplot(data = d, aes(x = cyl,y = mpg)) +
  geom_point()

#Change the default shape of the dots
#Any number from 1-25 will produce a different shape. (http://www.cookbook-r.com/Graphs/Shapes_and_line_types/)
ggplot(data = d, aes(x = cyl,y = mpg)) +
  geom_point(shape = 21) 

Did it work? The plot above should appear in the plot window (lower right).

Things to check if it doesn’t work

  • Did you leave a hanging plus + at the end of the code? If so, remove it.
  • Did you write ggplot2() instead of ggplot()?
  • Did you remember to assign the x and y axes within the aes() function?
  • Did you put a pipe %>% instead of a plus +?
  • Did you misspell anything?
  • Did you type out “cylinders” instead of using “cyl”? R doesn’t know those are related.
  • Did you add a capital letter somewhere?
  • Did you hit enter instead of ctrl-enter?
  • Is there a space in the arrow < - or the pipes % >%? There shouldn’t be.
  • Did you put the package in quotes when calling the library() function: library("tidyverse")?
  • Did you load the library library(tidyverse)? The pipes %>% only work if the tidyverse package is loaded.

Challenges

5) Added a mean and standard deviation to the plot.

#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse) 

# make the data and "model" the mean and sd
d <- mtcars %>% 
  group_by(cyl) %>% 
  mutate(mean = mean(mpg), sd = sd(mpg)) 

#plot the data
ggplot(data = d, aes(x = cyl,y = mpg)) +
  geom_point(shape = 21) +
  geom_pointrange(aes(y = mean, ymin = mean-sd, ymax = mean+sd))

What it does The new geom geom_pointrange() adds the mean and standard deviation to the plot. If we only typed geom_pointrange(), it wouldn’t work. That’s because the geom requires three values that we haven’t assigned yet: y, ymin, and ymax. In this case, we want y to be the mean of each group. We want the error bar to range from ymin to ymax. ymin is the mean minus the standard deviation for each group mean-sd. ymax is the mean plus the standard deviation mean+sd. With those values, ggplot draws a line from ymin to ymax.

Did it work? The plot above should appear in the plot window (lower right).

Things to check if it doesn’t work

  • Did you leave a hanging plus + at the end of the code? If so, remove it.
  • Did you write ggplot2() instead of ggplot()?
  • Did you remember to assign the x and y axes within the aes() function?
  • Did you put a pipe %>% instead of a plus +?
  • Did you misspell anything?
  • Did you type out “cylinders” instead of using “cyl”? R doesn’t know those are related.
  • Did you add a capital letter somewhere?
  • Did you hit enter instead of ctrl-enter?
  • Is there a space in the arrow < - or the pipes % >%? There shouldn’t be.
  • Did you put the package in quotes when calling the library() function: library("tidyverse")?
  • Did you load the library library(tidyverse)? The pipes %>%only work if the tidyverse package is loaded.

6) Modified the axis names.

#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse) 

# make the data and "model" the mean and sd
d <- mtcars %>% 
  group_by(cyl) %>% 
  mutate(mean = mean(mpg), sd = sd(mpg)) 

#plot the data
ggplot(data = d, aes(x = cyl,y = mpg)) +
  geom_point(shape = 21) +
  geom_pointrange(aes(y = mean, ymin = mean-sd, ymax = mean+sd)) +
  labs(y = "Miles Per Gallon",
       x = "Cylinders") 

What it does The function labs() replaces the title of the x and y axis with whatever we put in quotes.

Did it work? The plot above should appear in the plot window (lower right).

Things to check if it doesn’t work

  • Did you remember the comma?
  • Did you switch the x and y?
  • Did you leave a hanging plus + at the end of the code? If so, remove it.
  • Did you write ggplot2() instead of ggplot()?
  • Did you remember to assign the x and y axes within the aes() function?
  • Did you put a pipe %>% instead of a plus +?
  • Did you misspell anything?
  • Did you type out “cylinders” instead of using “cyl”? R doesn’t know those are related.
  • Did you add a capital letter somewhere?
  • Did you hit enter instead of ctrl-enter?
  • Is there a space in the arrow < - or the pipes % >%? There shouldn’t be.
  • Did you put the package in quotes when calling the library() function: library("tidyverse")?
  • Did you load the library library(tidyverse)? The pipes %>% only work if the tidyverse package is loaded.

Challenges

  • Rename the x and y axes
  • Add a title within the labs() function using title = "put your title here"

4.3.3 Importing your own data to RStudio

To get your own data into R, first save the data into the same folder as your project.

Look in the lower right panel of RStudio. Click on “Files”. Do you see the data set you saved? Click on it and choose Import Dataset….

Like this:

After you click Import Dataset…., you’ll see a preview of your data set like this:

Stop here and check the preview. Does everything look right? Are the column names correct? If not, you might need to check the box for “First Rows as Names” on the lower left.

From here, you have two options. 1) Click “Import”, or 2) Copy the code in the Preview Code box, cancel the preview, and paste the code into your script. We strongly recommend the second option.

Clicking will work, but if you come back to your script and need to reload the data, you’ll have to do this process again. If you instead copy the code and paste it into your script, then your code becomes self-contained and you won’t forget any steps in the future.

Here is where you can copy from. We don’t copy the View() part, but you can if you want:

Then paste it at the beginning of your script and run it. Our data set is called continents. Yours will probably be different, though.

Do you see the name of your data set in the upper right panel? If so, success! If not, re-try the steps above or ask your instructor. You are now ready to practice the coding you’ve learned on your own data set.

Why to Code Instead of Click

R is a programming language, which means that it can only do what you tell it to do by typing. RStudio has a few clickable shortcuts, but it still requires nearly everything to be typed into a script.

There are other programs that conduct statistical and graphical analyses without using code. We choose to use R instead for several reasons.

Clicking Isn’t Actually Easier

Undergraduates are incredibly savvy with some aspects of computers, particularly in nagivating social media platforms. But in our experience, students often struggle with even rudimentary tasks in programs that professors think are easy, such as Microsoft Excel or SPSS. These programs have their own bewildering array of shortcuts and buttons (Nash 2008). For example, while this Excel function might make perfect sense to a seasoned user

=STDEV.P(A$1:A$7)

it can be just as confusing to a new student as the similar function in R sd(data$column).

Similarly, while it may seem easier to run an ANOVA in SPSS by simply clicking the ANOVA button, this too is often misleading. Having helped students that are part way through a project in SPSS or other clickable programs, we almost always have to start their entire analysis over when a problem arises. The reason is that, by the time the ANOVA button is clicked, there have already been a series of steps in data preparation and uploading that might have generated a problem. In R, we can find these problems easily, because the script leaves a breadcrumb trail of each step. In non-scripting programs, there are no breadcrumbs, so solving the problem becomes much more complicated. And no matter what program you use or how simple your data seems, there will be problems to solve.

Data Ethics

A basic requirement in modern science is that the results of scientific findings could be reproduced by someone else. There are two levels to this. The first level of reproducibility is the description of the experimental approach, which is contained in a Methods section in a scientific publication. This ensures that someone else could read a Methods section and reproduce the steps of the experiment exactly without having to ask the author (who may no longer be alive or just doesn’t respond to email).

The second level of reproducibility is in the analysis of a data set presented in a scientific publication. All analyses involve myriad human decisions. For example, what do we do with outliers (extreme data values that may be real or may be a result of data entry error or errors in the instruments)? What if half of our fish died in the middle of an experiment? Should we replace them with new ones? There are no easy answers to these questions. Each experiment has its own quirks and they will all involve subjective decisions by the scientist.

What do we do about these subjective decisions? The golden rule is to be transparent about them. First, describe them in the Methods and provide a justification for them. Second, always include a way for readers to easily find the raw data and any scripts. This is where using computer code over clicking makes a huge difference. If the raw data and script are available, then it is simple for someone else to run the analysis later and see the decisions you made about the data quirks. Different scientists will make different decisions about each of those quirks. The most important thing is not which decision is made per se, but that the trail of breadcrumbs exists to allow a decision to be transparent.

That may seem a little daunting. It is scary to have someone else see all of your decisions. But here’s the actual truth: The person who will benefit most from your transparent data and code is not another scientist. It is you. In two days, two months, or two years, you will eventually have to return to an old analysis. You’ll need it to wrap up that semester’s term paper or reanalyze something from your thesis. You will NOT NOT NOT NOT NOT remember what you have done, no matter how obvious it seemed when you were doing it. For that reason, having script that is reproducible will save you hours, maybe weeks, of otherwise wasted time. Trust us…just trust us.

4.4 Tips and Tricks

4.4.1 pivot_longer()

Convert data from wide format to long format

library(gapminder)
library(tidyverse)

my_data <- gapminder %>% 
  select(country, year, lifeExp) %>% 
  pivot_wider(names_from = year, values_from = lifeExp) 

my_data
## # A tibble: 142 × 13
##    country     `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997` `2002` `2007`
##    <fct>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Afghanistan   28.8   30.3   32.0   34.0   36.1   38.4   39.9   40.8   41.7   41.8   42.1   43.8
##  2 Albania       55.2   59.3   64.8   66.2   67.7   68.9   70.4   72     71.6   73.0   75.7   76.4
##  3 Algeria       43.1   45.7   48.3   51.4   54.5   58.0   61.4   65.8   67.7   69.2   71.0   72.3
##  4 Angola        30.0   32.0   34     36.0   37.9   39.5   39.9   39.9   40.6   41.0   41.0   42.7
##  5 Argentina     62.5   64.4   65.1   65.6   67.1   68.5   69.9   70.8   71.9   73.3   74.3   75.3
##  6 Australia     69.1   70.3   70.9   71.1   71.9   73.5   74.7   76.3   77.6   78.8   80.4   81.2
##  7 Austria       66.8   67.5   69.5   70.1   70.6   72.2   73.2   74.9   76.0   77.5   79.0   79.8
##  8 Bahrain       50.9   53.8   56.9   59.9   63.3   65.6   69.1   70.8   72.6   73.9   74.8   75.6
##  9 Bangladesh    37.5   39.3   41.2   43.5   45.3   46.9   50.0   52.8   56.0   59.4   62.0   64.1
## 10 Belgium       68     69.2   70.2   70.9   71.4   72.8   73.9   75.4   76.5   77.5   78.3   79.4
## # ℹ 132 more rows
my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp")
## # A tibble: 1,704 × 3
##    country     year  lifeExp
##    <fct>       <chr>   <dbl>
##  1 Afghanistan 1952     28.8
##  2 Afghanistan 1957     30.3
##  3 Afghanistan 1962     32.0
##  4 Afghanistan 1967     34.0
##  5 Afghanistan 1972     36.1
##  6 Afghanistan 1977     38.4
##  7 Afghanistan 1982     39.9
##  8 Afghanistan 1987     40.8
##  9 Afghanistan 1992     41.7
## 10 Afghanistan 1997     41.8
## # ℹ 1,694 more rows

4.4.2 pivot_wider()

Convert data from long format to wide format

## # A tibble: 1,704 × 3
##    country     year  lifeExp
##    <fct>       <chr>   <dbl>
##  1 Afghanistan 1952     28.8
##  2 Afghanistan 1957     30.3
##  3 Afghanistan 1962     32.0
##  4 Afghanistan 1967     34.0
##  5 Afghanistan 1972     36.1
##  6 Afghanistan 1977     38.4
##  7 Afghanistan 1982     39.9
##  8 Afghanistan 1987     40.8
##  9 Afghanistan 1992     41.7
## 10 Afghanistan 1997     41.8
## # ℹ 1,694 more rows
#long format to wide format
my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  pivot_wider(names_from = "year", values_from = "lifeExp") 
## # A tibble: 142 × 13
##    country     `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997` `2002` `2007`
##    <fct>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Afghanistan   28.8   30.3   32.0   34.0   36.1   38.4   39.9   40.8   41.7   41.8   42.1   43.8
##  2 Albania       55.2   59.3   64.8   66.2   67.7   68.9   70.4   72     71.6   73.0   75.7   76.4
##  3 Algeria       43.1   45.7   48.3   51.4   54.5   58.0   61.4   65.8   67.7   69.2   71.0   72.3
##  4 Angola        30.0   32.0   34     36.0   37.9   39.5   39.9   39.9   40.6   41.0   41.0   42.7
##  5 Argentina     62.5   64.4   65.1   65.6   67.1   68.5   69.9   70.8   71.9   73.3   74.3   75.3
##  6 Australia     69.1   70.3   70.9   71.1   71.9   73.5   74.7   76.3   77.6   78.8   80.4   81.2
##  7 Austria       66.8   67.5   69.5   70.1   70.6   72.2   73.2   74.9   76.0   77.5   79.0   79.8
##  8 Bahrain       50.9   53.8   56.9   59.9   63.3   65.6   69.1   70.8   72.6   73.9   74.8   75.6
##  9 Bangladesh    37.5   39.3   41.2   43.5   45.3   46.9   50.0   52.8   56.0   59.4   62.0   64.1
## 10 Belgium       68     69.2   70.2   70.9   71.4   72.8   73.9   75.4   76.5   77.5   78.3   79.4
## # ℹ 132 more rows

4.4.3 mutate()

Add a new column

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(something_new  = "newly something") # adds a new column called something_new that contains the words "newly something"
## # A tibble: 1,704 × 4
##    country     year  lifeExp something_new  
##    <fct>       <chr>   <dbl> <chr>          
##  1 Afghanistan 1952     28.8 newly something
##  2 Afghanistan 1957     30.3 newly something
##  3 Afghanistan 1962     32.0 newly something
##  4 Afghanistan 1967     34.0 newly something
##  5 Afghanistan 1972     36.1 newly something
##  6 Afghanistan 1977     38.4 newly something
##  7 Afghanistan 1982     39.9 newly something
##  8 Afghanistan 1987     40.8 newly something
##  9 Afghanistan 1992     41.7 newly something
## 10 Afghanistan 1997     41.8 newly something
## # ℹ 1,694 more rows

4.4.4 case_when()

Add a new column whose values are conditional on an existing column

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(something_new  = case_when(year == 1952 ~ "newly something",
                                     TRUE ~ "something else"))
## # A tibble: 1,704 × 4
##    country     year  lifeExp something_new  
##    <fct>       <chr>   <dbl> <chr>          
##  1 Afghanistan 1952     28.8 newly something
##  2 Afghanistan 1957     30.3 something else 
##  3 Afghanistan 1962     32.0 something else 
##  4 Afghanistan 1967     34.0 something else 
##  5 Afghanistan 1972     36.1 something else 
##  6 Afghanistan 1977     38.4 something else 
##  7 Afghanistan 1982     39.9 something else 
##  8 Afghanistan 1987     40.8 something else 
##  9 Afghanistan 1992     41.7 something else 
## 10 Afghanistan 1997     41.8 something else 
## # ℹ 1,694 more rows
# adds a new column called something_new that contains "newly something" when the year is 1952, and "something else" for all other years. 

4.4.5 filter()

Filter by a number

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  filter(year == 1952 | year == 1972) # select individual years
## # A tibble: 284 × 3
##    country     year  lifeExp
##    <fct>       <chr>   <dbl>
##  1 Afghanistan 1952     28.8
##  2 Afghanistan 1972     36.1
##  3 Albania     1952     55.2
##  4 Albania     1972     67.7
##  5 Algeria     1952     43.1
##  6 Algeria     1972     54.5
##  7 Angola      1952     30.0
##  8 Angola      1972     37.9
##  9 Argentina   1952     62.5
## 10 Argentina   1972     67.1
## # ℹ 274 more rows

Filter by a range of numbers

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  filter(year >= 1972 & year <= 1992) # range of years
## # A tibble: 710 × 3
##    country     year  lifeExp
##    <fct>       <chr>   <dbl>
##  1 Afghanistan 1972     36.1
##  2 Afghanistan 1977     38.4
##  3 Afghanistan 1982     39.9
##  4 Afghanistan 1987     40.8
##  5 Afghanistan 1992     41.7
##  6 Albania     1972     67.7
##  7 Albania     1977     68.9
##  8 Albania     1982     70.4
##  9 Albania     1987     72  
## 10 Albania     1992     71.6
## # ℹ 700 more rows

Filter by text

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  filter(country == "Angola") # filter by country
## # A tibble: 12 × 3
##    country year  lifeExp
##    <fct>   <chr>   <dbl>
##  1 Angola  1952     30.0
##  2 Angola  1957     32.0
##  3 Angola  1962     34  
##  4 Angola  1967     36.0
##  5 Angola  1972     37.9
##  6 Angola  1977     39.5
##  7 Angola  1982     39.9
##  8 Angola  1987     39.9
##  9 Angola  1992     40.6
## 10 Angola  1997     41.0
## 11 Angola  2002     41.0
## 12 Angola  2007     42.7

Filter by text and numbers

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  filter(country == "Angola" & year >= 1982) # filter by country and year
## # A tibble: 6 × 3
##   country year  lifeExp
##   <fct>   <chr>   <dbl>
## 1 Angola  1982     39.9
## 2 Angola  1987     39.9
## 3 Angola  1992     40.6
## 4 Angola  1997     41.0
## 5 Angola  2002     41.0
## 6 Angola  2007     42.7

4.4.6 left_join()

Combine two data sets that have at least one shared column

## # A tibble: 142 × 2
##    country     continent
##    <fct>       <fct>    
##  1 Afghanistan Asia     
##  2 Albania     Europe   
##  3 Algeria     Africa   
##  4 Angola      Africa   
##  5 Argentina   Americas 
##  6 Australia   Oceania  
##  7 Austria     Europe   
##  8 Bahrain     Asia     
##  9 Bangladesh  Asia     
## 10 Belgium     Europe   
## # ℹ 132 more rows
## # A tibble: 1,704 × 3
##    country     year  lifeExp
##    <fct>       <chr>   <dbl>
##  1 Afghanistan 1952     28.8
##  2 Afghanistan 1957     30.3
##  3 Afghanistan 1962     32.0
##  4 Afghanistan 1967     34.0
##  5 Afghanistan 1972     36.1
##  6 Afghanistan 1977     38.4
##  7 Afghanistan 1982     39.9
##  8 Afghanistan 1987     40.8
##  9 Afghanistan 1992     41.7
## 10 Afghanistan 1997     41.8
## # ℹ 1,694 more rows

For every country, assign a continent from the tibble called continents

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  left_join(continents)  #joins the tibble 'continents' to the tibble 'my_data'. Requires a shared column with common values. In this case, country is the column and the country names are the values. 
## # A tibble: 1,704 × 4
##    country     year  lifeExp continent
##    <fct>       <chr>   <dbl> <fct>    
##  1 Afghanistan 1952     28.8 Asia     
##  2 Afghanistan 1957     30.3 Asia     
##  3 Afghanistan 1962     32.0 Asia     
##  4 Afghanistan 1967     34.0 Asia     
##  5 Afghanistan 1972     36.1 Asia     
##  6 Afghanistan 1977     38.4 Asia     
##  7 Afghanistan 1982     39.9 Asia     
##  8 Afghanistan 1987     40.8 Asia     
##  9 Afghanistan 1992     41.7 Asia     
## 10 Afghanistan 1997     41.8 Asia     
## # ℹ 1,694 more rows

4.4.7 parse_number()

Extract numbers from a column.

For example, in the current data frame, year is a text string so every instance of year will be plotted, like this.

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  left_join(continents) %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  geom_point()

That can be messy with lots of years. Let’s tell R that year should be a number and re-plot it.

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(year = parse_number(year)) %>%         # convert year to a number
  left_join(continents) 
## # A tibble: 1,704 × 4
##    country      year lifeExp continent
##    <fct>       <dbl>   <dbl> <fct>    
##  1 Afghanistan  1952    28.8 Asia     
##  2 Afghanistan  1957    30.3 Asia     
##  3 Afghanistan  1962    32.0 Asia     
##  4 Afghanistan  1967    34.0 Asia     
##  5 Afghanistan  1972    36.1 Asia     
##  6 Afghanistan  1977    38.4 Asia     
##  7 Afghanistan  1982    39.9 Asia     
##  8 Afghanistan  1987    40.8 Asia     
##  9 Afghanistan  1992    41.7 Asia     
## 10 Afghanistan  1997    41.8 Asia     
## # ℹ 1,694 more rows

Now year is a number (aka a ‘double’ or ‘dbl’) and will plot intervals instead of every instance.

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(year = parse_number(year)) %>% 
  left_join(continents) %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  geom_point()

Another way to convert to a number is with as.numeric()

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(year = as.numeric(year)) %>% 
  left_join(continents) %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  geom_point()

### separate() Separate strings in a single column to multiple columns

my_data_united <- my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(year = parse_number(year)) %>% 
  left_join(continents) %>% 
  unite("country_continent", c(country, continent))

For example, we might want to separate the column ‘country_continent’ to two independent columns

my_data_united %>% 
  separate(country_continent, c("country", "continent"), sep = "_")
## # A tibble: 1,704 × 4
##    country     continent  year lifeExp
##    <chr>       <chr>     <dbl>   <dbl>
##  1 Afghanistan Asia       1952    28.8
##  2 Afghanistan Asia       1957    30.3
##  3 Afghanistan Asia       1962    32.0
##  4 Afghanistan Asia       1967    34.0
##  5 Afghanistan Asia       1972    36.1
##  6 Afghanistan Asia       1977    38.4
##  7 Afghanistan Asia       1982    39.9
##  8 Afghanistan Asia       1987    40.8
##  9 Afghanistan Asia       1992    41.7
## 10 Afghanistan Asia       1997    41.8
## # ℹ 1,694 more rows

4.4.8 clean_names()

Use the janitor package to automatically fix the column names. This is especially helpful for messy data sets

iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

The column names in iris contain capitals and dots. We could rename them all by hand, but clean_names() will do this for us automatically.

library(janitor)
iris %>% clean_names()
##     sepal_length sepal_width petal_length petal_width    species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

4.4.9 group_by() and summarize()

Get summary statistics for each group in your data

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(year = as.numeric(year)) %>% 
  left_join(continents) 
## # A tibble: 1,704 × 4
##    country      year lifeExp continent
##    <fct>       <dbl>   <dbl> <fct>    
##  1 Afghanistan  1952    28.8 Asia     
##  2 Afghanistan  1957    30.3 Asia     
##  3 Afghanistan  1962    32.0 Asia     
##  4 Afghanistan  1967    34.0 Asia     
##  5 Afghanistan  1972    36.1 Asia     
##  6 Afghanistan  1977    38.4 Asia     
##  7 Afghanistan  1982    39.9 Asia     
##  8 Afghanistan  1987    40.8 Asia     
##  9 Afghanistan  1992    41.7 Asia     
## 10 Afghanistan  1997    41.8 Asia     
## # ℹ 1,694 more rows

What’s the average and sd of life expectancy in each country?

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(year = as.numeric(year)) %>% 
  left_join(continents) %>% 
  group_by(country) %>% 
  summarize(avg_le = mean(lifeExp),
            sd = sd(lifeExp))
## # A tibble: 142 × 3
##    country     avg_le    sd
##    <fct>        <dbl> <dbl>
##  1 Afghanistan   37.5  5.10
##  2 Albania       68.4  6.32
##  3 Algeria       59.0 10.3 
##  4 Angola        37.9  4.01
##  5 Argentina     69.1  4.19
##  6 Australia     74.7  4.15
##  7 Austria       73.1  4.38
##  8 Bahrain       65.6  8.57
##  9 Bangladesh    49.8  9.03
## 10 Belgium       73.6  3.78
## # ℹ 132 more rows

What’s the average and sd of life expectancy in each year?

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(year = as.numeric(year)) %>% 
  left_join(continents) %>% 
  group_by(year) %>% 
  summarize(avg_le = mean(lifeExp),
            sd = sd(lifeExp))
## # A tibble: 12 × 3
##     year avg_le    sd
##    <dbl>  <dbl> <dbl>
##  1  1952   49.1  12.2
##  2  1957   51.5  12.2
##  3  1962   53.6  12.1
##  4  1967   55.7  11.7
##  5  1972   57.6  11.4
##  6  1977   59.6  11.2
##  7  1982   61.5  10.8
##  8  1987   63.2  10.6
##  9  1992   64.2  11.2
## 10  1997   65.0  11.6
## 11  2002   65.7  12.3
## 12  2007   67.0  12.1

What’s the average and sd of life expectancy in each year for each continent?

my_data %>% 
  pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>% 
  mutate(year = as.numeric(year)) %>% 
  left_join(continents) %>% 
  group_by(year, continent) %>% 
  summarize(avg_le = mean(lifeExp),
            sd = sd(lifeExp))
## # A tibble: 60 × 4
## # Groups:   year [12]
##     year continent avg_le     sd
##    <dbl> <fct>      <dbl>  <dbl>
##  1  1952 Africa      39.1 5.15  
##  2  1952 Americas    53.3 9.33  
##  3  1952 Asia        46.3 9.29  
##  4  1952 Europe      64.4 6.36  
##  5  1952 Oceania     69.3 0.191 
##  6  1957 Africa      41.3 5.62  
##  7  1957 Americas    56.0 9.03  
##  8  1957 Asia        49.3 9.64  
##  9  1957 Europe      66.7 5.30  
## 10  1957 Oceania     70.3 0.0495
## # ℹ 50 more rows

  1. Except for JR, an English professor who doesn’t understand any of this. We assume he is currently pontificating about the literary importance of using salve versus halve in the writings of Chaucer (who uses neither word). JR has a large collection of feathered pens and prefers to write on low gloss paper sourced from the Pacific Northeast.↩︎