4 Introduction to R
Learning Objectives
Be able to download and open R and RStudio.
Understand how to import data to R.
Understand how to create a graph in R.
4.1 R and RStudio
So far we’ve discussed the nature of science and the structure of scientific papers. Now we’re going to introduce a central method that is required of most modern science: scientific computing. It is almost impossible to conduct science without a computer, especially since we need to analyze data. We also need to store the data, organize it, plot it, summarize, and report it. Lots of tools do that. In this book, we use the statistical software R (R Core Team 2020). It is among the most popular programs for analyzing scientific data and it is designed specifically for the workflow we use in this book. It is also free. Here’s how you get it.
Download R: (https://mirror.las.iastate.edu/CRAN/). Follow the link above and choose your operating system - Linux, Mac OS, or Windows
Download RStudio: (https://posit.co/download/rstudio-desktop/)
*Alternatively, you can use a cloud version of RStudio: https://posit.cloud/plans/free
All of the examples in this book are generated using R (R Core Team 2020). Actually, that’s not quite correct. While R is the workhorse, the examples in this book are generated through an interface to R called RStudio (RStudio Team 2020).
R looks like this: Download only
RStudio looks like this: Download, open, and use
Once you’ve downloaded both programs, you’ll only need to open RStudio. It automatically uses R in the background. It is possible to do everything only in base R, but we prefer RStudio as a more user-friendly interface.1
How to Use This Chapter
This book is not meant as a stand-alone R reference. It is meant as a companion to university-level labs and lectures, in which students can work through examples with an instructor or TA nearby to fill in the gaps and troubleshoot.
When starting R, these are the types of questions many students have:
“Is this the right program?” “What is a script again?” “How do you make the arrow?” “What is that squiggly sign?” “I ran the code and nothing happened…”
In other words, we expect students to have lots of questions in this new and unfamiliar environment. Everyone started this way and the easiest way to find the answers is to ask an expert.
However, there are lots of excellent R guides out there for students who are interested in learning more detail. Here are a few of our favorites:
R for Data Science (free) - https://r4ds.had.co.nz/ (Grolemund and Wickham n.d.)
The R Book (Crawley 2012)
Getting Started with R: An Introduction for Biologists (Beckerman, Childs, and Petchey 2017)
Data Visualization: A Practical Introduction (Healy 2018)
dplyr and tidyr cheatsheets - (https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
ggplot2 cheatsheet - (https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
4.2 Data Analysis Workflow
In this book, we focus on learning a few fundamental tasks that are common to the workflow of most data science projects (Wickham et al. 2019). Nearly every study that includes data has a workflow similar to that above. We gather data, get it into a program (Import), get it in the right format (Tidy), and then analyze it with plots (Visualize), Models, Transformations, etc. When we’ve finished, we communicate the results to our peers. You’ll learn how to complete these steps in R because its designed specifically for this type of workflow. But the workflow applies regardless of the software you use.
4.3 Getting Started in RStudio
Before the fun stuff happens, we need to determine where things will be saved on our computer. If you’d prefer to skip this step, that is OK. Just be prepared for certain doom.
Create a folder on your computer for your analyses
For example, if this is for a class called Biology 280, you might create a folder called BIO280_R
. If you have data to analyze (like an Excel file or a .csv), save it in this folder as well.
Open RStudio
Click the RStudio icon
Create a project
File -> New Project… -> Existing Directory -> Browse -> [NAME OF YOUR FOLDER]
You only need to do this once. After you create a project, all of the work you do within that project (data analysis, graphs, text) will be saved in it. If it all goes well, you should see a screen like this, with the name of your project in the upper right hand corner.
Open a script
File -> New File -> R Script or ctrl+shift+N
You should now see a screen like the one below, with four windows.
The window on the lower right shows all of the Files in the folder you created on your computer. If you add something to that folder from outside of R, it will show up here as well.
The window on the upper right shows your Environment. When you create something in R, like a new data frame or a plot, it will show up in the Environment (But it won’t be saved. More on that later).
The window on the upper left shows your Script. This is where you tell R what to do.
The window on the bottom left is the Console. It keeps a running list of all of the procedures you perform. For example, if you run code in the script, it will show up here. When something goes wrong, you’ll also see the error message here.
4.3.1 Install a package
Click anywhere on the script window so that you see the flashing prompt. Type the code below and then type ctrl+enter. (NOTE: If you just hit enter without adding ctrl, it won’t work. It will just move you to the next line. Get in the habit of typing ctrl+enter to run your code).
Like this:
The code above tells R to install a package called tidyverse. You only have to do this once. After it’s installed, it will always be available when you open R, but you’ll have to tell R when you want to use it each time by typing library(tidyverse)
.
Packages are bits of code that someone wrote and then converted into a series of shortcuts. R has 1000’s of packages for just about any task you can think of. The tidyverse package actually contains a bunch of other packages within it. As a result, when you install it for the first time, it will generate a lot of activity in your console, with red text and “https://..” links all over the place. That is all normal. Just give it a few minutes. You’ll know it’s done when you see the chevron >
in the console.
4.3.2 Your first script
Now you are ready for the fun parts. To begin coding your first script, we are going to take an unorthodox approach. Instead of starting with first principles, we’ll start with the Visualize and Model steps from the workflow and then deconstruct that to learn the principles.
Copy the code below and paste it in your script. Then run the code (by clicking ctrl+enter from the first line down). Do not try to interpret it yet. There is a lot going on here. We’ll break it down next.
#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse)
# make the data and "model" the mean and sd.
d <- mtcars %>%
group_by(cyl) %>%
mutate(mean = mean(mpg), sd = sd(mpg))
#plot the data
ggplot(data = d, aes(x = cyl,y = mpg)) +
geom_point(shape = 21) +
geom_pointrange(aes(y = mean, ymin = mean-sd, ymax = mean+sd)) +
labs(y = "Miles Per Gallon",
x = "Cylinders")
Here is what you just did:
Loaded the tidyverse package.
Created a data frame d that was a modified version of the data frame mtcars.
Added two new columns to d: one containing the mean mpg for each type of cylinder and another containing the standard deviation.
Plotted miles per gallon as a function of cylinders as raw data.
Added a mean and standard deviation to the plot.
Modified the axis names.
If you’re a normal person, this should all be mysterious. Here’s the good news. The code above is about as complex as we will get in this book. It is also modular. That is crucial. It means that you don’t have to know every step to get started. Each batch of code that precedes %>%
or +
will run by itself.
Let’s break the code down into individual components:
1) Loaded the tidyverse package.
What it does This code uses the function library()
to load a package called tidyverse. The rest of the code depends on loading this package first.
Did it work? Check the output in the console (lower left window). Red text is normal. It does not necessarily mean there is an error. If you see the prompt >
, that is a good sign. If you see words like "there is no package called…"
, "Error…"
, "failed…"
, then it probably didn’t work.
Things to check if it doesn’t work
- Did you install the package first?
- Did you misspell anything?
- Did you add a capital letter somewhere?
- Did you hit enter instead of ctrl-enter?
2) Created a data frame d that was a modified version of the data frame mtcars.
#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse)
d <- mtcars
What it does This code creates a data frame called d that contains all of the data that are in mtcars. A data frame is just a table with rows and columns. Try running View(mtcars)
and you’ll see what the data frame looks like. mtcars is one of many data frames that are built in to R. It contains data on things like miles per gallon, weight, and horsepower for different types of cars.
The symbol <-
is how we assign bits of code to objects in R. It’s a combination of the lesser than sign <
and the minus sign -
. You will use this symbol all the time.
Did it work? Do you see an object named d in the Environment window (upper right) that has “32 obs. of 11 variables”? If not, it didn’t work.
Visually, the d in the Environment window is the only thing that will automatically show up if it worked. Another way to check is simply to view the data frame, like this:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
This shows a snapshot of the first few rows of the data frame that exists in the object d. It is essentially the same as any other spreadsheet you might make in another program, like Excel. Each row contains information on mpg, cylinders (cyl), horsepower(hp), etc. for each type of car.
You can isolate individual columns:
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
Or check the types of columns:
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
This shows that the object d is a data frame with 32 observations (rows) of 11 variables (columns). It also lists the variables and gives a preview of the first 10 rows. [Note: instead of a data frame, you might see the word tibble. That is another name for a data frame used by the tidyverse package. It has some important distinctions, but they are not relevant for this chapter].
Things to check if it doesn’t work
- Did you misspell anything?
- Did you add a capital letter somewhere?
- Did you hit enter instead of ctrl-enter?
- Is there a space in the arrow
< -
? There shouldn’t be.
Challenges
- Give the data frame a different name other than d.
- Select other columns using the select() function.
3) Added two new columns to d: one containing the mean mpg for each type of cylinder and another containing the standard deviation.
#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse)
# make the data and "model" the mean and sd
d <- mtcars %>%
group_by(cyl) %>%
mutate(mean = mean(mpg), sd = sd(mpg))
What it does This code adds a column summarizing the mean and sd of mpg for cars with different numbers of cylinders. We also have a new symbol %>%
called a “pipe”. You can either type it directly or use a shortcut ctrl-shift-m. Think of it as a way of telling R “and then…”.
In sentence form, the code is saying this.
d <- mtcars %>%
create a data frame called d that contains all of the data that are in mtcars and then…
group_by(cyl) %>%
assign each type of cylinder to a group and then…
mutate(mean = mean(mpg), sd = sd(mpg))
create a new column called mean that contains the mean mpg’s for each type of cylinder. Also create a new column called sd that contains the standard deviation for each type of cylinder
Did it work? Check the columns again with str()
. Do you see the columns mean and sd now? Do you see the odd addendum that starts with “-attr(*,”groups”)…“? If so, then it worked.
## gropd_df [32 × 13] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num [1:32] 160 160 108 258 360 ...
## $ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
## $ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
## $ mean: num [1:32] 19.7 19.7 26.7 19.7 15.1 ...
## $ sd : num [1:32] 1.45 1.45 4.51 1.45 2.56 ...
## - attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ cyl : num [1:3] 4 6 8
## ..$ .rows: list<int> [1:3]
## .. ..$ : int [1:11] 3 8 9 18 19 20 21 26 27 28 ...
## .. ..$ : int [1:7] 1 2 4 6 10 11 30
## .. ..$ : int [1:14] 5 7 12 13 14 15 16 17 22 23 ...
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
Things to check if it doesn’t work
- Did you load the library
library(tidyverse)
? The pipes%>%
only work if the tidyverse package is loaded. - Did you misspell anything?
- Did you type out “cylinders” instead of using “cyl”? Computers don’t know those are related.
- Did you add a capital letter somewhere?
- Did you hit enter instead of ctrl-enter?
- Is there a space in the arrow
< -
or the pipes% >%
? There shouldn’t be. - Did you put the package in quotes when calling the
library()
function:library("tidyverse")
?
Challenges
- Summarize mpg by the number of gears instead of cylinders
- Add a column that calculates the median in addition to the mean and sd
4) Plotted miles per gallon as a function of cylinders as raw data.
#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse)
# make the data and "model" the mean and sd
d <- mtcars %>%
group_by(cyl) %>%
mutate(mean = mean(mpg), sd = sd(mpg))
#plot the data
ggplot(data = d, aes(x = cyl,y = mpg)) +
geom_point(shape = 21)
What it does This code uses the powerful plotting package called ggplot2 (Wickham 2016). It included when you installed the tidyverse. The “gg” stands for “The Grammar of Graphics” (Wilkinson 2012), a fundamental set of principles for producing just about any plot you can think of (and rules for why some types of plots are better than others).
Making anything with ggplot2 usually requires at least two things: 1) a call to ggplot(...)
where we specify the data along with the x and y axes or other aesthetics, and 2) a call to geom_…
, where we tell ggplot2 how to plot the data. There are lots of geoms, as you can see in this cheatsheet (https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf). For most practices in this book, we will use geom_point()
, which simply adds a dot for each x-y coordinate that we specified in the aes()
function.
Once we have our base plot, everything else is added with +
. This can be a point of confusion. The +
has a similar meaning as the pipe %>%
, but ggplot2 only uses +
. Accidentally typing %>%
instead of +
is a common mistake even for experienced coders (like authors of textbooks about data analysis).
You can see the iterative nature of ggplot by breaking it down further, adding one thing at a time.
#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse)
# make the data and "model" the mean and sd
d <- mtcars %>%
group_by(cyl) %>%
mutate(mean = mean(mpg), sd = sd(mpg))
#Create a placeholder for a plot
ggplot()
#Assign the aesthetics
#Put numbers from the column cyl on the x
#For each value of cyl on the x, add the corresponding value from the mpg column
ggplot(data = d, aes(x = cyl,y = mpg))
#Tell ggplot how to plot the x-y values with a geom.
ggplot(data = d, aes(x = cyl,y = mpg)) +
geom_point()
#Change the default shape of the dots
#Any number from 1-25 will produce a different shape. (http://www.cookbook-r.com/Graphs/Shapes_and_line_types/)
ggplot(data = d, aes(x = cyl,y = mpg)) +
geom_point(shape = 21)
Did it work? The plot above should appear in the plot window (lower right).
Things to check if it doesn’t work
- Did you leave a hanging plus
+
at the end of the code? If so, remove it. - Did you write
ggplot2()
instead ofggplot()
? - Did you remember to assign the x and y axes within the
aes()
function? - Did you put a pipe
%>%
instead of a plus+
? - Did you misspell anything?
- Did you type out “cylinders” instead of using “cyl”? R doesn’t know those are related.
- Did you add a capital letter somewhere?
- Did you hit enter instead of ctrl-enter?
- Is there a space in the arrow
< -
or the pipes% >%
? There shouldn’t be. - Did you put the package in quotes when calling the
library()
function:library("tidyverse")
? - Did you load the library
library(tidyverse)
? The pipes%>%
only work if the tidyverse package is loaded.
Challenges
- Assign a different y variable
- Change the shape
- In addition to shape, make the colors green (HINT: see the cheatsheet (https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf))
5) Added a mean and standard deviation to the plot.
#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse)
# make the data and "model" the mean and sd
d <- mtcars %>%
group_by(cyl) %>%
mutate(mean = mean(mpg), sd = sd(mpg))
#plot the data
ggplot(data = d, aes(x = cyl,y = mpg)) +
geom_point(shape = 21) +
geom_pointrange(aes(y = mean, ymin = mean-sd, ymax = mean+sd))
What it does The new geom geom_pointrange()
adds the mean and standard deviation to the plot. If we only typed geom_pointrange()
, it wouldn’t work. That’s because the geom requires three values that we haven’t assigned yet: y, ymin, and ymax. In this case, we want y to be the mean of each group. We want the error bar to range from ymin to ymax. ymin is the mean minus the standard deviation for each group mean-sd. ymax is the mean plus the standard deviation mean+sd. With those values, ggplot draws a line from ymin to ymax.
Did it work? The plot above should appear in the plot window (lower right).
Things to check if it doesn’t work
- Did you leave a hanging plus
+
at the end of the code? If so, remove it. - Did you write
ggplot2()
instead ofggplot()
? - Did you remember to assign the x and y axes within the
aes()
function? - Did you put a pipe
%>%
instead of a plus+
? - Did you misspell anything?
- Did you type out “cylinders” instead of using “cyl”? R doesn’t know those are related.
- Did you add a capital letter somewhere?
- Did you hit enter instead of ctrl-enter?
- Is there a space in the arrow
< -
or the pipes% >%
? There shouldn’t be. - Did you put the package in quotes when calling the
library()
function:library("tidyverse")
? - Did you load the library
library(tidyverse)
? The pipes%>%
only work if the tidyverse package is loaded.
6) Modified the axis names.
#Copy this code, paste it in your script, and run it.
#load a package
library(tidyverse)
# make the data and "model" the mean and sd
d <- mtcars %>%
group_by(cyl) %>%
mutate(mean = mean(mpg), sd = sd(mpg))
#plot the data
ggplot(data = d, aes(x = cyl,y = mpg)) +
geom_point(shape = 21) +
geom_pointrange(aes(y = mean, ymin = mean-sd, ymax = mean+sd)) +
labs(y = "Miles Per Gallon",
x = "Cylinders")
What it does The function labs()
replaces the title of the x and y axis with whatever we put in quotes.
Did it work? The plot above should appear in the plot window (lower right).
Things to check if it doesn’t work
- Did you remember the comma?
- Did you switch the x and y?
- Did you leave a hanging plus + at the end of the code? If so, remove it.
- Did you write
ggplot2()
instead ofggplot()
? - Did you remember to assign the x and y axes within the
aes()
function? - Did you put a pipe
%>%
instead of a plus+
? - Did you misspell anything?
- Did you type out “cylinders” instead of using “cyl”? R doesn’t know those are related.
- Did you add a capital letter somewhere?
- Did you hit enter instead of ctrl-enter?
- Is there a space in the arrow
< -
or the pipes% >%
? There shouldn’t be. - Did you put the package in quotes when calling the
library()
function:library("tidyverse")
? - Did you load the library
library(tidyverse)
? The pipes%>%
only work if the tidyverse package is loaded.
Challenges
- Rename the x and y axes
- Add a title within the
labs()
function usingtitle = "put your title here"
4.3.3 Importing your own data to RStudio
To get your own data into R, first save the data into the same folder as your project.
Look in the lower right panel of RStudio. Click on “Files”. Do you see the data set you saved? Click on it and choose Import Dataset….
Like this:
After you click Import Dataset…., you’ll see a preview of your data set like this:
Stop here and check the preview. Does everything look right? Are the column names correct? If not, you might need to check the box for “First Rows as Names” on the lower left.
From here, you have two options. 1) Click “Import”, or 2) Copy the code in the Preview Code box, cancel the preview, and paste the code into your script. We strongly recommend the second option.
Clicking will work, but if you come back to your script and need to reload the data, you’ll have to do this process again. If you instead copy the code and paste it into your script, then your code becomes self-contained and you won’t forget any steps in the future.
Here is where you can copy from. We don’t copy the View()
part, but you can if you want:
Then paste it at the beginning of your script and run it. Our data set is called continents. Yours will probably be different, though.
Do you see the name of your data set in the upper right panel? If so, success! If not, re-try the steps above or ask your instructor. You are now ready to practice the coding you’ve learned on your own data set.
Why to Code Instead of Click
R is a programming language, which means that it can only do what you tell it to do by typing. RStudio has a few clickable shortcuts, but it still requires nearly everything to be typed into a script.
There are other programs that conduct statistical and graphical analyses without using code. We choose to use R instead for several reasons.
Clicking Isn’t Actually Easier
Undergraduates are incredibly savvy with some aspects of computers, particularly in nagivating social media platforms. But in our experience, students often struggle with even rudimentary tasks in programs that professors think are easy, such as Microsoft Excel or SPSS. These programs have their own bewildering array of shortcuts and buttons (Nash 2008). For example, while this Excel function might make perfect sense to a seasoned user
=STDEV.P(A$1:A$7)
it can be just as confusing to a new student as the similar function in R sd(data$column)
.
Similarly, while it may seem easier to run an ANOVA in SPSS by simply clicking the ANOVA button, this too is often misleading. Having helped students that are part way through a project in SPSS or other clickable programs, we almost always have to start their entire analysis over when a problem arises. The reason is that, by the time the ANOVA button is clicked, there have already been a series of steps in data preparation and uploading that might have generated a problem. In R, we can find these problems easily, because the script leaves a breadcrumb trail of each step. In non-scripting programs, there are no breadcrumbs, so solving the problem becomes much more complicated. And no matter what program you use or how simple your data seems, there will be problems to solve.
Data Ethics
A basic requirement in modern science is that the results of scientific findings could be reproduced by someone else. There are two levels to this. The first level of reproducibility is the description of the experimental approach, which is contained in a Methods section in a scientific publication. This ensures that someone else could read a Methods section and reproduce the steps of the experiment exactly without having to ask the author (who may no longer be alive or just doesn’t respond to email).
The second level of reproducibility is in the analysis of a data set presented in a scientific publication. All analyses involve myriad human decisions. For example, what do we do with outliers (extreme data values that may be real or may be a result of data entry error or errors in the instruments)? What if half of our fish died in the middle of an experiment? Should we replace them with new ones? There are no easy answers to these questions. Each experiment has its own quirks and they will all involve subjective decisions by the scientist.
What do we do about these subjective decisions? The golden rule is to be transparent about them. First, describe them in the Methods and provide a justification for them. Second, always include a way for readers to easily find the raw data and any scripts. This is where using computer code over clicking makes a huge difference. If the raw data and script are available, then it is simple for someone else to run the analysis later and see the decisions you made about the data quirks. Different scientists will make different decisions about each of those quirks. The most important thing is not which decision is made per se, but that the trail of breadcrumbs exists to allow a decision to be transparent.
That may seem a little daunting. It is scary to have someone else see all of your decisions. But here’s the actual truth: The person who will benefit most from your transparent data and code is not another scientist. It is you. In two days, two months, or two years, you will eventually have to return to an old analysis. You’ll need it to wrap up that semester’s term paper or reanalyze something from your thesis. You will NOT NOT NOT NOT NOT remember what you have done, no matter how obvious it seemed when you were doing it. For that reason, having script that is reproducible will save you hours, maybe weeks, of otherwise wasted time. Trust us…just trust us.
4.4 Tips and Tricks
4.4.1 pivot_longer()
Convert data from wide format to long format
library(gapminder)
library(tidyverse)
my_data <- gapminder %>%
select(country, year, lifeExp) %>%
pivot_wider(names_from = year, values_from = lifeExp)
my_data
## # A tibble: 142 × 13
## country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997` `2002` `2007`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 28.8 30.3 32.0 34.0 36.1 38.4 39.9 40.8 41.7 41.8 42.1 43.8
## 2 Albania 55.2 59.3 64.8 66.2 67.7 68.9 70.4 72 71.6 73.0 75.7 76.4
## 3 Algeria 43.1 45.7 48.3 51.4 54.5 58.0 61.4 65.8 67.7 69.2 71.0 72.3
## 4 Angola 30.0 32.0 34 36.0 37.9 39.5 39.9 39.9 40.6 41.0 41.0 42.7
## 5 Argentina 62.5 64.4 65.1 65.6 67.1 68.5 69.9 70.8 71.9 73.3 74.3 75.3
## 6 Australia 69.1 70.3 70.9 71.1 71.9 73.5 74.7 76.3 77.6 78.8 80.4 81.2
## 7 Austria 66.8 67.5 69.5 70.1 70.6 72.2 73.2 74.9 76.0 77.5 79.0 79.8
## 8 Bahrain 50.9 53.8 56.9 59.9 63.3 65.6 69.1 70.8 72.6 73.9 74.8 75.6
## 9 Bangladesh 37.5 39.3 41.2 43.5 45.3 46.9 50.0 52.8 56.0 59.4 62.0 64.1
## 10 Belgium 68 69.2 70.2 70.9 71.4 72.8 73.9 75.4 76.5 77.5 78.3 79.4
## # ℹ 132 more rows
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <chr> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
4.4.2 pivot_wider()
Convert data from long format to wide format
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <chr> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
#long format to wide format
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
pivot_wider(names_from = "year", values_from = "lifeExp")
## # A tibble: 142 × 13
## country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997` `2002` `2007`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 28.8 30.3 32.0 34.0 36.1 38.4 39.9 40.8 41.7 41.8 42.1 43.8
## 2 Albania 55.2 59.3 64.8 66.2 67.7 68.9 70.4 72 71.6 73.0 75.7 76.4
## 3 Algeria 43.1 45.7 48.3 51.4 54.5 58.0 61.4 65.8 67.7 69.2 71.0 72.3
## 4 Angola 30.0 32.0 34 36.0 37.9 39.5 39.9 39.9 40.6 41.0 41.0 42.7
## 5 Argentina 62.5 64.4 65.1 65.6 67.1 68.5 69.9 70.8 71.9 73.3 74.3 75.3
## 6 Australia 69.1 70.3 70.9 71.1 71.9 73.5 74.7 76.3 77.6 78.8 80.4 81.2
## 7 Austria 66.8 67.5 69.5 70.1 70.6 72.2 73.2 74.9 76.0 77.5 79.0 79.8
## 8 Bahrain 50.9 53.8 56.9 59.9 63.3 65.6 69.1 70.8 72.6 73.9 74.8 75.6
## 9 Bangladesh 37.5 39.3 41.2 43.5 45.3 46.9 50.0 52.8 56.0 59.4 62.0 64.1
## 10 Belgium 68 69.2 70.2 70.9 71.4 72.8 73.9 75.4 76.5 77.5 78.3 79.4
## # ℹ 132 more rows
4.4.3 mutate()
Add a new column
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(something_new = "newly something") # adds a new column called something_new that contains the words "newly something"
## # A tibble: 1,704 × 4
## country year lifeExp something_new
## <fct> <chr> <dbl> <chr>
## 1 Afghanistan 1952 28.8 newly something
## 2 Afghanistan 1957 30.3 newly something
## 3 Afghanistan 1962 32.0 newly something
## 4 Afghanistan 1967 34.0 newly something
## 5 Afghanistan 1972 36.1 newly something
## 6 Afghanistan 1977 38.4 newly something
## 7 Afghanistan 1982 39.9 newly something
## 8 Afghanistan 1987 40.8 newly something
## 9 Afghanistan 1992 41.7 newly something
## 10 Afghanistan 1997 41.8 newly something
## # ℹ 1,694 more rows
4.4.4 case_when()
Add a new column whose values are conditional on an existing column
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(something_new = case_when(year == 1952 ~ "newly something",
TRUE ~ "something else"))
## # A tibble: 1,704 × 4
## country year lifeExp something_new
## <fct> <chr> <dbl> <chr>
## 1 Afghanistan 1952 28.8 newly something
## 2 Afghanistan 1957 30.3 something else
## 3 Afghanistan 1962 32.0 something else
## 4 Afghanistan 1967 34.0 something else
## 5 Afghanistan 1972 36.1 something else
## 6 Afghanistan 1977 38.4 something else
## 7 Afghanistan 1982 39.9 something else
## 8 Afghanistan 1987 40.8 something else
## 9 Afghanistan 1992 41.7 something else
## 10 Afghanistan 1997 41.8 something else
## # ℹ 1,694 more rows
4.4.5 filter()
Filter by a number
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
filter(year == 1952 | year == 1972) # select individual years
## # A tibble: 284 × 3
## country year lifeExp
## <fct> <chr> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1972 36.1
## 3 Albania 1952 55.2
## 4 Albania 1972 67.7
## 5 Algeria 1952 43.1
## 6 Algeria 1972 54.5
## 7 Angola 1952 30.0
## 8 Angola 1972 37.9
## 9 Argentina 1952 62.5
## 10 Argentina 1972 67.1
## # ℹ 274 more rows
Filter by a range of numbers
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
filter(year >= 1972 & year <= 1992) # range of years
## # A tibble: 710 × 3
## country year lifeExp
## <fct> <chr> <dbl>
## 1 Afghanistan 1972 36.1
## 2 Afghanistan 1977 38.4
## 3 Afghanistan 1982 39.9
## 4 Afghanistan 1987 40.8
## 5 Afghanistan 1992 41.7
## 6 Albania 1972 67.7
## 7 Albania 1977 68.9
## 8 Albania 1982 70.4
## 9 Albania 1987 72
## 10 Albania 1992 71.6
## # ℹ 700 more rows
Filter by text
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
filter(country == "Angola") # filter by country
## # A tibble: 12 × 3
## country year lifeExp
## <fct> <chr> <dbl>
## 1 Angola 1952 30.0
## 2 Angola 1957 32.0
## 3 Angola 1962 34
## 4 Angola 1967 36.0
## 5 Angola 1972 37.9
## 6 Angola 1977 39.5
## 7 Angola 1982 39.9
## 8 Angola 1987 39.9
## 9 Angola 1992 40.6
## 10 Angola 1997 41.0
## 11 Angola 2002 41.0
## 12 Angola 2007 42.7
Filter by text and numbers
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
filter(country == "Angola" & year >= 1982) # filter by country and year
## # A tibble: 6 × 3
## country year lifeExp
## <fct> <chr> <dbl>
## 1 Angola 1982 39.9
## 2 Angola 1987 39.9
## 3 Angola 1992 40.6
## 4 Angola 1997 41.0
## 5 Angola 2002 41.0
## 6 Angola 2007 42.7
4.4.6 left_join()
Combine two data sets that have at least one shared column
## # A tibble: 142 × 2
## country continent
## <fct> <fct>
## 1 Afghanistan Asia
## 2 Albania Europe
## 3 Algeria Africa
## 4 Angola Africa
## 5 Argentina Americas
## 6 Australia Oceania
## 7 Austria Europe
## 8 Bahrain Asia
## 9 Bangladesh Asia
## 10 Belgium Europe
## # ℹ 132 more rows
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <chr> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
For every country, assign a continent from the tibble called continents
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
left_join(continents) #joins the tibble 'continents' to the tibble 'my_data'. Requires a shared column with common values. In this case, country is the column and the country names are the values.
## # A tibble: 1,704 × 4
## country year lifeExp continent
## <fct> <chr> <dbl> <fct>
## 1 Afghanistan 1952 28.8 Asia
## 2 Afghanistan 1957 30.3 Asia
## 3 Afghanistan 1962 32.0 Asia
## 4 Afghanistan 1967 34.0 Asia
## 5 Afghanistan 1972 36.1 Asia
## 6 Afghanistan 1977 38.4 Asia
## 7 Afghanistan 1982 39.9 Asia
## 8 Afghanistan 1987 40.8 Asia
## 9 Afghanistan 1992 41.7 Asia
## 10 Afghanistan 1997 41.8 Asia
## # ℹ 1,694 more rows
4.4.7 parse_number()
Extract numbers from a column.
For example, in the current data frame, year is a text string so every instance of year will be plotted, like this.
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
left_join(continents) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_point()
That can be messy with lots of years. Let’s tell R that year should be a number and re-plot it.
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(year = parse_number(year)) %>% # convert year to a number
left_join(continents)
## # A tibble: 1,704 × 4
## country year lifeExp continent
## <fct> <dbl> <dbl> <fct>
## 1 Afghanistan 1952 28.8 Asia
## 2 Afghanistan 1957 30.3 Asia
## 3 Afghanistan 1962 32.0 Asia
## 4 Afghanistan 1967 34.0 Asia
## 5 Afghanistan 1972 36.1 Asia
## 6 Afghanistan 1977 38.4 Asia
## 7 Afghanistan 1982 39.9 Asia
## 8 Afghanistan 1987 40.8 Asia
## 9 Afghanistan 1992 41.7 Asia
## 10 Afghanistan 1997 41.8 Asia
## # ℹ 1,694 more rows
Now year is a number (aka a ‘double’ or ‘dbl’) and will plot intervals instead of every instance.
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(year = parse_number(year)) %>%
left_join(continents) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_point()
Another way to convert to a number is with as.numeric()
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(year = as.numeric(year)) %>%
left_join(continents) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_point()
### separate() Separate strings in a single column to multiple columns
my_data_united <- my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(year = parse_number(year)) %>%
left_join(continents) %>%
unite("country_continent", c(country, continent))
For example, we might want to separate the column ‘country_continent’ to two independent columns
## # A tibble: 1,704 × 4
## country continent year lifeExp
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8
## 2 Afghanistan Asia 1957 30.3
## 3 Afghanistan Asia 1962 32.0
## 4 Afghanistan Asia 1967 34.0
## 5 Afghanistan Asia 1972 36.1
## 6 Afghanistan Asia 1977 38.4
## 7 Afghanistan Asia 1982 39.9
## 8 Afghanistan Asia 1987 40.8
## 9 Afghanistan Asia 1992 41.7
## 10 Afghanistan Asia 1997 41.8
## # ℹ 1,694 more rows
4.4.8 clean_names()
Use the janitor package to automatically fix the column names. This is especially helpful for messy data sets
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
The column names in iris contain capitals and dots. We could rename them all by hand, but clean_names()
will do this for us automatically.
## sepal_length sepal_width petal_length petal_width species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
4.4.9 group_by() and summarize()
Get summary statistics for each group in your data
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(year = as.numeric(year)) %>%
left_join(continents)
## # A tibble: 1,704 × 4
## country year lifeExp continent
## <fct> <dbl> <dbl> <fct>
## 1 Afghanistan 1952 28.8 Asia
## 2 Afghanistan 1957 30.3 Asia
## 3 Afghanistan 1962 32.0 Asia
## 4 Afghanistan 1967 34.0 Asia
## 5 Afghanistan 1972 36.1 Asia
## 6 Afghanistan 1977 38.4 Asia
## 7 Afghanistan 1982 39.9 Asia
## 8 Afghanistan 1987 40.8 Asia
## 9 Afghanistan 1992 41.7 Asia
## 10 Afghanistan 1997 41.8 Asia
## # ℹ 1,694 more rows
What’s the average and sd of life expectancy in each country?
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(year = as.numeric(year)) %>%
left_join(continents) %>%
group_by(country) %>%
summarize(avg_le = mean(lifeExp),
sd = sd(lifeExp))
## # A tibble: 142 × 3
## country avg_le sd
## <fct> <dbl> <dbl>
## 1 Afghanistan 37.5 5.10
## 2 Albania 68.4 6.32
## 3 Algeria 59.0 10.3
## 4 Angola 37.9 4.01
## 5 Argentina 69.1 4.19
## 6 Australia 74.7 4.15
## 7 Austria 73.1 4.38
## 8 Bahrain 65.6 8.57
## 9 Bangladesh 49.8 9.03
## 10 Belgium 73.6 3.78
## # ℹ 132 more rows
What’s the average and sd of life expectancy in each year?
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(year = as.numeric(year)) %>%
left_join(continents) %>%
group_by(year) %>%
summarize(avg_le = mean(lifeExp),
sd = sd(lifeExp))
## # A tibble: 12 × 3
## year avg_le sd
## <dbl> <dbl> <dbl>
## 1 1952 49.1 12.2
## 2 1957 51.5 12.2
## 3 1962 53.6 12.1
## 4 1967 55.7 11.7
## 5 1972 57.6 11.4
## 6 1977 59.6 11.2
## 7 1982 61.5 10.8
## 8 1987 63.2 10.6
## 9 1992 64.2 11.2
## 10 1997 65.0 11.6
## 11 2002 65.7 12.3
## 12 2007 67.0 12.1
What’s the average and sd of life expectancy in each year for each continent?
my_data %>%
pivot_longer(cols = -country, names_to = "year", values_to = "lifeExp") %>%
mutate(year = as.numeric(year)) %>%
left_join(continents) %>%
group_by(year, continent) %>%
summarize(avg_le = mean(lifeExp),
sd = sd(lifeExp))
## # A tibble: 60 × 4
## # Groups: year [12]
## year continent avg_le sd
## <dbl> <fct> <dbl> <dbl>
## 1 1952 Africa 39.1 5.15
## 2 1952 Americas 53.3 9.33
## 3 1952 Asia 46.3 9.29
## 4 1952 Europe 64.4 6.36
## 5 1952 Oceania 69.3 0.191
## 6 1957 Africa 41.3 5.62
## 7 1957 Americas 56.0 9.03
## 8 1957 Asia 49.3 9.64
## 9 1957 Europe 66.7 5.30
## 10 1957 Oceania 70.3 0.0495
## # ℹ 50 more rows
Except for JR, an English professor who doesn’t understand any of this. We assume he is currently pontificating about the literary importance of using salve versus halve in the writings of Chaucer (who uses neither word). JR has a large collection of feathered pens and prefers to write on low gloss paper sourced from the Pacific Northeast.↩︎