Chapter 7 Wrangling Columns in R with Select, Rename, and Relocate
In this chapter, we will introduce you ways to wrangle columns in R. Data wrangling refers broadly to the many things you might do to make disorderly data more organized and tidy. You will often want to focus your analysis on particular variables, or columns, in your dataset. This chapter will show you how to include the columns you want, and exclude the columns you don’t want. You will also learn how to rename and relocate your columns in your dataset. We will also learn some good general practices for column naming, which is often more important than people realize. The main functions in the tidyverse for doing column functions are select(), rename(), rename_with(), and relocate(). The select() function can use a number of logical statements, and functional helpers, to decide whether to keep or exclude rows in your dataset.
7.1 Goals for this Chapter
- Understand select statements
- Select rows based on name and number
- Select rows with helpers including
starts_with()
,ends_with()
- Select rows with helpers including
contains()
,matches()
- Relocate column variables within your dataset with
relocate
- Rename column variables within your dataset with
rename()
andrename_with()
7.3 Pathway for this Chapter
This Chapter is part of the DATA WRANGLING pathway. Chapters in this pathway include
- What is Tidy Data?
- Selecting, Renaming, and Relocating Columns
- Filtering Rows
- Counting, Grouping, and Summarizing Rows
- Arranging and Ranking Rows
- Mutating to Make New Variables
- Rearranging Untidy data with {tidyr} and {pivotr}
7.4 Tidyselect Helpers in R
Tidyselect helpers in R are functions that help you select a specfic group of columns. A typical tidyselect helper function defines which variables match your criteria.
Typical examples of tidyselect helper functions include:
starts_with(“preop”) - matches preop_bp, preop_hr, preop_psa
ends_with(“pain”) - matches abd_pain, extremity_pain, wound_pain
contains(“bp”) - matches bp_screening, end_bp, first_bp_am
matches(“week[0-9]”) - matches week4, week8, week0
everything() - all columns
last_col() - the last column
num_range(“wk”, 1:6) - matches wk1, wk3, wk6, but not wk7
where(is.numeric) - selects variables of numeric type
Each of these examples of logical tests will each column in your dataframe with the included argument - the criteria within the parentheses, and results in a vector of TRUE or FALSE values for each column (variable) of the dataframe. The select() function will act on these TRUE and FALSE values to include (TRUE) or exclude (FALSE) the column variables from the resulting dataframe.
7.5 Selecting a Column Variables
The general format for select statements is:
select(variable1, variable_name3:variable_name17, variable55)
,
which selects by name the first variable, the 3rd through the 17th, and the 55th variable (you use just the variable names, the numbers are just to illustrate how to use the colon operator), and drops the rest.
You can also negatively select against variables, with the negative sign, as in
select(-sbp, -hr)
These logical statements can even be sequential within a select() function, to help clarify selections, as in when you want “age” but not “stage”.
select(contains("age"), -stage)
You can also select variables by their column number, (though this can get ugly if column numbers change), as in
select(1:5, 17:22)
7.5.1 Try this out
Copy the code block below to your RStudio Console, and run it to get started.
library(tidyverse)
library(medicaldata)
medicaldata::blood_storage %>%
select(everything()) %>%
head()
Now replace the argument to the select statement (everything()
) with 1:5
.
Run this - you should get only the first 5 columns, instead of all the columns.
Now try this with AnyAdjTherapy:TimeToRecurrence
as the argument to get the last 5 columns.
You can use the colon operator to get any continguous group of columns between the start:end
columns (inclusive of the start and end columns).
Experiment for yourself to get specific groups of contiguous columns.
7.6 Selecting Columns that are Not Contiguous
You can select any non-contiguous columns by inserting a comma between the columns selected. You can even select one column at a time. For example
RBC.Age.Group bGS BN+ OrganConfined PreopPSA PreopTherapy
1 3 3 0 0 14.08 1
2 3 2 0 1 10.50 0
3 3 3 0 1 6.98 1
4 2 1 0 1 4.40 0
5 2 2 0 1 21.40 0
6 3 1 0 0 5.10 0
Units sGS Age AA
1 6 1 72.1 0
2 2 3 73.6 0
3 1 1 67.5 0
4 2 3 65.8 0
5 3 3 63.2 0
6 1 3 65.4 0
You can mix single columns, ranges of columns, column names vs. numbers, and even change the order in which the columns are listed.
Try this yourself, with the cytomegalovirus dataset. Copy the code chunk below and run it in your RStudio Console pane. Then edit it to select the ID, patient demographics, the prior treatment variables, and the dose variables.
7.7 Selecting Columns With Logical Operators
You can select groups of columns with logical operators to combine selections. You can use
c()
to combine selections. An example:select(c(age:race, prior.radiation:prior.chemo))
.- the symbols
&
and|
to select the intersection or the union, respectively, of two sets of variables. An example:select(age:race & prior.radiation:prior.chemo)
. - the
!
symbol to select the complement of a set of variables. An example:select(!c(age:race, prior.radiation:prior.chemo))
.
Try this yourself, with the esoph_ca dataset. Copy the code chunk below and run it in your RStudio Console pane. Then edit it to select all variables except the ncases and ncontrols, using the !
symbol.
Now try editing the code chunk below to select all of the variables that are not related to dose or any kind of graft versus host disease (end in gvhd
) in the cytomegalovirus dataset. Use the !
and the |
symbols.
7.8 Further Challenges
Try selecting variables on your own computer.
Install R and Rstudio, as in Chapter 2.
Then make sure you have the following packages installed, as in the Installing Packages chapter:
tidyverse
medicaldata
Read in data from the medicaldata datasets with data(name)
Then try some of the following challenges
select in the cytomegalovirus dataset for
select in the opt dataset for
select in the covid_testing dataset for