Chapter 8 Wrangling Columns in R with Select, Rename, and Relocate

In this chapter, we will introduce you ways to wrangle columns in R. Data wrangling refers broadly to the many things you might do to make disorderly data more organized and tidy. You will often want to focus your analysis on particular variables, or columns, in your dataset. This chapter will show you how to include the columns you want, and exclude the columns you don’t want. You will also learn how to rename and relocate your columns in your dataset. We will also learn some good general practices for column naming, which is often more important than people realize. The main functions in the tidyverse for doing column functions are select(), rename(), rename_with(), and relocate(). The select() function can use a number of logical statements, and functional helpers, to decide whether to keep or exclude rows in your dataset.

Figure 6.1: Artwork by Allison Horst, https://www.allisonhorst.com

8.1 Goals for this Chapter

Understand select statements
Select rows based on name and number
Select rows with helpers including starts_with(), ends_with()
Select rows with helpers including contains(), matches()
Relocate column variables within your dataset with relocate
Rename column variables within your dataset with rename() and rename_with()

8.2 Packages needed for this Chapter

{tidyverse}

{janitor}

{medicaldata}

8.3 Pathway for this Chapter

This Chapter is part of the DATA WRANGLING pathway. Chapters in this pathway include

What is Tidy Data?
Selecting, Renaming, and Relocating Columns
Filtering Rows
Counting, Grouping, and Summarizing Rows
Arranging and Ranking Rows
Mutating to Make New Variables
Rearranging Untidy data with {tidyr} and {pivotr}

8.4 Tidyselect Helpers in R

Tidyselect helpers in R are functions that help you select a specfic group of columns. A typical tidyselect helper function defines which variables match your criteria.
Typical examples of tidyselect helper functions include:

starts_with(“preop”) - matches preop_bp, preop_hr, preop_psa
ends_with(“pain”) - matches abd_pain, extremity_pain, wound_pain
contains(“bp”) - matches bp_screening, end_bp, first_bp_am
matches(“week[0-9]”) - matches week4, week8, week0
everything() - all columns
last_col() - the last column
num_range(“wk”, 1:6) - matches wk1, wk3, wk6, but not wk7
where(is.numeric) - selects variables of numeric type

Each of these examples of logical tests will each column in your dataframe with the included argument - the criteria within the parentheses, and results in a vector of TRUE or FALSE values for each column (variable) of the dataframe. The select() function will act on these TRUE and FALSE values to include (TRUE) or exclude (FALSE) the column variables from the resulting dataframe.

8.5 Selecting a Column Variables

The general format for select statements is:

select(variable1, variable_name3:variable_name17, variable55),
which selects by name the first variable, the 3rd through the 17th, and the 55th variable (you use just the variable names, the numbers are just to illustrate how to use the colon operator), and drops the rest.

You can also negatively select against variables, with the negative sign, as in

select(-sbp, -hr)

These logical statements can even be sequential within a select() function, to help clarify selections, as in when you want “age” but not “stage”.

select(contains("age"), -stage)

You can also select variables by their column number, (though this can get ugly if column numbers change), as in

select(1:5, 17:22)

8.5.1 Try this out

Copy the code block below to your RStudio Console, and run it to get started.

library(tidyverse)
library(medicaldata) 
medicaldata::blood_storage %>% 
  select(everything()) %>% 
  head()

Now replace the argument to the select statement (everything()) with 1:5.

Run this - you should get only the first 5 columns, instead of all the columns.

Now try this with AnyAdjTherapy:TimeToRecurrence as the argument to get the last 5 columns.

You can use the colon operator to get any continguous group of columns between the start:end columns (inclusive of the start and end columns).

Experiment for yourself to get specific groups of contiguous columns.

8.6 Selecting Columns that are Not Contiguous

You can select any non-contiguous columns by inserting a comma between the columns selected. You can even select one column at a time. For example

medicaldata::blood_storage %>% 
  select(1, bGS:sGS, 3, AA) %>% 
  head()

  RBC.Age.Group bGS BN+ OrganConfined PreopPSA PreopTherapy Units sGS  Age AA
1             3   3   0             0    14.08            1     6   1 72.1  0
2             3   2   0             1    10.50            0     2   3 73.6  0
3             3   3   0             1     6.98            1     1   1 67.5  0
4             2   1   0             1     4.40            0     2   3 65.8  0
5             2   2   0             1    21.40            0     3   3 63.2  0
6             3   1   0             0     5.10            0     1   3 65.4  0

You can mix single columns, ranges of columns, column names vs. numbers, and even change the order in which the columns are listed.

Try this yourself, with the cytomegalovirus dataset. Copy the code chunk below and run it in your RStudio Console pane. Then edit it to select the ID, patient demographics, the prior treatment variables, and the dose variables.

medicaldata::cytomegalovirus %>% 
  select(everything()) %>% 
  head()

medicaldata::cytomegalovirus %>% 
  select(ID, age:race, prior.radiation:prior.transplant,
         TNC.dose:TBI.dose) %>% 
  head()

8.7 Selecting Columns With Logical Operators

You can select groups of columns with logical operators to combine selections. You can use

c() to combine selections. An example: select(c(age:race, prior.radiation:prior.chemo)).
the symbols & and | to select the intersection or the union, respectively, of two sets of variables. An example: select(age:race & prior.radiation:prior.chemo).
the ! symbol to select the complement of a set of variables. An example: select(!c(age:race, prior.radiation:prior.chemo)).

Try this yourself, with the esoph_ca dataset. Copy the code chunk below and run it in your RStudio Console pane. Then edit it to select all variables except the ncases and ncontrols, using the ! symbol.

  medicaldata::esoph_ca %>% 
  select(everything()) %>% 
  head()

  medicaldata::esoph_ca %>% 
  select(!ncases:ncontrols) %>% 
  head()

Now try editing the code chunk below to select all of the variables that are not related to dose or any kind of graft versus host disease (end in gvhd) in the cytomegalovirus dataset. Use the ! and the | symbols.

medicaldata::cytomegalovirus %>% 
  select(everything()) %>% 
  head()

medicaldata::cytomegalovirus %>% 
  select(!c(TNC.dose:TBI.dose | agvhd:time.to.cgvhd)) %>% 
  head()

8.8 Further Challenges

Try selecting variables on your own computer.

Install R and Rstudio, as in Chapter 2.

Then make sure you have the following packages installed, as in the Installing Packages chapter:

tidyverse
medicaldata

Read in data from the medicaldata datasets with data(name)

Then try some of the following challenges

select in the cytomegalovirus dataset for
select in the opt dataset for
select in the covid_testing dataset for

8.9 Explore More about Filtering

Some helpful resources include:

The RYouWithMe series from RLadiesSyndey post on selecting here.

The tidyverse guide to selecting variables, found here.

Suzan Baert’s blog series on dplyr, part 2 about wrangling columns is found here.