6.1 mutate()

What it does: Adds new columns or modifies current variables in the dataset.

Let’s say I want to create three new variables in the diamonds dataset (described in Section5) :

  1. One variable called JustOne where all of the values inside the column are 1.

  2. One variable called Values where all of the values inside are something:

  3. One variable called Simple where all the values equal TRUE

diamonds %>% 
  mutate(JustOne = 1,
         Values = "something",
         Simple = TRUE)
## # A tibble: 53,940 x 13
##    carat cut   color clarity depth table price     x     y     z JustOne
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl>
##  1 0.23  Ideal E     SI2      61.5    55   326  3.95  3.98  2.43       1
##  2 0.21  Prem~ E     SI1      59.8    61   326  3.89  3.84  2.31       1
##  3 0.23  Good  E     VS1      56.9    65   327  4.05  4.07  2.31       1
##  4 0.290 Prem~ I     VS2      62.4    58   334  4.2   4.23  2.63       1
##  5 0.31  Good  J     SI2      63.3    58   335  4.34  4.35  2.75       1
##  6 0.24  Very~ J     VVS2     62.8    57   336  3.94  3.96  2.48       1
##  7 0.24  Very~ I     VVS1     62.3    57   336  3.95  3.98  2.47       1
##  8 0.26  Very~ H     SI1      61.9    55   337  4.07  4.11  2.53       1
##  9 0.22  Fair  E     VS2      65.1    61   337  3.87  3.78  2.49       1
## 10 0.23  Very~ H     VS1      59.4    61   338  4     4.05  2.39       1
## # ... with 53,930 more rows, and 2 more variables: Values <chr>,
## #   Simple <lgl>

Need a refresher on how a pipe (%>%) works? Go back to section 3.3.9!

mutate() can be used to create variables based on existing variables from the dataset.

Let’s use the existing variable price from the diamonds dataset to create a new column/variable named price200. The values for price200 calculated from price minus $200 (price is a default variable in diamonds):

diamonds %>% 
  mutate(price200 = price - 200)

You can also create multiple columns at once, separating each new variable with a comma.

diamonds %>% 
  mutate(price200 = price - 200,        # $200 OFF from the original price
         price20perc = price * 0.20,    # 20% of the original price
         price20percoff = price * 0.80, # 20% OFF from the original price 
         pricepercarat = price / carat, # ratio of price to carat
         pizza = depth ^ 2)             # Square the original depth

Notice that you can label the variables/new columns however you want (refer to 3.4.2 for naming conventions). In the example above, pizza is not a descriptive name for the new variable defined as the depth values squared. Make it easier for Future You to navigate your old code: choose simple, descriptive names.

Currently, the changes that we’ve made to the diamonds dataset have not been saved. We can see that diamonds does not have the newly created variables (price200, price20perc, pizza) when we execute str(diamonds). In order to save your changes, you must define the new dataset as an object using <-! In the code below, the modified diamonds dataset will be saved as a new object/dataset called diamonds.new. As a beginner, it is good practice to save your changes to a dataset under a new object name (e.g., diamonds.new) each time you make changes rather than saving over the original dataset name (e.g., diamonds). See 3.3.7 for more information on saving objects.

diamonds.new <- # saving changes to diamonds as a new object
  diamonds %>%  # original dataset
  mutate(price200 = price - 200,        # $200 OFF from the original price
         price20perc = price * .20,     # 20% of the original price
         price20percoff = price * 0.80, # 20% OFF from the original price
         pricepercarat = price / carat, # ratio of price to carat
         pizza = depth ^ 2)             # Square the original depth

6.1.1 Nesting Functions

We can also use other functions inside mutate() to create our new variable(s). For example, we might use the mean() function to calculate the average price value for all diamonds in the dataset. This is called nesting, where one function (mean()) “nests” inside another function (mutate()).

diamonds %>% 
  mutate(m = mean(price))

Nearly all of the functions used in this book are available by default or are loaded with library(tidyverse). However, you may come across many other kinds of functions that come from different packages. Make sure that the relevant packages are loaded in your library!

The example below creates four new columns to the diamonds dataset that calculate the mean, standard deviation, standard error, and median price value for all diamonds. The values in these columns will be the same for every row because R takes all of the values in price to calculate the mean/standard deviation/median.

diamonds %>% 
  mutate(m = mean(price),     # calculates the mean price
         sd = sd(price),      # calculates standard deviation
         med = median(price)) # calculates the median price
6.1.1.0.1 Exercises
  1. Using the built-in dataset, midwest (make sure tidyverse is loaded!). Each exercise comes with the desired output! That means you get the answer to each question but not how to get to the answer. To view the newly created variable, scroll all the way over to the right side of the output.
  1. Create a column named avg.pop.den which calculates average population density for the entire dataset (hint: use mean() and popdensity; all values in this column should be the same – Why?)
## # A tibble: 437 x 29
##      PID county state  area poptotal popdensity popwhite popblack
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702
##  2   562 ALEXA~ IL    0.014    10626       759      7054     3496
##  3   563 BOND   IL    0.022    14991       681.    14477      429
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127
##  5   565 BROWN  IL    0.018     5836       324.     5264      547
##  6   566 BUREAU IL    0.05     35688       714.    35157       50
##  7   567 CALHO~ IL    0.017     5322       313.     5298        1
##  8   568 CARRO~ IL    0.027    16805       622.    16519      111
##  9   569 CASS   IL    0.024    13437       560.    13384       16
## 10   570 CHAMP~ IL    0.058   173025      2983.   146506    16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## #   avg.pop.den <dbl>
  1. Create a column named avg.area which calculates the average area for the entire dataset
## # A tibble: 437 x 29
##      PID county state  area poptotal popdensity popwhite popblack
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702
##  2   562 ALEXA~ IL    0.014    10626       759      7054     3496
##  3   563 BOND   IL    0.022    14991       681.    14477      429
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127
##  5   565 BROWN  IL    0.018     5836       324.     5264      547
##  6   566 BUREAU IL    0.05     35688       714.    35157       50
##  7   567 CALHO~ IL    0.017     5322       313.     5298        1
##  8   568 CARRO~ IL    0.027    16805       622.    16519      111
##  9   569 CASS   IL    0.024    13437       560.    13384       16
## 10   570 CHAMP~ IL    0.058   173025      2983.   146506    16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## #   avg.area <dbl>
  1. Create a column called totadult which calculates the total number of adults in this dataset (hint: use popadults, sum(); requires nesting)
## # A tibble: 437 x 29
##      PID county state  area poptotal popdensity popwhite popblack
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702
##  2   562 ALEXA~ IL    0.014    10626       759      7054     3496
##  3   563 BOND   IL    0.022    14991       681.    14477      429
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127
##  5   565 BROWN  IL    0.018     5836       324.     5264      547
##  6   566 BUREAU IL    0.05     35688       714.    35157       50
##  7   567 CALHO~ IL    0.017     5322       313.     5298        1
##  8   568 CARRO~ IL    0.027    16805       622.    16519      111
##  9   569 CASS   IL    0.024    13437       560.    13384       16
## 10   570 CHAMP~ IL    0.058   173025      2983.   146506    16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## #   totadult <int>
  1. Create a new column called tot.minus.white calculating the difference between poptotal and popwhite. Using other variables in the dataset, how else could you create the tot.minus.white column without using poptotal and popwhite. (hint: popblack, popamerindian, popasian, popother)?
## # A tibble: 437 x 30
##      PID county state  area poptotal popdensity popwhite popblack
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702
##  2   562 ALEXA~ IL    0.014    10626       759      7054     3496
##  3   563 BOND   IL    0.022    14991       681.    14477      429
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127
##  5   565 BROWN  IL    0.018     5836       324.     5264      547
##  6   566 BUREAU IL    0.05     35688       714.    35157       50
##  7   567 CALHO~ IL    0.017     5322       313.     5298        1
##  8   568 CARRO~ IL    0.027    16805       622.    16519      111
##  9   569 CASS   IL    0.024    13437       560.    13384       16
## 10   570 CHAMP~ IL    0.058   173025      2983.   146506    16559
## # ... with 427 more rows, and 22 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## #   tot.minus.white <int>, tot.minus.white2 <int>
  1. Create a new column called child.to.adult that calculates the ratio of percchildbelowpovert to percadultpoverty (i.e., for every adult that is in poverty, what proportion of children are in poverty? Compare to the question from 1a: Why are the values in child.to.adult all different?)
## # A tibble: 437 x 29
##      PID county state  area poptotal popdensity popwhite popblack
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702
##  2   562 ALEXA~ IL    0.014    10626       759      7054     3496
##  3   563 BOND   IL    0.022    14991       681.    14477      429
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127
##  5   565 BROWN  IL    0.018     5836       324.     5264      547
##  6   566 BUREAU IL    0.05     35688       714.    35157       50
##  7   567 CALHO~ IL    0.017     5322       313.     5298        1
##  8   568 CARRO~ IL    0.027    16805       622.    16519      111
##  9   569 CASS   IL    0.024    13437       560.    13384       16
## 10   570 CHAMP~ IL    0.058   173025      2983.   146506    16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## #   child.to.adult <dbl>
  1. Create a new column named ratio.adult which calculates the ratio of adults in this dataset (hint: popadults, poptotal).
## # A tibble: 437 x 29
##      PID county state  area poptotal popdensity popwhite popblack
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702
##  2   562 ALEXA~ IL    0.014    10626       759      7054     3496
##  3   563 BOND   IL    0.022    14991       681.    14477      429
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127
##  5   565 BROWN  IL    0.018     5836       324.     5264      547
##  6   566 BUREAU IL    0.05     35688       714.    35157       50
##  7   567 CALHO~ IL    0.017     5322       313.     5298        1
##  8   568 CARRO~ IL    0.027    16805       622.    16519      111
##  9   569 CASS   IL    0.024    13437       560.    13384       16
## 10   570 CHAMP~ IL    0.058   173025      2983.   146506    16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## #   ratio.adult <dbl>
  1. Create a new column named perc.adult that calculates the percentage of the total population that are adults per county (hint: build from 1f; * 100).
## # A tibble: 437 x 29
##      PID county state  area poptotal popdensity popwhite popblack
##    <int> <chr>  <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1   561 ADAMS  IL    0.052    66090      1271.    63917     1702
##  2   562 ALEXA~ IL    0.014    10626       759      7054     3496
##  3   563 BOND   IL    0.022    14991       681.    14477      429
##  4   564 BOONE  IL    0.017    30806      1812.    29344      127
##  5   565 BROWN  IL    0.018     5836       324.     5264      547
##  6   566 BUREAU IL    0.05     35688       714.    35157       50
##  7   567 CALHO~ IL    0.017     5322       313.     5298        1
##  8   568 CARRO~ IL    0.027    16805       622.    16519      111
##  9   569 CASS   IL    0.024    13437       560.    13384       16
## 10   570 CHAMP~ IL    0.058   173025      2983.   146506    16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## #   perc.adult <dbl>
  1. Using the built-in dataset, presidential, create a column named duration that calculates the total number of days each president held in office.
## # A tibble: 11 x 5
##    name       start      end        party      duration 
##    <chr>      <date>     <date>     <chr>      <drtn>   
##  1 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days
##  2 Kennedy    1961-01-20 1963-11-22 Democratic 1036 days
##  3 Johnson    1963-11-22 1969-01-20 Democratic 1886 days
##  4 Nixon      1969-01-20 1974-08-09 Republican 2027 days
##  5 Ford       1974-08-09 1977-01-20 Republican  895 days
##  6 Carter     1977-01-20 1981-01-20 Democratic 1461 days
##  7 Reagan     1981-01-20 1989-01-20 Republican 2922 days
##  8 Bush       1989-01-20 1993-01-20 Republican 1461 days
##  9 Clinton    1993-01-20 2001-01-20 Democratic 2922 days
## 10 Bush       2001-01-20 2009-01-20 Republican 2922 days
## 11 Obama      2009-01-20 2017-01-20 Democratic 2922 days
  1. Using the built-in dataset, economics, create a column named perc.unemploy that calculates the percentage of the population that is unemployed.
## # A tibble: 574 x 7
##    date         pce    pop psavert uempmed unemploy perc.unemploy
##    <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>         <dbl>
##  1 1967-07-01  507. 198712    12.6     4.5     2944          1.48
##  2 1967-08-01  510. 198911    12.6     4.7     2945          1.48
##  3 1967-09-01  516. 199113    11.9     4.6     2958          1.49
##  4 1967-10-01  512. 199311    12.9     4.9     3143          1.58
##  5 1967-11-01  517. 199498    12.8     4.7     3066          1.54
##  6 1967-12-01  525. 199657    11.8     4.8     3018          1.51
##  7 1968-01-01  531. 199808    11.7     5.1     2878          1.44
##  8 1968-02-01  534. 199920    12.3     4.5     3001          1.50
##  9 1968-03-01  544. 200056    11.7     4.1     2877          1.44
## 10 1968-04-01  544  200208    12.3     4.6     2709          1.35
## # ... with 564 more rows
  1. Using the built-in dataset, txhousing,
  1. Create a column named successrate that calculates the percent of houses that sell of the total listings available (hint: sales, listings).
## # A tibble: 8,602 x 10
##    city   year month sales volume median listings inventory  date
##    <chr> <int> <int> <dbl>  <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abil~  2000     1    72 5.38e6  71400      701       6.3 2000 
##  2 Abil~  2000     2    98 6.50e6  58700      746       6.6 2000.
##  3 Abil~  2000     3   130 9.28e6  58100      784       6.8 2000.
##  4 Abil~  2000     4    98 9.73e6  68600      785       6.9 2000.
##  5 Abil~  2000     5   141 1.06e7  67300      794       6.8 2000.
##  6 Abil~  2000     6   156 1.39e7  66900      780       6.6 2000.
##  7 Abil~  2000     7   152 1.26e7  73500      742       6.2 2000.
##  8 Abil~  2000     8   131 1.07e7  75000      765       6.4 2001.
##  9 Abil~  2000     9   104 7.62e6  64500      771       6.5 2001.
## 10 Abil~  2000    10   101 7.04e6  59300      764       6.6 2001.
## # ... with 8,592 more rows, and 1 more variable: successrate <dbl>
  1. Create a column called failrate that calculates the percent of houses that do not sell of the total listings available (hint: build from 4a).
## # A tibble: 8,602 x 10
##    city    year month sales volume median listings inventory  date failrate
##    <chr>  <int> <int> <dbl>  <dbl>  <dbl>    <dbl>     <dbl> <dbl>    <dbl>
##  1 Abile~  2000     1    72 5.38e6  71400      701       6.3 2000      89.7
##  2 Abile~  2000     2    98 6.50e6  58700      746       6.6 2000.     86.9
##  3 Abile~  2000     3   130 9.28e6  58100      784       6.8 2000.     83.4
##  4 Abile~  2000     4    98 9.73e6  68600      785       6.9 2000.     87.5
##  5 Abile~  2000     5   141 1.06e7  67300      794       6.8 2000.     82.2
##  6 Abile~  2000     6   156 1.39e7  66900      780       6.6 2000.     80  
##  7 Abile~  2000     7   152 1.26e7  73500      742       6.2 2000.     79.5
##  8 Abile~  2000     8   131 1.07e7  75000      765       6.4 2001.     82.9
##  9 Abile~  2000     9   104 7.62e6  64500      771       6.5 2001.     86.5
## 10 Abile~  2000    10   101 7.04e6  59300      764       6.6 2001.     86.8
## # ... with 8,592 more rows

Let’s take a look at some other useful functions.

6.1.1.1 recode()

What it does: modifies the values within a variable. Here is the basic structure for using recode:

data %>% mutate(Variable = recode(Variable, "old value" = "new value"))

Example

diamonds %>% 
  mutate(cut.new = recode(cut,
                          "Ideal" = "IDEAL"))

In the code below, we created a new variable named cut.new that is defined as the original cut variable except that values originally listed as "Ideal" now read "IDEAL".

We can also recode multiple values at a time.

diamonds %>% 
  mutate(cut.new = recode(cut,
                          "Ideal" = "IDEAL",
                          "Fair" = "Okay",
                          "Premium" = "pizza"))

Of course, this new edit is intended to be a bit silly. When recoding data, it is important to be descriptive so that Future You can navigate your code. Again, remember that the changes made to diamonds in the above code have not been saved to the dataset; diamonds will not contain the cut.new variable.

Most commonly, I use recode() to fix inconsistent labeling.

Example

# creating a dataset with 2 variables (Sex , TestScore)
Sex <- factor(c("male", "m", "M", "Female", "Female", "Female"))
TestScore <- c(10, 20, 10, 25, 12, 5)
dataset <- tibble(Sex, TestScore)
str(dataset)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  2 variables:
##  $ Sex      : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
##  $ TestScore: num  10 20 10 25 12 5

We see that Sex has 4 categories (Female, m, M, and male). We know that m, M, and male are not three different categorical values. We meant to have 2 categories representing the two most common biolgical sexes - male and female. Let’s recode all of the male values in Sex to match the female equivalent value, Female.

# creating a new variable (Sex.new) with recoded values 
# from the original variable (Sex)
dataset %>% 
  mutate(Sex.new = recode(Sex, 
                          "m" = "Male",
                          "M" = "Male",
                          "male" = "Male"))
## # A tibble: 6 x 3
##   Sex    TestScore Sex.new
##   <fct>      <dbl> <fct>  
## 1 male          10 Male   
## 2 m             20 Male   
## 3 M             10 Male   
## 4 Female        25 Female 
## 5 Female        12 Female 
## 6 Female         5 Female

Notice that there are now three variables: Sex, TestScore, and Sex.new. The Sex.new variable contains two categorical values: Male and Female. You must save this new dataset’s edits as an object (using <-) to save these changes! Otherwise, the next time you use dataset, the Sex.new variable will not be available.

dataset.new <- # saving the changes to a new object
  dataset %>% 
  mutate(Sex.new = recode(Sex, 
                          "m" = "Male",
                          "M" = "Male",
                          "male" = "Male"))
str(dataset.new)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  3 variables:
##  $ Sex      : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
##  $ TestScore: num  10 20 10 25 12 5
##  $ Sex.new  : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1

The original object, dataset remains the same as before. Only dataset.new includes the additional variable, Sex.new.

str(dataset)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  2 variables:
##  $ Sex      : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
##  $ TestScore: num  10 20 10 25 12 5