6.1 mutate()
What it does: Adds new columns or modifies current variables in the dataset.
Let’s say I want to create three new variables in the diamonds
dataset (described in Section5) :
One variable called
JustOne
where all of the values inside the column are1
.One variable called
Values
where all of the values inside aresomething
:One variable called
Simple
where all the values equalTRUE
diamonds %>%
mutate(JustOne = 1,
Values = "something",
Simple = TRUE)
## # A tibble: 53,940 x 13
## carat cut color clarity depth table price x y z JustOne
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1
## 2 0.21 Prem~ E SI1 59.8 61 326 3.89 3.84 2.31 1
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1
## 4 0.290 Prem~ I VS2 62.4 58 334 4.2 4.23 2.63 1
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1
## 6 0.24 Very~ J VVS2 62.8 57 336 3.94 3.96 2.48 1
## 7 0.24 Very~ I VVS1 62.3 57 336 3.95 3.98 2.47 1
## 8 0.26 Very~ H SI1 61.9 55 337 4.07 4.11 2.53 1
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1
## 10 0.23 Very~ H VS1 59.4 61 338 4 4.05 2.39 1
## # ... with 53,930 more rows, and 2 more variables: Values <chr>,
## # Simple <lgl>
Need a refresher on how a pipe (%>%
) works? Go back to section 3.3.9!
mutate()
can be used to create variables based on existing variables from the dataset.
Let’s use the existing variable price
from the diamonds
dataset to create a new column/variable named price200
. The values for price200
calculated from price
minus $200 (price
is a default variable in diamonds
):
diamonds %>%
mutate(price200 = price - 200)
You can also create multiple columns at once, separating each new variable with a comma.
diamonds %>%
mutate(price200 = price - 200, # $200 OFF from the original price
price20perc = price * 0.20, # 20% of the original price
price20percoff = price * 0.80, # 20% OFF from the original price
pricepercarat = price / carat, # ratio of price to carat
pizza = depth ^ 2) # Square the original depth
Notice that you can label the variables/new columns however you want (refer to 3.4.2 for naming conventions). In the example above, pizza
is not a descriptive name for the new variable defined as the depth
values squared. Make it easier for Future You to navigate your old code: choose simple, descriptive names.
Currently, the changes that we’ve made to the diamonds
dataset have not been saved. We can see that diamonds
does not have the newly created variables (price200
, price20perc
, pizza
) when we execute str(diamonds)
. In order to save your changes, you must define the new dataset as an object using <-
! In the code below, the modified diamonds
dataset will be saved as a new object/dataset called diamonds.new
. As a beginner, it is good practice to save your changes to a dataset under a new object name (e.g., diamonds.new
) each time you make changes rather than saving over the original dataset name (e.g., diamonds
). See 3.3.7 for more information on saving objects.
diamonds.new <- # saving changes to diamonds as a new object
diamonds %>% # original dataset
mutate(price200 = price - 200, # $200 OFF from the original price
price20perc = price * .20, # 20% of the original price
price20percoff = price * 0.80, # 20% OFF from the original price
pricepercarat = price / carat, # ratio of price to carat
pizza = depth ^ 2) # Square the original depth
6.1.1 Nesting Functions
We can also use other functions inside mutate()
to create our new variable(s). For example, we might use the mean()
function to calculate the average price
value for all diamonds in the dataset. This is called nesting, where one function (mean()
) “nests” inside another function (mutate()
).
diamonds %>%
mutate(m = mean(price))
Nearly all of the functions used in this book are available by default or are loaded with library(tidyverse)
. However, you may come across many other kinds of functions that come from different packages. Make sure that the relevant packages are loaded in your library!
The example below creates four new columns to the diamonds
dataset that calculate the mean, standard deviation, standard error, and median price
value for all diamonds. The values in these columns will be the same for every row because R takes all of the values in price
to calculate the mean/standard deviation/median.
diamonds %>%
mutate(m = mean(price), # calculates the mean price
sd = sd(price), # calculates standard deviation
med = median(price)) # calculates the median price
6.1.1.0.1 Exercises
- Using the built-in dataset,
midwest
(make sure tidyverse is loaded!). Each exercise comes with the desired output! That means you get the answer to each question but not how to get to the answer. To view the newly created variable, scroll all the way over to the right side of the output.
- Create a column named
avg.pop.den
which calculates average population density for the entire dataset (hint: usemean()
andpopdensity
; all values in this column should be the same – Why?)
## # A tibble: 437 x 29
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702
## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496
## 3 563 BOND IL 0.022 14991 681. 14477 429
## 4 564 BOONE IL 0.017 30806 1812. 29344 127
## 5 565 BROWN IL 0.018 5836 324. 5264 547
## 6 566 BUREAU IL 0.05 35688 714. 35157 50
## 7 567 CALHO~ IL 0.017 5322 313. 5298 1
## 8 568 CARRO~ IL 0.027 16805 622. 16519 111
## 9 569 CASS IL 0.024 13437 560. 13384 16
## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## # avg.pop.den <dbl>
- Create a column named
avg.area
which calculates the average area for the entire dataset
## # A tibble: 437 x 29
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702
## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496
## 3 563 BOND IL 0.022 14991 681. 14477 429
## 4 564 BOONE IL 0.017 30806 1812. 29344 127
## 5 565 BROWN IL 0.018 5836 324. 5264 547
## 6 566 BUREAU IL 0.05 35688 714. 35157 50
## 7 567 CALHO~ IL 0.017 5322 313. 5298 1
## 8 568 CARRO~ IL 0.027 16805 622. 16519 111
## 9 569 CASS IL 0.024 13437 560. 13384 16
## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## # avg.area <dbl>
- Create a column called
totadult
which calculates the total number of adults in this dataset (hint: usepopadults
,sum()
; requires nesting)
## # A tibble: 437 x 29
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702
## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496
## 3 563 BOND IL 0.022 14991 681. 14477 429
## 4 564 BOONE IL 0.017 30806 1812. 29344 127
## 5 565 BROWN IL 0.018 5836 324. 5264 547
## 6 566 BUREAU IL 0.05 35688 714. 35157 50
## 7 567 CALHO~ IL 0.017 5322 313. 5298 1
## 8 568 CARRO~ IL 0.027 16805 622. 16519 111
## 9 569 CASS IL 0.024 13437 560. 13384 16
## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## # totadult <int>
- Create a new column called
tot.minus.white
calculating the difference betweenpoptotal
andpopwhite
. Using other variables in the dataset, how else could you create thetot.minus.white
column without usingpoptotal
andpopwhite
. (hint:popblack
,popamerindian
,popasian
,popother
)?
## # A tibble: 437 x 30
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702
## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496
## 3 563 BOND IL 0.022 14991 681. 14477 429
## 4 564 BOONE IL 0.017 30806 1812. 29344 127
## 5 565 BROWN IL 0.018 5836 324. 5264 547
## 6 566 BUREAU IL 0.05 35688 714. 35157 50
## 7 567 CALHO~ IL 0.017 5322 313. 5298 1
## 8 568 CARRO~ IL 0.027 16805 622. 16519 111
## 9 569 CASS IL 0.024 13437 560. 13384 16
## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559
## # ... with 427 more rows, and 22 more variables: popamerindian <int>,
## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## # tot.minus.white <int>, tot.minus.white2 <int>
- Create a new column called
child.to.adult
that calculates the ratio ofpercchildbelowpovert
topercadultpoverty
(i.e., for every adult that is in poverty, what proportion of children are in poverty? Compare to the question from 1a: Why are the values inchild.to.adult
all different?)
## # A tibble: 437 x 29
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702
## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496
## 3 563 BOND IL 0.022 14991 681. 14477 429
## 4 564 BOONE IL 0.017 30806 1812. 29344 127
## 5 565 BROWN IL 0.018 5836 324. 5264 547
## 6 566 BUREAU IL 0.05 35688 714. 35157 50
## 7 567 CALHO~ IL 0.017 5322 313. 5298 1
## 8 568 CARRO~ IL 0.027 16805 622. 16519 111
## 9 569 CASS IL 0.024 13437 560. 13384 16
## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## # child.to.adult <dbl>
- Create a new column named
ratio.adult
which calculates the ratio of adults in this dataset (hint:popadults
,poptotal
).
## # A tibble: 437 x 29
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702
## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496
## 3 563 BOND IL 0.022 14991 681. 14477 429
## 4 564 BOONE IL 0.017 30806 1812. 29344 127
## 5 565 BROWN IL 0.018 5836 324. 5264 547
## 6 566 BUREAU IL 0.05 35688 714. 35157 50
## 7 567 CALHO~ IL 0.017 5322 313. 5298 1
## 8 568 CARRO~ IL 0.027 16805 622. 16519 111
## 9 569 CASS IL 0.024 13437 560. 13384 16
## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## # ratio.adult <dbl>
- Create a new column named
perc.adult
that calculates the percentage of the total population that are adults per county (hint: build from 1f;* 100
).
## # A tibble: 437 x 29
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702
## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496
## 3 563 BOND IL 0.022 14991 681. 14477 429
## 4 564 BOONE IL 0.017 30806 1812. 29344 127
## 5 565 BROWN IL 0.018 5836 324. 5264 547
## 6 566 BUREAU IL 0.05 35688 714. 35157 50
## 7 567 CALHO~ IL 0.017 5322 313. 5298 1
## 8 568 CARRO~ IL 0.027 16805 622. 16519 111
## 9 569 CASS IL 0.024 13437 560. 13384 16
## 10 570 CHAMP~ IL 0.058 173025 2983. 146506 16559
## # ... with 427 more rows, and 21 more variables: popamerindian <int>,
## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>,
## # perc.adult <dbl>
- Using the built-in dataset,
presidential
, create a column namedduration
that calculates the total number of days each president held in office.
## # A tibble: 11 x 5
## name start end party duration
## <chr> <date> <date> <chr> <drtn>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days
## 2 Kennedy 1961-01-20 1963-11-22 Democratic 1036 days
## 3 Johnson 1963-11-22 1969-01-20 Democratic 1886 days
## 4 Nixon 1969-01-20 1974-08-09 Republican 2027 days
## 5 Ford 1974-08-09 1977-01-20 Republican 895 days
## 6 Carter 1977-01-20 1981-01-20 Democratic 1461 days
## 7 Reagan 1981-01-20 1989-01-20 Republican 2922 days
## 8 Bush 1989-01-20 1993-01-20 Republican 1461 days
## 9 Clinton 1993-01-20 2001-01-20 Democratic 2922 days
## 10 Bush 2001-01-20 2009-01-20 Republican 2922 days
## 11 Obama 2009-01-20 2017-01-20 Democratic 2922 days
- Using the built-in dataset,
economics
, create a column namedperc.unemploy
that calculates the percentage of the population that is unemployed.
## # A tibble: 574 x 7
## date pce pop psavert uempmed unemploy perc.unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944 1.48
## 2 1967-08-01 510. 198911 12.6 4.7 2945 1.48
## 3 1967-09-01 516. 199113 11.9 4.6 2958 1.49
## 4 1967-10-01 512. 199311 12.9 4.9 3143 1.58
## 5 1967-11-01 517. 199498 12.8 4.7 3066 1.54
## 6 1967-12-01 525. 199657 11.8 4.8 3018 1.51
## 7 1968-01-01 531. 199808 11.7 5.1 2878 1.44
## 8 1968-02-01 534. 199920 12.3 4.5 3001 1.50
## 9 1968-03-01 544. 200056 11.7 4.1 2877 1.44
## 10 1968-04-01 544 200208 12.3 4.6 2709 1.35
## # ... with 564 more rows
- Using the built-in dataset,
txhousing
,
- Create a column named
successrate
that calculates the percent of houses that sell of the total listings available (hint:sales
,listings
).
## # A tibble: 8,602 x 10
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abil~ 2000 1 72 5.38e6 71400 701 6.3 2000
## 2 Abil~ 2000 2 98 6.50e6 58700 746 6.6 2000.
## 3 Abil~ 2000 3 130 9.28e6 58100 784 6.8 2000.
## 4 Abil~ 2000 4 98 9.73e6 68600 785 6.9 2000.
## 5 Abil~ 2000 5 141 1.06e7 67300 794 6.8 2000.
## 6 Abil~ 2000 6 156 1.39e7 66900 780 6.6 2000.
## 7 Abil~ 2000 7 152 1.26e7 73500 742 6.2 2000.
## 8 Abil~ 2000 8 131 1.07e7 75000 765 6.4 2001.
## 9 Abil~ 2000 9 104 7.62e6 64500 771 6.5 2001.
## 10 Abil~ 2000 10 101 7.04e6 59300 764 6.6 2001.
## # ... with 8,592 more rows, and 1 more variable: successrate <dbl>
- Create a column called
failrate
that calculates the percent of houses that do not sell of the total listings available (hint: build from 4a).
## # A tibble: 8,602 x 10
## city year month sales volume median listings inventory date failrate
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abile~ 2000 1 72 5.38e6 71400 701 6.3 2000 89.7
## 2 Abile~ 2000 2 98 6.50e6 58700 746 6.6 2000. 86.9
## 3 Abile~ 2000 3 130 9.28e6 58100 784 6.8 2000. 83.4
## 4 Abile~ 2000 4 98 9.73e6 68600 785 6.9 2000. 87.5
## 5 Abile~ 2000 5 141 1.06e7 67300 794 6.8 2000. 82.2
## 6 Abile~ 2000 6 156 1.39e7 66900 780 6.6 2000. 80
## 7 Abile~ 2000 7 152 1.26e7 73500 742 6.2 2000. 79.5
## 8 Abile~ 2000 8 131 1.07e7 75000 765 6.4 2001. 82.9
## 9 Abile~ 2000 9 104 7.62e6 64500 771 6.5 2001. 86.5
## 10 Abile~ 2000 10 101 7.04e6 59300 764 6.6 2001. 86.8
## # ... with 8,592 more rows
Let’s take a look at some other useful functions.
6.1.1.1 recode()
What it does: modifies the values within a variable. Here is the basic structure for using recode
:
data %>% mutate(Variable = recode(Variable, "old value" = "new value"))
Example
diamonds %>%
mutate(cut.new = recode(cut,
"Ideal" = "IDEAL"))
In the code below, we created a new variable named cut.new
that is defined as the original cut
variable except that values originally listed as "Ideal"
now read "IDEAL"
.
We can also recode multiple values at a time.
diamonds %>%
mutate(cut.new = recode(cut,
"Ideal" = "IDEAL",
"Fair" = "Okay",
"Premium" = "pizza"))
Of course, this new edit is intended to be a bit silly. When recoding data, it is important to be descriptive so that Future You can navigate your code. Again, remember that the changes made to diamonds
in the above code have not been saved to the dataset; diamonds
will not contain the cut.new
variable.
Most commonly, I use recode()
to fix inconsistent labeling.
Example
# creating a dataset with 2 variables (Sex , TestScore)
Sex <- factor(c("male", "m", "M", "Female", "Female", "Female"))
TestScore <- c(10, 20, 10, 25, 12, 5)
dataset <- tibble(Sex, TestScore)
str(dataset)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 2 variables:
## $ Sex : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
## $ TestScore: num 10 20 10 25 12 5
We see that Sex
has 4 categories (Female, m, M, and male). We know that m, M, and male are not three different categorical values. We meant to have 2 categories representing the two most common biolgical sexes - male and female. Let’s recode all of the male values in Sex
to match the female equivalent value, Female
.
# creating a new variable (Sex.new) with recoded values
# from the original variable (Sex)
dataset %>%
mutate(Sex.new = recode(Sex,
"m" = "Male",
"M" = "Male",
"male" = "Male"))
## # A tibble: 6 x 3
## Sex TestScore Sex.new
## <fct> <dbl> <fct>
## 1 male 10 Male
## 2 m 20 Male
## 3 M 10 Male
## 4 Female 25 Female
## 5 Female 12 Female
## 6 Female 5 Female
Notice that there are now three variables: Sex
, TestScore
, and Sex.new
. The Sex.new
variable contains two categorical values: Male and Female. You must save this new dataset’s edits as an object (using <-
) to save these changes! Otherwise, the next time you use dataset
, the Sex.new
variable will not be available.
dataset.new <- # saving the changes to a new object
dataset %>%
mutate(Sex.new = recode(Sex,
"m" = "Male",
"M" = "Male",
"male" = "Male"))
str(dataset.new)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 3 variables:
## $ Sex : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
## $ TestScore: num 10 20 10 25 12 5
## $ Sex.new : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1
The original object, dataset
remains the same as before. Only dataset.new
includes the additional variable, Sex.new
.
str(dataset)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 2 variables:
## $ Sex : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
## $ TestScore: num 10 20 10 25 12 5