6.6 arrange()
Function: Allows you arrange values within a variable in ascending or descending order (if that is applicable to your values). This can apply to both numerical and non-numerical values.
Arrange cut
by alphabetical order (A to Z):
diamonds %>% arrange(cut)
Arrange price
by numerical order (lowest to highest):
diamonds %>% arrange(price)
Arrange cut
in descending alphabetical order:
diamonds %>% arrange(desc(cut))
Arrange price
in descending numerical order:
diamonds %>% arrange(desc(price))
6.6.1 Exercises
In the following exercises, it is important to type out the code rather than to copy and paste it. This code should produce zero errors (unless otherwise specified); refer back to the troubleshooting section (3.6) if necessary.
- Practice typing/executing these problems and explain what each line of code does. Make sure tidyverse is loaded in your libraries
This is an example for how each problem should be solved. The purpose for each line should be explained.
## Example illustrating how to answer these problems:
diamonds %>% # utilizes the diamonds dataset
group_by(color, clarity) %>% # groups data by color and clarity variables
mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
ungroup() %>% # data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>% # new variable, original price + $10
select(cut, color, # retain only these listed columns
clarity, price,
price200, random10) %>%
arrange(color) %>% # visualize data ordered by color
group_by(cut) %>% # group data by cut
mutate(dis = n_distinct(price), # counts the number of unique price values per cut
rowID = row_number()) %>% # numbers each row consecutively for each cut
ungroup() # final ungrouping of data
It is an excellent exercise to execute one line at a time to visualize how each line of code changes the output. Refer back to 3.6.2 for more information on this technique!
library(tidyverse)
## Problem A
midwest %>%
group_by(state) %>%
summarize(poptotalmean = mean(poptotal),
poptotalmed = median(poptotal),
popmax = max(poptotal),
popmin = min(poptotal),
popdistinct = n_distinct(poptotal),
popfirst = first(poptotal),
popany = any(poptotal < 5000),
popany2 = any(poptotal > 2000000)) %>%
ungroup()
## # A tibble: 5 x 9
## state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
## <chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
## 1 IL 112065. 24486. 5.11e6 4373 101 66090 TRUE
## 2 IN 60263. 30362. 7.97e5 5315 92 31095 FALSE
## 3 MI 111992. 37308 2.11e6 1701 83 10145 TRUE
## 4 OH 123263. 54930. 1.41e6 11098 88 25371 FALSE
## 5 WI 67941. 33528 9.59e5 3890 72 15682 TRUE
## # ... with 1 more variable: popany2 <lgl>
## Problem B
midwest %>%
group_by(state) %>%
summarize(num5k = sum(poptotal < 5000),
num2mil = sum(poptotal > 2000000),
numrows = n()) %>%
ungroup()
## # A tibble: 5 x 4
## state num5k num2mil numrows
## <chr> <int> <int> <int>
## 1 IL 1 1 102
## 2 IN 0 0 92
## 3 MI 1 1 83
## 4 OH 0 0 88
## 5 WI 2 0 72
## Problem C
# part I
midwest %>%
group_by(county) %>%
summarize(x = n_distinct(state)) %>%
arrange(desc(x)) %>%
ungroup()
## # A tibble: 320 x 2
## county x
## <chr> <int>
## 1 CRAWFORD 5
## 2 JACKSON 5
## 3 MONROE 5
## 4 ADAMS 4
## 5 BROWN 4
## 6 CLARK 4
## 7 CLINTON 4
## 8 JEFFERSON 4
## 9 LAKE 4
## 10 WASHINGTON 4
## # ... with 310 more rows
# part II
# How does n() differ from n_distinct()?
# When would they be the same? different?
midwest %>%
group_by(county) %>%
summarize(x = n()) %>%
ungroup()
## # A tibble: 320 x 2
## county x
## <chr> <int>
## 1 ADAMS 4
## 2 ALCONA 1
## 3 ALEXANDER 1
## 4 ALGER 1
## 5 ALLEGAN 1
## 6 ALLEN 2
## 7 ALPENA 1
## 8 ANTRIM 1
## 9 ARENAC 1
## 10 ASHLAND 2
## # ... with 310 more rows
# part III
# hint:
# - How many distinctly different counties are there for each county?
# - Can there be more than 1 (county) county in each county?
# - What if we replace 'county' with 'state'?
midwest %>%
group_by(county) %>%
summarize(x = n_distinct(county)) %>%
ungroup()
## # A tibble: 320 x 2
## county x
## <chr> <int>
## 1 ADAMS 1
## 2 ALCONA 1
## 3 ALEXANDER 1
## 4 ALGER 1
## 5 ALLEGAN 1
## 6 ALLEN 1
## 7 ALPENA 1
## 8 ANTRIM 1
## 9 ARENAC 1
## 10 ASHLAND 1
## # ... with 310 more rows
## Problem D
diamonds %>%
group_by(clarity) %>%
summarize(a = n_distinct(color),
b = n_distinct(price),
c = n()) %>%
ungroup()
## # A tibble: 8 x 4
## clarity a b c
## <ord> <int> <int> <int>
## 1 I1 7 632 741
## 2 SI2 7 4904 9194
## 3 SI1 7 5380 13065
## 4 VS2 7 5051 12258
## 5 VS1 7 3926 8171
## 6 VVS2 7 2409 5066
## 7 VVS1 7 1623 3655
## 8 IF 7 902 1790
## Problem E
# part I
diamonds %>%
group_by(color, cut) %>%
summarize(m = mean(price),
s = sd(price)) %>%
ungroup()
## # A tibble: 35 x 4
## color cut m s
## <ord> <ord> <dbl> <dbl>
## 1 D Fair 4291. 3286.
## 2 D Good 3405. 3175.
## 3 D Very Good 3470. 3524.
## 4 D Premium 3631. 3712.
## 5 D Ideal 2629. 3001.
## 6 E Fair 3682. 2977.
## 7 E Good 3424. 3331.
## 8 E Very Good 3215. 3408.
## 9 E Premium 3539. 3795.
## 10 E Ideal 2598. 2956.
## # ... with 25 more rows
# part II
diamonds %>%
group_by(cut, color) %>%
summarize(m = mean(price),
s = sd(price)) %>%
ungroup()
## # A tibble: 35 x 4
## cut color m s
## <ord> <ord> <dbl> <dbl>
## 1 Fair D 4291. 3286.
## 2 Fair E 3682. 2977.
## 3 Fair F 3827. 3223.
## 4 Fair G 4239. 3610.
## 5 Fair H 5136. 3886.
## 6 Fair I 4685. 3730.
## 7 Fair J 4976. 4050.
## 8 Good D 3405. 3175.
## 9 Good E 3424. 3331.
## 10 Good F 3496. 3202.
## # ... with 25 more rows
# part III
# hint:
# - How good is the sale if the price of diamonds equaled msale?
# - e.x. The diamonds are x% off original price in msale.
diamonds %>%
group_by(cut, color, clarity) %>%
summarize(m = mean(price),
s = sd(price),
msale = m * 0.80) %>%
ungroup()
## # A tibble: 276 x 6
## cut color clarity m s msale
## <ord> <ord> <ord> <dbl> <dbl> <dbl>
## 1 Fair D I1 7383 5899. 5906.
## 2 Fair D SI2 4355. 3260. 3484.
## 3 Fair D SI1 4273. 3019. 3419.
## 4 Fair D VS2 4513. 3383. 3610.
## 5 Fair D VS1 2921. 2550. 2337.
## 6 Fair D VVS2 3607 3629. 2886.
## 7 Fair D VVS1 4473 5457. 3578.
## 8 Fair D IF 1620. 525. 1296.
## 9 Fair E I1 2095. 824. 1676.
## 10 Fair E SI2 4172. 3055. 3338.
## # ... with 266 more rows
## Problem F
diamonds %>%
group_by(cut) %>%
summarize(potato = mean(depth),
pizza = mean(price),
popcorn = median(y),
pineapple = potato - pizza,
papaya = pineapple ^ 2,
peach = n()) %>%
ungroup()
## # A tibble: 5 x 7
## cut potato pizza popcorn pineapple papaya peach
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
## 2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
## 3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
## 4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
## 5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
## Problem G
# part I
diamonds %>%
group_by(color) %>%
summarize(m = mean(price)) %>%
mutate(x1 = str_c("Diamond color ", color),
x2 = 5) %>%
ungroup()
## # A tibble: 7 x 4
## color m x1 x2
## <ord> <dbl> <chr> <dbl>
## 1 D 3170. Diamond color D 5
## 2 E 3077. Diamond color E 5
## 3 F 3725. Diamond color F 5
## 4 G 3999. Diamond color G 5
## 5 H 4487. Diamond color H 5
## 6 I 5092. Diamond color I 5
## 7 J 5324. Diamond color J 5
# part II
# What does the first ungroup() do? Is it useful here? Why/why not?
# Why isn't there a closing ungroup() after the mutate()?
diamonds %>%
group_by(color) %>%
summarize(m = mean(price)) %>%
ungroup() %>%
mutate(x1 = str_c("Diamond color ", color),
x2 = 5)
## # A tibble: 7 x 4
## color m x1 x2
## <ord> <dbl> <chr> <dbl>
## 1 D 3170. Diamond color D 5
## 2 E 3077. Diamond color E 5
## 3 F 3725. Diamond color F 5
## 4 G 3999. Diamond color G 5
## 5 H 4487. Diamond color H 5
## 6 I 5092. Diamond color I 5
## 7 J 5324. Diamond color J 5
## Problem H
# part I
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>%
summarize(m = mean(x1)) %>%
ungroup()
## # A tibble: 7 x 2
## color m
## <ord> <dbl>
## 1 D 1585.
## 2 E 1538.
## 3 F 1862.
## 4 G 2000.
## 5 H 2243.
## 6 I 2546.
## 7 J 2662.
# part II
# What's the difference between part I and II?
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>%
ungroup() %>%
summarize(m = mean(x1))
## # A tibble: 1 x 1
## m
## <dbl>
## 1 1966.
Why is grouping data necessary?
Why is ungrouping data necessary?
When should you ungroup data?
If the code does not contain
group_by()
, do you still needungroup()
at the end? For example, doesdata() %>% mutate(newVar = 1 + 2)
requireungroup()
?