6 Lecture 4 - 21/12/2020
In this lecture we will learn another R programming approach based on the tidyverse
package. This is alternative to the base R code we learnt in the first lectures.
6.1 Tidyverse
Tidyverse is a collection of R packages designed for data science (see Figure 6.1. All the packages share an underlying design philosophy, grammar, and data structures. See here for more details.
The tidyverse-based functions process faster than base R functions. It is because they are written in a computationally efficient manner and are also more stable in the syntax and better supports data frames than vectors.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
6.2 The pipe operator
Let’s consider a general R function named f
with argument x
. We usually
use the following approach when we need to apply f
:
An alternative is given by the pipe operator %>%
which is part of the
dplyr
package (see here for more details). It works as follows
Basically, the pipe tells R to pass x
as the first argument of the function f
. The shortcut to type the pipe operator in RStudio is given by CTRL/CMD + Shift + M.
We simulate a sample of data in order to run some simple examples with the pipe operator. By using the function sample
we draw randomly (without replacement) 5 numbers between 1 and 20 (1:20).
## [1] 17 11 1 20 5
We are now interested in computing the log transformation of the vector x
. By adopting the standard R programming we would use:
## [1] 2.833213 2.397895 0.000000 2.995732 1.609438
while with the pipe operator we have
## [1] 2.833213 2.397895 0.000000 2.995732 1.609438
where x
is taken as the first argument of the function log
. It is also possible to include other arguments, such as for example the base
of the logarithm (in this case equal to 5). In this case note that x %>% f(y)
is equivalent to f(x,y)
.
## [1] 1.760374 1.489896 0.000000 1.861353 1.000000
## [1] 1.760374 1.489896 0.000000 1.861353 1.000000
We want now to apply the log transformation and then round the corresponding output to 2 digits. This requires the use of two functions (log
and round
). In general, when we apply 3 functions (f
and then g
and finally h
), we have that x %>% f %>% g %>% h
is equivalent to h(g(f(x)))
.
## [1] 2.83 2.40 0.00 3.00 1.61
## [1] 2.83 2.40 0.00 3.00 1.61
We now add a new function: after rounding the log output we compute the sum of the 5 numbers
## [1] 9.84
## [1] 9.84
We want now to use the sum result as the base of the log transformation of the number 5
## [1] 0.7039008
## [1] 0.7039008
The symbol .
is the placeholder and is used when the output of the previous pipe should not be used as the first input of the following function. In general, x %>% f(y, z = .)
is equivalent to f(y, z = x)
.
When it is not convenient to use the pipe:
- when the pipes are longer than 10 steps. In this case the suggestion is to create intermediate objects with meaningful names (that can help understanding what the code does;
- when you have multiple inputs or outputs (e.g. when there is no primary object being transformed but two or more objects being combined together).
6.3 dyplyr verbs
dplyr
is a grammar of data manipulation, providing a consistent set of
verbs that help you solve the most common data manipulation challenges:
- select
: pick variables (columns) based on their names
- filter
pick observations (rows) based on their values
- mutate
: add new variables that are functions of existing variables
- summarise
: reduce multiple values down to a single summary (e.g.
mean)
- arrange
: change the ordering of the rows
All verbs work similarly: - the first argument is a data frame; - the subsequent arguments describe what to do with the data frame using the variable names (without quotes); - the result is a new data frame.
In the following we will take into account all the dplyr
verbs by considering
the diamonds
dataset which contains the prices and other attributes of almost 54,000 diamonds (see ?diamonds
). If we use the function class
to understand the nature of diamonds
we get the following output:
## [1] "tbl_df" "tbl" "data.frame"
The term tbl
(tibble) is the tidyverse version of a classical R data frame. Tiblles are very similar to data frame (they just contain/display more information) and are designed to be used with the tidyverse syntax style.
To get the list and the type of variables included in diamonds
we can use the standard str
or the corresponding tidyverse
function named glimpse
:
## tibble[,10] [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
6.3.1 Verb 1: select
This verb is used to select some of the columns by name. For example, to select the variable named carat
we use the following code:
## # A tibble: 53,940 x 1
## carat
## <dbl>
## 1 0.23
## 2 0.21
## 3 0.23
## 4 0.29
## 5 0.31
## 6 0.24
## 7 0.24
## 8 0.26
## 9 0.22
## 10 0.23
## # … with 53,930 more rows
Selecting more than one column is very simple:
## # A tibble: 53,940 x 4
## carat cut color price
## <dbl> <ord> <ord> <int>
## 1 0.23 Ideal E 326
## 2 0.21 Premium E 326
## 3 0.23 Good E 327
## 4 0.29 Premium I 334
## 5 0.31 Good J 335
## 6 0.24 Very Good J 336
## 7 0.24 Very Good I 336
## 8 0.26 Very Good H 337
## 9 0.22 Fair E 337
## 10 0.23 Very Good H 338
## # … with 53,930 more rows
Given that carat
, cut
and color
are consecutive, the following code is also possible:
## # A tibble: 53,940 x 4
## carat cut color price
## <dbl> <ord> <ord> <int>
## 1 0.23 Ideal E 326
## 2 0.21 Premium E 326
## 3 0.23 Good E 327
## 4 0.29 Premium I 334
## 5 0.31 Good J 335
## 6 0.24 Very Good J 336
## 7 0.24 Very Good I 336
## 8 0.26 Very Good H 337
## 9 0.22 Fair E 337
## 10 0.23 Very Good H 338
## # … with 53,930 more rows
Moreover, it is also possible to select all the columns whose name starts with the letter “c” by using starts_with
combined with the select
function:
## # A tibble: 53,940 x 4
## carat cut color clarity
## <dbl> <ord> <ord> <ord>
## 1 0.23 Ideal E SI2
## 2 0.21 Premium E SI1
## 3 0.23 Good E VS1
## 4 0.29 Premium I VS2
## 5 0.31 Good J SI2
## 6 0.24 Very Good J VVS2
## 7 0.24 Very Good I VVS1
## 8 0.26 Very Good H SI1
## 9 0.22 Fair E VS2
## 10 0.23 Very Good H VS1
## # … with 53,930 more rows
It is also possible to specify a criterion which excludes from the selection some variables. For example, to select all the columns but carat
we use the -
symbol:
## # A tibble: 53,940 x 9
## cut color clarity depth table price x y z
## <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
We proceed similarly to select all the columns but not the ones with a name starting with “c”:
## # A tibble: 53,940 x 6
## depth table price x y z
## <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 61.5 55 326 3.95 3.98 2.43
## 2 59.8 61 326 3.89 3.84 2.31
## 3 56.9 65 327 4.05 4.07 2.31
## 4 62.4 58 334 4.2 4.23 2.63
## 5 63.3 58 335 4.34 4.35 2.75
## 6 62.8 57 336 3.94 3.96 2.48
## 7 62.3 57 336 3.95 3.98 2.47
## 8 61.9 55 337 4.07 4.11 2.53
## 9 65.1 61 337 3.87 3.78 2.49
## 10 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
6.3.2 Verb 2: filter
The verb filter
can be used to select the observations satisfying some criterion. Consider the example the selection of the diamonds with variable cut
equal to the category “Premium”:
## # A tibble: 13,791 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 2 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 3 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
## 4 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 5 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
## 6 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
## 7 0.29 Premium F SI1 62.4 58 403 4.24 4.26 2.65
## 8 0.22 Premium E VS2 61.6 58 404 3.93 3.89 2.41
## 9 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
## 10 0.3 Premium J SI2 59.3 61 405 4.43 4.38 2.61
## # … with 13,781 more rows
It is also possible to include more conditions. Select for example the diamonds with cut
equal to “Premium” AND color
equal to “D” (the AND can be specified by using &
or comma
):
## # A tibble: 1,603 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
## 2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
## 3 0.71 Premium D SI2 61.7 59 2768 5.71 5.67 3.51
## 4 0.71 Premium D VS2 62.5 60 2770 5.65 5.61 3.52
## 5 0.7 Premium D VS2 58 62 2773 5.87 5.78 3.38
## 6 0.72 Premium D SI1 62.7 59 2782 5.73 5.69 3.58
## 7 0.7 Premium D SI1 62.8 60 2782 5.68 5.66 3.56
## 8 0.72 Premium D SI2 62 60 2795 5.73 5.69 3.54
## 9 0.71 Premium D SI1 62.7 60 2797 5.67 5.71 3.57
## 10 0.71 Premium D SI1 61.3 58 2797 5.73 5.75 3.52
## # … with 1,593 more rows
## # A tibble: 1,603 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
## 2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
## 3 0.71 Premium D SI2 61.7 59 2768 5.71 5.67 3.51
## 4 0.71 Premium D VS2 62.5 60 2770 5.65 5.61 3.52
## 5 0.7 Premium D VS2 58 62 2773 5.87 5.78 3.38
## 6 0.72 Premium D SI1 62.7 59 2782 5.73 5.69 3.58
## 7 0.7 Premium D SI1 62.8 60 2782 5.68 5.66 3.56
## 8 0.72 Premium D SI2 62 60 2795 5.73 5.69 3.54
## 9 0.71 Premium D SI1 62.7 60 2797 5.67 5.71 3.57
## 10 0.71 Premium D SI1 61.3 58 2797 5.73 5.75 3.52
## # … with 1,593 more rows
If we are interested in filtering th diamonds with a price between 500 and 600 dollars the following two alternative codes can be adopted:
## # A tibble: 2,360 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.35 Ideal I VS1 60.9 57 552 4.54 4.59 2.78
## 2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
## 3 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
## 4 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
## 5 0.42 Premium I SI2 61.5 59 552 4.78 4.84 2.96
## 6 0.28 Ideal G VVS2 61.4 56 553 4.19 4.22 2.58
## 7 0.32 Ideal I VVS1 62 55.3 553 4.39 4.42 2.73
## 8 0.31 Very Good G SI1 63.3 57 553 4.33 4.3 2.73
## 9 0.31 Premium G SI1 61.8 58 553 4.35 4.32 2.68
## 10 0.24 Premium E VVS1 60.7 58 553 4.01 4.03 2.44
## # … with 2,350 more rows
## # A tibble: 2,407 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.35 Ideal I VS1 60.9 57 552 4.54 4.59 2.78
## 2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
## 3 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
## 4 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
## 5 0.42 Premium I SI2 61.5 59 552 4.78 4.84 2.96
## 6 0.28 Ideal G VVS2 61.4 56 553 4.19 4.22 2.58
## 7 0.32 Ideal I VVS1 62 55.3 553 4.39 4.42 2.73
## 8 0.31 Very Good G SI1 63.3 57 553 4.33 4.3 2.73
## 9 0.31 Premium G SI1 61.8 58 553 4.35 4.32 2.68
## 10 0.24 Premium E VVS1 60.7 58 553 4.01 4.03 2.44
## # … with 2,397 more rows
The output of a selection can be saved in a new data frame as follows:
To select diamonds with cut
equal to fair OR good we can use the |
operator or the %in%
function:
## # A tibble: 6,516 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
## 5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
## 6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
## 7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
## 8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
## 9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
## 10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
## # … with 6,506 more rows
## # A tibble: 6,516 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
## 5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
## 6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
## 7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
## 8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
## 9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
## 10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
## # … with 6,506 more rows
The code diamonds %>% filter(cut == c("Fair","Good"))
does not perform what we mean to do. For an interesting discussion about the difference between ==
and %in%
see here.
6.3.3 Verb 4: summarise
This verbs can be used to compute summary statistics. For example the following code computes the mean and median of price
.
## # A tibble: 1 x 2
## `mean(price)` `median(price)`
## <dbl> <dbl>
## 1 3933. 2401
The output is automatically labelled but it is also possible to specify different labels as follows:
## # A tibble: 1 x 2
## mean_price median_price
## <dbl> <dbl>
## 1 3933. 2401
It is also possible to compute how many observations (and its corresponding proportion and percentage) satisfying a certain condition, e.g. the number of diamonds with a price > 15000$:
diamonds %>%
summarise(veryexp = sum(price > 15000),
veryexpprop = mean(price>15000),
veryexpperc = mean(price>15000)*100)
## # A tibble: 1 x 3
## veryexp veryexpprop veryexpperc
## <int> <dbl> <dbl>
## 1 1655 0.0307 3.07
It is also interesting to compute summary statistics conditionally on the categories of a qualitative variable (i.e. a factor). This can be done by combining the group_by
function with the summarise
function. Let’s compute for example the mean, min and max price conditionally on the categories of cut
:
## # A tibble: 5 x 4
## cut `mean(price)` `min(price)` `max(price)`
## <ord> <dbl> <int> <int>
## 1 Fair 4359. 337 18574
## 2 Good 3929. 327 18788
## 3 Very Good 3982. 336 18818
## 4 Premium 4584. 326 18823
## 5 Ideal 3458. 326 18806
In particular, group_by
splits the original data set into different groups and for each of them the requested summary statistics are computed. In this case we obtain 5 values of the mean, min and max price according to the number of categories of the variable cut
. It is also possible to condition on two different factors, such as for example cut
and color
:
## `summarise()` has grouped output by 'cut'. You can override using the `.groups` argument.
## # A tibble: 35 x 3
## # Groups: cut [5]
## cut color `mean(price)`
## <ord> <ord> <dbl>
## 1 Fair D 4291.
## 2 Fair E 3682.
## 3 Fair F 3827.
## 4 Fair G 4239.
## 5 Fair H 5136.
## 6 Fair I 4685.
## 7 Fair J 4976.
## 8 Good D 3405.
## 9 Good E 3424.
## 10 Good F 3496.
## # … with 25 more rows
In this case each category of cut
is combined with each category of color
and then the mean price is computed.
The function group_by
can by combined with summarise
also to compute frequency distribution, by means of the n()
function which gives the current group size:
## # A tibble: 5 x 2
## cut AbsFreq
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
This is an alternative to the table
standard function introduced in Lecture 3:
##
## Fair Good Very Good Premium Ideal
## 1610 4906 12082 13791 21551
Similarly, percentages can be easily computed by using AbsFreq
which are quantities available for new computations:
## # A tibble: 5 x 3
## cut AbsFreq Perc
## <ord> <int> <dbl>
## 1 Fair 1610 2.98
## 2 Good 4906 9.10
## 3 Very Good 12082 22.4
## 4 Premium 13791 25.6
## 5 Ideal 21551 40.0
A shorter alternative for computing the frequency table si given by the count
function which let you quickly count the unique values of one or more variables. Basically, df %>% count(a, b)
is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())
. Here below we have an example:
## # A tibble: 5 x 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
6.3.4 Verb 5: mutate
The verb mutate
can be used to create new column in the data frame. For example, let’s create a new column given by a categorical variables with two categories:
“Yes” if the price is < 1000$, “No” otherwise. For doing this we will make use of the
ifelse
function, already introduced during Lecture 2. We will call the new column as newcol
and will save the new data frame (with one more column) in a new object named newdiamonds
:
To derive the frequency distribution of newcol
we proceed as described above, computing also percentages by means of mutate
:
newdiamonds %>%
count(newcol) %>%
#summarise(perc=n/nrow(newdiamonds)*100)
mutate(perc=n/nrow(newdiamonds)*100)
## # A tibble: 2 x 3
## newcol n perc
## <chr> <int> <dbl>
## 1 No 39441 73.1
## 2 Yes 14499 26.9
6.3.5 Verb 6: arrange
The verb arrange
can be used to sort observations with respect to the values of a given variable. For example, we can sort diamonds according to price
(by default the ascending order is adopted):
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
## 2 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
## 3 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
## 4 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
## 5 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
## 6 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
With the function tail
we have a preview of the 6 bottom lines which contains, in this case, the diamonds with the highest price.
If we need to use a descending ordering we will use the desc
function inside arrange
:
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
## 2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
## 3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
## 4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
## 5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
## 6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
## 7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
## 8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
## 9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
## 10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
## # … with 53,930 more rows
Let’s now use the arrange
function in order to sort the categories of a factor (cut
) according to the corresponding frequencies:
## # A tibble: 5 x 2
## cut freq
## <ord> <int>
## 1 Ideal 21551
## 2 Premium 13791
## 3 Very Good 12082
## 4 Good 4906
## 5 Fair 1610
6.4 Exercises Lecture 4
See Section 7.7.