Chapter 4 Lab 2 - 07/02/2023

In this lecture we will learn another R programming approach based on the tidyverse package. This is alternative to the base R code we learnt in the first lectures.

4.1 Tidyverse

Tidyverse is a collection of R packages designed for data science (see Figure 4.1. All the packages share an underlying design philosophy, grammar, and data structures. See here for more details.

Packages included in tidyverse

Figure 4.1: Packages included in tidyverse

The tidyverse-based functions process faster than base R functions. It is because they are written in a computationally efficient manner and are also more stable in the syntax and better supports data frames than vectors.

4.2 Install and load a package

Before starting using a package it is necessary to follow two steps:

  1. install the package: this has to be done only once (unless you re-install R, change or reset your computer). It is like buying a light bulb and installing it in the lamp, as described in Figure 4.2: you do this only once not every time you need some light in your room. This step can be performed by using the RStudio menu, through Tools - Install package, as shown in Figure ??. Behind this menu shortcut RStudio is using the install.packages function.

  2. load the package: this is like switching on the light one you have an installed light bulb, something that can be done every time you need some light in the room (see Figure 4.2). Similarly, each package can be loaded whenever you need to use some functions included in the package. To load the tidyverse package we proceed as follows:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
Install and load a R package

Figure 4.2: Install and load a R package

4.3 The pipe operator

Let’s consider a general R function named f with argument x. We usually use the following approach when we need to apply f:

f(x)

An alternative is given by the pipe operator %>% which is part of the dplyr package (see here for more details). It works as follows

x %>% f() 
#this is equivalent to f(x)

Basically, the pipe tells R to pass x as the first argument of the function f. The shortcut to type the pipe operator in RStudio is given by CTRL/CMD Shift M.

We simulate a sample of data in order to run some simple examples with the pipe operator. By using the function sample we draw randomly (without replacement) 5 numbers between 1 and 20 (1:20).

set.seed(4)
x = sample(1:20, 5)
x
## [1] 11 19  3  7 12

We are now interested in computing the log transformation of the vector x. By adopting the standard R programming we would use:

log(x)
## [1] 2.397895 2.944439 1.098612 1.945910 2.484907

while with the pipe operator we have

x %>% log()
## [1] 2.397895 2.944439 1.098612 1.945910 2.484907
#it's also possible to omit the parentheses given that there is no input

where x is taken as the first argument of the function log. It is also possible to include other arguments, such as for example the base of the logarithm (in this case equal to 5). In this case note that x %>% f(y) is equivalent to f(x,y).

#standard programming
log(x, base=5)
## [1] 1.4898961 1.8294828 0.6826062 1.2090620 1.5439593
#pipe based programming
x %>% log(base=5)
## [1] 1.4898961 1.8294828 0.6826062 1.2090620 1.5439593

We want now to apply the log transformation and then round the corresponding output to 2 digits. This requires the use of two functions (log and round). In general, when we apply 3 functions (f and then g and finally h), we have that x %>% f %>% g %>% h is equivalent to h(g(f(x))).

# standard programming
round(log(x), 2)
## [1] 2.40 2.94 1.10 1.95 2.48
# pipe based programming
x %>%  log %>% round(2)
## [1] 2.40 2.94 1.10 1.95 2.48

We now add a new function: after rounding the log output we compute the sum of the 5 numbers

# standard programming
sum(round(log(x),2))
## [1] 10.87
# pipe based programming
x %>% log %>% round(2) %>% sum 
## [1] 10.87

We want now to use the sum result as the base of the log transformation of the number 5

# standard programming
log(5, base = sum(round(log(x),2)))
## [1] 0.674532
# pipe based programming
x %>% log %>% round(2) %>% sum %>% log(5,base=.) 
## [1] 0.674532

The symbol . is the placeholder and is used when the output of the previous pipe should not be used as the first input of the following function. In general, x %>% f(y, z = .) is equivalent to f(y, z = x).

When it is not convenient to use the pipe:

  • when the pipes are longer than 10 steps. In this case the suggestion is to create intermediate objects with meaningful names (that can help understanding what the code does);
  • when you have multiple inputs or outputs (e.g. when there is no primary object being transformed but two or more objects being combined together).

4.4 dyplyr verbs

dplyr (a package in the tidyverse collection) is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • select: pick variables (columns) based on their names
  • filter pick observations (rows) based on their values
  • mutate: add new variables that are functions of existing variables
  • summarise: reduce multiple values down to a single summary (e.g. mean)
  • arrange: change the ordering of the rows

All verbs work similarly:

  • the first argument is a data frame;
  • the subsequent arguments describe what to do with the data frame using the variable names (without quotes);
  • the result is a new data frame.

In the following we will take into account all the dplyr verbs by considering the diamonds dataset which contains the prices and other attributes of almost 54,000 diamonds (see ?diamonds). If we use the function class to understand the nature of diamonds we get the following output:

class(diamonds)
## [1] "tbl_df"     "tbl"        "data.frame"

The term tbl (tibble) is the tidyverse version of a classical R data frame. Tiblles are very similar to data frame (they just contain/display more information) and are designed to be used with the tidyverse syntax style.

To get the list and the type of variables included in diamonds we can use the standard str or the corresponding tidyverse function named glimpse:

str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

4.4.1 Verb 1: select

This verb is used to select some of the columns by name. For example, to select the variable named carat we use the following code:

diamonds %>% select(carat) 
## # A tibble: 53,940 × 1
##    carat
##    <dbl>
##  1  0.23
##  2  0.21
##  3  0.23
##  4  0.29
##  5  0.31
##  6  0.24
##  7  0.24
##  8  0.26
##  9  0.22
## 10  0.23
## # … with 53,930 more rows

Selecting more than one column is very simple:

diamonds %>% select(carat, cut, color, price)
## # A tibble: 53,940 × 4
##    carat cut       color price
##    <dbl> <ord>     <ord> <int>
##  1  0.23 Ideal     E       326
##  2  0.21 Premium   E       326
##  3  0.23 Good      E       327
##  4  0.29 Premium   I       334
##  5  0.31 Good      J       335
##  6  0.24 Very Good J       336
##  7  0.24 Very Good I       336
##  8  0.26 Very Good H       337
##  9  0.22 Fair      E       337
## 10  0.23 Very Good H       338
## # … with 53,930 more rows

Given that carat, cut and color are consecutive, the following code is also possible:

diamonds %>% select(carat : color, price)
## # A tibble: 53,940 × 4
##    carat cut       color price
##    <dbl> <ord>     <ord> <int>
##  1  0.23 Ideal     E       326
##  2  0.21 Premium   E       326
##  3  0.23 Good      E       327
##  4  0.29 Premium   I       334
##  5  0.31 Good      J       335
##  6  0.24 Very Good J       336
##  7  0.24 Very Good I       336
##  8  0.26 Very Good H       337
##  9  0.22 Fair      E       337
## 10  0.23 Very Good H       338
## # … with 53,930 more rows

Moreover, it is also possible to select all the columns whose name starts with the letter “c” by using starts_with combined with the select function:

diamonds %>% select(starts_with("c"))
## # A tibble: 53,940 × 4
##    carat cut       color clarity
##    <dbl> <ord>     <ord> <ord>  
##  1  0.23 Ideal     E     SI2    
##  2  0.21 Premium   E     SI1    
##  3  0.23 Good      E     VS1    
##  4  0.29 Premium   I     VS2    
##  5  0.31 Good      J     SI2    
##  6  0.24 Very Good J     VVS2   
##  7  0.24 Very Good I     VVS1   
##  8  0.26 Very Good H     SI1    
##  9  0.22 Fair      E     VS2    
## 10  0.23 Very Good H     VS1    
## # … with 53,930 more rows

It is also possible to specify a criterion which excludes from the selection some variables. For example, to select all the columns but carat we use the - symbol:

diamonds %>% select(-carat)
## # A tibble: 53,940 × 9
##    cut       color clarity depth table price     x     y     z
##    <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

We proceed similarly to select all the columns but not the ones with a name starting with “c”:

diamonds %>% select(- starts_with("c"))
## # A tibble: 53,940 × 6
##    depth table price     x     y     z
##    <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  61.5    55   326  3.95  3.98  2.43
##  2  59.8    61   326  3.89  3.84  2.31
##  3  56.9    65   327  4.05  4.07  2.31
##  4  62.4    58   334  4.2   4.23  2.63
##  5  63.3    58   335  4.34  4.35  2.75
##  6  62.8    57   336  3.94  3.96  2.48
##  7  62.3    57   336  3.95  3.98  2.47
##  8  61.9    55   337  4.07  4.11  2.53
##  9  65.1    61   337  3.87  3.78  2.49
## 10  59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

Another useful application is the selection of only the numerical variables. This can be performed by using the select_if function combined with is.numeric (the latter is a test of an object being interpretable as numbers):

diamonds %>% 
  select_if(is.numeric)
## # A tibble: 53,940 × 7
##    carat depth table price     x     y     z
##    <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23  61.5    55   326  3.95  3.98  2.43
##  2  0.21  59.8    61   326  3.89  3.84  2.31
##  3  0.23  56.9    65   327  4.05  4.07  2.31
##  4  0.29  62.4    58   334  4.2   4.23  2.63
##  5  0.31  63.3    58   335  4.34  4.35  2.75
##  6  0.24  62.8    57   336  3.94  3.96  2.48
##  7  0.24  62.3    57   336  3.95  3.98  2.47
##  8  0.26  61.9    55   337  4.07  4.11  2.53
##  9  0.22  65.1    61   337  3.87  3.78  2.49
## 10  0.23  59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

The returned output can be then used to compute jointly all the averages of the numerical variables by using the apply function (see ??):

diamonds %>% 
  select_if(is.numeric) %>% 
  apply(2, mean)
##        carat        depth        table        price            x            y 
##    0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572    5.7345260 
##            z 
##    3.5387338

4.4.2 Verb 2: filter

The verb filter can be used to select the observations satisfying some criterion. Consider the example the selection of the diamonds with variable cut equal to the category “Premium”:

diamonds %>% filter(cut == "Premium")
## # A tibble: 13,791 × 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
##  2  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
##  3  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
##  4  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
##  5  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
##  6  0.24 Premium I     VS1      62.5    57   355  3.97  3.94  2.47
##  7  0.29 Premium F     SI1      62.4    58   403  4.24  4.26  2.65
##  8  0.22 Premium E     VS2      61.6    58   404  3.93  3.89  2.41
##  9  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
## 10  0.3  Premium J     SI2      59.3    61   405  4.43  4.38  2.61
## # … with 13,781 more rows

It is also possible to include more conditions. Select for example the diamonds with cut equal to “Premium” AND color equal to “D” (the AND can be specified by using & or comma):

diamonds %>% filter(cut == "Premium" & color == "D") 
## # A tibble: 1,603 × 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
##  2  0.3  Premium D     SI1      62.6    59   552  4.23  4.27  2.66
##  3  0.71 Premium D     SI2      61.7    59  2768  5.71  5.67  3.51
##  4  0.71 Premium D     VS2      62.5    60  2770  5.65  5.61  3.52
##  5  0.7  Premium D     VS2      58      62  2773  5.87  5.78  3.38
##  6  0.72 Premium D     SI1      62.7    59  2782  5.73  5.69  3.58
##  7  0.7  Premium D     SI1      62.8    60  2782  5.68  5.66  3.56
##  8  0.72 Premium D     SI2      62      60  2795  5.73  5.69  3.54
##  9  0.71 Premium D     SI1      62.7    60  2797  5.67  5.71  3.57
## 10  0.71 Premium D     SI1      61.3    58  2797  5.73  5.75  3.52
## # … with 1,593 more rows
diamonds %>% filter(cut == "Premium" , color == "D") 
## # A tibble: 1,603 × 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
##  2  0.3  Premium D     SI1      62.6    59   552  4.23  4.27  2.66
##  3  0.71 Premium D     SI2      61.7    59  2768  5.71  5.67  3.51
##  4  0.71 Premium D     VS2      62.5    60  2770  5.65  5.61  3.52
##  5  0.7  Premium D     VS2      58      62  2773  5.87  5.78  3.38
##  6  0.72 Premium D     SI1      62.7    59  2782  5.73  5.69  3.58
##  7  0.7  Premium D     SI1      62.8    60  2782  5.68  5.66  3.56
##  8  0.72 Premium D     SI2      62      60  2795  5.73  5.69  3.54
##  9  0.71 Premium D     SI1      62.7    60  2797  5.67  5.71  3.57
## 10  0.71 Premium D     SI1      61.3    58  2797  5.73  5.75  3.52
## # … with 1,593 more rows

If we are interested in filtering th diamonds with a price between 500 and 600 dollars the following two alternative codes can be adopted:

diamonds %>% filter(price > 500 & price < 600)
## # A tibble: 2,360 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.35 Ideal     I     VS1      60.9  57     552  4.54  4.59  2.78
##  2  0.3  Premium   D     SI1      62.6  59     552  4.23  4.27  2.66
##  3  0.3  Ideal     D     SI1      62.5  57     552  4.29  4.32  2.69
##  4  0.3  Ideal     D     SI1      62.1  56     552  4.3   4.33  2.68
##  5  0.42 Premium   I     SI2      61.5  59     552  4.78  4.84  2.96
##  6  0.28 Ideal     G     VVS2     61.4  56     553  4.19  4.22  2.58
##  7  0.32 Ideal     I     VVS1     62    55.3   553  4.39  4.42  2.73
##  8  0.31 Very Good G     SI1      63.3  57     553  4.33  4.3   2.73
##  9  0.31 Premium   G     SI1      61.8  58     553  4.35  4.32  2.68
## 10  0.24 Premium   E     VVS1     60.7  58     553  4.01  4.03  2.44
## # … with 2,350 more rows
diamonds %>% filter(between(price, 500,600))
## # A tibble: 2,407 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.35 Ideal     I     VS1      60.9  57     552  4.54  4.59  2.78
##  2  0.3  Premium   D     SI1      62.6  59     552  4.23  4.27  2.66
##  3  0.3  Ideal     D     SI1      62.5  57     552  4.29  4.32  2.69
##  4  0.3  Ideal     D     SI1      62.1  56     552  4.3   4.33  2.68
##  5  0.42 Premium   I     SI2      61.5  59     552  4.78  4.84  2.96
##  6  0.28 Ideal     G     VVS2     61.4  56     553  4.19  4.22  2.58
##  7  0.32 Ideal     I     VVS1     62    55.3   553  4.39  4.42  2.73
##  8  0.31 Very Good G     SI1      63.3  57     553  4.33  4.3   2.73
##  9  0.31 Premium   G     SI1      61.8  58     553  4.35  4.32  2.68
## 10  0.24 Premium   E     VVS1     60.7  58     553  4.01  4.03  2.44
## # … with 2,397 more rows

The output of a selection can always be saved in a new data frame as follows:

out = diamonds %>% filter(between(price, 500,600))

To select diamonds with cut equal to fair OR good we can use the | operator or the %in% function:

diamonds %>% filter(cut=="Fair" | cut=="Good")
## # A tibble: 6,516 × 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
##  2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
##  3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
##  4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
##  5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
##  6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
##  7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
##  8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
##  9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
## 10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
## # … with 6,506 more rows
diamonds %>% filter(cut %in% c("Fair","Good"))
## # A tibble: 6,516 × 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
##  2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
##  3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
##  4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
##  5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
##  6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
##  7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
##  8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
##  9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
## 10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
## # … with 6,506 more rows

The code diamonds %>% filter(cut == c("Fair","Good")) does not perform what we mean to do. For an interesting discussion about the difference between == and %in% see here.

4.5 Exercises Lab 2

4.5.1 Exercise 1

Write the code for the following operations:

  1. Simulate a vector of 15 values from the continuous Uniform distribution defined between 0 and 10 (see ?runif). Set the seed equal to 99.
  2. Round the numbers in the vector to two digits.
  3. Sort the values in the vector in descending order (see ?sort)
  4. Display a few of the largest values with head().

Use both the standard R programming approach and the more modern approach based on the use of the pipe %>% (remember to load the tidyverse library).

4.5.2 Exercise 2

Consider the mtcars data set which is available in R. Type the following code to explore the variables included in the data set (for an explanation of the variables see ?mtcars:

library(tidyverse)
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

Use the following code to create a new variable named car_model that contains the names of the cars, now available as row names.

mtcars = mtcars %>%
        rownames_to_column("car_model")
glimpse(mtcars)
## Rows: 32
## Columns: 12
## $ car_model <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive"…
## $ mpg       <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, …
## $ cyl       <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, …
## $ disp      <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.…
## $ hp        <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180…
## $ drat      <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, …
## $ wt        <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.15…
## $ qsec      <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.9…
## $ vs        <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, …
## $ am        <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, …
## $ gear      <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, …
## $ carb      <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, …
  1. How many observations and variables are available?

  2. Print (on the screen) the hp variable using the select() function. Try also to use the pull() function. Which is the difference?

  3. Print out all but the hp column using the select() function.

  4. Print out the following variables: mpg, hp, vs, am, gear. Suggestion: use : if necessary.

  5. Select all the observations which have mpg>20 AND hp>100. How many observations do you select?

  6. Select all the observations which have mpg>20 OR hp>100. How many observations do you select? Suggestion: the OR operator is implemented with |.

  7. Select all the observations which have exactly 6 cylinders. How many observations do you select? Suggestion: the exactly operator is implemented with ==.