5.2 Essential tibble commands
Whenever working with rectangular data structures — data consisting of multiple cases (rows) and variables (columns) — our first step (in a tidyverse context) is to create or transform the data into a tibble.
A tibble is a rectangular data table and a modern and simpler version of the data.frame
construct in R.
As the tibble package (Müller & Wickham, 2020) is part of the core tidyverse, we can load it as follows:
library(tidyverse)
Tibbles vs. data frames
Whereas data frames were present in base R in the 1990s (and a part of the S programming language on which R is based), tibbles first appeared in the dplyr package and in the form of an R package tibble (v1.0) in 2016. Nevertheless, tibbles essentially are simpler data frames. In contrast to the base R behavior of data frames, turning data into tibbles is more restrictive. Specifically,
tibbles do not change the types of input variables (e.g., strings are not converted to factors);
tibbles do not change the names of input variables and do not use row names.
Tibbles are quite flexible in also allowing for non-syntactic variable (column) names. For instance, in contrast to data frames, the variable names in tibbles can start with a number or contain spaces:
<- tibble(
tb `1 age` = c(20, 33, 28, 23),
` sex` = c("m", "f", "f", "m")
)
tb#> # A tibble: 4 × 2
#> `1 age` ` sex`
#> <dbl> <chr>
#> 1 20 m
#> 2 33 f
#> 3 28 f
#> 4 23 m
To refer to these names, they need to be enclosed in backticks ` `
.41 For instance:
# Refer to non-syntactic names:
$`1 age`
tb#> [1] 20 33 28 23
$` sex`[2]
tb#> [1] "f"
%>%
tb filter(`1 age` > 20) %>%
arrange(`1 age`)
#> # A tibble: 3 × 2
#> `1 age` ` sex`
#> <dbl> <chr>
#> 1 23 m
#> 2 28 f
#> 3 33 f
The differences between tibbles and data frames are not important to us at this point. (The main differences concern printing, subsetting, and the recycling behavior of vector elements when creating tibbles. See vignette("tibble")
for the details.) For our present purposes, tibbles are preferable to data frames, as they are easier to understand, easier to manipulate, and reduce the chance of unpleasant surprises.
Creating tibbles
The question How can we create tibbles? is more relevant to our concerns. This chapter covers three ways of creating tibbles:
as_tibble()
converts (or “coerces”) an existing rectangle of data (e.g., a data frame) into a tibble.tibble()
converts several vectors into (the columns of) a tibble.tribble()
converts a table (entered row-by-row) into a tibble.
We will illustrate each of these commands in the following sections.
Practice
Before we start creating tibbles, inspect the 3-item list of commands more closely.
The three commands yield the same type of output (i.e., a data table of the tibble
variety), but require different inputs.
- Which kind of input(s) does each command expect and how do these inputs need to be structured and formatted (e.g., do they contain parentheses, commas, etc.)?
5.2.1 as_tibble()
(from rectangles)
We use the as_tibble()
function to create a tibble from data that already is in rectangular format (e.g., a data frame or matrix).
- Starting from a data frame:
## Using the data frame `sleep`: ------
# ?datasets::sleep # provides background information on the data set.
<- datasets::sleep # copy
df
# Convert df into a tibble (tb):
<- as_tibble(df) tb
As always, we can apply some standard functions for inspecting df
and tb
:
# Inspect the data frame df: ----
dim(df)
#> [1] 20 3
is.data.frame(df)
#> [1] TRUE
head(df)
#> extra group ID
#> 1 0.7 1 1
#> 2 -1.6 1 2
#> 3 -0.2 1 3
#> 4 -1.2 1 4
#> 5 -0.1 1 5
#> 6 3.4 1 6
str(df)
#> 'data.frame': 20 obs. of 3 variables:
#> $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#> $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
As tibbles are data frames, we can use the same commands on tb
:
# Inspect the tibble tb: ----
dim(tb)
#> [1] 20 3
is.data.frame(tb) # => tibbles ARE data frames.
#> [1] TRUE
head(tb)
#> # A tibble: 6 × 3
#> extra group ID
#> <dbl> <fct> <fct>
#> 1 0.7 1 1
#> 2 -1.6 1 2
#> 3 -0.2 1 3
#> 4 -1.2 1 4
#> 5 -0.1 1 5
#> 6 3.4 1 6
str(tb)
#> tibble [20 × 3] (S3: tbl_df/tbl/data.frame)
#> $ extra: num [1:20] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#> $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
However, when using tibble, we can use some additional commands:
is.tibble(tb)
#> [1] TRUE
glimpse(tb)
#> Rows: 20
#> Columns: 3
#> $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1.9, 0.8, …
#> $ group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
#> $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
The most obvious advantage of a tibble is that it can simply be printed to see the most important information about a table of data: Its dimensions, types of variables (columns), and the values of the first rows:
tb#> # A tibble: 20 × 3
#> extra group ID
#> <dbl> <fct> <fct>
#> 1 0.7 1 1
#> 2 -1.6 1 2
#> 3 -0.2 1 3
#> 4 -1.2 1 4
#> 5 -0.1 1 5
#> 6 3.4 1 6
#> 7 3.7 1 7
#> 8 0.8 1 8
#> 9 0 1 9
#> 10 2 1 10
#> 11 1.9 2 1
#> 12 0.8 2 2
#> 13 1.1 2 3
#> 14 0.1 2 4
#> 15 -0.1 2 5
#> 16 4.4 2 6
#> 17 5.5 2 7
#> 18 1.6 2 8
#> 19 4.6 2 9
#> 20 3.4 2 10
- Starting from an existing matrix:
# create a 5 x 4 matrix of random numbers:
<- matrix(rnorm(n = 20, mean = 100, sd = 10), nrow = 5, ncol = 4)
mx
mx#> [,1] [,2] [,3] [,4]
#> [1,] 118.95193 93.60005 105.04955 100.36123
#> [2,] 95.69531 104.55450 82.82991 102.05999
#> [3,] 97.42731 107.04837 92.15541 96.38943
#> [4,] 82.36837 110.35104 91.49092 107.58163
#> [5,] 104.60097 93.91074 75.85792 92.73295
As matrices are rectangular data structures, coercing a matrix into a tibble also works with the as_tibble
command:
<- as_tibble(mx)
tx
tx#> # A tibble: 5 × 4
#> V1 V2 V3 V4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 119. 93.6 105. 100.
#> 2 95.7 105. 82.8 102.
#> 3 97.4 107. 92.2 96.4
#> 4 82.4 110. 91.5 108.
#> 5 105. 93.9 75.9 92.7
Note that — whereas the matrix mx
contained no column names — the corrsponding tibble tx
contains default variable names:
names(tx)
#> [1] "V1" "V2" "V3" "V4"
Practice
Convert some other R datasets (e.g.,
datasets::attitude
,datasets::mtcars
, anddatasets::Orange
) into tibbles and inspect their dimensions and contents.- What types of variables (columns) do they contain?
- What is the basic unit of an observation (row)?
- What types of variables (columns) do they contain?
- Obtain the same information that you get by printing a tibble
tb
(i.e., its dimensions, types of variables, and values of the first rows) about some data framedf
. How many commands do you need?
# data:
<- datasets::mtcars
df <- as_tibble(df)
tb
# print tibble
tb
dim(df) # provides dimensions
str(df) # provides types of variables
# provides variables and values df
For relatively small data tables, using one versus several short commands may seem comparable. But for larger data sets, using tibbles is much more convenient.42
5.2.2 tibble()
(from columns/vectors)
How can we create a tibble when we do not yet have a rectangular data structure? A common case of this type is that we have several vectors (i.e., linear data structures) and want to combine them into a tibble (i.e., tabular data structure). Importantly, the vectors will become the variables (columns) of our tibble.
Use the tibble()
function when the data to be turned into a tibble appears as a collection of vectors (which will become the tibble’s columns). For instance, imagine we wanted to create a tibble that stores the following information about a family:
id | name | age | gender | drives | married_2 |
---|---|---|---|---|---|
1 | Adam | 46 | male | TRUE | Eva |
2 | Eva | 48 | female | TRUE | Adam |
3 | Xaxi | 21 | female | FALSE | Zenon |
4 | Yota | 19 | female | TRUE | NA |
5 | Zack | 17 | male | FALSE | NA |
One way of viewing this table is as a series of variables that are the columns of the table (rather than its rows). Each column consists of a variable name and the same number of (here: 5) values, which can be of different types (here: numbers, characters, or Boolean truth values). Each column may or may not contain missing values (entered as NA
).
The tibble()
function expects that each column of the table is entered as a vector:
## Create a tibble from vectors (column-by-column):
<- tibble(
fm id = c(1, 2, 3, 4, 5), # OR: id = 1:5,
name = c("Adam", "Eva", "Xaxi", "Yota", "Zack"),
age = c(46, 48, 21, 19, 17),
gender = c("male", rep("female", 3), "male"),
drives = c(TRUE, TRUE, FALSE, TRUE, FALSE),
married_2 = c("Eva", "Adam", "Zenon", NA, NA)
)
# prints the tibble:
fm #> # A tibble: 5 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
Note some details:
Each vector is labeled by the variable (column) name, which is not put into quotes;
Avoid spaces within variable (column) names (or enclose names in backticks if you really
must use spaces
);All vectors need to have the same length;
Each vector is of a single type (numeric, character, logical, or a categorical factor);
Consecutive vectors are separated by commas (but there is no comma after the final vector).
A neat feature of using tibble()
for creating a new tibble is that later vectors may use the values of earlier vectors:
# Using earlier vectors when defining later ones:
<- tibble(
abc ltr = LETTERS[1:5],
n = 1:5,
l_n = paste(ltr, n, sep = "_"), # combining abc with num
n_sq = n^2 # squaring num
)
# prints the tibble:
abc #> # A tibble: 5 × 4
#> ltr n l_n n_sq
#> <chr> <int> <chr> <dbl>
#> 1 A 1 A_1 1
#> 2 B 2 B_2 4
#> 3 C 3 C_3 9
#> 4 D 4 D_4 16
#> 5 E 5 E_5 25
Practice
- Find some tabular data online (e.g., on Wikipedia) and enter it as a tibble by applying
tibble()
to a list of vectors.
5.2.3 tribble()
(from rows)
The tribble()
function comes into play when the data to be used appears as a collection of rows (or already is in tabular form).
For instance, when copying and pasting the above family data from an electronic document, it is easy to insert commas between consecutive cell values and use tribble()
for converting it (row-by-row) into a tibble:
## Create a tibble from tabular data (row-by-row):
<- tribble(
fm2 ~id, ~name, ~age, ~gender, ~drives, ~married_2,
#--|------|-----|--------|----------|----------|
1, "Adam", 46, "male", TRUE, "Eva",
2, "Eva", 48, "female", TRUE, "Adam",
3, "Xaxi", 21, "female", FALSE, "Zenon",
4, "Yota", 19, "female", TRUE, NA,
5, "Zack", 17, "male", FALSE, NA )
# prints the tibble:
fm2 #> # A tibble: 5 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
Note some details:
The column names are preceded by
~
and become the variable names of the tibble;Consecutive entries are separated by a comma, but there is no comma after the final entry;
The line
#--|-----|-----|-----|--------|--------|
is commented out and can be omitted;The type of each column is determined by the type of the corresponding cell values. For instance, the NA values in
fm2
are missing character values because the entries above were characters (entered in quotes).
Check
If tibble()
and tribble()
really are alternative commands, then the contents of our objects fm
and fm2
should be identical:
# Are fm and fm2 equal?
all.equal(fm, fm2)
#> [1] TRUE
Practice
- Enter the tibble
abc
(from above) by using thetribble()
function. Another way of creating tibbles is importing data (e.g., with theread_csv()
orread_delim()
functions of the readr package). The readr package will be covered in the next chapter on Importing data (Chapter 6).
5.2.4 Accessing tibble parts
Once we have a tibble, we typically want to access individual parts of it. Although we already know how rectangular data structures in R can be accessed by indexing (see Section 1.5 and how we can select columns or rows of tibbles by dplyr commands (see Section 3.2), it is helpful to revisit various ways of subsetting tables in the context of tibbles.
We can distinguish between three cases:
- accessing variables (columns),
- accessing cases (rows), and
- accessing cells.
1. Variables (columns)
We will select columns from the family tibble fm
(defined above):
fm#> # A tibble: 5 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
As each column of a tibble is a vector, obtaining a column of the tibble amounts to obtaining the corresponding vector. We can access this vector by its name (label) or by its number (column position):
# Get the name column of fm (as a vector):
$name # by label (with $)
fm#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
"name"]] # by label (with [])
fm[[#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
2]] # by number (with [])
fm[[#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
Actually, we know even more ways of obtaining the name
information from fm
.
However, note that the following commands — which use either base R indexing or dplyr commands — all return a tibble, rather than a vector:
# Get the name column of fm (as a 5 x 1 tibble):
2]
fm[ , #> # A tibble: 5 × 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
select(fm, 2)
#> # A tibble: 5 × 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
select(fm, name)
#> # A tibble: 5 × 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
Practice
- Use analog commands to obtain the
age
information offm
either as a vector or a tibble.
# Get the age column of fm (as a vector):
$age # by name (with $)
fm"age"]] # by name (with [])
fm[[3]] # by number (with [])
fm[[
# Get the age column of fm (as a 5 x 1 tibble):
3]
fm[ , select(fm, 3)
select(fm, age)
- Extract the
price
column ofggplot2::diamonds
in at least 3 different ways and verify that they all yield the same mean price.
2. Cases (rows)
Extracting specific rows of a tibble amounts to filtering a tibble and typically yields smaller tibbles (as a row may contain entries of different types). As we are familiar with essential dplyr commands (see Section 3.2), we can achieve this by filtering specific rows of a tibble by dplyr::filter
or select specific rows by dplyr::slice
. However, it is also possible to specify the desired rows by logical subsetting (i.e., specifying a condition that results in a Boolean value) or by specifying the desired row number (in numeric subsetting).
The following examples illustrate how we can obtain rows from the family tibble fm
(defined above):
# family tibble (defined above):
fm #> # A tibble: 5 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
The dplyr command filter
selects cases (rows) based on some condition(s).
Thus, it is similar to logical subsetting (i.e., indexing the rows by tests of column variables that evaluate to vectors of TRUE
or FALSE
):
%>% filter(id > 2)
fm #> # A tibble: 3 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
%>% filter(age < 18)
fm #> # A tibble: 1 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
%>% filter(drives == TRUE)
fm #> # A tibble: 3 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>
Here are the same 3 filters by using logical subsetting:
$id > 2, ]
fm[fm#> # A tibble: 3 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
$age < 18, ]
fm[fm#> # A tibble: 1 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
$drives == TRUE, ]
fm[fm#> # A tibble: 3 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>
The dplyr command slice
chooses cases (rows) based on their ordinal number.
Thus, it is similar to numeric subsetting (i.e., indexing the rows of a data table):
%>% slice(5) # get row 5
fm #> # A tibble: 1 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
%>% slice(3:5) # get rows 3 to 5
fm #> # A tibble: 3 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
%>% slice(2, 4) # get rows 2 and 4
fm #> # A tibble: 2 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 2 Eva 48 female TRUE Adam
#> 2 4 Yota 19 female TRUE <NA>
Here are the same 3 selections by using numeric subsetting:
5, ] # get row 5
fm[#> # A tibble: 1 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
3:5, ] # get rows 3 to 5
fm[#> # A tibble: 3 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
c(2, 4), ] # get rows 2 and 4
fm[#> # A tibble: 2 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 2 Eva 48 female TRUE Adam
#> 2 4 Yota 19 female TRUE <NA>
Practice
- Extract all diamonds from
ggplot2::diamonds
that have at least two carat. How many of them are there and what is their average price?
# (1) In several steps:
# Save data ggplot2::diamonds as dm:
<- ggplot2::diamonds
dm
# Filter dm by condition:
<- filter(dm, carat >= 2)
dm_2 nrow(dm_2) # => 2154 rows (cases)
# Compute the mean price of dm_2 (in 3 ways):
mean(dm_2$price)
mean(dm_2[["price"]])
mean(dm_2[[7]]) # => US-$ 14843.66
# (2) In one pipe:
::diamonds %>%
ggplot2filter(carat >= 2) %>%
summarise(nr = n(),
mn_price = mean(price))
3. Cells
Accessing the values of individual tibble cells is relatively rare, but can be achieved by
explicitly providing both row number
r
and column numberc
(as[r, c]
), or byfirst extracting the column (as a vector
v
) and then providing the desired row numberr
(v[r]
).
# family tibble (defined above):
fm #> # A tibble: 5 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Getting specific cell values:
$name[4] # getting the name of the 4th row
fm#> [1] "Yota"
4, 2] # getting the same name by row and column numbers
fm[#> # A tibble: 1 × 1
#> name
#> <chr>
#> 1 Yota
# Note: What if we don't know the row number?
which(fm$name == "Yota") # get the row number that matches the name "Yota"
#> [1] 4
In practice, accessing individual cell values is mostly needed to check for specific cell values and to change or correct erroneous entries by re-assigning them to a different value.
# Checking and changing cell values: ------
# Check: "Who is Xaxi's spouse?" (in 3 different ways):
$name == "Xaxi", ]$married_2
fm[fm#> [1] "Zenon"
$married_2[3]
fm#> [1] "Zenon"
3, 6]
fm[#> # A tibble: 1 × 1
#> married_2
#> <chr>
#> 1 Zenon
# Change: "Zenon" is actually "Zeus" (in 3 different ways):
$name == "Xaxi", ]$married_2 <- "Zeus"
fm[fm$married_2[3] <- "Zeus"
fm3, 6] <- "Zeus"
fm[
# Check for successful change:
fm#> # A tibble: 5 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
By contrast, a relatively common task is to check an entire tibble (e.g., for the existence or count of missing values, or to replace them by some other value):
# Checking for, counting, and changing missing values: ------
# family tibble (defined above):
fm #> # A tibble: 5 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# (a) Check for missing values:
is.na(fm) # checks each cell value for being NA
#> id name age gender drives married_2
#> [1,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE FALSE TRUE
#> [5,] FALSE FALSE FALSE FALSE FALSE TRUE
# (b) Count the number of missing values:
sum(is.na(fm)) # counts missing values (by adding up all TRUE values)
#> [1] 2
# (c) Change all missing values:
is.na(fm)] <- "A MISSING value!"
fm[
# Check for successful change:
fm#> # A tibble: 5 × 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE A MISSING value!
#> 5 5 Zack 17 male FALSE A MISSING value!
Practice
- Determine the number and the percentage of missing values in the datasets
dplyr::starwars
anddplyr::storms
.
# starwars:
sum(is.na(dplyr::starwars)) # 101 missing values
mean(is.na(dplyr::starwars)) # 8.93%
# storms:
sum(is.na(dplyr::storms)) # 13056 missing values
mean(is.na(dplyr::storms)) # 10.03%
5.2.5 From tibbles to data frames
As any tibble also is (a special type of) data frame, we rarely need to convert a tibble tb
into a data frame.
However, some R functions require an original data frame — mostly because they expect tb[ , i]
to return the i-th column of tb
as a vector, when it actually will return another tibble:
# Using the tibble fm (from above):
class(fm) # tibbles are a kind of data.frame
#> [1] "tbl_df" "tbl" "data.frame"
2] # yields the 2nd column as a tibble (!)
fm[ , #> # A tibble: 5 × 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
2]] # yields the 2nd column as a vector
fm[[#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
For rare cases like this, it is good to know that R has a as.data.frame()
function
that allows turning a tibble into a data frame:
# Turn the tibble fm into a data frame:
<- as.data.frame(fm)
df_fm
class(df_fm) # an ordinary data.frame
#> [1] "data.frame"
2] # yields the 2nd column as a vector (!)
df_fm[ , #> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
2]] # yields the 2nd column as a vector
df_fm[[#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
5.2.6 Conclusion
Our focus in this section was on creating tibbles and accessing parts of tibbles. Prior to this chapter, we were already working with tibbles, but encountered them mostly as objects provided by packages or as inputs of dplyr and ggplot2 commands. More advanced transformations of tibbles were discussed in the context of on Transforming data (Chapter 3). The following chapters will continue to use tibbles and teach us new ways of importing (Chapter 6) and combining (Chapter 8) tibbles, or wrangling them into various shapes (Chapter 7).
At this point, it may seem as if tibbles are the only data structure that we will ever need. This impression is wrong, but has a simple reason: In this book, we focus on rectangular data that can conveniently be stored as a tibble. Although it is impressive how many things can be expressed in this format, tibbles are just a convenient way of starting our expedition into data science. Clearly, there are lots of types of information that are of immense scientific interest, but not easily stored in this format — for instance images, texts, sounds, tastes, and most natural phenomena (e.g., psychological, economic, or social processes).