5.2 Essential tibble commands

Whenever working with rectangular data structures — data consisting of multiple cases (rows) and variables (columns) — our first step (in a tidyverse context) is to create or transform the data into a tibble. A tibble is a rectangular data table and a modern and simpler version of the data.frame construct in R.

Tibbles vs. data frames

Tibbles essentially are simpler data frames. In contrast to the base R behavior of data frames, turning data into tibbles is stricter. Specifically,

  • tibbles do not change the types of input variables (e.g., strings are not converted to factors by default),

  • tibbles do not change the names of input variables and do not use row names.

Tibbles also allow non-syntactic variable (column) names. For instance, in contrast to data frames, the variable names in tibbles can start with a number or contain spaces:

To refer to these names, they need to be enclosed in backticks ` `.26 For instance:

The differences between tibbles and data frames are not important to us at this point. (The main differences concern printing, subsetting, and the recycling behavior of vector elements when creating tibbles. See vignette("tibble") in case you are interested.) For our purposes, tibbles are preferable to data frames, as they are easier to understand, easier to manipulate, and reduce the chance of unpleasant surprises.

Creating tibbles

The question How can we create tibbles? is more relevant to our concerns. This chapter covers 3 ways of creating tibbles:

  1. as_tibble converts (or “coerces”) an existing rectangle of data (e.g., a data frame) into a tibble.

  2. tibble converts several vectors into (the columns of) a tibble.

  3. tribble converts a table (entered row-by-row) into a tibble.

We will illustrate each of these commands in the following sections.

Practice

Before we start creating tibbles, inspect the 3-item list of commands more closely. The 3 commands yield the same type of output (i.e., a data table of the tibble variety), but require different inputs. Ask yourself which kind of input each command takes and how these inputs need to be structured and formatted (e.g., do they contain parentheses, commas, etc.).

5.2.1 as_tibble (rectangles)

We use as_tibble to create a tibble from data that already is in rectangular format (e.g., a data frame or matrix).

  1. Starting from a data frame:

As always, we can apply some standard functions for inspecting df and tb:

As tibbles are data frames, we can use the same commands on tb:

However, when using tibble, we can use some additional commands:

The most obvious advantage of a tibble is that it can simply be printed to see the most important information about a table of data: Its dimensions, types of variables (columns), and the values of the first rows:

  1. Starting from an existing matrix:

As matrices are rectangular data structures, coercing a matrix into a tibble also works with the as_tibble command:

Note that — whereas the matrix mx contained no column names — the corrsponding tibble tx contains default variable names:

Practice

  1. Convert some other R datasets (e.g., datasets::attitude, datasets::mtcars, and datasets::Orange) into tibbles and inspect their dimensions and contents.

    • What types of variables (columns) do they contain?
    • What is the basic unit of an observation (row)?
  2. Obtain the same information that you get by printing a tibble tb (i.e., its dimensions, types of variables, and values of the first rows) about some data frame df. How many commands do you need?

For relatively small data tables, using one versus several short commands may seem comparable. But for larger data sets, using tibbles is much more convenient.27

5.2.2 tibble (columns)

How can we create a tibble when we do not yet have a rectangular data structure? A common case of this type is that we have several vectors (i.e., linear data structures) and want to combine them into a tibble (i.e., tabular data structure). Importantly, the vectors will become the variables (columns) of our tibble.

Use the tibble command when the data to be turned into a tibble appears as a collection of vectors (which will become the tibble’s columns). For instance, imagine we wanted to create a tibble that stores the following information about a family:

Table 5.1: Table 1: Example data of some family.
id name age gender drives married_2
1 Adam 46 male TRUE Eva
2 Eva 48 female TRUE Adam
3 Xaxi 21 female FALSE Zenon
4 Yota 19 female TRUE NA
5 Zack 17 male FALSE NA

One way of viewing this table is as a series of variables that are the columns of the table (rather than its rows). Each column consists of a variable name and the same number of (here: 5) values, which can be of different types (here: numbers, characters, or Boolean truth values). Each column may or may not contain missing values (entered as NA).

The tibble command expects that each column of the table is entered as a vector:

Note some details:

  • Each vector is labeled by the variable (column) name, which is not put into quotes;

  • Avoid spaces within variable (column) names (or enclose names in backticks if you really must use spaces);

  • All vectors need to have the same length;

  • Each vector is of a single type (numeric, character, logical, or a categorical factor);

  • Consecutive vectors are separated by commas (but there is no comma after the final vector).

A neat feature of using tibble for creating a new tibble is that later vectors may use the values of earlier vectors:

Practice

Find some tabular data online (e.g., on Wikipedia) and enter it as a tibble by applying tibble to a list of vectors.

5.2.3 tribble (rows)

Use tribble when the data to be used appears as a collection of rows (or already is in tabular form).

For instance, when you copy and paste the above family data from an electronic document, it is easy to insert commas between consecutive cell values and use tribble to convert it (row-by-row) into a tibble:

Note some details:

  • The column names are preceded by ~ and become the variable names of the tibble;

  • Consecutive entries are separated by a comma, but there is no comma after the final entry;

  • The line #--|-----|-----|-----|--------|--------| is commented out and can be omitted;

  • The type of each column is determined by the type of the corresponding cell values. For instance, the NA values in fm2 are missing character values because the entries above were characters (entered in quotes).

Check

If tibble and tribble really are alternative commands, then the contents of our objects fm and fm2 should be identical:

Practice

  • Enter the tibble abc (from above) by using tribble. Another way of creating tibbles is importing data (e.g., with readr::read_csv or readr::read_delim). The readr package will be covered in the next chapter.

5.2.4 Accessing tibble parts

Once we have a tibble, we typically want to access individual parts of it. Although we already know how rectangular data structures in R can be accessed by indexing (see Section 1.5 and how we can select columns or rows of tibbles by dplyr commands (see Section 3.2), it is helpful to revisit various ways of subsetting tables in the context of tibbles.

We can distinguish between 3 cases:

  • accessing variables (columns),
  • accessing cases (rows), and
  • accessing cells.

Practice

  1. Use analog commands to obtain the age information of fm either as a vector or a tibble.
  1. Extract the price column of ggplot2::diamonds in at least 3 different ways and verify that they all yield the same mean price.

2. Cases (rows)

Extracting specific rows of a tibble amounts to filtering a tibble and typically yields smaller tibbles (as a row may contain entries of different types). As we are familiar with essential dplyr commands (see Section 3.2), we can achieve this by filtering specific rows of a tibble by dplyr::filter or select specific rows by dplyr::slice. However, it is also possible to specify the desired rows by logical subsetting (i.e., specifying a condition that results in a Boolean value) or by specifying the desired row number (in numeric subsetting).

The following examples illustrate how we can obtain rows from the family tibble fm (defined above):

The dplyr command filter selects cases (rows) based on some condition(s). Thus, it is similar to logical subsetting (i.e., indexing the rows by tests of column variables that evaluate to vectors of TRUE or FALSE):

Here are the same 3 filters by using logical subsetting:

The dplyr command slice chooses cases (rows) based on their ordinal number. Thus, it is similar to numeric subsetting (i.e., indexing the rows of a data table):

Here are the same 3 selections by using numeric subsetting:

Practice

Extract all diamonds from ggplot2::diamonds that have at least 2 carat. How many of them are there and what is their mean price?

3. Cells

Accessing the values of individual tibble cells is relatively rare, but can be achieved by

  1. explicitly providing both row number r and column number c (as [r, c]), or by

  2. first extracting the column (as a vector v) and then providing the desired row number r (v[r]).

In practice, accessing individual cell values is mostly needed to check for specific cell values and to change or correct erroneous entries by re-assigning them to a different value.

By contrast, a relatively common task is to check an entire tibble (e.g., for the existence or count of missing values, or to replace them by some other value):

Practice

Determine the number and the percentage of missing values in the datasets dplyr::starwars and dplyr::storms.

5.2.5 From tibbles to data frames

As any tibble also is (a special type of) data frame, we rarely need to convert a tibble tb into a data frame. However, some R functions require an original data frame — mostly because they expect tb[ , i] to return the i-th column of tb as a vector, when it actually will return another tibble:

For rare cases like this, it is good to know that R has a as.data.frame function that allows turning a tibble into a data frame:

5.2.6 Conclusion

Our focus in this section was on creating tibbles and accessing parts of tibbles. Previously, we were working with tibbles, but encountered them mostly as the inputs of dplyr and ggplot2 commands. More advanced transformations of tibbles were discussed in Chapter 3 on Transforming data and the following chapters will continue to use tibbles and teach us new ways of importing them and wrangling them into various shapes.

At this point, it may seem as if tibbles are the only data structure that we will ever need. This impression is wrong, but has a simple reason: In this book, we focus on rectangular data that can conveniently be stored as a tibble. Although it is impressive how many things can be expressed in this format, tibbles are just a convenient way of starting our exploration of data science. Clearly, there are lots of types of information that are of immense scientific interest, but not easily stored in this format — for instance images, texts, sounds, tastes, and most natural phenomena (e.g., psychological, economic, or social processes).


  1. For compatibility reasons, I recommend avoiding variable names that do not start with a letter or contain spaces.

  2. Especially seeing the types of all variables is more difficult when using data frames. The command sapply(df, class) works, but is difficult to understand.