12.2 Essentials of iteration

As we only focus on the essentials in this book, we will mostly focus on for and while loops here. In later sections, we briefly introduce the notion of functional programming (see Section 12.2.3) and learn how loops can be replaced by the base R apply (R Core Team, 2020) and purrr’s map (Henry & Wickham, 2020) family of functions (see Section 12.2.4).

12.2.1 Loops

Loops are structures of code that cause some code to be executed repeatedly. In R, we can distinguish between 2 basic versions:

  1. for loops are indicated when we know the number of required iterations in advance;

  2. while loops are more general and indicated when we only know a condition that should end a loop.

We will discuss both types of loops in the following sections.

#> # A tibble: 6 x 5
#>      id   age height shoesize    IQ
#>   <int> <dbl>  <dbl>    <dbl> <dbl>
#> 1     1    21    173       38    89
#> 2     2    26    193       43    93
#> 3     3    24    171       41    92
#> 4     4    32    191       43    97
#> 5     5    26    156       36   110
#> 6     6    28    172       34   117

Using a for loop

When we want to execute some code repeatedly and know the number of required iterations in advance, a for loop is indicated. To create a for loop, we are typically asking and answering 3 questions:

  1. Body: What is the task performed (and corresponding code executed) in the loop?

  2. Sequence: Over which sequence should be iterated? How many iterations are there?

  3. Output: What is the result of the loop: What type of object and how many instances?

After answering these questions, a for loop can be designed in reverse order (from output to body):

Loops for iteration

Assuming we know how often we want to do something.

Example: Compute square number of integers from 1 to 10.

  1. Body: Square of some number i.

  2. Sequence: i from 1 to 10.

  3. Output: A vector of 10 numbers.

Implementation:

Note: In the for loop, we use output[[i]], rather than output[i] to refer to the i-th element of output . Actually, using a single [] would have worked as well, but the double [[]] makes it clear that we want to remove a level of the hierarchy and assign something to a single element of output. (See Ch. 20.5.2 Subsetting recursive vectors (lists) for more details on this distinction.)

Loops over data

In the context of data science, we often want to iterate over (rows or columns) of data tables. Let’s load some toy data to work with:

Suppose we wanted to obtain the means of the variables from age to IQ. We could call the mean function for each descired variable. Thus, repeating this call for each variable would be:

However, the statement “for each variable” in the previous sentence shows that we are dealing with an instance of iteration here. When dealing with computers, repetition of identical steps or commands is a signal that there are more efficient ways to accomplish the same task.

How could we use a for loop here? To design this loop, we need to answer our 3 questions from above:

  1. Body: We want to compute the mean of 4 columns (age to IQ) in tb.

  2. Sequence: We want to iterate over columns 2 to 5 (i.e., 4 iterations).

  3. Output: The result or the loop is a vector of type “double”, containing 4 elements.

Notes

  • We remove the 1st column, as no computation is needed for it.

  • The i-th column of tibble tb (or data frame df) can be accessed via tb[[i]] (or df[[i]]).

The range of a for loop can be defined with a special function:

The base R function seq_along() returns an integer vector. When its argument is a table (e.g., a data frame or a tibble), it returns the integer values of all columns, so that the last for loop could be re-written as follows:

Another way of iterating over all columns of a table tb_2 could loop from 1 to ncol(tb_2):

Practice

  1. Rewrite the for loop to compute the means of columns 2 to 5 of tb (i.e., without simplifying tb to tb_2 first).

Solution

We create a new output vector output_4 and need to change 2 things:

  • the column numbers of the for loop statement (from 1:4 to 2:5).

  • the index to which we assign our current mean mn should be decreased to i - 1 (to assign the mean of column 2 to the 1st element of output_2).

  1. We have learned that creating a for loop requires knowing (a) the data type of the loop results and (b) a data structure than can collect these results. This is simple and straightforward if each for loop results in a single number, as then all results can be stored in a vector.

However, things get more complicated when for loops yield tables, lists, or plots, as outputs. Try creating similar for loops that return the summary and a histogram (using the base R function hist) of each variable in tb (or each variable of tb_2).

Solution

Creating a summary:

The following for loop is almost identical to the one (computing mean of columns 2:5 of tb) above. However, we initialize the summaries vector to a mode = "list", which allows storing more complex objects in a vector:

The following code uses the R command hist() (from the graphics package included in R) to create a histogram for a specific variable (column) of tb:

The following code uses a loop over the tb_2 data (which was created as tb without the 1st column) to create a histogram for each variable (column) of tb_2. Each histogram is stored in a list out, so that individual plots can be plotted later (using the plot() command on an element of out):

#> [1] 2

#> [1] 3

#> [1] 4

12.2.2 For loop variations

We can distinguish between 4 variations of the basic theme of the for loop:

1. Modifying an existing object, instead of creating a new object

Example: We want to rescale every column of a table (tibble or data frame).

  1. We could rescale all columns of tb in 4 separate steps:
  1. Alternatively, we could use a for loop to modify an existing data structure.

Here are the answers to our 3 questions regarding loops:

  1. Body: apply rescale01() to every column of tb.

  2. Sequence: A list of columns (i.e., iterate over each column with seq_along(tb)).

  3. Output: tb (i.e., identical dimensions to the input).

2. Looping over names or values, instead of indices

So far, we used for loop to loop over numeric indices of x with for (i in seq_along(x)), and then extracted the i-th value of x with x[[i]].

There are 2 additional common loop patterns:

  1. loop over the elements of x with for (x in xs): useful when only caring about side effects

  2. loop over the names of x with for (nm in names(xs)): useful when names are needed for files or plots.

Whenever creating named output, make sure also provide names to the results vector:

Note that the basic iteration over the numeric indices is the most general form, because given a position we can extract both the current name and the current value:

3. Handling outputs of unknown length

  • Problem: Knowing the number of iterations, but not knowing how long the output will be.
  • Solution: Increase the size of the output within a loop OR use a list object to collect loop results.

For example, imagine we knew the first N = 1000 digits of pi (i.e., a series of digits 314159 etc.), but wanted to count how frequently a specific target subsequence (e.g., target = 13) occurs in this sequence.

The following code uses pi_100k — available in the ds4psy package or from a file pi_100k.txt — to read the first N = 1000 digits of pi into a scalar character object pi_1000:

The following code uses a for loop to answer the questions: How many times and at which positions does the target <- 13 occur in pi_1000?

The answers are that 13 occurs 12 times in pi_1000. Its 1st occurrence is at position 111 and its 12-th occurrence at position 972.

Note that we could specify the number of iterations (i.e., N - 1 loops, from 1 to 999), but not the number of elements in target_positions.

Incrementing the target_positions vector by i every time a new target is found — by target_positions <- c(target_positions, i) — is quite slow and inefficient. However, this is not problematic as long as we only do this once and for a relatively small problem (like a loop with 999 iterations).

A more efficient solutions could initialize target_positions to a list (which can take any data object as an element) and then store any instance of finding the target at the i-th position of pi_1000 as the i-th instance of the list. Once the loop is finished, we can use unlist() to flatten the list to a vector:

This way, we could initialize the length of target_positions before entering the for loop. This made it possible to assign any new target to target_positions[i], but made the list much larger than it actually needed to be. The advantages and disadvantages of these different options should be considered for the specific problem at hand.

4. Handling sequences of unknown length

  • Problem: The number of iterations is not known in advance.
  • Solution: Use a while loop, with a condition to stop the loop.

Sometimes we cannot know in advance how many iterations our loop should run for. This is common when dealing with random outcomes or running simulations that need to reach some threshold value to stop.

We can address these problems with a while loop.
Actually, a while loop is simpler than a for loop because it only has 2 components, a condition and a body:

A while loop is also more general than a for loop because we can write any for loop as a while loop, but not vice versa. For instance, any for loop with N steps:

can be re-written as a while loop that uses a counter variable i for the number of iterations and a condition that the maximum number of steps N must not be exceeded:

As this requires explicit maintenance (here: the initialization and incrementation of a counter), we prefer using for loops when the number of iterations is known in advance.

However, we often do not know in advance how many iterations we will need. For instance, let’s ask ourselves the following question:

  • At which position in the first 1000 digits of pi do we first encounter the subsequence 13?

Actually, we do know the answer to this problem from target_positions above: The 1st occurrence of 13 in pi_1000 is at position 111. Knowing a solution makes this a good practice problem.

Assuming that we know nothing else of the sequence, we cannot do that sort of iteration with the for loop (unless we loop over the entire sequence, as we did above). The following while loop solves this task by incrementally increasing i to inspect the corresponding digits (at positions i and i + 1) of pi_1000 as long as we meet the condition digits_i != target:

A danger of while loops is that they may never stop. For instance, if we asked:

  • At which position in the first 1000 digits of pi do we first encounter the subsequence 999?

we could slightly modify our code above (to accommodate digits_i to look for 3-digit number):

The answer is: The digits 999 first appear in pi_1000 at position 763.

However, if we changed our target to 123 to ask the analog question:

  • At which position in the first 1000 digits of pi do we first encounter the subsequence 123?

the same while loop would encounter an error message:

The source of this error becomes obvious when realizing that i is set to a value of 1001: We simply did not find an instance of 123 in the first 1000 digits of pi and the counter is trying to access its 1001. digit, which is undefined (NA) and hence causes an error in our condition (digits_i != target).

To prevent this sort of error, we could modify our condition to also stop the while loop after the maximum number of possible steps has been reached. In our case, the while loop does only makes sense as long as we do not exceed the number of characters in pi_1000, so that we can add the requirement (i <= nchar(pi_1000)) as an additional (conjunctive, i.e., using &&) test to our condition:

This way, the while loop is limited to a maximum of nchar(pi_1000) = 1000 iterations. If the counter i shows an impossible value of 1001, we can conclude that the target sequence was not found.

12.2.3 Functional programming

For loops vs. functionals

In R, for loops are not as important as in most other programming languages, because R is a functional programming language. This means that it is possible to replace many for loops by wrapping up the body of a for loop in a function, and then call that function.

Example

Consider a simple data table df:

Assume that our goal is getting the mean of every column of df. The standard solution for this task is using a for loop:

As we will want to compute the means of every column pretty frequently, we abstract the for loop into a dedicated col_mean function:

But now we are no longer satisfied with the mean, but also want the median and standard deviation of every column. Of course we could write analog functions col_median() and col_sd(), but they would only differ in 1 line from col_mean above. Hence, what we really would want is that we could pass a function (like mean or median) to another function — which is exactly what we will learn to do next.

Motivation: Generalizing functions

Consider the following 3 functions:

The 3 functions only vary in their last digit. Given this degree of duplicated code, we would make the last element an additional argument to a more general function:

By doing this, we have reduced the amount of code, the corresponding chance of errors, and made it easier to generalise to new situations.

Application: Using functions as arguments to functions

We can achieve exactly the same thing for 3 similar functions — like col_mean(), col_median(), and col_sd() — by adding an argument to a function that supplies the function to apply to each column:

The new function col_summary() accepts data and a function fun as its arguments. The for loop in its body ensures that fun is applied to each element of data. When data is a data frame, its elements are its variables (columns).

Here is what happens when we pass data = df and fun = mean to this col_summary() function:

We obtain the means of all columns as the expected results (corresponding to those above). But the real power of col_summary() lies in the fact that we can pass a variety of functions to it:

Passing these functions computes the median or standard deviations of each column in data.

Overall, the idea of passing a function (like mean or median) as an argument (here: fun) to another function (here: col_summary()) is an extremely powerful feature of R and one reason for calling R a functional programming language.

12.2.4 Replacing loops in base R and purrr

The apply family of functions of base R and the map functions of the purrr package provide functions that eliminate the need for many for loops.

Note that using apply or map is not necessarily faster than using for loops. The chief benefit of using these functions is not speed, but clarity: They make code easier to write and read.

12.2.4.1 Using apply functions

The base R apply family of functions in (i.e., apply(), lapply(), tapply(), etc.) replaces for loops over data structures by applying a function to designated parts of a (rectangular) data structure. To accomplish this, apply takes the following arguments:

  • X argument takes an array or matrix (i.e., rectangular data);
  • MARGIN argument: 1 = rows, 2 = columns;
  • FUN is the function to be applied.

Examples

We can use base::apply to solve the problems addressed by col_summary above:

In this call, MARGIN = 2 instructed apply to apply some function FUN to each column of X. Changing the MARGIN argument MARGIN = 1 will apply FUN to each row of X:

Note some variants of apply:

  • lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

  • See also sapply() and vapply() for simplifying arrays and returning vectors.

12.2.4.2 Using map functions

The purrr package (Henry & Wickham, 2020) provides a family of map functions as modern and more consistent versions of apply.

The main goal of using purrr functions (instead of for loops) is to break common list manipulation challenges into smaller and independent pieces. This strategy involves 2 steps, each of which scales down the problem:

  1. Solving the problem for a single element of the list.
    Once we have solved that problem, purrr takes care of generalising the solution to every element in the list.

  2. Breaking a complex problem down into smaller sub-problems that allow us to advance towards a solution.
    With purrr, we get many small pieces that we can compose together with the pipe (%>%).

This scaling-down strategy makes it easier to solve new problems and to understand our solutions to old problems when we re-read older code.

Essential map functions

The pattern of looping over a vector, doing something to each element, and saving the results is so common that the purrr package provides a family of functions for it. There is a separate function for each type of output:

  • map() creates a list.
  • map_lgl() creates a logical vector.
  • map_int() creates an integer vector.
  • map_dbl() creates a double vector.
  • map_chr() creates a character vector.

Each function takes a vector .x as input, applies a function .f to each element, and then returns a new vector whose length corresponds to (and has the same names as) the input vector. The type of the output vector is determined by the suffix of the map function (e.g., map_chr() returns the output as a character vector).

Practice

  • Predict the output of the following variants of map, then check your predictions:

Details

It is worth pointing out some details regarding the uses of the map variants:

  • Without the suffix _dbl, map(df, sd) returns a list (with an element for every column of df).

  • With the map functions, our focus is on the function/operation, not the bookkeeping of the for loop. This is even more obvious when using the pipe:

  • map functions also use the generic ... argument to allow passing additional arguments to the function indicated by .f:
  • map functions preserve the names of .x:

Shortcuts

There are a few shortcuts to save typing in the .f argument of map.

Imagine we want to fit a linear model to each group in a dataset.

The following example splits the up the mtcars dataset in to 3 pieces (by value of cylinder) and fits the same linear model to each piece. The linear model is supplied as an anonymous function:

As the syntax for creating an anonymous function in R is quite verbose, purrr provides a one-sided formula as a shortcut:

Here, the . is used as a pronoun that refers to the current list element.

When inspecting many models, we may want to extract a summary statistic like \(R^{2}\). To do that we need to first run summary() on our models and then extract the component called r.squared. We could do this using the shorthand for anonymous functions:

But extracting named components from a list of elements is a common operation, so purrr provides an even shorter shortcut: We can use a string:

Alternatively, we can also use an integer to select elements by their position:

12.2.4.3 Comparing map vs. apply

The map functions of purrr are modeled on the apply functions:

  • lapply() is basically identical to map(), except that map is consistent with all the other functions in purrr, and we can use the shortcuts for .f.

  • sapply() is a wrapper around lapply() that automatically simplifies the output (which can yield unexpected results).

  • vapply() is a safe alternative to sapply() because you supply an additional argument that defines the type. The only problem with vapply() is that it’s a lot of typing: vapply(df, is.numeric, logical(1)) is equivalent to map_lgl(df, is.numeric).

An advantage of vapply() over purrr’s map functions is that it can also produce matrices — the map functions only ever produce vectors.

12.2.4.4 Advanced aspects of purrr

See the following sections of r4ds (Wickham & Grolemund, 2017) for more advanced issues in combination with the map functions of purrr:

  • 21.6: Dealing with failure for using the adverbs safely, possibly, and quietly in combination with map functions.

  • 21.7: Mapping over multiple arguments for the map2() and pmap() variants (for supplying multiple arguments to a function), as well as invoke_map() (for supplying multiple functions).

  • 21.8: Walk provides alternatives to map when calling functions for their side effects (see walk(), walk2() and pwalk() for details).

Rather than digging deeper into the functional programming paradigm of purrr, the exercises of this chapter (in Section~12.4) will focus on good-old fashioned for loops and only cover essential aspects of apply and map.

References

Henry, L., & Wickham, H. (2020). purrr: Functional programming tools. Retrieved from https://CRAN.R-project.org/package=purrr

R Core Team. (2020). R base: A language and environment for statistical computing. Retrieved from https://www.R-project.org

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz