6 Iteration

Having learned how to use conditionals (in Chapter 4) and how to write functions (in Chapter 5), this chapter will introduce the notion of iteration, which means to repeatedly execute a process. As R is a functional programming language, the notion of iteration is deeply embedded in its DNA. Iteration in R can be explicit (by enclosing code in loops), but remains often implicit (by computing operations over vectors or applying functions to data structures).

This chapter concludes Part 2 on programming basics in R.

Preparation

Alternative introductions to iteration include:

Preflections

i2ds: Preflexions

To reflect upon the notion and uses of iteration, try answering the following questions:

  • What does the word iteration mean?

  • Why and when do we want to execute code repeatedly?

  • What other means of repeating code have we encountered in R?

6.1 Introduction

‘Begin at the beginning,’ the King said gravely,
‘and go on till you come to the end: then stop.’

Lewis Carroll: Alice’s Adventures in Wonderland (Chapter XII)

The king’s instruction is so general that it may seem vacuous. Nevertheless, it aptly describes what happens when a human or a computer executes a rule-based process. But rather than moving straight from the beginning to the end, a characteristic feature of many computer programs is that they execute code in an iterative fashion.

The Latin term iter means “route”, and to “re-iterate” something is to repeat something. In ordinary language, proceeding in an “iterative” fashion typically means to proceed step-by-step until some end or goal is reached.

In programming contexts, the term iteration means to repeat a process or procedure. However, unless this process or procedure includes some random element, strict repetition of a deterministic sequence of steps would only yield the same outcome. To be useful, we typically do not want to repeat an exact series of calculations, but handle different data inputs in each repetition. If an overall problem can be decomposed into many similar sub-problems, successfully solving each sub-problem helps to solve the overall problem.

Just as we have previously emphasized for functions (in Section 5.1), iteration provides a means of abstraction: By encapsulating a process (e.g., in a loop), we view it as a unit that we can repeat. As we can treat a loop as a black box that is defined in terms of its inputs and outputs, iteration and functions are closely related. Rather than constructing a loop, we can always write a corresponding function and call this function repeatedly (with different inputs).

In R, we distinguish between explicit and implicit iteration. Explicit iteration involves loop structures (Section 6.2), whereas implicit iteration applies or maps functions to data structures (Section 6.3).

6.2 Explicit iteration: Loops

Loops are structures of code that cause some code (i.e., a collection of R expressions) to be executed repeatedly. The series of expressions or loop <body> to be repeated is enclosed in curly brackets { <body> }. To instruct R when and how often the <body> is to be evaluated, any loop requires some condition or criterion that indicates whether to continue with another iteration or to stop and exit the loop.

R provides three basic structures for explicit iteration:

  1. for loops define an explicit index variable and the range of values that this variable should assume. They are indicated when we know (or can determine) the number of required iterations in advance.

  2. while loops are more general and specify the condition for repeating a process. They are indicated when we primarily know a condition that should end a loop.

  3. repeat loops re-iterate a process until a break keyword is encountered. They are similar to while loops, but require an explicit termination signal.

When defining explicit loops, two keyword constructs provide additional control over the flow of information:

  • next abandons the <body> of the current loop iteration and begins the next iteration of the same loop.

  • break causes an exit from the <body> of the current (innermost) loop and continues with the code after this loop.

We will discuss these three types of loops in the following sections.

6.2.1 for loops

The general structure of a for loop in R is as follows:

for (i in <seq>){
  
    <body>

}

In this template, i is an index variable that assumes a new value for every iteration of the loop, <seq> denotes the sequence of values of i, and <body> can be an arbitrary series of R expressions. Note that defining seq (e.g., as 1:n) implies that the number of repetitions n is known or can be determined (e.g., from the shape of some data structure).

To be helpful, the loop <body> must either evoke desired side effects (e.g., print or plot something) or modify some data structure that collects the results. If the current overall (or global) problem can be decomposed into many (local) sub-problems, solving each sub-problem and storing its solution in some data structure helps to solve the overall problem.

Before writing a for loop, we first must figure out its range and the type of output we need or want. Correspondingly, two key questions for creating a for loop are:

  • Which range: Over which variable and range of values do we want to iterate (to construct <seq>)?

  • Which output: How do we provide the local results of each loop iteration or collect them to answer our global problem?

A general recipe for creating loops includes the following steps:

  1. Decompose an overall problem into many similar sub-problems.

  2. Anticipate the desired output of each sub-problem and prepare a data structure to collect this output.

  3. Define an index variable i, a sequence <seq>, and a loop <body> that solves each sub-problem and collects its result in the output.

  4. Use the loop output to solve the overall problem.

In the following, we illustrate this recipe for simple examples.

Using for loops

To illustrate the creation of loops (and later the application of functions), we first create a simple data structure. Here is a data frame df, that contains the numbers from 1 to 20 in 5 rows and 4 columns, with its variables named A to D:

df <- data.frame(matrix(1:20, ncol = 4, byrow = TRUE))
names(df) <- LETTERS[1:4]
df
#>    A  B  C  D
#> 1  1  2  3  4
#> 2  5  6  7  8
#> 3  9 10 11 12
#> 4 13 14 15 16
#> 5 17 18 19 20

Our goal for various loops in this chapter will be to compute arithmetic results (e.g., the sum, mean, etc.) over the rows and columns of this data. Although this helps to illustrate the creation of loops, we emphasize that we usually do not need loops for solving such simple tasks. For instance, there even exist R functions for computing the sum of (numeric) rows or columns:

rowSums(df)
#> [1] 34 38 42 46 50
colSums(df)
#>  A  B  C  D 
#> 15 40 65 90

But if we were ignorant of these functions (or had to solve tasks that lack a dedicated function), a general strategy for solving such tasks consists in using a for loops to derive and collect the desired results.

Sums of rows

We can apply our general loop recipe to the current overall problem (of computing the row sums of a data frame):

  1. Decomposition: We obtain the row sums of a data frame df by computing the sum of each individual row.

  2. Expected output: We expect to obtain a vector of numeric sums and initialize out as 5 missing values.

  3. We define a for loop that increments an index variable i from 1 to nrow(df). The loop <body> must sum the values of each row and collect them in out.

  4. As our solution to the overall problem, we simply print out to view the computed row sums.

# A. Prepare output:
out <- rep(NA, nrow(df))

# B. Create for loop:
for (i in 1:nrow(df)){
  
  # a. Type the name of each variable (bad):
  out[i] <- df$A[i] + df$B[i] + df$C[i] + df$D[i]  
  
  # b. Sum the i-th row of df (better):
  out[i] <- sum(df[i, ])  
  
}

# C. Use output:
out
#> [1] 10 26 42 58 74

Note that we implemented two different solutions in the <body> of our loop:

  • Variant a. refers to each variable (column) of df by its name and explicitly adds the corresponding values. This solution requires typing each variable name, but could also handle data frames with non-numeric columns.

  • Variant b. uses numerical indexing of df to select each row of df and adds their values by the sum() function. This solution works when all values are numeric.

Both variants yield the same solution (which is why we could keep both of them), but b. is more elegant than a.

In this example, solving the overall problem only consisted in printing the solutions stored in out. But in real applications we would probably continue doing other things with them. We can verify that we obtained the expected sums by comparing the values of out with the corresponding results of the rowSums() function:

# Check:
all.equal(out, rowSums(df))
#> [1] TRUE

Sums of columns

Given that our data frame df consists only of numbers, summing its rows and columns seems very similar. However, knowing that a data frame stores variables (or vectors) in columns, rather than in rows, suggests that both directions may not be quite so similar. However, we can solve the overall problem of computing all column sums of df with a for loop by making only minimal changes (changing the definition of <seq> from nrow() to ncol() and indexing the 2nd dimension of df in the loop <body>):

# A. Prepare output:
out <- rep(NA, ncol(df))

# B. Create for loop:
# for (i in 1:ncol(df)){   # explicit <seq> range
for (i in seq_along(df)){  # implicit <seq> range
  
  out[i] <- sum(df[ , i])  # i-th column
  
}

# C. Use output:
out
#> [1] 45 50 55 60

In defining the range <seq> of this for loop, we also replaced the explicit value range definition (as a numeric vector 1:ncol(df)) by the seq_along() function. When the range of loop values can be determined by a data structure, seq_along() allows to define the value range in a more implicit fashion. Its output depends on its type of input: When used on a vector v, seq_along(v) returns a vector of integers from 1 to length(v). When used on a data frame df, seq_along(df) returns an integer vector of the column numbers (i.e., the indices of the list elements of df).

Again, let’s verify that out contains the same result as the corresponding function:

# Check:
all.equal(out, colSums(df), check.attributes = FALSE)
#> [1] TRUE

Thus, our second for loop also yielded the same results as the corresponding function.

Let’s practice our knowledge and skills on for loops before we study while and repeat loops.

Practice

  1. Create for loops for computing the row and column means and medians of df.

  2. What would change, if we added next as the last statement of a loop’s <body>? What would change, if we added break as the last statement of a loop’s <body>? (Predict and then try out both in your solutions to 1.)

  3. It seems that 1:n and seq_along(1:n) yield identical results. Explain the difference between them and why seq_along() still makes sense.

Solution

# Yes: 
all.equal(1:5, seq_along(1:5))
#> [1] TRUE
# but only because each element in the vector 1:5 corresponds to the 
# (numeric index of the) i-th element of 1:5 (which is what seq_along(v) provides). 

# Differences between v and seq_along(v) are far more common:
all.equal(5:1, seq_along(1:5))
#> [1] "Mean relative difference: 1"
all.equal(runif(5), seq_along(runif(5)))
#> [1] "Mean relative difference: 4.434031"
all.equal(letters[1:5], seq_along(letters[1:5]))
#> [1] "Modes: character, numeric"              
#> [2] "target is character, current is numeric"

# seq_along() is more general than 1:n: 
# seq_along() corresponds to 1:length(v) for vectors v.
# seq_along() corresponds to 1:ncol(df) for data frames df. 
  1. How could the for loops that iterate over all rows of df be re-written with seq_along()?

Solution

By using seq_along() for a vector (i.e., a column) of df (e.g., df[, 1] or df[[1]]).

  1. Use a for loop to write a function that computes the factorial \(n!\) (for a non-negative integer n).

Hint: The factorial \(n!\) is also computed by the base R function factorial() and was previously defined as a recursive fac() function (in Section 5.4.3).

Solution

The most complicated parts of the following fac_for() function are the conditionals to catch some special cases. By contrast, the for loop to multiply fac by each number i is simple:

fac_for <- function(n){
  
  # Catch some special cases:
  if (is.na(n) || is.character(n) || n < 0) { return(NA) }
  if (!ds4psy::is_wholenumber(n)) { stop("n must be an integer >= 0") }
  
  fac <- 1  # initialize output
  
  for (i in 1:n){
    
    fac <- i * fac
    
  }
  
  return(fac)
  
}

After handling some special cases (by early return() and a stop() function), we initialize an output variable fac to a value of 1. Next, we set up the loop <seq> to iterate over 1:n. The loop <body> consists of a single line that updates the value of fac to the product i * fac. After the loop and at the end of the function, we return the current value of fac.

Let’s check the fac_for() function for some scalar inputs:

# Check:
fac_for(NA)
#> [1] NA
fac_for(-1)
#> [1] NA
fac_for(3/2)
#> Error in fac_for(3/2): n must be an integer >= 0
fac_for(0)
#> [1] 0
fac_for(1)
#> [1] 1
fac_for(2)  
#> [1] 2
fac_for(10)
#> [1] 3628800

all.equal(fac_for(123), factorial(123))
#> [1] TRUE

This suggests that fac_for() works as intended (as long as we only provide scalar inputs for n).

However, note that the for loop was not really necessary to solve this task. An even simpler solution could directly compute the product of the vector elements 1:n:

fac_prod <- function(n){
  
  # Catch some special cases:
  if (is.na(n) || is.character(n) || n < 0) { return(NA) }
  if (!ds4psy::is_wholenumber(n)) { stop("n must be an integer >= 0") }
  
  return(prod(1:n))
  
}

# Check:
fac_prod(NA)
#> [1] NA
fac_prod(-1)
#> [1] NA
fac_prod(3/2)
#> Error in fac_prod(3/2): n must be an integer >= 0
fac_prod(0)
#> [1] 0
fac_prod(1)
#> [1] 1
fac_prod(2)  
#> [1] 2
fac_prod(10)
#> [1] 3628800

all.equal(fac_prod(123), factorial(123))
#> [1] TRUE

In addition to for loops, the second construct of explicit iteration in R is a while loop.

6.2.2 while loops

The structure of a while loop seems even simpler than a for loop because it only has two components, a <condition> and a <body>:

while (<condition>) {

    <body>
}

However, this structural simplicity is deceiving, as it hides requirements on both the <condition> and the <body> of the loop. To execute the <body> of a while loop, its <condition> must initially evaluate to TRUE. More importantly, to ensure that the while loop eventually stops, its <body> must contain code that eventually changes the <condition> from TRUE to FALSE.

A while loop is somewhat more general than a for loop, as any for loop can easily be written as a while loop, but not necessarily vice versa. For instance, a for loop with n steps:

for (i in 1:n){
  
  <body>

}

can be re-written as a while loop that uses a counter variable i for the number of iterations and a condition that some maximum number of steps n must not be exceeded:

i <- 1  # initialize counter

while (i <= n){
  
  <body>
  
  i <- i + 1  # increment counter
  
}

As while loops require some explicit maintenance (here: the initialization and incrementation of a counter), we prefer using for loops when the number of iterations is available in advance. However, we often cannot know in advance how many iterations we will need — and that’s what while loops are for.

Using while loops

As an example of a while loop use case, let’s throw two fair coins until we obtain two heads (denoted by the character "H", whereas the alternative outcome tails is denoted by "T") at once. As we do not know how long this will take, we throw our coins in a while loop and always check the results for our target event:

Here is a basic version of a while loop that runs until the desired event occurs:

# Initialize:
two_heads <- FALSE

while (!two_heads) {
  
  # Throw coins:
  cur_throw <- sample(c("H", "T"), size = 2, replace = TRUE)
  
  # Feedback (on Console):  
  cat(cur_throw, "\n", sep = "")
  
  # Stopping criterion:  
  if (all(cur_throw == "H")) { 
    two_heads <- TRUE
  }
  
}
#> TH
#> TH
#> TH
#> TT
#> TT
#> TH
#> TT
#> HH

The cat() function concatenates and prints the character vector cur_throw (and \n is a special new line character).

Our task gets trickier when we also want to collect the intermediate results. Lacking a given counter variable i, we can explicitly create one and increment it each time the loop is being executed. Additionally, we create an all_throws vector that will store our results. However, since we do not know the number of iterations, we can only initialize it to a scalar and then assign current events to the i-th element of all_throws:

# Initialize:
two_heads  <- FALSE
all_throws <- NA
i <- 0

while (!two_heads) {
  
  # Increment counter: 
  i <- i + 1  
  
  # Throw coins:
  cur_throw <- sample(c("H", "T"), size = 2, replace = TRUE)
  
  # Feedback (on Console):   
  cat(i, ": ", cur_throw, "\n", sep = "")
  
  # Collect results:
  all_throws[i] <- paste(cur_throw, collapse = "")
  
  # Stopping criterion:
  if (all(cur_throw == "H")) { 
    two_heads <- TRUE
  }
  
}
#> 1: TT
#> 2: TT
#> 3: HT
#> 4: TH
#> 5: HT
#> 6: TH
#> 7: HH

# Output: 
all_throws
#> [1] "TT" "TT" "HT" "TH" "HT" "TH" "HH"

This works, but increasing the length of all_throws vector on each iteration is not a good idea. An alternative that may be faster is to initialize cur_throw to a longer vector. In the current example, we can estimate that the random target event should occur on every four iterations on average, but randomness can require many more throws with low, but non-negligible probabilities.

6.2.3 repeat loops

A repeat loop is even more general than a while loop: It continues to iterate through the loop <body> until the stop signal break is encountered. Its general structure is:

repeat {
  
  <body>
    
  if (<condition>) { break }
  
}

Actually, a repeat loop is very similar to a while loop: Just as we need to ensure that the <condition> of while eventually switches from being TRUE to being FALSE, a repeat loop must ensure that some <condition> eventually turns TRUE so that encountering break exits the loop. Both loop types risk getting lost in infinite looping, unless some event actually occurs and causes an exit. The key difference is that a repeat loop explicitly encodes the break, whereas a while loop leaves the break implicit in the truth value of its <condition>.

Using repeat loops

To illustrate a repeat loop, we move from flipping coins to rolling dice. Let’s roll a pair of dice (each showing numbers from 1 to 6) until a combination of 1 and 2 is being rolled (a compound event known, at least in Germany, as “Mäxle”). The word “until” signals that we cannot know how many iterations this will require. Hence, we use a repeat loop that repeats a process until our goal is reached:

repeat {
  
  dice <- sort(sample(1:6, size = 2, replace = TRUE))
  
  print(dice)
  
  if ((dice[1] == 1) & (dice[2] == 2)) { 
    break # required 
    }
  
  next  # optional
  
}
#> [1] 2 4
#> [1] 2 5
#> [1] 4 5
#> [1] 5 5
#> [1] 1 5
#> [1] 1 6
#> [1] 1 2

Note that the use of sort() to arrange the numeric outcomes of sample() in ascending order facilitates our stopping condition (as it ensures that the first element of dice must be a 1 and the 2nd must be 2, rather than also checking the opposite order).

As we have emphasized for white loops (above), it is crucial that the stopping condition of our repeat loop will eventually evaluate to TRUE, so that the break keyword is encountered. Otherwise, the loop would repeat eternally — or until some external force (or a lack of electrical power) terminates it.

The core requirement of loops is that decomposing an overall problem into many similar sub-problems helps solving the overall problem. As we will see in the next section on implicit iteration (in Section 6.3), there are (usually) easier and smarter ways for addressing iterative tasks than using explicit loops. But whereas there may be more elegant ways of solving the simple tasks for which we have created loops so far, the general notion of loops can also solve problems for which there exist no dedicated functions. And although loops are often denounced as somewhat clumsy or slow in R, there is nothing wrong with being explicit about an iteration.

Practice

Before we move on from explicit to implicit iteration, let’s practice while and repeat loops:

  1. Write a while and a repeat loop that throws three fair coins (each with an outcome of either “H” or “T”) until all three show the same outcome. Ensure that the loop outputs provide both the exact sequence of the obtained outcomes and the number of iterations needed to solve the problem.

Hint: When storing all outcomes in a suitable data structure, its shape represents the number of iterations.

  1. Re-write our example of a coin flipping while loop (from above) as a for loop that allows for a large number of iterations (e.g., for (i in 1:1000) and a break keyword). What are the advantages and disadvantages of for vs. while loops?

  2. Re-write our example of a dice rolling repeat`` loop (from above) as awhileloop and aforloop that allows for a large number of iterations. What are the advantages and disadvantages ofrepeat`vs.\whilevs.\for` loops?

6.3 Implicit iteration: Functional programming

Implicit iteration is a key competence of R — and something we already are familiar with. For instance, many operations (e.g., arithmetic ones) are vectorized in R, so that we do not need to use a loop (e.g., to multiply a vector v by a scalar constant c). As a consequence, explicit iteration is not as common in R as it is in most other programming languages. And when a loop primarily traverses or modifies a data structure, it is possible to replace explicit loops by wrapping up their <body> in a function, and then repeatedly apply that function to the data.

In Chapter 5, we saw that functions provide a way of abstracting and encapsulating a series of instructions: Instead of writing a loop, we can enclose the required code in a function and then call this function repeatedly. As R provides sophisticated ways of repeatedly applying functions to data structures, it is known as a functional programming language.

In Section 2.2, we cited:

To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.

John Chambers

A superficial reading of this statement could mis-interpret it as suggesting a division between (or dichotomy of) objects and functions. However, in Chapter 5, we have seen that functions are objects that are defined like other objects (by assigning the output of the function() function to an object name). Now it is time to break down another barrier between two seemingly distinct concepts: That between data and functions. As we will see, R provides several ways of avoiding explicit loops and these ways work by using functions as data arguments that are being passed by functions.

Whereas most beginners find the abstract notion of implicit iteration (i.e., applying functions to data structures) difficult to grasp, it is actually simpler than many explicit loops. As we focus on the essentials in this book, we only briefly introduce the base R apply() (R Core Team, 2023a) and purrr’s map() family of functions (Henry & Wickham, 2023).

6.3.1 Applying functions to data structures

R provides a family of functions for implicit looping (e.g., apply(), lapply(), and tapply()). They all have in common that they take some data structure X and the name of some function FUN as inputs and provide some output after applying FUN to X. They differ in the data structures for inputs and outputs, and which other arguments they allow for.

In Section 6.2 above, our first two examples of for loops computed the row and column sums of some data frame df:

# Data structure (from above):
df
#>    A  B  C  D
#> 1  1  2  3  4
#> 2  5  6  7  8
#> 3  9 10 11 12
#> 4 13 14 15 16
#> 5 17 18 19 20

# Solutions:
rowSums(df)
#> [1] 10 26 42 58 74
colSums(df)
#>  A  B  C  D 
#> 45 50 55 60

Rather than defining explicit loops that solve this task, we will now solve it with implicit iteration.

Using apply()

Given that we have the sum() function to compute the sum (of a numeric vector), it would be nice to instruct R to apply this function to each row of the data structure df. The apply() function does exactly this:

apply(X = df, MARGIN = 1, FUN = sum)
#> [1] 10 26 42 58 74

Note that the arguments of apply() require some data (X = df), a function to be applied (FUN = sum), and some detail how the function should be applied to the data. This detail is provided by the MARGIN argument. For a rectangular data structure (e.g., a matrix or data frame), MARGIN specifies the dimension over which FUN should be applied to X: A value of 1 denotes rows, whereas 2 denotes columns. Thus, we can easily obtain the result of our 2nd for loop from Section 6.2 above, which computed the column sums of df:

apply(X = df, MARGIN = 2, FUN = sum)
#>  A  B  C  D 
#> 45 50 55 60

Note that we passed the name of the sum() function as a data argument of apply(). From studying the documentation of apply() (by evaluating ?apply), we can learn that — in addition to three required arguments X, MARGIN and FUN — it also allows for optional arguments (...) to be passed to FUN. For instance, if our data had contained any missing values, we could have added the argument na.rm = TRUE to be passed to the sum() function (to only add all non-missing values).

Equipped with the power to apply functions to data, we can avoid many explicit loops. Instead, tasks like the first Practice task above (which asked to compute the row and column means and medians of df) can be immediately be solved by applying suitable functions to our data structure.

Using lapply()

To deal with different data structures, base R also provides several variants of apply(). The lapply() function applies a function FUN to a list X and returns a list of the same length as X. Each element of the returned list is the result of applying FUN to the corresponding element of X.

To provide some examples, we can define a list ls as follows:

ls <- list(a = 1L:10L, 
           b = seq(-1, +1, by = 1/2), 
           c = c(TRUE, FALSE, FALSE, TRUE)
           )
ls
#> $a
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $b
#> [1] -1.0 -0.5  0.0  0.5  1.0
#> 
#> $c
#> [1]  TRUE FALSE FALSE  TRUE

Although the elements of this list are vectors of different length and type, we can compute the sum() or mean() of each list element by:

# Apply FUN to list elements:
lapply(X = ls, FUN = sum)
#> $a
#> [1] 55
#> 
#> $b
#> [1] 0
#> 
#> $c
#> [1] 2
lapply(X = ls, FUN = mean)
#> $a
#> [1] 5.5
#> 
#> $b
#> [1] 0
#> 
#> $c
#> [1] 0.5

Actually, since a data frame also is a list of vectors (see Chapter 3), we can compute the column sums of df (computed explicitly by a for loop and implicitly by apply(df, MARGIN = 2, FUN = sum) above) alternatively by:

# Apply FUN to data frame variables (columns):
lapply(X = df, FUN = sum)
#> $A
#> [1] 45
#> 
#> $B
#> [1] 50
#> 
#> $C
#> [1] 55
#> 
#> $D
#> [1] 60

As lapply() not only accwepts list inputs, but also returns its output as the elements of a list, it can be maximally flexible. For instance, if the function denoted by FUN returns more than one value as its output, lapply() can still cope with them:

# Apply FUN to data frame variables (columns):
lapply(df, range)
#> $A
#> [1]  1 17
#> 
#> $B
#> [1]  2 18
#> 
#> $C
#> [1]  3 19
#> 
#> $D
#> [1]  4 20
lapply(df, quantile)
#> $A
#>   0%  25%  50%  75% 100% 
#>    1    5    9   13   17 
#> 
#> $B
#>   0%  25%  50%  75% 100% 
#>    2    6   10   14   18 
#> 
#> $C
#>   0%  25%  50%  75% 100% 
#>    3    7   11   15   19 
#> 
#> $D
#>   0%  25%  50%  75% 100% 
#>    4    8   12   16   20

Despite the unmatched flexibility of lists, we often would prefer non-list outputs, which is where sapply() comes into play.

Using sapply()

When we prefer simpler output formats to lists, we can replace lapply(X, FUN) by sapply(X, FUN) (where s stands for simplified). Essentially, sapply() is a user-friendlier version of lapply() that returns its output as a named vector (if possible):

# Apply FUN to list elements (and simplify output):
sapply(X = ls, FUN = sum)
#>  a  b  c 
#> 55  0  2
sapply(X = ls, FUN = mean)
#>   a   b   c 
#> 5.5 0.0 0.5

For our data frame df, using sapply() to apply a function to each variable (column) simplifies the output to a named vector or a matrix (if needed):

# Apply FUN to data frame (and simplify output):
sapply(df, sum)       # as a vector
#>  A  B  C  D 
#> 45 50 55 60
sapply(df, range)     # as a matrix
#>       A  B  C  D
#> [1,]  1  2  3  4
#> [2,] 17 18 19 20
sapply(df, quantile)  # as a matrix
#>       A  B  C  D
#> 0%    1  2  3  4
#> 25%   5  6  7  8
#> 50%   9 10 11 12
#> 75%  13 14 15 16
#> 100% 17 18 19 20

The apply() family of functions has even more members. We will briefly introduce some of them, but see the documentations and examples of vapply(), mapply(), rapply() and tapply() for more specialized and sophisticated versions of the apply() function.

vapply()

The vapply() function is similar to sapply(), but provides additional control over the shape and type of the output. Setting its FUN.VALUE to 1 returns a vector of length(X), otherwise an array that matches the structure of FUN.VALUE. While this can be safer, using vapply() can require a lot of typing: vapply(df, is.numeric, logical(1)) provides its output as logical values (and is equivalent to purrr::map_lgl(df, is.numeric), see below).

mapply()

The mapply() function provides a multi-variate version of sapply(). Essentially, mapply(FUN, ...) applies FUN to the first elements of each ... argument, the second elements, the third elements, etc. Its ... argument accepts multiple arguments to vectorize over, which will be recycled to a common length.

mapply(sample, x = 1:4, size = 1:4)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2 1
#> 
#> [[3]]
#> [1] 3 2 1
#> 
#> [[4]]
#> [1] 3 4 1 2
mapply(sample, x = 1:4, size = 1:4, replace = TRUE)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 1
#> 
#> [[3]]
#> [1] 1 1 3
#> 
#> [[4]]
#> [1] 2 2 3 2

tapply()

The tapply() function is probably the most impressive apply() member. Its INDEX argument accepts a list of one or more factors (or a formula) that are applied to X. This allows computing sophisticated aggregate summaries over combinations of data values for vectors or data frames. (The documentation of ?tapply() refers to these combinations as a “ragged array”, which essentially means that they can contain a variable number of elements.)

Here are some basic examples for applying tapply() to data structures:

  • Summaries of vector values: Given a vector of dice throws:
# Data:
throws <- sample(1:6, size = 30, replace = TRUE)
throws
#>  [1] 5 2 4 2 1 6 2 3 6 3 2 4 5 1 3 4 3 3 4 5 6 1 3 2 5 4 4 5 5 3

we can use tapply() to compute summaries over their possible values:

# Summary of vector values:
tapply(X = throws, INDEX = throws, FUN = length)
#> 1 2 3 4 5 6 
#> 3 5 7 6 6 3
table(throws)  # almost identical
#> throws
#> 1 2 3 4 5 6 
#> 3 5 7 6 6 3
# but:
tapply(X = throws, INDEX = throws, FUN = sum)
#>  1  2  3  4  5  6 
#>  3 10 21 24 30 18
  • Contingency tables (from data frames): Given the datasets data warpbreaks:
#> # A tibble: 54 × 3
#>    breaks wool  tension
#>     <dbl> <fct> <fct>  
#>  1     26 A     L      
#>  2     30 A     L      
#>  3     54 A     L      
#>  4     25 A     L      
#>  5     70 A     L      
#>  6     52 A     L      
#>  7     51 A     L      
#>  8     26 A     L      
#>  9     67 A     L      
#> 10     18 A     M      
#> # ℹ 44 more rows

we can use tapply() to create summaries of variable levels/factor combinations:

# Contingency table from data.frame:
tapply(X = warpbreaks$breaks, INDEX = warpbreaks[ ,   2], FUN = sum)
#>   A   B 
#> 838 682
tapply(X = warpbreaks$breaks, INDEX = warpbreaks[ , 2:3], FUN = sum)
#>     tension
#> wool   L   M   H
#>    A 401 216 221
#>    B 254 259 169

# Averages over groups:
tapply(warpbreaks$breaks, warpbreaks[ ,   2], mean)
#>        A        B 
#> 31.03704 25.25926
tapply(warpbreaks$breaks, warpbreaks[ , 2:3], mean)
#>     tension
#> wool        L        M        H
#>    A 44.55556 24.00000 24.55556
#>    B 28.22222 28.77778 18.77778
  • Averages over groups: Let’s assume some dataset pcg that includes the initials of professors, their courses, and corresponding grades (for individuals students or assignements):
#> # A tibble: 100 × 3
#>    prof  course  grade
#>    <chr> <chr>   <int>
#>  1 C.D.  arts        3
#>  2 A.B.  arts        1
#>  3 B.C.  science     2
#>  4 B.C.  arts        3
#>  5 C.D.  arts        3
#>  6 C.D.  science     4
#>  7 B.C.  arts        1
#>  8 A.B.  science     1
#>  9 A.B.  data        4
#> 10 B.C.  data        2
#> # ℹ 90 more rows

we can use tapply() to compute the average grade over (combinations of) variable levels:

# Compute group averages:
tapply(pcg$grade, pcg$prof,    mean)
#>     A.B.     B.C.     C.D. 
#> 2.371429 2.533333 2.285714
tapply(pcg$grade, pcg[ , 2],   mean)
#>     arts     data  science 
#> 2.078947 2.545455 2.620690
tapply(pcg$grade, pcg[ , 1:2], mean)
#>       course
#> prof       arts     data  science
#>   A.B. 2.071429 2.833333 2.222222
#>   B.C. 2.000000 2.625000 2.916667
#>   C.D. 2.142857 2.230769 2.625000

rapply()

The rapply() function is a recursive version of lapply() with additional control over how the result is structured. The type of output is determined by the arguments how and classes. (See ?rapply() for its documentation and examples.)

Overall, the apply() family of functions provides a powerful toolbox for applying functions to data structures. However, the variety of function names and their (sometimes inconsistent) arguments make it difficult to map the family members to tasks and to master each function. In the likely case that you find the family of apply() functions confusing, you must not despair: The *purrr package provides a similar range of functions in a more consistent way.

6.3.2 Mapping functions of purrr

The R package purrr (Henry & Wickham, 2023) contains a family of map() functions that provide updated and more consistent versions of apply(). The main goal of using purrr functions (instead of for loops) is to break common list manipulation challenges into smaller and independent pieces. This strategy scales down the problem, solves its sub-problems, and then tries to solve the overall problem from those pieces:

  1. Decomposing a complex problem into smaller sub-problems that allow us to advance towards a solution.

  2. Solving the sub-problem for a single element of a list.

  3. Constructing an overall solution from the sub-problem solutions.

Once we have solved the sub-problem, purrr takes care of generalizing the solution to every element in the list and construct an overall solution from those pieces.

The scaling-down and up-again strategy of purrr matches our general strategy for explicit iteration (above), but is implemented in more concise expressions. While this affords a greater level of abstraction, it also makes it easier to solve related problems and to understand the solutions to old problems when re-reading older code.

Essential map() functions

The pattern of looping over a vector, doing something to each element, and saving the results is so common that the purrr package provides a family of functions for it. To control the type of output, there is a separate function for each output type:

Each of these functions takes a vector (or data frame) .x as input, applies a function .f to each element, and then returns a new vector whose length corresponds to (and has the same names as) the input vector. The type of the output vector is determined by the suffix _? of the map() function (e.g., map_chr() returns the output as a character vector).

Examples

The most common use of map() applies a function .f to each variable/column of a data frame .x. For instance, we can use map_dbl() on the data frame df (from above) to compute various statistical measures (with numeric doubles as output):

library(purrr)

map_dbl(.x = df, .f = sum)  # sum of all 4 columns/variables
#>  A  B  C  D 
#> 45 50 55 60
map_dbl(df, mean)           # mean
#>  A  B  C  D 
#>  9 10 11 12
map_dbl(df, sd)             # standard deviation 
#>        A        B        C        D 
#> 6.324555 6.324555 6.324555 6.324555

map() variants

Several variants of the map() function accommodate a different number of arguments:

  • map() applies a function .f to 1 argument .x
  • map2() applies a function .f to 2 arguments .x and .y
  • pmap() applies a function .f to 3 or more arguments (provided as a list .l)

As before, typical uses of map_() specify the expected data type of the output as a suffix (after an underscore _). Here is a numeric example that uses three different map() functions:

# Create data:
tb <- data.frame(n_1 = sample(1:9, 10, replace = TRUE),
                 n_2 = sample(1:3, 10, replace = TRUE))
tb
#>    n_1 n_2
#> 1    9   3
#> 2    7   2
#> 3    1   2
#> 4    9   1
#> 5    1   3
#> 6    9   2
#> 7    5   1
#> 8    2   2
#> 9    6   1
#> 10   8   1

# Functions:
square <- function(x){ x^2 }
power  <- function(x, y){ x^y }

# Mapping multiple arguments:
tb$square <- map_dbl(.x = tb$n_1, .f = square)               # 1 argument
tb$power  <- map2_dbl(tb$n_1, tb$n_2, power)                 # 2 arguments
tb$sum_3  <- pmap_dbl(list(tb$n_1, tb$n_2, tb$square), sum)  # 3 arguments

# Result:
tb
#>    n_1 n_2 square power sum_3
#> 1    9   3     81   729    93
#> 2    7   2     49    49    58
#> 3    1   2      1     1     4
#> 4    9   1     81     9    91
#> 5    1   3      1     1     5
#> 6    9   2     81    81    92
#> 7    5   1     25     5    31
#> 8    2   2      4     4     8
#> 9    6   1     36     6    43
#> 10   8   1     64     8    73

Although we cannot cover them in detail here, it should be clear that the ability to apply map() functions (with a flexible number of arguments) to data structures provides a very convenient and powerful programming tool.

See the purrr documentation (e.g., at https://purrr.tidyverse.org/) or Section 12.3 of the ds4psy book (Neth, 2023a) for additional details on purrr’s mapping functions.

Practice

  1. Use apply() for computing the row and column means and medians of df.

  2. Explain the difference between lapply() and sapply() in your own words.

  3. Predict the outputs of the following variants of map(), then check your predictions:

map(df, mean)
map_dbl(df, mean)
map_int(df, mean)
map_lgl(df, mean)
map_chr(df, mean)

6.4 Conclusion

Iteration involves breaking down a problem into similar sub-problems and then repeatedly solving all sub-problems to solve the overall problem. To enable a solution by iteration, the overall problem must consist of many similar sub-problems.

6.4.1 Summary

As R is a functional programming language, iteration can be explicit or implicit:

  • Explicit iteration uses for loops, while loops, or repeat loops. Whereas for loops usually assume that we know the number of required iterations, loops using while or repeat require that we can specify the conditions for entering or ending loops.

  • Implicit iteration uses vectorized functions or the families of base R apply() or purrr map() functions to avoid loops by directly applying functions to data structures. This is more concise and elegant, but also more abstract than using explicit iteration.

Another summary of iteration contents is provided by Section 12.4 of the ds4psy book (Neth, 2023a).

6.4.2 Resources

i2ds: Links to resources, etc.

Manuals or book chapters

Cheatsheets

For basic information on explicit iteration with loops in R, see the contributed cheatsheets on Base R:

Base R summary from Posit cheatsheets.

Figure 6.1: Base R summary from Posit cheatsheets.

For an overview of the map() functions of the purrr package, see the corresponding summary from Posit cheatsheets:

Overview of the map() functions of purrr from Posit cheatsheets.

Figure 6.2: Overview of the map() functions of purrr from Posit cheatsheets.

For additional resources in iteration, see Section 12.7 of the ds4psy book (Neth, 2023a).

6.4.3 Preview

This chapter concludes Part 2 of this book and our introduction to programming basics in R. The upcoming Part 3 addresses various aspects of visualizing data. The introductory Chapter 7 reflects on the purposes, elements, and evaluation of visualizations. Chapters 8 and 9 will then enable us to create visualizations (using base R functions or the ggplot2 package). Chapter 10 deals with the representation and manipulation of colors and color palettes.

6.5 Exercises

i2ds: Exercises

Here are four exercises from 12.5: Exercises of the ds4psy book (Neth, 2023a):

6.5.1 Fibonacci loops

6.5.2 Looping for divisors

6.5.3 Dice loops

6.5.4 Cumulative savings revisited

6.5.5 Explicit vs. implicit iteration

Explain the difference between explicit and implicit iteration in your own words. The answer the following questions (e.g., by thinking of examples):

  • Which types of loops can immediately be replaced by implicit iteration?

  • Which aspects or types of loops can not be replaced by implicit iteration?

6.5.6 Applying a divisor checking function

  • Re-solve Part 1. of Exercise 2 by creating a is_divisor(x, N) verification function (that evaluates to TRUE iff x is a divisor of N) and use a function from the apply() family of functions to apply is_divisor() to a data structure that contains the natural numbers from 1 to 1000.

Beware: Is there a simpler solution to achieve the same result in R?

6.5.7 Bonus exercise: Explaining mapping functions

As this exercise involves advanced mapping functions, it is optional, but still worth trying.

  1. Predict, evaluate, and explain the results of the following expressions (which are examples copied the documentations of the corresponding functions):
# sapply(): ----
i39 <- sapply(3:9, seq) # data: list of vectors
sapply(i39, fivenum)

# mapply(): ---- 
mapply(rep, 1:4, 4:1)
mapply(rep, times = 1:4, MoreArgs = list(x = 42))

# tapply(): ----
tibble::as_tibble(datasets::warpbreaks)  # data frame
tapply(warpbreaks$breaks, warpbreaks[,-1], sum)
  1. Try to re-write the expressions of 1. with purrr mapping functions.

This concludes our exercises on iteration by creating loops and applying functions.