## 12.2 Essentials of iteration

As we only focus on the essentials in this book, we will mostly focus on for and while loops here. In later sections, we briefly introduce the notion of functional programming (see Section 12.2.3) and learn how loops can be replaced by the base R apply (R Core Team, 2020) and purrr’s map (Henry & Wickham, 2020) family of functions (see Section 12.2.4).

### 12.2.1 Loops

Loops are structures of code that cause some code to be executed repeatedly. In R, we can distinguish between 2 basic versions:

1. for loops are indicated when we know the number of required iterations in advance;

2. while loops are more general and indicated when we only know a condition that should end a loop.

We will discuss both types of loops in the following sections.

#> # A tibble: 6 x 5
#>      id   age height shoesize    IQ
#>   <int> <dbl>  <dbl>    <dbl> <dbl>
#> 1     1    21    173       38    89
#> 2     2    26    193       43    93
#> 3     3    24    171       41    92
#> 4     4    32    191       43    97
#> 5     5    26    156       36   110
#> 6     6    28    172       34   117

#### Using a for loop

When we want to execute some code repeatedly and know the number of required iterations in advance, a for loop is indicated. To create a for loop, we are typically asking and answering 3 questions:

1. Body: What is the task performed (and corresponding code executed) in the loop?

2. Sequence: Over which sequence should be iterated? How many iterations are there?

3. Output: What is the result of the loop: What type of object and how many instances?

After answering these questions, a for loop can be designed in reverse order (from output to body):

# (a) prepare output:
output <- vector(<type>, length(start:end))

# (b) for loop:
for (i in start:end} {                    # sequence

output[[i]] <- result_i                 #       collect results in output

}

# (c) use output:
output  

#### Loops for iteration

Assuming we know how often we want to do something.

Example: Compute square number of integers from 1 to 10.

1. Body: Square of some number i.

2. Sequence: i from 1 to 10.

3. Output: A vector of 10 numbers.

Implementation:

# (a) prepare output:
output <- vector("double", length(1:10))

# (b) for loop:
for (i in 1:10) {

sq <- i^2
print(paste0("i = ", i, ": sq = ", sq))  # for debugging
output[[i]] <- sq

}
#> [1] "i = 1: sq = 1"
#> [1] "i = 2: sq = 4"
#> [1] "i = 3: sq = 9"
#> [1] "i = 4: sq = 16"
#> [1] "i = 5: sq = 25"
#> [1] "i = 6: sq = 36"
#> [1] "i = 7: sq = 49"
#> [1] "i = 8: sq = 64"
#> [1] "i = 9: sq = 81"
#> [1] "i = 10: sq = 100"

# (c) use output:
output
#>  [1]   1   4   9  16  25  36  49  64  81 100

Note: In the for loop, we use output[[i]], rather than output[i] to refer to the i-th element of output . Actually, using a single [] would have worked as well, but the double [[]] makes it clear that we want to remove a level of the hierarchy and assign something to a single element of output. (See Ch. 20.5.2 Subsetting recursive vectors (lists) for more details on this distinction.)

#### Loops over data

In the context of data science, we often want to iterate over (rows or columns) of data tables. Let’s load some toy data to work with:

## Load data:
tb <- ds4psy::tb  # from ds4psy package

# inspect tb:
dim(tb)  # 100 cases x 5 variables
#> [1] 100   5
#> # A tibble: 6 x 5
#>      id   age height shoesize    IQ
#>   <dbl> <dbl>  <dbl>    <dbl> <dbl>
#> 1     1    21    173       38    89
#> 2     2    26    193       43    93
#> 3     3    24    171       41    92
#> 4     4    32    191       43    97
#> 5     5    26    156       36   110
#> 6     6    28    172       34   117

Suppose we wanted to obtain the means of the variables from age to IQ. We could call the mean function for each descired variable. Thus, repeating this call for each variable would be:

# (a) Means: ----
mean(tb$age) #> [1] 26.29 mean(tb$height)
#> [1] 177.78
mean(tb$shoesize) #> [1] 39.05 mean(tb$IQ)
#> [1] 104.85

However, the statement “for each variable” in the previous sentence shows that we are dealing with an instance of iteration here. When dealing with computers, repetition of identical steps or commands is a signal that there are more efficient ways to accomplish the same task.

How could we use a for loop here? To design this loop, we need to answer our 3 questions from above:

1. Body: We want to compute the mean of 4 columns (age to IQ) in tb.

2. Sequence: We want to iterate over columns 2 to 5 (i.e., 4 iterations).

3. Output: The result or the loop is a vector of type “double”, containing 4 elements.

#### Notes

• We remove the 1st column, as no computation is needed for it.

• The i-th column of tibble tb (or data frame df) can be accessed via tb[[i]] (or df[[i]]).

# Prepare data:
tb_2 <- tb %>% select(-1)
tb_2
#> # A tibble: 100 x 4
#>      age height shoesize    IQ
#>    <dbl>  <dbl>    <dbl> <dbl>
#>  1    21    173       38    89
#>  2    26    193       43    93
#>  3    24    171       41    92
#>  4    32    191       43    97
#>  5    26    156       36   110
#>  6    28    172       34   117
#>  7    20    166       35   107
#>  8    31    172       34   110
#>  9    18    192       32    88
#> 10    22    176       39   111
#> # … with 90 more rows

# (a) prepare output:
output <- vector("double", 4)

# (b) for loop:
for (i in 1:4){

mn <- mean(tb_2[[i]])

output[[i]] <- mn

}

# (c) use output:
output
#> [1]  26.29 177.78  39.05 104.85

The range of a for loop can be defined with a special function:

seq_along(tb_2)  # loop through COLUMNS of a df/table
#> [1] 1 2 3 4

The base R function seq_along() returns an integer vector. When its argument is a table (e.g., a data frame or a tibble), it returns the integer values of all columns, so that the last for loop could be re-written as follows:

# (a) prepare output:
output_2 <- vector("double", 4)

# (b) for loop:
for (i in seq_along(tb_2)){

mn <- mean(tb_2[[i]])

output_2[[i]] <- mn

}

# (c) use output:
output_2
#> [1]  26.29 177.78  39.05 104.85

Another way of iterating over all columns of a table tb_2 could loop from 1 to ncol(tb_2):

# (a) prepare output:
output_3 <- vector("double", 4)

# (b) for loop:
for (i in 1:ncol(tb_2)){

mn <- mean(tb_2[[i]])

output_3[[i]] <- mn

}

# (c) use output:
output_3
#> [1]  26.29 177.78  39.05 104.85

#### Practice

1. Rewrite the for loop to compute the means of columns 2 to 5 of tb (i.e., without simplifying tb to tb_2 first).

#### Solution

We create a new output vector output_4 and need to change 2 things:

• the column numbers of the for loop statement (from 1:4 to 2:5).

• the index to which we assign our current mean mn should be decreased to i - 1 (to assign the mean of column 2 to the 1st element of output_2).

# Using data:
# tb

# (a) prepare output:
output_4 <- vector("double", 4)

# (b) for loop:
for (i in 2:5){

mn <- mean(tb[[i]])

output_4[[i - 1]] <- mn

}

# (c) use output:
output_4
#> [1]  26.29 177.78  39.05 104.85

# Verify equality:
all.equal(output, output_4)
#> [1] TRUE
1. We have learned that creating a for loop requires knowing (a) the data type of the loop results and (b) a data structure than can collect these results. This is simple and straightforward if each for loop results in a single number, as then all results can be stored in a vector.

However, things get more complicated when for loops yield tables, lists, or plots, as outputs. Try creating similar for loops that return the summary and a histogram (using the base R function hist) of each variable in tb (or each variable of tb_2).

#### Solution

Creating a summary:

# (b) Summary: ----
summary(tb$age) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 15.00 22.00 24.50 26.29 30.25 46.00 summary(tb$height)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>   150.0   171.8   177.0   177.8   185.0   206.0
summary(tb$shoesize) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 29.00 36.00 39.00 39.05 42.00 47.00 s <- summary(tb$IQ)
s
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>    85.0    97.0   104.0   104.8   113.0   145.0

# What type of object is s?
typeof(s)     # "double"
#> [1] "double"
is.vector(s)  # FALSE
#> [1] FALSE
is.table(s)   # TRUE
#> [1] TRUE

The following for loop is almost identical to the one (computing mean of columns 2:5 of tb) above. However, we initialize the summaries vector to a mode = "list", which allows storing more complex objects in a vector:

# Loop:
# (a) prepare output:
summaries <- vector(mode = "list", length = 4)  # initialize to a vector of lists!

# (b) for loop:
for (i in 2:5){

sm <- summary(tb[[i]])

summaries[[i - 1]] <- sm

}

# (c) use output:
summaries  # print summaries:
#> [[1]]
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>   15.00   22.00   24.50   26.29   30.25   46.00
#>
#> [[2]]
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>   150.0   171.8   177.0   177.8   185.0   206.0
#>
#> [[3]]
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>   29.00   36.00   39.00   39.05   42.00   47.00
#>
#> [[4]]
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>    85.0    97.0   104.0   104.8   113.0   145.0

The following code uses the R command hist() (from the graphics package included in R) to create a histogram for a specific variable (column) of tb:

# Examples of histograms: ----
# hist(tb$age) # hist(tb$height)
hist(tb_2$shoesize)  h <- hist(tb$IQ)  # save graphic as an object

h                 # show graphic
#> $breaks #> [1] 80 90 100 110 120 130 140 150 #> #>$counts
#> [1] 12 29 29 20  8  1  1
#>
#> $density #> [1] 0.012 0.029 0.029 0.020 0.008 0.001 0.001 #> #>$mids
#> [1]  85  95 105 115 125 135 145
#>
#> $xname #> [1] "tb$IQ"
#>
#> [1] 1 2 3
#>
#> $y #> [1] 4 5 z %>% map_int(length) #> x y #> 3 2 #### Shortcuts There are a few shortcuts to save typing in the .f argument of map. Imagine we want to fit a linear model to each group in a dataset. The following example splits the up the mtcars dataset in to 3 pieces (by value of cylinder) and fits the same linear model to each piece. The linear model is supplied as an anonymous function: mtcars %>% group_by(cyl) %>% count() #> # A tibble: 3 x 2 #> # Groups: cyl [3] #> cyl n #> <dbl> <int> #> 1 4 11 #> 2 6 7 #> 3 8 14 models <- mtcars %>% split(.$cyl) %>%                            # split data into 3 sets (by cyl)
map(function(df) lm(mpg ~ wt, data = df))   # map lm function to each set

models # 3 linear models
#> $4 #> #> Call: #> lm(formula = mpg ~ wt, data = df) #> #> Coefficients: #> (Intercept) wt #> 39.571 -5.647 #> #> #>$6
#>
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#>
#> Coefficients:
#> (Intercept)           wt
#>       28.41        -2.78
#>
#>
#> $8 #> #> Call: #> lm(formula = mpg ~ wt, data = df) #> #> Coefficients: #> (Intercept) wt #> 23.868 -2.192 As the syntax for creating an anonymous function in R is quite verbose, purrr provides a one-sided formula as a shortcut: models <- mtcars %>% split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .))

models
#> $4 #> #> Call: #> lm(formula = mpg ~ wt, data = .) #> #> Coefficients: #> (Intercept) wt #> 39.571 -5.647 #> #> #>$6
#>
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#>
#> Coefficients:
#> (Intercept)           wt
#>       28.41        -2.78
#>
#>
#> $8 #> #> Call: #> lm(formula = mpg ~ wt, data = .) #> #> Coefficients: #> (Intercept) wt #> 23.868 -2.192 Here, the . is used as a pronoun that refers to the current list element. When inspecting many models, we may want to extract a summary statistic like $$R^{2}$$. To do that we need to first run summary() on our models and then extract the component called r.squared. We could do this using the shorthand for anonymous functions: models %>% map(summary) %>% map_dbl(~.$r.squared)

But extracting named components from a list of elements is a common operation, so purrr provides an even shorter shortcut: We can use a string:

models %>%
map(summary) %>%
map_dbl("r.squared")

Alternatively, we can also use an integer to select elements by their position:

x <- list(list(1, 2, 3),
list(4, 5, 6),
list(7, 8, 9))

x %>% map_dbl(2)  # 2nd column of x

#### 12.2.4.3 Comparing map vs. apply

The map functions of purrr are modeled on the apply functions:

• lapply is basically identical to map, except that map is consistent with all the other functions in purrr, and we can use the shortcuts for .f.

• sapply is a wrapper around lapply that automatically simplifies the output (which can yield unexpected results).

• vapply is a safe alternative to sapply because you supply an additional argument that defines the type. The only problem with vapply is that it’s a lot of typing: vapply(df, is.numeric, logical(1)) is equivalent to map_lgl(df, is.numeric).

• One advantage of vapply over purrr’s map functions is that it can also produce matrices — the map functions only ever produce vectors.

#### 12.2.4.4 Advanced aspects of purrr

See the following sections of r4ds (Wickham & Grolemund, 2017) for more advanced issues in combination with the map functions of purrr:

• 21.6: Dealing with failure for using the adverbs safely, possibly, and quietly in combination with map functions.

• 21.7: Mapping over multiple arguments for the map2 variants, as well as invoke_map.

• 21.8: Walk to use walk, walk2 and pwalk for calling functions for their side effects.

Rather than digging deeper into the functional programming paradigm of purrr, the exercises of this chapter (in Section~12.4) will focus on good-old fashioned for loops and only cover essential aspects of apply and map.

### References

Henry, L., & Wickham, H. (2020). purrr: Functional programming tools. Retrieved from https://CRAN.R-project.org/package=purrr

R Core Team. (2020). R: A language and environment for statistical computing. Retrieved from https://www.R-project.org

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz