## 12.2 Essentials of iteration

As we only focus on the essentials in this book, we will mostly focus on `for`

and `while`

loops here.
In later sections, we briefly introduce the notion of functional programming (see Section 12.2.3) and learn how loops can be replaced by the **base** R `apply`

(R Core Team, 2020) and **purrr**’s `map`

(Henry & Wickham, 2020) family of functions (see Section 12.2.4).

### 12.2.1 Loops

Loops are structures of code that cause some code to be executed repeatedly. In R, we can distinguish between 2 basic versions:

`for`

loops are indicated when we know the number of required iterations in advance;`while`

loops are more general and indicated when we only know a condition that should end a loop.

We will discuss both types of loops in the following sections.

```
#> # A tibble: 6 x 5
#> id age height shoesize IQ
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 21 173 38 89
#> 2 2 26 193 43 93
#> 3 3 24 171 41 92
#> 4 4 32 191 43 97
#> 5 5 26 156 36 110
#> 6 6 28 172 34 117
```

#### Using a `for`

loop

When we want to execute some code repeatedly and know the number of required iterations in advance, a `for`

loop is indicated.
To create a `for`

loop, we are typically asking and answering 3 questions:

*Body*: What is the task performed (and corresponding code executed) in the loop?*Sequence*: Over which sequence should be iterated?*How many*iterations are there?*Output*: What is the*result*of the loop: What*type*of object and*how many*instances?

After answering these questions, a `for`

loop can be designed in reverse order (from output to body):

#### Loops for iteration

Assuming we know how often we want to do something.

**Example:** Compute square number of integers from 1 to 10.

*Body*: Square of some number`i`

.*Sequence*:`i`

from 1 to 10.*Output*: A vector of 10 numbers.

**Implementation:**

```
# (a) prepare output:
output <- vector("double", length(1:10))
# (b) for loop:
for (i in 1:10) {
sq <- i^2
print(paste0("i = ", i, ": sq = ", sq)) # for debugging
output[[i]] <- sq
}
#> [1] "i = 1: sq = 1"
#> [1] "i = 2: sq = 4"
#> [1] "i = 3: sq = 9"
#> [1] "i = 4: sq = 16"
#> [1] "i = 5: sq = 25"
#> [1] "i = 6: sq = 36"
#> [1] "i = 7: sq = 49"
#> [1] "i = 8: sq = 64"
#> [1] "i = 9: sq = 81"
#> [1] "i = 10: sq = 100"
# (c) use output:
output
#> [1] 1 4 9 16 25 36 49 64 81 100
```

**Note:** In the `for`

loop, we use `output[[i]]`

, rather than `output[i]`

to refer to the `i`

-th element of `output`

.
Actually, using a single `[]`

would have worked as well, but the double `[[]]`

makes it clear that we want to remove a level of the hierarchy and
assign something to a single element of `output`

.
(See Ch. 20.5.2 Subsetting recursive vectors (lists) for more details on this distinction.)

#### Loops over data

In the context of data science, we often want to iterate over (rows or columns) of data tables. Let’s load some toy data to work with:

```
## Load data:
tb <- ds4psy::tb # from ds4psy package
# tb <- readr::read_csv2("../data/tb.csv") # from local file
# tb <- readr::read_csv2("http://rpository.com/ds4psy/data/tb.csv") # online
# inspect tb:
dim(tb) # 100 cases x 5 variables
#> [1] 100 5
head(tb)
#> # A tibble: 6 x 5
#> id age height shoesize IQ
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 21 173 38 89
#> 2 2 26 193 43 93
#> 3 3 24 171 41 92
#> 4 4 32 191 43 97
#> 5 5 26 156 36 110
#> 6 6 28 172 34 117
```

Suppose we wanted to obtain the means of the variables from `age`

to `IQ`

.
We could call the `mean`

function for each descired variable. Thus, repeating this call for each variable would be:

```
# (a) Means: ----
mean(tb$age)
#> [1] 26.29
mean(tb$height)
#> [1] 177.78
mean(tb$shoesize)
#> [1] 39.05
mean(tb$IQ)
#> [1] 104.85
```

However, the statement “for each variable” in the previous sentence shows that we are dealing with an instance of *iteration* here.
When dealing with computers, *repetition* of identical steps or commands is a signal that there are more efficient ways to accomplish the same task.

How could we use a for loop here? To design this loop, we need to answer our 3 questions from above:

*Body*: We want to compute the mean of 4 columns (`age`

to`IQ`

) in`tb`

.*Sequence*: We want to iterate over columns 2 to 5 (i.e., 4 iterations).*Output*: The*result*or the loop is a vector of type “double”, containing 4 elements.

#### Notes

We remove the 1st column, as no computation is needed for it.

The i-th column of tibble

`tb`

(or data frame`df`

) can be accessed via`tb[[i]]`

(or`df[[i]]`

).

```
# Prepare data:
tb_2 <- tb %>% select(-1)
tb_2
#> # A tibble: 100 x 4
#> age height shoesize IQ
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21 173 38 89
#> 2 26 193 43 93
#> 3 24 171 41 92
#> 4 32 191 43 97
#> 5 26 156 36 110
#> 6 28 172 34 117
#> 7 20 166 35 107
#> 8 31 172 34 110
#> 9 18 192 32 88
#> 10 22 176 39 111
#> # … with 90 more rows
# (a) prepare output:
output <- vector("double", 4)
# (b) for loop:
for (i in 1:4){
mn <- mean(tb_2[[i]])
output[[i]] <- mn
}
# (c) use output:
output
#> [1] 26.29 177.78 39.05 104.85
```

The range of a `for`

loop can be defined with a special function:

The **base** R function `seq_along()`

returns an integer vector.
When its argument is a table (e.g., a data frame or a tibble), it returns the integer values of all columns, so that the last `for`

loop could be re-written as follows:

```
# (a) prepare output:
output_2 <- vector("double", 4)
# (b) for loop:
for (i in seq_along(tb_2)){
mn <- mean(tb_2[[i]])
output_2[[i]] <- mn
}
# (c) use output:
output_2
#> [1] 26.29 177.78 39.05 104.85
```

Another way of iterating over all columns of a table `tb_2`

could loop from 1 to `ncol(tb_2)`

:

#### Practice

- Rewrite the for loop to compute the means of columns 2 to 5 of
`tb`

(i.e., without simplifying`tb`

to`tb_2`

first).

#### Solution

We create a new output vector `output_4`

and need to change 2 things:

the column numbers of the

`for`

loop statement (from`1:4`

to`2:5`

).the index to which we assign our current mean

`mn`

should be decreased to`i - 1`

(to assign the mean of column 2 to the 1st element of`output_2`

).

```
# Using data:
# tb
# (a) prepare output:
output_4 <- vector("double", 4)
# (b) for loop:
for (i in 2:5){
mn <- mean(tb[[i]])
output_4[[i - 1]] <- mn
}
# (c) use output:
output_4
#> [1] 26.29 177.78 39.05 104.85
# Verify equality:
all.equal(output, output_4)
#> [1] TRUE
```

- We have learned that creating a
`for`

loop requires knowing (a) the data type of the loop results and (b) a data structure than can collect these results. This is simple and straightforward if each`for`

loop results in a single number, as then all results can be stored in a vector.

However, things get more complicated when `for`

loops yield tables, lists, or plots, as outputs.
Try creating similar `for`

loops that return the `summary`

and a histogram (using the **base** R function `hist`

) of each variable in `tb`

(or each variable of `tb_2`

).

#### Solution

Creating a `summary`

:

```
# (b) Summary: ----
summary(tb$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 15.00 22.00 24.50 26.29 30.25 46.00
summary(tb$height)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 150.0 171.8 177.0 177.8 185.0 206.0
summary(tb$shoesize)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 29.00 36.00 39.00 39.05 42.00 47.00
s <- summary(tb$IQ)
s
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 85.0 97.0 104.0 104.8 113.0 145.0
# What type of object is s?
typeof(s) # "double"
#> [1] "double"
is.vector(s) # FALSE
#> [1] FALSE
is.table(s) # TRUE
#> [1] TRUE
```

The following `for`

loop is almost identical to the one (computing `mean`

of columns 2:5 of `tb`

) above.
However, we initialize the `summaries`

vector to a `mode = "list"`

, which allows storing more complex objects in a vector:

```
# Loop:
# (a) prepare output:
summaries <- vector(mode = "list", length = 4) # initialize to a vector of lists!
# (b) for loop:
for (i in 2:5){
sm <- summary(tb[[i]])
summaries[[i - 1]] <- sm
}
# (c) use output:
summaries # print summaries:
#> [[1]]
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 15.00 22.00 24.50 26.29 30.25 46.00
#>
#> [[2]]
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 150.0 171.8 177.0 177.8 185.0 206.0
#>
#> [[3]]
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 29.00 36.00 39.00 39.05 42.00 47.00
#>
#> [[4]]
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 85.0 97.0 104.0 104.8 113.0 145.0
```

The following code uses the R command `hist()`

(from the **graphics** package included in R) to create a histogram for a specific variable (column) of `tb`

:

```
h # show graphic
#> $breaks
#> [1] 80 90 100 110 120 130 140 150
#>
#> $counts
#> [1] 12 29 29 20 8 1 1
#>
#> $density
#> [1] 0.012 0.029 0.029 0.020 0.008 0.001 0.001
#>
#> $mids
#> [1] 85 95 105 115 125 135 145
#>
#> $xname
#> [1] "tb$IQ"
#>
#> $equidist
#> [1] TRUE
#>
#> attr(,"class")
#> [1] "histogram"
```

The following code uses a loop over the `tb_2`

data (which was created as `tb`

without the 1st column) to create a histogram for each variable (column) of `tb_2`

. Each histogram is stored in a list `out`

, so that individual plots can be plotted later (using the `plot()`

command on an element of `out`

):

```
# Data:
tb_2
#> # A tibble: 100 x 4
#> age height shoesize IQ
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21 173 38 89
#> 2 26 193 43 93
#> 3 24 171 41 92
#> 4 32 191 43 97
#> 5 26 156 36 110
#> 6 28 172 34 117
#> 7 20 166 35 107
#> 8 31 172 34 110
#> 9 18 192 32 88
#> 10 22 176 39 111
#> # … with 90 more rows
# names(tb_2[1])
hist(tb_2[[1]], main = paste0("Histogram of ", names(tb_2[1]), " values:"), xlab = names(tb_2[1]))
# For loop:
out <- vector("list", 4)
for (i in seq_along(tb_2)) { # loop through COLUMNS of tb_2:
print(i)
var_name <- names(tb_2[i])
title <- paste0("Histogram of ", var_name, " values:")
x_lab <- var_name
out[[i]] <- hist(tb_2[[i]], main = title, xlab = x_lab)
} # end for.
#> [1] 1
```

`#> [1] 2`

`#> [1] 3`

`#> [1] 4`

### 12.2.2 For loop variations

We can distinguish between 4 variations of the basic theme of the `for`

loop:

#### 1. Modifying an existing object, instead of creating a new object

**Example:** We want to rescale every column of a table (tibble or data frame).

```
set.seed(1) # for reproducible results
tb <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
tb
#> # A tibble: 10 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.626 1.51 0.919 1.36
#> 2 0.184 0.390 0.782 -0.103
#> 3 -0.836 -0.621 0.0746 0.388
#> 4 1.60 -2.21 -1.99 -0.0538
#> 5 0.330 1.12 0.620 -1.38
#> 6 -0.820 -0.0449 -0.0561 -0.415
#> 7 0.487 -0.0162 -0.156 -0.394
#> 8 0.738 0.944 -1.47 -0.0593
#> 9 0.576 0.821 -0.478 1.10
#> 10 -0.305 0.594 0.418 0.763
# Define rescale01 function:
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
```

- We could rescale all columns of
`tb`

in 4 separate steps:

```
df <- tb # copy
# (a) each column individually:
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
df
#> # A tibble: 10 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0860 1 1 1
#> 2 0.419 0.699 0.953 0.466
#> 3 0 0.428 0.710 0.645
#> 4 1 0 0 0.484
#> 5 0.479 0.896 0.897 0
#> 6 0.00624 0.582 0.665 0.352
#> 7 0.544 0.590 0.630 0.359
#> 8 0.647 0.848 0.178 0.482
#> 9 0.581 0.815 0.520 0.905
#> 10 0.218 0.754 0.828 0.782
```

- Alternatively, we could use a
`for`

loop to modify an existing data structure.

Here are the **answers** to our 3 questions regarding loops:

*Body*: apply`rescale01()`

to every column of`tb`

.*Sequence*: A list of columns (i.e., iterate over each column with`seq_along(tb)`

).*Output*:`tb`

(i.e., identical dimensions to the input).

```
# (b) loop that modifies an existing object:
df2 <- tb # copy
for (i in seq_along(df2)) {
df2[[i]] <- rescale01(df2[[i]])
}
# Output:
df2
#> # A tibble: 10 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0860 1 1 1
#> 2 0.419 0.699 0.953 0.466
#> 3 0 0.428 0.710 0.645
#> 4 1 0 0 0.484
#> 5 0.479 0.896 0.897 0
#> 6 0.00624 0.582 0.665 0.352
#> 7 0.544 0.590 0.630 0.359
#> 8 0.647 0.848 0.178 0.482
#> 9 0.581 0.815 0.520 0.905
#> 10 0.218 0.754 0.828 0.782
# Verify equality:
all.equal(df, df2)
#> [1] TRUE
```

#### 2. Looping over names or values, instead of indices

So far, we used `for loop`

to loop over *numeric indices* of `x`

with `for (i in seq_along(x))`

,
and then extracted the `i`

-th value of `x`

with `x[[i]]`

.

There are 2 additional common loop patterns:

loop over the

*elements*of`x`

with`for (x in xs)`

: useful when only caring about side effectsloop over the

*names*of`x`

with`for (nm in names(xs))`

: useful when names are needed for files or plots.

Whenever creating named output, make sure also provide *names* to the results vector:

```
x <- 1:3
results <- vector("list", length(x))
names(results) <- names(x)
results
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> NULL
```

Note that the basic iteration over the numeric indices is the most general form, because given a position we can extract both the current name and the current value:

#### 3. Handling outputs of unknown length

**Problem:**Knowing the number of iterations, but not knowing how long the output will be.

**Solution:**Increase the size of the output within a loop OR use a list object to collect loop results.

For example, imagine we knew the first `N = 1000`

digits of pi (i.e., a series of digits `314159`

etc.), but wanted to count how frequently a specific target subsequence (e.g., `target = 13`

) occurs in this sequence.

The following code uses `pi_100k`

— available in the **ds4psy** package or from a file pi_100k.txt —
to read the first `N = 1000`

digits of pi into a scalar character object `pi_1000`

:

```
# Data:
## Orig. data source <http://www.geom.uiuc.edu/~huberty/math5337/groupe/digits.html>
pi_100k <- ds4psy::pi_100k # from ds4psy package
## From local data file:
# pi_all <- readLines("./data/pi_100k.txt") # from local data file
# pi_data <- "http://rpository.com/ds4psy/data/pi_100k.txt" # URL of online data file
# pi_100k <- readLines(pi_data) # read from online source
N <- 1000
pi_1000 <- paste0(substr(pi_100k, 1, 1), substr(pi_100k, 3, (N + 1))) # skip the "." at position 2!
nchar(pi_1000) # 1000 (qed)
#> [1] 1000
substr(pi_1000, 1, 10) # first 10 digits
#> [1] "3141592653"
```

The following code uses a for loop to answer the questions: How many times and at which positions does the `target <- 13`

occur in `pi_1000`

?

```
# initialize:
count <- 0 # count number of occurrences
target <- 13
target_positions <- c() # vector to store positions
# loop:
for (i in 1:(N-1)){
# Current 2-digit sequence:
digits_i <- substr(pi_1000, i, i+1) # as character
digits_i <- as.integer(digits_i) # as integer
if (digits_i == target){
count <- count + 1 # increment count
target_positions <- c(target_positions, i) # add the current i to pi_3
} # end if.
} # end for.
# Results:
count
#> [1] 12
target_positions
#> [1] 111 282 364 382 526 599 628 735 745 760 860 972
count == length(target_positions)
#> [1] TRUE
```

The answers are that 13 occurs 12 times in `pi_1000`

. Its 1st occurrence is at position 111 and its 12-th occurrence at position 972.

Note that we could specify the number of iterations (i.e., `N - 1`

loops, from 1 to 999), but not the number of elements in `target_positions`

.

Incrementing the `target_positions`

vector by `i`

every time a new target is found — by `target_positions <- c(target_positions, i)`

— is quite slow and inefficient.
However, this is not problematic as long as we only do this once and for a relatively small problem (like a loop with 999 iterations).

A more efficient solutions could initialize `target_positions`

to a *list* (which can take any data object as an element) and then store any instance of finding the `target`

at the `i`

-th position of `pi_1000`

as the `i`

-th instance of the list. Once the loop is finished, we can use `unlist()`

to flatten the list to a vector:

```
# initialize:
count <- 0 # count number of occurrences
target <- 13
target_positions <- vector("list", (N - 1)) # empty list (of N - 1 elements) to store positions
# loop:
for (i in 1:(N-1)){
# Current 2-digit sequence:
digits_i <- substr(pi_1000, i, i+1) # as character
digits_i <- as.integer(digits_i) # as integer
if (digits_i == target){
count <- count + 1 # increment count
target_positions[i] <- i # add the current i to pi_3
} # end if.
} # end for.
# Results:
count
#> [1] 12
# target_positions # is a list with mostly NULL
target_positions <- unlist(target_positions) # flatten list to vector (removing NULL elements)
target_positions
#> [1] 111 282 364 382 526 599 628 735 745 760 860 972
count == length(target_positions)
#> [1] TRUE
```

This way, we could initialize the length of `target_positions`

before entering the for loop.
This made it possible to assign any new target to `target_positions[i]`

, but made the list much larger than it actually needed to be.
The advantages and disadvantages of these different options should be considered for the specific problem at hand.

#### 4. Handling sequences of unknown length

**Problem:**The number of iterations is not known in advance.**Solution:**Use a`while`

loop, with a`condition`

to stop the loop.

Sometimes we cannot know in advance how many iterations our loop should run for. This is common when dealing with random outcomes or running simulations that need to reach some threshold value to stop.

We can address these problems with a `while`

loop.

Actually, a `while`

loop is simpler than a `for`

loop because it only has 2 components, a condition and a body:

A `while`

loop is also more general than a `for`

loop because we can write any `for`

loop as a `while`

loop, but not vice versa.
For instance, any `for`

loop with `N`

steps:

can be re-written as a `while`

loop that uses a counter variable `i`

for the number of iterations and a condition that the maximum number of steps `N`

must not be exceeded:

As this requires explicit maintenance (here: the initialization and incrementation of a counter), we prefer using `for`

loops when the number of iterations is known in advance.

However, we often do not know in advance how many iterations we will need. For instance, let’s ask ourselves the following question:

- At which position in the first 1000 digits of pi do we first encounter the subsequence
`13`

?

Actually, we do know the answer to this problem from `target_positions`

above: The 1st occurrence of 13 in `pi_1000`

is at position 111. Knowing a solution makes this a good practice problem.

Assuming that we know nothing else of the sequence, we cannot do that sort of iteration with the `for`

loop (unless we loop over the entire sequence, as we did above).
The following `while`

loop solves this task by incrementally increasing `i`

to inspect the corresponding digits (at positions `i`

and `i + 1`

) of `pi_1000`

as long as we meet the condition `digits_i != target`

:

```
# Data:
# pi_1000 # 1000 digits of pi (from above)
target <- 13
i <- 1 # initialize position counter
digits_i <- as.integer(substr(pi_1000, i, i+1)) # 2-digit integer starting at position i
while (digits_i != target){
i <- i + 1 # increment position counter
digits_i <- as.integer(substr(pi_1000, i, i+1)) # 2-digit integer starting at position i
} # end while.
# Check results:
i # Position of 1st target:
#> [1] 111
substr(pi_1000, i, i+nchar(target)-1) # Digits at target position(s)
#> [1] "13"
```

A danger of `while`

loops is that they may never stop. For instance, if we asked:

- At which position in the first 1000 digits of pi do we first encounter the subsequence
`999`

?

we could slightly modify our code above (to accommodate `digits_i`

to look for 3-digit number):

```
# Data:
# pi_1000 # 1000 digits of pi (from above)
target <- 999 # 3-digit target (found)
# target <- 123 # alternative target (yielding error)
i <- 1 # initialize position counter
digits_i <- as.integer(substr(pi_1000, i, i+2)) # 3-digit integer starting at position i
while (digits_i != target){
i <- i + 1 # increment position counter
digits_i <- as.integer(substr(pi_1000, i, i+2)) # 3-digit integer starting at position i
} # end while.
# Check results:
i # Position of 1st target:
#> [1] 763
substr(pi_1000, i, i+nchar(target)-1) # Digits at target position(s)
#> [1] "999"
```

The answer is: The digits 999 first appear in `pi_1000`

at position 763.

However, if we changed our target to `123`

to ask the analog question:

- At which position in the first 1000 digits of pi do we first encounter the subsequence
`123`

?

the same `while`

loop would encounter an error message:

The source of this error becomes obvious when realizing that `i`

is set to a value of 1001:
We simply did not find an instance of `123`

in the first 1000 digits of pi and the counter is trying to access its 1001. digit, which is undefined (`NA`

) and hence causes an error in our condition (`digits_i != target`

).

To prevent this sort of error, we could modify our condition to also stop the `while`

loop after the maximum number of possible steps has been reached.
In our case, the `while`

loop does only makes sense as long as we do not exceed the number of characters in `pi_1000`

, so that we can add the requirement `(i <= nchar(pi_1000))`

as an additional (conjunctive, i.e., using `&&`

) test to our condition:

```
# Data:
# pi_1000 # 1000 digits of pi (from above)
# target <- 999 # 3-digit target (found)
target <- 123 # alternative target (not found)
i <- 1 # initialize position counter
digits_i <- as.integer(substr(pi_1000, i, i+2)) # 3-digit integer starting at position i
while ( (digits_i != target) && (i <= nchar(pi_1000)) ){
i <- i + 1 # increment position counter
digits_i <- as.integer(substr(pi_1000, i, i+2)) # 3-digit integer starting at position i
} # end while.
# position of 1st target:
i
#> [1] 1001
```

This way, the `while`

loop is limited to a maximum of `nchar(pi_1000) =`

1000 iterations.
If the counter `i`

shows an impossible value of 1001, we can conclude that the `target`

sequence was not found.

#### Practice

- Combine a
`for`

loop and a`while`

loop to find the positions of the first 3 occurrences of`13`

in`pi_1000`

.

**Hint:** Inspecting `target_positions`

(computed above) tells you the solution.
But it’s still instructive to combine both type of loops to solve this problem.

```
# Data:
# pi_1000
first_3 <- rep(NA, 3) # initialize output
target <- 13
i <- 1 # initialize position counter
digits_i <- as.integer(substr(pi_1000, i, i+1)) # 2-digit integer starting at position i
for (n in seq_along(first_3)){
while (digits_i != target){
i <- i + 1 # increment position counter
digits_i <- as.integer(substr(pi_1000, i, i+1)) # 2-digit integer starting at position i
} # end while loop.
first_3[n] <- i # store current position
i <- i + 1 # increment position counter
digits_i <- as.integer(substr(pi_1000, i, i+1)) # 2-digit integer starting at position i
} # end for.
# Solution:
first_3
#> [1] 111 282 364
# Verify equality:
all.equal(first_3, target_positions[1:3])
#> [1] TRUE
```

### 12.2.3 Functional programming

#### For loops vs. functionals

In R, `for`

loops are not as important as in most other programming languages, because R is a *functional* programming language.
This means that it is possible to replace many `for`

loops by wrapping up the body of a `for`

loop in a function, and then call that function.

#### Example

Consider a simple data table `df`

:

```
# Data:
set.seed(1) # for reproducible results
df <- tibble(a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10))
df
#> # A tibble: 10 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.626 1.51 0.919 1.36
#> 2 0.184 0.390 0.782 -0.103
#> 3 -0.836 -0.621 0.0746 0.388
#> 4 1.60 -2.21 -1.99 -0.0538
#> 5 0.330 1.12 0.620 -1.38
#> 6 -0.820 -0.0449 -0.0561 -0.415
#> 7 0.487 -0.0162 -0.156 -0.394
#> 8 0.738 0.944 -1.47 -0.0593
#> 9 0.576 0.821 -0.478 1.10
#> 10 -0.305 0.594 0.418 0.763
```

Assume that our goal is getting the mean of every column of `df`

.
The standard solution for this task is using a `for`

loop:

```
# prepare output vector:
out <- vector("double", ncol(df))
# for loop over columns i:
for (i in seq_along(df)) {
out[[i]] <- mean(df[[i]])
} # end for.
out
#> [1] 0.1322028 0.2488450 -0.1336732 0.1207302
```

As we will want to compute the means of every column pretty frequently, we abstract the `for`

loop into a dedicated `col_mean`

function:

```
col_mean <- function(data) {
out <- vector("double", length(data))
for (i in seq_along(data)) {
out[i] <- mean(data[[i]]) # apply function to i-th column of df
} # end for.
return(out)
}
# Check:
col_mean(df)
#> [1] 0.1322028 0.2488450 -0.1336732 0.1207302
```

But now we are no longer satisfied with the *mean*, but also want the *median* and *standard deviation* of every column.
Of course we could write analog functions `col_median()`

and `col_sd()`

, but they would only differ in 1 line from `col_mean`

above.
Hence, what we really would want is that we could *pass a function* (like `mean`

or `median`

) to another function — which is exactly what we will learn to do next.

#### Motivation: Generalizing functions

Consider the following 3 functions:

```
f1 <- function(x) {abs(x - mean(x)) ^ 1}
f2 <- function(x) {abs(x - mean(x)) ^ 2}
f3 <- function(x) {abs(x - mean(x)) ^ 3}
```

The 3 functions only vary in their last digit. Given this degree of duplicated code, we would make the last element an *additional argument* to a more general function:

By doing this, we have reduced the amount of code, the corresponding chance of errors, and made it easier to generalise to new situations.

#### Application: Using functions as arguments to functions

We can achieve exactly the same thing for 3 similar functions — like `col_mean()`

, `col_median()`

, and `col_sd()`

— by adding an argument to a function that supplies the *function* to apply to each column:

```
col_summary <- function(data, fun) {
# prepare vector of outputs:
out <- vector("double", length(data))
# loop over data:
for (i in seq_along(data)) {
# apply fun to the i-th element of data:
out[i] <- fun(data[[i]]) # !!!
} # end for.
return(out)
}
# Repeat use of col_mean (from above):
col_summary(df, fun = mean) # Note: same functionality/results as above.
#> [1] 0.1322028 0.2488450 -0.1336732 0.1207302
# New uses:
col_summary(df, fun = median)
#> [1] 0.256575548 0.491872279 0.009218122 -0.056559219
col_summary(df, fun = sd)
#> [1] 0.7805860 1.0695148 0.9556076 0.8085646
```

The idea of passing a *function* (like `mean`

or `median`

) as an *argument* (here: `fun`

) to another *function* (here: `col_summary`

) is an extremely powerful idea,
and it one reason for calling R a *functional* programming language.

### 12.2.4 Replacing loops in **base** R and **purrr**

The `apply`

family of functions of **base** R and the `map`

functions of the **purrr** package provide functions that eliminate the need for many `for`

loops.

Note that using `apply`

or `map`

is not necessarily faster than using `for`

loops.
The chief benefit of using these functions is not speed, but *clarity*: They make code easier to write and read.

#### 12.2.4.1 Using `apply`

functions

The **base** R `apply`

family of functions in (i.e., `apply()`

, `lapply()`

, `tapply()`

, etc.) replaces `for`

loops over data structures by applying a function to designated parts of a (rectangular) data structure. To accomplish this, `apply`

takes the following arguments:

`X`

argument takes an array or matrix (i.e., rectangular data);

`MARGIN`

argument:`1`

= rows,`2`

= columns;

`FUN`

is the function to be applied.

#### Examples

We can use `base::apply`

to solve the problems addressed by `col_summary`

above:

```
# Data:
# df # from above
dim(df) # 10 rows, 4 columns:
#> [1] 10 4
# apply FUN to columns:
apply(X = df, MARGIN = 2, FUN = mean) # mean of every column
#> a b c d
#> 0.1322028 0.2488450 -0.1336732 0.1207302
apply(X = df, MARGIN = 2, FUN = median) # median of every column
#> a b c d
#> 0.256575548 0.491872279 0.009218122 -0.056559219
apply(X = df, MARGIN = 2, FUN = sd) # SD of every column
#> a b c d
#> 0.7805860 1.0695148 0.9556076 0.8085646
```

In this call, `MARGIN = 2`

instructed `apply`

to apply some function `FUN`

to each *column* of `X`

.
Changing the `MARGIN`

argument `MARGIN = 1`

will apply `FUN`

to each *row* of `X`

:

```
# apply FUN to rows:
apply(X = df, MARGIN = 1, FUN = mean) # mean of every row
#> [1] 0.79074607 0.31320878 -0.24865815 -0.66564396 0.17430122 -0.33413132
#> [7] -0.01971167 0.03802378 0.50471947 0.36740756
apply(X = df, MARGIN = 1, FUN = median) # median of every row
#> [1] 1.13882846 0.28674328 -0.27333780 -1.02157837 0.47466676 -0.23556165
#> [7] -0.08599288 0.33950565 0.69850127 0.50592144
apply(X = df, MARGIN = 1, FUN = sd) # SD of every row
#> [1] 0.9776398 0.3722034 0.5752509 1.7923824 1.0852032 0.3669620 0.3723940
#> [8] 1.0949578 0.6893583 0.4701560
```

**Note** some variants of `apply`

:

`lapply`

returns a*list*of the same length as`X`

, each element of which is the result of applying`FUN`

to the corresponding element of`X`

.See also

`sapply()`

and`vapply()`

for simplifying arrays and returning vectors.

#### 12.2.4.2 Using `map`

functions

The **purrr** package (Henry & Wickham, 2020) provides a family of `map`

functions as modern and more consistent versions of `apply`

.

The main goal of using **purrr** functions (instead of `for`

loops) is to break common list manipulation challenges into smaller and independent pieces.
This strategy involves 2 steps, each of which scales down the problem:

Solving the problem for a

*single element*of the list.

Once we have solved that problem,**purrr**takes care of generalising the solution to every element in the list.Breaking a complex problem down into

*smaller sub-problems*that allow us to advance towards a solution.

With**purrr**, we get many small pieces that we can compose together with the pipe (`%>%`

).

This scaling-down strategy makes it easier to solve new problems and to understand our solutions to old problems when we re-read older code.

#### Essential `map`

functions

The pattern of looping over a vector, doing something to each element, and saving the results is so common that the **purrr** package provides a family of functions for it. There is a separate function for each type of output:

`map()`

makes a*list*.

`map_lgl()`

makes a*logical vector*.

`map_int()`

makes an*integer vector*.

`map_dbl()`

makes a*double vector*.

`map_chr()`

makes a*character vector*.

Each function takes a vector `.x`

as input, applies a function `.f`

to each element, and then returns a new vector that’s the same length (and has the same names) as the input vector. The type of the output vector is determined by the suffix of the `map`

function (e.g., `map_chr`

would return the output as a character vector).

#### Examples

Using a `map`

variant on the above example (with doubles as output):

#### Practice

- Predict the output of the following variants of
`map`

, then check your predictions:

#### Details

It is worth pointing out some details regarding the different `map`

variants:

Without the suffix

`_dbl`

,`map(df, sd)`

returns a*list*(with an element for every column of`df`

).With

`map`

, our focus is on the function/operation, not the bookkeeping of the`for`

loop. This is even more obvious when using the pipe:

```
df %>% map_dbl(mean)
#> a b c d
#> 0.1322028 0.2488450 -0.1336732 0.1207302
df %>% map_dbl(median)
#> a b c d
#> 0.256575548 0.491872279 0.009218122 -0.056559219
df %>% map_dbl(sd)
#> a b c d
#> 0.7805860 1.0695148 0.9556076 0.8085646
```

`map`

also uses the generic`...`

argument to allow passing additional arguments to the function indicated by`.f`

:

```
map_dbl(df, mean, trim = 0.5)
#> a b c d
#> 0.256575548 0.491872279 0.009218122 -0.056559219
map_dbl(df, sd, na.rm = FALSE)
#> a b c d
#> 0.7805860 1.0695148 0.9556076 0.8085646
```

`map`

preserves the names of`.x`

:

#### Shortcuts

There are a few shortcuts to save typing in the `.f`

argument of `map`

.

Imagine we want to fit a linear model to each group in a dataset.

The following example splits the up the `mtcars`

dataset in to 3 pieces (by value of cylinder)
and fits the same linear model to each piece.
The linear model is supplied as an *anonymous function*:

```
mtcars %>%
group_by(cyl) %>%
count()
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl n
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
models <- mtcars %>%
split(.$cyl) %>% # split data into 3 sets (by cyl)
map(function(df) lm(mpg ~ wt, data = df)) # map lm function to each set
models # 3 linear models
#> $`4`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#>
#> Coefficients:
#> (Intercept) wt
#> 39.571 -5.647
#>
#>
#> $`6`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#>
#> Coefficients:
#> (Intercept) wt
#> 28.41 -2.78
#>
#>
#> $`8`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#>
#> Coefficients:
#> (Intercept) wt
#> 23.868 -2.192
```

As the syntax for creating an *anonymous function* in R is quite verbose, **purrr** provides a *one-sided formula* as a shortcut:

```
models <- mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .))
models
#> $`4`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#>
#> Coefficients:
#> (Intercept) wt
#> 39.571 -5.647
#>
#>
#> $`6`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#>
#> Coefficients:
#> (Intercept) wt
#> 28.41 -2.78
#>
#>
#> $`8`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#>
#> Coefficients:
#> (Intercept) wt
#> 23.868 -2.192
```

Here, the `.`

is used as a pronoun that refers to the current list element.

When inspecting many models, we may want to extract a summary statistic like \(R^{2}\).
To do that we need to first run `summary()`

on our `models`

and then extract the component called `r.squared`

.
We could do this using the *shorthand for anonymous functions*:

But extracting named components from a list of elements is a common operation, so **purrr** provides an even shorter shortcut: We can use a *string*:

Alternatively, we can also use an *integer* to select elements by their position:

#### 12.2.4.3 Comparing `map`

vs. `apply`

The `map`

functions of **purrr** are modeled on the `apply`

functions:

`lapply`

is basically identical to`map`

, except that`map`

is consistent with all the other functions in**purrr**, and we can use the shortcuts for`.f`

.`sapply`

is a wrapper around`lapply`

that automatically simplifies the output (which can yield unexpected results).`vapply`

is a safe alternative to`sapply`

because you supply an additional argument that defines the type. The only problem with`vapply`

is that it’s a lot of typing:`vapply(df, is.numeric, logical(1))`

is equivalent to`map_lgl(df, is.numeric)`

.One advantage of

`vapply`

over**purrr**’s`map`

functions is that it can also produce*matrices*— the`map`

functions only ever produce*vectors*.

#### 12.2.4.4 Advanced aspects of **purrr**

See the following sections of r4ds (Wickham & Grolemund, 2017) for more advanced issues in combination with the `map`

functions of **purrr**:

21.6: Dealing with failure for using the adverbs

`safely`

,`possibly`

, and`quietly`

in combination with`map`

functions.21.7: Mapping over multiple arguments for the

`map2`

variants, as well as`invoke_map`

.21.8: Walk to use

`walk`

,`walk2`

and`pwalk`

for calling functions for their side effects.

Rather than digging deeper into the functional programming paradigm of **purrr**, the exercises of this chapter (in Section~12.4) will focus on good-old fashioned `for`

loops and only cover essential aspects of `apply`

and `map`

.

### References

Henry, L., & Wickham, H. (2020). *purrr: Functional programming tools*. Retrieved from https://CRAN.R-project.org/package=purrr

R Core Team. (2020). *R: A language and environment for statistical computing*. Retrieved from https://www.R-project.org

Wickham, H., & Grolemund, G. (2017). *R for data science: Import, tidy, transform, visualize, and model data*. Retrieved from http://r4ds.had.co.nz