# Chapter 11 Functional programming and iterations

So far, you have learned heaps of data wrangling and analyses, but no real customization of R. This will change now, as you will be introduced to functions. Furthermore, the operations have only been applied to one singular object (read vector or data.frame/tibble). Iteration means that you perform the same operation on multiple objects/data sets/you name it.

Today’s session will all be about following the DRY principle. DRY stands for Don’t Repeat Yourself. “Why not?,” you may ask. Well, the problem with copy-and-pasting code is that you have to change all the variable names in every instance of your code. RStudio has a nice Search-and-Replace function which might facilitate that, but this practice still bears the danger of writing code that contains errors. This is where you will need to make use of the tools that R offers to iterate over a couple of elements, perform operations on them, and return the results. An example:

example_strings <- c("this", "is", "how", "a", "for", "loop", "works")

for (i in seq_along(example_strings)) {
print(example_strings[[i]])
}
## [1] "this"
## [1] "is"
## [1] "how"
## [1] "a"
## [1] "for"
## [1] "loop"
## [1] "works"

Another option – from the tidyverse – is the purrr package:

library(tidyverse)
walk(example_strings, print)
## [1] "this"
## [1] "is"
## [1] "how"
## [1] "a"
## [1] "for"
## [1] "loop"
## [1] "works"

So, what has this code done? In both cases, it has taken the function print() and applied it to every element of our vector. Copying-and-pasting would have looked like this:

print(example_strings[[1]])
## [1] "this"
print(example_strings[[2]])
## [1] "is"
print(example_strings[[3]])
## [1] "how"
print(example_strings[[4]])
## [1] "a"
print(example_strings[[5]])
## [1] "for"
print(example_strings[[6]])
## [1] "loop"
print(example_strings[[7]])
## [1] "works"
print(example_strings[[7]])
## [1] "works"

Damn, I pasted the last instance twice. In this case, the mistake is obvious, but oftentimes it is not.

In the following, I will provide you a more extensive introduction into conditional statements, functions, loops, and the purrr package.

## 11.1 Flow control

Sometimes you want your code to only run in specific cases. For mutate(), I have already showed you conditional imputation of values with case_when(). A more generalized approach for conditionally running code in R are if statements. They look as follows:

if (conditional_statement evaluates to TRUE) {
do_something
}

They also have an extension – if…else:

if (conditional_statement evaluates to TRUE) {
do_something
} else {
do_something_else
}

Imagine that I want R to tell me whether a number it draws is smaller than or equal to five:

set.seed(1234)
x <- sample(10, 1)

if (x <= 5) {
print("x is smaller than or equals 5")
} 

In this case, x is 3, so the if statement returns something. If this is not the case, nothing happens:

set.seed(1234)
x <- sample(10, 1)

if (x <= 5) {
print("x is smaller than or equals 5")
}

Now I could extend it by another if statement:

if (x > 5) {
print("x is greater than 5")
}
## [1] "x is greater than 5"

But else allows me to take a shortcut

if (x <= 5) {
print("x is smaller than or equals 5")
} else {
print("x is greater than 5")
}
## [1] "x is greater than 5"

Please note that the condition inside the if statement needs to be a vector of type logical (hence, either TRUE or FALSE). Apart from that, only the first value will be used:

if (c(TRUE, FALSE, TRUE)) {
print("example")
}
## Warning in if (c(TRUE, FALSE, TRUE)) {: the condition has length > 1 and only
## the first element will be used
## [1] "example"

## 11.2 Functions

So far, every call you have made within R contained a function. Even the most basic operations, such as c() for building vectors, rely on functions. Functions are the verbs of R, they do something to your objects. Hence, you as someone who obeys the principles of DRY can make good use of them. Whenever you need to copy code to perform certain tasks to an object, you can also put those tasks into a function and just provide the function with the objects.

Imagine you want to rescale some variables in a tibble (an example I took from R4DS ):

set.seed(1234)

df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)

df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE)) df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE)) df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

Given that you now know how to loop over the tibble, you can certainly reduce the amount of copy-pasting here.

df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)

for (i in seq_along(df)) {
df[[i]] <- (df[[i]] - min(df[[i]], na.rm = TRUE)) /
(max(df[[i]], na.rm = TRUE) - min(df[[i]], na.rm = TRUE))
}

However, the operation within the loop is generalizable: it always only takes a vector of numeric values as input, performs some actions on them and returns another vector of the same length, but rescaled into a range from 0 to 1. Hence, the operation fulfills the requirements for putting it into a function.

Doing so has some advantages:

• If an error occurs, you can simply change the function in one place – when you define it – instead of changing all the occurrences in your code
• It will certainly make your code easier to read – rescale0to1 is a more concise description than (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) (–> you see what I did here? I already replaced the arguments with a generic variable. You can use it to write the function yourself (Exercise 1).)

### 11.2.1 Writing your own functions

When you define functions in R, you need to follow a certain structure:

function_name <- function(argument_1, argument_2, argument_n) {
function body
}
• The function_name is the thing you will call (e.g., mean()). In general, it should be a verb, it should be concise, and it should be in_snakecase.
• The arguments are what you need to provide the function with (e.g., mean(1:10)).
• The function body contains the operations which are performed to the arguments. It can contain other functions as well – which need to be defined beforehand (e.g., sum(1:10) / length(1:10))). It is advisable to split up the function body into as little pieces as you can.

### 11.2.2 An example: Roulette

In the following, I will guide you through a quick example on how you could use functions to play an extremely basic game of Roulette with R. You provide it with two values (how much you bet and which number you choose) and R takes care of the rest.

So what does the function need to do? First, it needs to draw a number between 0 and 36. Second, it needs to compare the bet and its corresponding number. Third, it needs to return the respective result.

play_roulette <- function(bet, number) {
draw <- sample(0:36, 1)
tibble(
winning_number = draw,
your_number = number,
your_bet = bet,
your_return = if (number == draw) {
bet * 36
} else {
0
}
)
}

play_roulette(bet = 1, number = 35)
## # A tibble: 1 × 4
##   winning_number your_number your_bet your_return
##            <int>       <dbl>    <dbl>       <dbl>
## 1             15          35        1           0

But how to make sure that I do not bet on a number which I cannot bet on (i.e., numbers greater than 36)? Or, put differently, how to forbid values? Use stop(). Besides, how to set default values for the arguments? Just use argument = default.

play_roulette_restricted <- function(bet = 1, number) {
if (number > 36) stop("You can only bet on numbers between 0 and 36.")
draw <- sample(0:36, 1)
tibble(
winning_number = draw,
your_number = number,
your_bet = bet,
your_return = if (number == draw) {
bet * 36
} else {
0
}
)
#return(tbl_return)
}
play_roulette_restricted(number = 3)
## # A tibble: 1 × 4
##   winning_number your_number your_bet your_return
##            <int>       <dbl>    <dbl>       <dbl>
## 1              1           3        1           0

The function returns the results of the last call, i.e., the tibble. If you want to be more concrete about what it should return, use return():

play_roulette_basic <- function(bet = 1, number) {
if (number > 36) stop("You can only bet on numbers between 0 and 36.")
draw <- sample(0:36, 1)
if (number == draw) {
return(str_c("Nice, you won", as.character(bet * 36), "Dollars", sep = " "))
} else {
return("I'm sorry, you lost.")
}
}
play_roulette_basic(number = 35)
## [1] "I'm sorry, you lost."

### 11.2.3 Functional programming with tidyverse functions

The majority of dplyr verbs uses so-called tidy evaluation which is a framework for controlling how expressions and variables in your code are evaluated by the tidyverse functions. The two main things here are data masking and tidy selection. The former facilitates computing on values within the data set and refers to functions such as filter(), where you can just type in variable names instead of tediously typing name_of_df$var_name. The latter aims to facilitate working with the columns in the data set. It is provided by the tidyselect package and allows you, for instance, to work with code such as tbl %>% select(starts_with("a")). More examples can be acquired using ?dplyr_tidy_select. I will not go into detail here but rather stick to what implications this has to you. If you are interested in the theoretical underpinnings, read the chapter on “Metaprogramming” in Advanced R by Hadley Wickham. #### 11.2.3.1 Providing the variable in the function call If your function takes a user-supplied variable as an argument, you need to consider this arguments in the pipeline. For instance, the following function calculates the mean, median, and standard deviation of a variable. my_summary <- function(tbl, var) { tbl %>% summarize( mean = mean({{ var }}), median = median({{ var }}), sd = sd({{ var }}) ) } mtcars %>% my_summary(cyl)  ## mean median sd ## 1 6.1875 6 1.785922 If the variable names are supplied in a character vector, you need all_of(): summarize_mean <- function(data, vars) { data %>% summarize(n = n(), across({{ vars }}, mean)) } mtcars %>% group_by(cyl) %>% summarize_mean(all_of(c("hp", "mpg"))) %>% glimpse() ## Rows: 3 ## Columns: 4 ##$ cyl <dbl> 4, 6, 8
## $n <int> 11, 7, 14 ##$ hp  <dbl> 82.63636, 122.28571, 209.21429
## $mpg <dbl> 26.66364, 19.74286, 15.10000 Another handy thing is changing the variable names in the output depending on the input names. Here, you can use glue syntax and :=: my_summary_w_names <- function(tbl, var){ tbl %>% summarize( "mean_{{ var }}" := mean({{ var }}), "median_{{ var }}" := median({{ var }}), "sd_{{ var }}" := sd({{ var }}) ) } mtcars %>% my_summary_w_names(cyl) ## mean_cyl median_cyl sd_cyl ## 1 6.1875 6 1.785922 Find more on programming with dplyr in this vignette. ### 11.2.4 Further readings If you want to learn more about functional programming, check out the following resources: ## 11.3 Iteration Strictly speaking, there are three kinds of loops: for, repeat, and while. I will touch upon for and while, because they are more straight-forward than repeat. repeat loops will repeat a task until you tell it to stop by hitting the escape button or adding a condition up front. Interactive programming – hitting the escape button to break a loop – is no desired practice and while loops have internalized the condition already. Hence, repeat loops do not appear to have any advantage and I can leave them out deliberately. ### 11.3.1for loops for loops are the sort of loops you will have to work with more often as they allow you to loop over a predefined number of elements. For this sake, I will briefly revise how you index vectors, lists, and tibbles. The ith element of a vector can be accessed by using either [[i]] or [i]. The ith element of a list can be obtained by using [[i]][i] would return a sub-list instead of the element. The second element of the ith element in a list (if it were a vector or a list) can be obtained using [[i]][[2]] etc. The ith column of a tibble can be accessed as a vector using [[i]]. The second value of the ith column of a tibble can be accessed using [[i]][[2]] How does that matter for for loops? Remember the example I showed you in the beginning? All a for loop does is iterating over a vector of values and imputing them instead of a placeholder. example_strings <- c("this", "is", "how", "a", "for", "loop", "works") for (i in seq_along(example_strings)) { print(example_strings[[i]]) } ## [1] "this" ## [1] "is" ## [1] "how" ## [1] "a" ## [1] "for" ## [1] "loop" ## [1] "works" seq_along(example_strings) # seq_along looks like this ## [1] 1 2 3 4 5 6 7 # hence, the first iteration looks like this. print(example_strings[[seq_along(example_strings)[[1]]]]) ## [1] "this" # translates to print(example_strings[[1]]) ## [1] "this" However, this course is about data analysis. So, I have a tibble with different cars and I want to perform some operations on some columns. In this case, I want the average value for every column where it makes sense. cars_tbl <- mtcars %>% rownames_to_column(var = "model_name") %>% select(mpg, cyl, disp, hp, gear) glimpse(cars_tbl) ## Rows: 32 ## Columns: 5 ##$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,… ##$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180… ##$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
output <- double(length = ncol(cars_tbl))
output <- set_names(output, colnames(cars_tbl))
# names don't look good -- for loop and change them to "mean_*" using the str_c-function

for (i in seq_along(cars_tbl)) {
output[[i]] <- mean(cars_tbl[[i]])
}

If you wanted to loop over a tibble and just perform operations to certain variables using dplyr syntax, you could also draw the variable names from a vector. However, for this a slightly different command needs to be used, you cannot simply refer to the variable name in a pipeline. You need to index into the .data pronoun.

relevant_columns <- c("mpg", "cyl", "disp", "hp", "gear")

for (var in relevant_columns) {
mtcars %>% count(.data[[var]]) %>% print()
}
##     mpg n
## 1  10.4 2
## 2  13.3 1
## 3  14.3 1
## 4  14.7 1
## 5  15.0 1
## 6  15.2 2
## 7  15.5 1
## 8  15.8 1
## 9  16.4 1
## 10 17.3 1
## 11 17.8 1
## 12 18.1 1
## 13 18.7 1
## 14 19.2 2
## 15 19.7 1
## 16 21.0 2
## 17 21.4 2
## 18 21.5 1
## 19 22.8 2
## 20 24.4 1
## 21 26.0 1
## 22 27.3 1
## 23 30.4 2
## 24 32.4 1
## 25 33.9 1
##   cyl  n
## 1   4 11
## 2   6  7
## 3   8 14
##     disp n
## 1   71.1 1
## 2   75.7 1
## 3   78.7 1
## 4   79.0 1
## 5   95.1 1
## 6  108.0 1
## 7  120.1 1
## 8  120.3 1
## 9  121.0 1
## 10 140.8 1
## 11 145.0 1
## 12 146.7 1
## 13 160.0 2
## 14 167.6 2
## 15 225.0 1
## 16 258.0 1
## 17 275.8 3
## 18 301.0 1
## 19 304.0 1
## 20 318.0 1
## 21 350.0 1
## 22 351.0 1
## 23 360.0 2
## 24 400.0 1
## 25 440.0 1
## 26 460.0 1
## 27 472.0 1
##     hp n
## 1   52 1
## 2   62 1
## 3   65 1
## 4   66 2
## 5   91 1
## 6   93 1
## 7   95 1
## 8   97 1
## 9  105 1
## 10 109 1
## 11 110 3
## 12 113 1
## 13 123 2
## 14 150 2
## 15 175 3
## 16 180 3
## 17 205 1
## 18 215 1
## 19 230 1
## 20 245 2
## 21 264 1
## 22 335 1
##   gear  n
## 1    3 15
## 2    4 12
## 3    5  5

Every for loop consists of three components:

• Output: In the beginning, I create a double vector output <- double(length = ncol(cars_tbl)). As you can see here, I determine the length of the vector in the beginning. This is due to efficiency: if you were to grow the vector by every iteration (using c), the loop becomes very slow. This is especially important if you work with large data sets.
• Sequence: i in seq_along(cars_tbl) tells the for loop what to loop over.
• Body: output[[i]] <- mean(cars_tbl[[i]]). The actual code. Performs the operation on the respective column cars_tbl[[whatever 'i']] and stores the resulting value in the pre-defined output vector at position i.

One problem with for loops is that they are considered slow. They are not, at least not if you stick to the following rules:

• Always pre-allocate space – make sure that R does not have to expand your objects
• Do as much as you can outside the loop – every operation inside the loop will be repeated every time the loop is repeated

#### 11.3.1.1 Variations

In general, you will come across three different problems with for loops.

• Modifying an existing object
• Length of output is unknown
• Sequences are of unknown length
##### 11.3.1.1.1 Modifying the existing object

Remember the for loop with the cars_tbl? I could have performed the same operation storing it in the very same tibble again:

for (i in seq_along(cars_tbl)) {
cars_tbl[[i]] <- mean(cars_tbl[[i]])
}

However, in this case it preserves the number of rows and changes all the values to the respective measure. Hence, I need to slice() it.

cars_tbl_sliced <- cars_tbl %>%
slice(1)
##### 11.3.1.1.2 Length of output is unknown

Sometimes, you do not know how long your output object is. This is, for instance, if you simulate vectors of random length. Normally, you would just put the values into a vector. However, if you do not know the length, then you would have to ask R to grow the vector every iteration. But this is extremely inefficient. For this, the solution is lists. You always know how many iterations your loop will have. Hence, you can create a list of this exact length and then just store the results in the list (as lists do not care about the length of the singular elements). Afterwards, you can unlist() or flatten_*() the list into a vector.

a_list <- vector(mode = "list", length = 10L)
##### 11.3.1.1.3 Unknown sequence length

Seldom, you also do not know how long your input sequence is. Instead, you want to loop until a certain condition is met. This is where while loops come in handy (but this is the only use case I could think of).

The basic structure of while loops is as follows:

while (condition) {
code
}

What could an example look like?6 The following loop keeps running until three heads appeared in a row and the condition is met.

Please note that both vectors which are to be modified within the loop – indicator and head – need to be created beforehand. If I had not created head beforehand, the loop would not have started because there would not have been any vector to assess the length.

indicator <- 0
while (length(head) < 3) {
if (sample(2, 1) == 1) {
} else {
x <- "tail"
}
if (x == "head") {
} else {
}
indicator <- indicator + 1
}

### 11.3.2 purrr::map()

Loops are good because they make everything very explicit. However, it is often tedious to type. The purrr package provides functions which enable you to iterate over vectors, data frames/tibbles, and lists. Apart from that, it has a lot of functions to work with lists as well. I will only cover the former functions. If you are interested in using purrr for working with lists, check out this extensive tutorial by Jenny Bryan.

In the beginning of this chapter, I used the walk() function. This function is related to map() as it iterates over a vector and applies a function to its respective elements. The difference is that walk() doesn’t store the results, map() does.

#### 11.3.2.1 The basics

The structure of the map() function looks like this:

map(vector or list, function(, if you need it, additional arguments of function))

map() always returns a list.

If you want the output to be in a different format, there are different, type-specific map() functions.

• map_dfr() returns a data frame – by binding the rows
• map_dfc() returns a data frame – by binding the columns
• map_dbl() returns a double vector
• map_chr() returns a character vector
• map_lgl() returns a logical vector

In the following I will demonstrate the function of map() with a simple example. The basic vector I will map over is:

example_dbl <- c(1.5, 1.3, 1.8, 1.9, 2.3)

In the first example, I just add 10 to the vector. In order to do so, I first need to create a function which adds 10.

add_10 <- function(x) {
x + 10
}
map(example_dbl, add_10)
## [[1]]
## [1] 11.5
##
## [[2]]
## [1] 11.3
##
## [[3]]
## [1] 11.8
##
## [[4]]
## [1] 11.9
##
## [[5]]
## [1] 12.3
map_dbl(example_dbl, add_10)
## [1] 11.5 11.3 11.8 11.9 12.3
map_chr(example_dbl, add_10) # does not make sense though
## [1] "11.500000" "11.300000" "11.800000" "11.900000" "12.300000"
##### 11.3.2.1.1 Anonymous functions

In the former example, I did specify the function beforehand. map() also allows you to define the function within the call using a so-called anonymous function. The function’s argument is .x which stands for the respective input.

map_dbl(example_dbl, ~{
.x + 10
})
## [1] 11.5 11.3 11.8 11.9 12.3

The for loop which calculated the mean for the cars_tbl would have looked like this in purrr:

map(cars_tbl, mean)
## $mpg ## [1] 20.09062 ## ##$cyl
## [1] 6.1875
##
## $disp ## [1] 230.7219 ## ##$hp
## [1] 146.6875
##
## \$gear
## [1] 3.6875

When I put it into a tibble, names are preserved:

map_dfc(cars_tbl, mean)
## # A tibble: 1 × 5
##     mpg   cyl  disp    hp  gear
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  20.1  6.19  231.  147.  3.69

#### 11.3.2.2 Mapping over multiple arguments

Sometimes you want to apply things to multiple arguments. Think for example of the sample()function. It requires at least two arguments: the size of the sample you draw and the element space x you draw the sample from.

map2(10, 1:5, sample, replace = TRUE)
## [[1]]
## [1] 5
##
## [[2]]
## [1] 6 7
##
## [[3]]
## [1] 3 3 7
##
## [[4]]
## [1]  9  8  3 10
##
## [[5]]
## [1] 6 7 2 5 5

However, the map2() functions do not provide you with the possibility to control the type of output you get. You can take care of this using flatten_*().

map2(10, 5, sample) %>% flatten_dbl()
## [1]  9  3  8 10  7

If you provide it with a vector which is longer than 1, map2() will not perform the operation on every possible combination of the two vectors. Instead, it iterates over both vectors simultaneously, hence, the first iteration uses the first two values, the second iteration the second two values etc.

map2(c(10, 5), c(5, 3), sample) 
## [[1]]
## [1]  9  4  8 10  3
##
## [[2]]
## [1] 2 3 1

If you want to map over more than two arguments, pmap() is the way to go. If you work with functions which need multiple values as arguments, you can store the vectors containing the respective values in a tibble. You should name the columns according to the function’s arguments.

An example here is drawing numbers from a normal distribution – rnorm(). The function takes three arguments: n– the number of values to be drawn, mean, and sd.

tibble(
n = 10,
mean = 1:10,
sd = 0.5
) %>%
pmap(rnorm)
## [[1]]
##  [1] 1.0840927 1.1774841 0.9739474 0.9020327 0.6754651 0.4451164 1.4246371
##  [8] 1.0111813 1.4155703 0.3778561
##
## [[2]]
##  [1] 2.084513 2.336583 1.986862 1.904304 1.609047 3.029081 2.375251 2.912104
##  [9] 2.040030 1.684295
##
## [[3]]
##  [1] 2.243356 2.681950 3.113151 3.506845 3.126375 2.414026 3.334357 2.174950
##  [9] 2.817074 2.841941
##
## [[4]]
##  [1] 3.025877 4.460029 3.688564 3.832982 4.697574 4.318337 3.945784 4.256881
##  [9] 4.199636 4.831428
##
## [[5]]
##  [1] 5.137947 5.253136 5.173776 4.811381 5.048810 5.819372 4.562204 5.060880
##  [9] 5.681065 4.882689
##
## [[6]]
##  [1] 5.473309 5.565108 5.804936 5.576325 5.869680 5.792790 5.908475 6.203528
##  [9] 6.312317 6.839103
##
## [[7]]
##  [1] 6.965653 6.839580 7.735503 7.852165 7.021622 6.833671 6.088882 7.705631
##  [9] 6.581209 6.438119
##
## [[8]]
##  [1] 9.521883 8.117511 7.983371 6.633890 7.950105 8.488016 8.206934 8.456161
##  [9] 8.991866 8.584554
##
## [[9]]
##  [1]  8.745631  9.352090  8.900792  8.730965  7.572121  8.605177  9.243907
##  [8] 10.084016  9.250347  9.310105
##
## [[10]]
##  [1]  9.517048 10.081327  8.960881 10.242613 10.348384 10.092757 10.350367
##  [8] 10.155841 10.380231 10.921232

### References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly.

1. I have taken this example from the R for Data Science book. I hardly ever work with while loops. The only use case from my day-to-day work is web-scraping, where I want to loop over pages until a certain threshold is reached. Therefore, I could not really come up with an example myself.↩︎