12.3 Functional programming

In Section 1.2 of Chapter 1 and Section 11.1 of Chapter 11 we cited:

To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.

John Chambers

A superficial reading of this statement could misinterpret it as suggesting a division between (or dichotomy of) objects and functions. However, we have already seen (in Section 11.2) that functions are objects that are defined like other objects. Now it is time to break down another barrier between two seemingly distinct concepts: That between data and functions. As this section will show, R provides ways of avoiding iterative loops and these ways work by using functions as data that are passed to other functions.

Please note: As this section covers advanced aspects of iteration, it can be skipped on a first reading of this chapter and course.

12.3.1 For loops vs. functionals

In R, for loops are not as important as in most other programming languages, because R is a functional programming language. This means that it is possible to replace many for loops by wrapping up the body of a for loop in a function, and then repeatedly call or apply that function.

Example

To motivate the idea of functional programming, consider a simple data table df:

# Data:
set.seed(1)  # for reproducible results

df <- tibble(a = rnorm(10),
             b = rnorm(10),
             c = rnorm(10),
             d = rnorm(10))
df
#> # A tibble: 10 × 4
#>         a       b       c       d
#>     <dbl>   <dbl>   <dbl>   <dbl>
#>  1 -0.626  1.51    0.919   1.36  
#>  2  0.184  0.390   0.782  -0.103 
#>  3 -0.836 -0.621   0.0746  0.388 
#>  4  1.60  -2.21   -1.99   -0.0538
#>  5  0.330  1.12    0.620  -1.38  
#>  6 -0.820 -0.0449 -0.0561 -0.415 
#>  7  0.487 -0.0162 -0.156  -0.394 
#>  8  0.738  0.944  -1.47   -0.0593
#>  9  0.576  0.821  -0.478   1.10  
#> 10 -0.305  0.594   0.418   0.763

Assume that our goal is getting the mean of every column of df. The standard solution for this task is using a for loop:

# prepare output vector: 
out <- vector("double", ncol(df)) 

# for loop over columns i: 
for (i in seq_along(df)) {
  
  out[[i]] <- mean(df[[i]])

} # end for. 

out
#> [1]  0.1322028  0.2488450 -0.1336732  0.1207302

As we will want to compute the means of every column pretty frequently, we abstract the for loop into a dedicated col_mean() function:

col_mean <- function(data) {
  
  out <- vector("double", length(data))
  
  for (i in seq_along(data)) {
    
    out[i] <- mean(data[[i]])  # apply function to i-th column of df
    
  } # end for. 
  
  return(out)
}

# Check: 
col_mean(df)
#> [1]  0.1322028  0.2488450 -0.1336732  0.1207302

But now we are no longer satisfied with the mean, but also want the median and standard deviation of every column. Of course we could write analog functions col_median() and col_sd(), but they would only differ in 1 line from col_mean above. Hence, what we really would want is that we could pass a function (like mean() or median()) to another function — which is exactly what we will learn to do next.

12.3.2 Generalizing functions

In previous chapters, we used functions to create and transform data objects. An important next step when using R consists in using functions as arguments or data objects (i.e., as inputs to other functions). To motivate this step, consider the following three functions:

f1 <- function(x) {abs(x - mean(x)) ^ 1}
f2 <- function(x) {abs(x - mean(x)) ^ 2}
f3 <- function(x) {abs(x - mean(x)) ^ 3}

These three functions only vary in their last digit. Given this degree of duplicated code, we would make the last element an additional argument to a more general function:

f <- function(x, i) {abs(x - mean(x)) ^ i}

By doing this, we have reduced the amount of code, the corresponding chance of errors, and made it easier to generalise to new situations.

Application: Using functions as arguments to functions

We can achieve exactly the same thing for three similar functions — like col_mean(), col_median(), and col_sd() — by adding an argument to a function that supplies the function to apply to each column:

col_summary <- function(data, fun) {
  
  # prepare vector of outputs:
  out <- vector("double", length(data))
  
  # loop over data: 
  for (i in seq_along(data)) {
    
    # apply fun to the i-th element of data:
    out[i] <- fun(data[[i]])  # !!!
    
  } # end for. 
  
  return(out)
  
}

The new function col_summary() accepts data and a function fun as its two arguments. The for loop in its body ensures that the function fun() is applied to each element of data. When data is a data frame, its elements are its variables (columns).

Here is what happens when we pass data = df and fun = mean to this col_summary() function:

# Repeat use of col_mean (from above): 
col_summary(data = df, fun = mean)
#> [1]  0.1322028  0.2488450 -0.1336732  0.1207302

We obtain the means of all columns as the expected results (corresponding to those above). But the real power of col_summary() lies in the fact that we can pass a variety of functions to it:

# Passing other functions:
col_summary(df, fun = median)
#> [1]  0.256575548  0.491872279  0.009218122 -0.056559219
col_summary(df, fun = sd)
#> [1] 0.7805860 1.0695148 0.9556076 0.8085646

Passing these functions computes the median or standard deviations of each column in data.

Functions as data

We know that functions are “objects” just like any other object in R (e.g., some data or variable). In the examples above, we used functions (i.e., mean(), median(), and sd()) as data that were passed to other functions (here: as the fun orgument of col_summary()). More precisely, we merely passed the names of functions (e.g., the name mean of the function mean()) as an argument to another function (col_summary()) and trusted that this other function would know what to do with the function whose name was passed on.

Overall, the idea of passing the name of a function (like mean or median) as an argument (here: fun) to another function (here: col_summary()) is an extremely powerful feature of R and the main reason for calling R a functional programming language. All this may seem a bit confusing at first, but is actually quite simple when realizing that functions (or their names) can be seen as just another type of data: They tell other functions which task is to be performed.

Applying/mapping functions to data

There are special R functions that apply or map other functions to existing data structures. The apply() family of functions of base R and the corresponding map() functions of the purrr package eliminate the need for many for loops. We will briefly consider both families in the next two sections (Section 12.3.3 and 12.3.4).

Note that using apply() or map() is not necessarily faster than using for loops. The chief benefit of using these functions is not speed, but clarity: They make code easier to write and read.

12.3.3 Using `apply()` functions

The apply() family of functions of base R (i.e., apply(), lapply(), tapply(), etc.) replaces for loops over existing data structures by applying a function to designated parts of a (rectangular) data structure. To accomplish this, apply() takes the following arguments:

an X argument accepts an array or matrix (i.e., rectangular data) as input;
a MARGIN argument specifies a direction: 1 = rows, 2 = columns;
a FUN provides the function to be applied.

Examples

We can use base::apply to solve the problems addressed by col_summary above:

# Data:
# df # from above
dim(df)  # 10 rows, 4 columns:
#> [1] 10  4

# apply FUN to columns:
apply(X = df, MARGIN = 2, FUN = mean)    # mean of every column
#>          a          b          c          d 
#>  0.1322028  0.2488450 -0.1336732  0.1207302
apply(X = df, MARGIN = 2, FUN = median)  # median of every column
#>            a            b            c            d 
#>  0.256575548  0.491872279  0.009218122 -0.056559219
apply(X = df, MARGIN = 2, FUN = sd)      # SD of every column
#>         a         b         c         d 
#> 0.7805860 1.0695148 0.9556076 0.8085646

In this call, MARGIN = 2 instructed apply to apply some function FUN to each column of X. Changing the MARGIN argument MARGIN = 1 will apply FUN to each row of X:

# apply FUN to rows:
apply(X = df, MARGIN = 1, FUN = mean)    # mean of every row
#>  [1]  0.79074607  0.31320878 -0.24865815 -0.66564396  0.17430122 -0.33413132
#>  [7] -0.01971167  0.03802378  0.50471947  0.36740756
apply(X = df, MARGIN = 1, FUN = median)  # median of every row
#>  [1]  1.13882846  0.28674328 -0.27333780 -1.02157837  0.47466676 -0.23556165
#>  [7] -0.08599288  0.33950565  0.69850127  0.50592144
apply(X = df, MARGIN = 1, FUN = sd)      # SD of every row
#>  [1] 0.9776398 0.3722034 0.5752509 1.7923824 1.0852032 0.3669620 0.3723940
#>  [8] 1.0949578 0.6893583 0.4701560

Note some variants of apply():

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
See also sapply() and vapply() for using or returning vectors and simplifying arrays.

Figure 12.5: Replacing loops by applying functions. (Image based on this post on the Learning Machines blog and created by the R package meme.)

12.3.4 Using `map()` functions

The purrr package (Henry & Wickham, 2020) contains a family of map() functions that provide updated and more consistent versions of apply().

The main goal of using purrr functions (instead of for loops) is to break common list manipulation challenges into smaller and independent pieces. This strategy involves two steps, each of which scales down the problem:

Solving the problem for a single element of a list.
Once we have solved that problem, purrr takes care of generalising the solution to every element in the list.
Breaking a complex problem down into smaller sub-problems that allow us to advance towards a solution.
With purrr, we get many small pieces that we can compose together with the pipe (%>%).

This scaling-down strategy makes it easier to solve new problems and to understand our solutions to old problems when we re-read older code.

Essential `map()` functions

The pattern of looping over a vector, doing something to each element, and saving the results is so common that the purrr package provides a family of functions for it. There is a separate function for each type of output:

map() creates a list.
map_lgl() creates a logical vector.
map_int() creates an integer vector.
map_dbl() creates a double vector.
map_chr() creates a character vector.

Each function takes a vector .x as input, applies a function .f to each element, and then returns a new vector whose length corresponds to (and has the same names as) the input vector. The type of the output vector is determined by the suffix _? of the map() function (e.g., map_chr() returns the output as a character vector).

Figure 12.6: Replacing loops by mapping functions. (Image based on this post on the Learning Machines blog and created by the R package meme.)

Examples

The most common use of map() applies a function to each column/variable of a data frame. For instance, we can use map_dbl() on the data frame df (from above) to compute various statistical measures (with numeric doubles as output):

map_dbl(.x = df, .f = mean)  # mean of all 4 columns/variables
#>          a          b          c          d 
#>  0.1322028  0.2488450 -0.1336732  0.1207302
map_dbl(df, median)          # median ~
#>            a            b            c            d 
#>  0.256575548  0.491872279  0.009218122 -0.056559219
map_dbl(df, sd)              # standard deviation 
#>         a         b         c         d 
#> 0.7805860 1.0695148 0.9556076 0.8085646

The main varieties of the map() function accommodate different numbers of arguments:

map() applies a function .f to 1 argument .x
map2() applies a function .f to 2 arguments .x and .y
pmap() applies a function .f to a list .l of 3 or more arguments

Again, typical uses of map_() specify the expected data type of the output as a suffix (after an underscore _). Here is a numeric example that uses three different map() functions:

# Functions:
square <- function(x){ x^2 }
expone <- function(x, y){ x^y }

# Data:
tb <- tibble(n_1 = sample(1:9, 100, replace = TRUE),
             n_2 = sample(1:3, 100, replace = TRUE))

# map functions to every row of tb (using the pipe):
tb %>% 
  mutate(sqr = purrr::map_dbl(.x = tb$n_1, .f = square),  # 1 argument
         exp = purrr::map2_dbl(n_1, n_2, expone),         # 2 arguments
         sum = purrr::pmap_dbl(list(n_1, n_2, sqr), sum)  # 3+ arguments
         )
#> # A tibble: 100 × 5
#>      n_1   n_2   sqr   exp   sum
#>    <int> <int> <dbl> <dbl> <dbl>
#>  1     6     1    36     6    43
#>  2     8     1    64     8    73
#>  3     7     1    49     7    57
#>  4     1     1     1     1     3
#>  5     4     1    16     4    21
#>  6     8     1    64     8    73
#>  7     9     2    81    81    92
#>  8     9     2    81    81    92
#>  9     7     2    49    49    58
#> 10     4     1    16     4    21
#> # … with 90 more rows

The ability to apply map() functions (with a flexible number of arguments) to data structures within pipes provides a very convenient and powerful programming tool.

Practice

Predict the outputs of the following variants of map(), then check your predictions:

map(df, mean)
map_dbl(df, mean)
map_int(df, mean)
map_lgl(df, mean)
map_chr(df, mean)

Details

It is worth pointing out some details regarding the uses of the map variants:

Without the suffix _dbl, map(df, sd) returns a list (with an element for every column of df).
With the map functions, our focus is on the function/operation, not the bookkeeping of the for loop. This is even more obvious when using the pipe:

df %>% map_dbl(mean)
#>          a          b          c          d 
#>  0.1322028  0.2488450 -0.1336732  0.1207302
df %>% map_dbl(median)
#>            a            b            c            d 
#>  0.256575548  0.491872279  0.009218122 -0.056559219
df %>% map_dbl(sd)
#>         a         b         c         d 
#> 0.7805860 1.0695148 0.9556076 0.8085646

map() functions also use the generic ... argument to allow passing additional arguments to the function indicated by .f:

map_dbl(df, mean, trim = 0.5)
#>            a            b            c            d 
#>  0.256575548  0.491872279  0.009218122 -0.056559219
map_dbl(df, sd, na.rm = FALSE)
#>         a         b         c         d 
#> 0.7805860 1.0695148 0.9556076 0.8085646

map() functions preserve the names of .x:

z <- list(x = 1:3, y = 4:5)
z
#> $x
#> [1] 1 2 3
#> 
#> $y
#> [1] 4 5
z %>% map_int(length)
#> x y 
#> 3 2

Shortcuts

There are a few shortcuts to save typing in the .f argument of the map() family of functions.

Imagine we want to fit a linear model to each group in a dataset: The following example splits the up the mtcars dataset into three pieces (by the value of the cyl variable) and fits the same linear model to each piece. The linear model is supplied as an anonymous function:

mtcars %>%
  group_by(cyl) %>% 
  count()
#> # A tibble: 3 × 2
#> # Groups:   cyl [3]
#>     cyl     n
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14

models <- mtcars %>% 
  split(.$cyl) %>%                            # split data into 3 sets (by cyl) 
  map(function(df) lm(mpg ~ wt, data = df))   # map lm() function to each set

models  # 3 linear models
#> $`4`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      39.571       -5.647  
#> 
#> 
#> $`6`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>       28.41        -2.78  
#> 
#> 
#> $`8`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      23.868       -2.192

As the syntax for creating an anonymous function in R is quite long and complicated, purrr provides a one-sided formula as a shortcut:

models <- mtcars %>% 
  split(.$cyl) %>% 
  map(~lm(mpg ~ wt, data = .))

models  # 3 linear models
#> $`4`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      39.571       -5.647  
#> 
#> 
#> $`6`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>       28.41        -2.78  
#> 
#> 
#> $`8`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      23.868       -2.192

Here, the symbol . is used as a pronoun that refers to the current list element.

When inspecting many models, we may want to extract a summary statistic, like \(R^{2}\). To do that we need to first run summary() on our models and then extract the component called r.squared. We could do this using the shorthand for anonymous functions:

models %>% 
  map(summary) %>% 
  map_dbl(~.$r.squared)

But extracting named components from a list of elements is a common operation, so purrr provides an even shorter shortcut: We can use a string:

models %>% 
  map(summary) %>% 
  map_dbl("r.squared")

Alternatively, we can also use an integer value to select elements by their position:

x <- list(list(1, 2, 3), 
          list(4, 5, 6), 
          list(7, 8, 9))

x %>% map_dbl(2)  # 2nd column/variable of x

12.3.5 Comparing `map()` vs. `apply()`

The map() functions of purrr are modeled on the apply() functions of base R:

lapply() is basically identical to map(), except that the map() functions are more consistent to the other functions in purrr, and we can use some shortcuts for .f.
sapply() is a wrapper around lapply() that automatically simplifies the output (which can yield unexpected results).
vapply() is a safe alternative to sapply() because you supply an additional argument that defines the type. The only problem with vapply() is that it implies a lot of typing: vapply(df, is.numeric, logical(1)) is equivalent to map_lgl(df, is.numeric).

An advantage of vapply() over purrr’s map() functions is that it can also produce matrices — the map() functions only ever produce vectors.

12.3.6 Advanced aspects of purrr

See the following sections of r4ds (Wickham & Grolemund, 2017) for more advanced issues in combination with the map functions of purrr:

21.6: Dealing with failure for using the adverbs safely, possibly, and quietly in combination with map() functions.
21.7: Mapping over multiple arguments for the map2() and pmap() variants (for supplying multiple arguments to a function), as well as invoke_map() (for supplying multiple functions).
21.8: Walk provides alternatives to map when calling functions for their side effects, like plotting visualizations or writing to files. (See walk(), walk2() and pwalk() for details.)

Rather than digging deeper into the functional programming paradigm of purrr, the exercises of this chapter (in Section 12.5) will focus on good-old fashioned for loops and only cover essential aspects of apply() and map().

References

Henry, L., & Wickham, H. (2020). purrr: Functional programming tools. Retrieved from https://CRAN.R-project.org/package=purrr

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz