12.3 Functional programming
In Section 1.2 of Chapter 1 and Section 11.1 of Chapter 11 we cited:
To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.John Chambers
A superficial reading of this statement could misinterpret it as suggesting a division between (or dichotomy of) objects and functions. However, we have already seen (in Section 11.2) that functions are objects that are defined like other objects. Now it is time to break down another barrier between two seemingly distinct concepts: That between data and functions. As this section will show, R provides ways of avoiding iterative loops and these ways work by using functions as data that are passed to other functions.
Please note: As this section covers advanced aspects of iteration, it can be skipped on a first reading of this chapter and course.
12.3.1 For loops vs. functionals
In R, for
loops are not as important as in most other programming languages, because R is a functional programming language.
This means that it is possible to replace many for
loops by wrapping up the body of a for
loop in a function, and then repeatedly call or apply that function.
Example
To motivate the idea of functional programming, consider a simple data table df
:
# Data:
set.seed(1) # for reproducible results
df <- tibble(a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10))
df
#> # A tibble: 10 × 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.626 1.51 0.919 1.36
#> 2 0.184 0.390 0.782 -0.103
#> 3 -0.836 -0.621 0.0746 0.388
#> 4 1.60 -2.21 -1.99 -0.0538
#> 5 0.330 1.12 0.620 -1.38
#> 6 -0.820 -0.0449 -0.0561 -0.415
#> 7 0.487 -0.0162 -0.156 -0.394
#> 8 0.738 0.944 -1.47 -0.0593
#> 9 0.576 0.821 -0.478 1.10
#> 10 -0.305 0.594 0.418 0.763
Assume that our goal is getting the mean of every column of df
.
The standard solution for this task is using a for
loop:
# prepare output vector:
out <- vector("double", ncol(df))
# for loop over columns i:
for (i in seq_along(df)) {
out[[i]] <- mean(df[[i]])
} # end for.
out
#> [1] 0.1322028 0.2488450 -0.1336732 0.1207302
As we will want to compute the means of every column pretty frequently, we abstract the for
loop into a dedicated col_mean()
function:
col_mean <- function(data) {
out <- vector("double", length(data))
for (i in seq_along(data)) {
out[i] <- mean(data[[i]]) # apply function to i-th column of df
} # end for.
return(out)
}
# Check:
col_mean(df)
#> [1] 0.1322028 0.2488450 -0.1336732 0.1207302
But now we are no longer satisfied with the mean, but also want the median and standard deviation of every column.
Of course we could write analog functions col_median()
and col_sd()
, but they would only differ in 1 line from col_mean
above.
Hence, what we really would want is that we could pass a function (like mean()
or median()
) to some other function that then applies it to data — which is exactly what we will learn to do next.
12.3.2 Generalizing functions
In previous chapters, we used functions to create and transform data objects. An important next step when using R consists in using functions as arguments or data objects (i.e., as inputs to other functions). To motivate this step, consider the following three functions:
f1 <- function(x) {abs(x - mean(x)) ^ 1}
f2 <- function(x) {abs(x - mean(x)) ^ 2}
f3 <- function(x) {abs(x - mean(x)) ^ 3}
These three functions only vary in their last digit. Given this degree of duplicated code, we would make the last element an additional argument to a more general function:
By doing this, we have reduced the amount of code, the corresponding chance of errors, and made it easier to generalise to new situations.
Application: Using functions as arguments to functions
We can achieve exactly the same thing for three similar functions — like col_mean()
, col_median()
, and col_sd()
— by adding an argument to a function that supplies the function to apply to each column:
col_summary <- function(data, fun) {
# prepare vector of outputs:
out <- vector("double", length(data))
# loop over data:
for (i in seq_along(data)) {
# apply fun to the i-th element of data:
out[i] <- fun(data[[i]]) # !!!
} # end for.
return(out)
}
The new function col_summary()
accepts data
and a function fun
as its two arguments.
The for
loop in its body ensures that the function fun()
is applied to each element of data
.
When data
is a data frame, its elements are its variables (columns).
Here is what happens when we pass data = df
and fun = mean
to this col_summary()
function:
# Repeat use of col_mean (from above):
col_summary(data = df, fun = mean)
#> [1] 0.1322028 0.2488450 -0.1336732 0.1207302
We obtain the means of all columns as the expected results (corresponding to those above).
But the real power of col_summary()
lies in the fact that we can pass a variety of functions to it:
# Passing other functions:
col_summary(df, fun = median)
#> [1] 0.256575548 0.491872279 0.009218122 -0.056559219
col_summary(df, fun = sd)
#> [1] 0.7805860 1.0695148 0.9556076 0.8085646
Passing these functions computes the median or standard deviations of each column in data
.
Functions as data
We know that functions are “objects” just like any other object in R (e.g., some data or variable).
In the examples above, we used functions (i.e., mean()
, median()
, and sd()
) as data that were passed to other functions (here: as the fun
orgument of col_summary()
). More precisely, we merely passed the names of functions (e.g., the name mean
of the function mean()
) as an argument to another function (col_summary()
) and trusted that this other function would know what to do with the function whose name was passed on.
Overall, the idea of passing the name of a function (like mean
or median
) as an argument (here: fun
) to another function (here: col_summary()
) is an extremely powerful feature of R and the main reason for calling R a functional programming language.
All this may seem a bit confusing at first, but is actually quite simple when realizing that functions (or their names) can be seen as just another type of data: They tell other functions which task is to be performed.
Applying/mapping functions to data
There are special R functions that apply or map other functions to existing data structures.
The apply()
family of functions of base R and the corresponding map()
functions of the purrr package eliminate the need for many for
loops. We will briefly consider both families in the next two sections (Section 12.3.3 and 12.3.4).
Note that using apply()
or map()
is not necessarily faster than using for
loops.
The chief benefit of using these functions is not speed, but clarity:
They make code easier to write and read.
12.3.3 Using apply()
functions
The apply()
family of functions of base R (i.e., apply()
, lapply()
, tapply()
, etc.) replaces for
loops over existing data structures by applying a function to designated parts of a (rectangular) data structure.
To accomplish this, apply()
takes the following arguments:
- an
X
argument accepts an array or matrix (i.e., rectangular data) as input;
- a
MARGIN
argument specifies a direction:1
= rows,2
= columns;
- a
FUN
provides the function to be applied.
Examples
We can use base::apply()
to solve the problems addressed by col_summary
above:
# Data:
# df # from above
dim(df) # 10 rows, 4 columns:
#> [1] 10 4
# apply FUN to columns:
apply(X = df, MARGIN = 2, FUN = mean) # mean of every column
#> a b c d
#> 0.1322028 0.2488450 -0.1336732 0.1207302
apply(X = df, MARGIN = 2, FUN = median) # median of every column
#> a b c d
#> 0.256575548 0.491872279 0.009218122 -0.056559219
apply(X = df, MARGIN = 2, FUN = sd) # SD of every column
#> a b c d
#> 0.7805860 1.0695148 0.9556076 0.8085646
In this call, MARGIN = 2
instructed apply()
to apply some function FUN
to each column of X
.
Changing the MARGIN
argument MARGIN = 1
will apply FUN
to each row of X
:
# apply FUN to rows:
apply(X = df, MARGIN = 1, FUN = mean) # mean of every row
#> [1] 0.79074607 0.31320878 -0.24865815 -0.66564396 0.17430122 -0.33413132
#> [7] -0.01971167 0.03802378 0.50471947 0.36740756
apply(X = df, MARGIN = 1, FUN = median) # median of every row
#> [1] 1.13882846 0.28674328 -0.27333780 -1.02157837 0.47466676 -0.23556165
#> [7] -0.08599288 0.33950565 0.69850127 0.50592144
apply(X = df, MARGIN = 1, FUN = sd) # SD of every row
#> [1] 0.9776398 0.3722034 0.5752509 1.7923824 1.0852032 0.3669620 0.3723940
#> [8] 1.0949578 0.6893583 0.4701560
Note some variants of apply()
:
lapply()
applies a functionFUN
to a listX
and returns a list of the same length asX
. Each element of the returned list is the result of applyingFUN
to the corresponding element ofX
.See also
sapply()
andvapply()
for using or returning vectors and simplifying arrays.
12.3.4 Using map()
functions
The purrr package (Henry & Wickham, 2023) contains a family of map()
functions that provide updated and more consistent versions of apply()
.
The main goal of using purrr functions (instead of for
loops) is to break common list manipulation challenges into smaller and independent pieces. This strategy involves two steps, each of which scales down the problem:
Solving the problem for a single element of a list.
Once we have solved that problem, purrr takes care of generalizing the solution to every element in the list.Breaking a complex problem down into smaller sub-problems that allow us to advance towards a solution.
With purrr, we get many small pieces that we can compose together with the pipe (%>%
).
This scaling-down strategy makes it easier to solve new problems and to understand our solutions to old problems when we re-read older code.
Essential map()
functions
The pattern of looping over a vector, doing something to each element, and saving the results is so common that the purrr package provides a family of functions for it. There is a separate function for each type of output:
map()
creates a list.
map_lgl()
creates a logical vector.
map_int()
creates an integer vector.
map_dbl()
creates a double vector.
map_chr()
creates a character vector.
Each function takes a vector .x
as input, applies a function .f
to each element, and then returns a new vector whose length corresponds to (and has the same names as) the input vector. The type of the output vector is determined by the suffix _?
of the map()
function (e.g., map_chr()
returns the output as a character vector).
Examples
The most common use of map()
applies a function to each column/variable of a data frame.
For instance, we can use map_dbl()
on the data frame df
(from above) to compute various statistical measures (with numeric doubles as output):
map_dbl(.x = df, .f = mean) # mean of all 4 columns/variables
#> a b c d
#> 0.1322028 0.2488450 -0.1336732 0.1207302
map_dbl(df, median) # median ~
#> a b c d
#> 0.256575548 0.491872279 0.009218122 -0.056559219
map_dbl(df, sd) # standard deviation
#> a b c d
#> 0.7805860 1.0695148 0.9556076 0.8085646
The main varieties of the map()
function accommodate different numbers of arguments:
map()
applies a function.f
to 1 argument.x
map2()
applies a function.f
to 2 arguments.x
and.y
pmap()
applies a function.f
to a list.l
of 3 or more arguments
Again, typical uses of map_()
specify the expected data type of the output as a suffix (after an underscore _
).
Here is a numeric example that uses three different map()
functions:
# Functions:
square <- function(x){ x^2 }
expone <- function(x, y){ x^y }
# Data:
tb <- tibble(n_1 = sample(1:9, 100, replace = TRUE),
n_2 = sample(1:3, 100, replace = TRUE))
# map functions to every row of tb (using the pipe):
tb %>%
mutate(sqr = purrr::map_dbl(.x = tb$n_1, .f = square), # 1 argument
exp = purrr::map2_dbl(n_1, n_2, expone), # 2 arguments
sum = purrr::pmap_dbl(list(n_1, n_2, sqr), sum) # 3+ arguments
)
#> # A tibble: 100 × 5
#> n_1 n_2 sqr exp sum
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 6 1 36 6 43
#> 2 8 1 64 8 73
#> 3 7 1 49 7 57
#> 4 1 1 1 1 3
#> 5 4 1 16 4 21
#> 6 8 1 64 8 73
#> 7 9 2 81 81 92
#> 8 9 2 81 81 92
#> 9 7 2 49 49 58
#> 10 4 1 16 4 21
#> # … with 90 more rows
The ability to apply map()
functions (with a flexible number of arguments) to data structures within pipes provides a very convenient and powerful programming tool.
Details
It is worth pointing out some details regarding the uses of the map
variants:
Without the suffix
_dbl
,map(df, sd)
returns a list (with an element for every column ofdf
).With the
map
functions, our focus is on the function/operation, not the bookkeeping of thefor
loop. This is even more obvious when using the pipe:
df %>% map_dbl(mean)
#> a b c d
#> 0.1322028 0.2488450 -0.1336732 0.1207302
df %>% map_dbl(median)
#> a b c d
#> 0.256575548 0.491872279 0.009218122 -0.056559219
df %>% map_dbl(sd)
#> a b c d
#> 0.7805860 1.0695148 0.9556076 0.8085646
map()
functions also use the generic...
argument to allow passing additional arguments to the function indicated by.f
:
map_dbl(df, mean, trim = 0.5)
#> a b c d
#> 0.256575548 0.491872279 0.009218122 -0.056559219
map_dbl(df, sd, na.rm = FALSE)
#> a b c d
#> 0.7805860 1.0695148 0.9556076 0.8085646
map()
functions preserve the names of.x
:
Shortcuts
There are a few shortcuts to save typing in the .f
argument of the map()
family of functions.
Imagine we want to fit a linear model to each group in a dataset:
The following example splits the up the mtcars
dataset into three pieces (by the value of the cyl
variable)
and fits the same linear model to each piece.
The linear model is supplied as an anonymous function:
mtcars %>%
group_by(cyl) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: cyl [3]
#> cyl n
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
models <- mtcars %>%
split(.$cyl) %>% # split data into 3 sets (by cyl)
map(function(df) lm(mpg ~ wt, data = df)) # map lm() function to each set
models # 3 linear models
#> $`4`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#>
#> Coefficients:
#> (Intercept) wt
#> 39.571 -5.647
#>
#>
#> $`6`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#>
#> Coefficients:
#> (Intercept) wt
#> 28.41 -2.78
#>
#>
#> $`8`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#>
#> Coefficients:
#> (Intercept) wt
#> 23.868 -2.192
As the syntax for creating an anonymous function in R is quite long and complicated, purrr provides a one-sided formula as a shortcut:
models <- mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .))
models # 3 linear models
#> $`4`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#>
#> Coefficients:
#> (Intercept) wt
#> 39.571 -5.647
#>
#>
#> $`6`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#>
#> Coefficients:
#> (Intercept) wt
#> 28.41 -2.78
#>
#>
#> $`8`
#>
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#>
#> Coefficients:
#> (Intercept) wt
#> 23.868 -2.192
Here, the symbol .
is used as a pronoun that refers to the current list element.
When inspecting many models, we may want to extract a summary statistic, like \(R^{2}\).
To do that we need to first run summary()
on our models
and then extract the component called r.squared
.
We could do this using the shorthand for anonymous functions:
But extracting named components from a list of elements is a common operation, so purrr provides an even shorter shortcut: We can use a string:
Alternatively, we can also use an integer value to select elements by their position:
12.3.5 Comparing map()
vs. apply()
The map()
functions of purrr are modeled on the apply()
functions of base R:
lapply()
is basically identical tomap()
, except that themap()
functions are more consistent to the other functions in purrr, and we can use some shortcuts for.f
.sapply()
is a wrapper aroundlapply()
that automatically simplifies the output (which can yield unexpected results).vapply()
is a safe alternative tosapply()
because you supply an additional argument that defines the type. The only problem withvapply()
is that it implies a lot of typing:vapply(df, is.numeric, logical(1))
is equivalent tomap_lgl(df, is.numeric)
.
An advantage of vapply()
over purrr’s map()
functions is that it can also produce matrices — the map()
functions only ever produce vectors.
12.3.6 Advanced aspects of purrr
See the following sections of r4ds (Wickham & Grolemund, 2017) for more advanced issues in combination with the map
functions of purrr:
21.6: Dealing with failure for using the adverbs
safely
,possibly
, andquietly
in combination withmap()
functions.21.7: Mapping over multiple arguments for the
map2()
andpmap()
variants (for supplying multiple arguments to a function), as well asinvoke_map()
(for supplying multiple functions).21.8: Walk provides alternatives to
map
when calling functions for their side effects, like plotting visualizations or writing to files. (Seewalk()
,walk2()
andpwalk()
for details.)
Rather than digging deeper into the functional programming paradigm endorsed by purrr, the exercises of this chapter (in Section 12.5) will focus on good-old fashioned for
loops and only cover basic aspects of apply()
and map()
.