9 Split, Apply, Combine

In this chapter, as we transition into the practical side of R usage, we discuss a tremendously useful strategy in data analysis, split-apply-combine, through implementing the apply family of functions and purrr’s map functions. Both speak to the idea of split-apply-combine that we will frequently encounter in various scenarios of problem solving.

Below we first start with the apply family based on what we have learned from using and writing R functions. Apply functions are vectorized functions that minimize the need to write loops explicitly, and they allow us to apply a function to data structures in a more elegant and efficient manner.

While the apply family is powerful, it can be considered legacy functionality. Therefore, after meeting the apply family, we then introduce the modern solutions to the types of problems previously approached by the apply family: a group of map functions provided by the purrr package.

9.1 The split-apply-combine strategy

The split-apply-combine strategy, at its core, involves breaking up a big problem into manageable pieces, operating on each piece independently, and then putting all the pieces back together.

We may apply this strategy in a couple of applications, including performing group-wise operations in data manipulations, creating summaries for each group, sequence generation for specified number of times, and model fitting to each panel of panel data.

When utilizing the split-apply-combine strategy, it is recommended that we start with thinking about the input data structure and the desired output data structure of our task, and then identify the function that meets our goal.

9.2 Apply family of functions

Let’s start with the apply family:

apply applies a function to arrays and matrices
tapply applies a function over subsets of a vector
lapply applies a function to lists
sapply simplifies list apply
vapply is the list apply that returns a vector
mapply is the multiple argument list apply
rapply recursively applies a function to a list
eapply applies a function to each entry in an environment

They are quite versatile, considering that

they can operate on different data structures (matrix, list, data frame etc.);
the function to be applied can be passed to specific parts of the data (rows, columns, groups, rows and columns etc.) to handle a task;
and they can return desired format of the output that is not necessarily the same as the input format.

This chapter discusses the following functions: apply; tapply and its cousin by; and lapply and its variants sapply and mapply. The split-apply-combine approach embedded in these base functions are also utilized by functions from add-on packages such as purrr, albeit with greater efficiency.

For illustrative purposes, we’ll use several pre-installed datasets in the base package datasets: USPersonalExpenditure, UCBAdmissions, ToothGrowth, and state.x77. These datasets are directly available to us.

9.3 `apply()`

apply(X, MARGIN, FUN, ...) takes an input object X, and applies a function to margins of the array.

X can be an array, matrix, or a data frame. If the input X is a data frame, R will convert it into a matrix. When the data frame has columns of different types, apply() will convert the columns to one type; and the data frame becomes a matrix. If keeping the different data types of the columns is important, we should then use lapply(), which we discuss later.

MARGIN is a vector indicating the subscripts which the function will be applied over. For instance, when X is a matrix, 1 indicates rows, 2 indicates columns, and c(1, 2) indicates both rows and columns. When MARGIN = 1, apply() will call the function FUN once for each row. Besides, MARGIN can also be a character vector that selects dimension names if X has named dimnames, such as row names or column names.

FUN is the function to be applied.

apply() returns a vector, array, or list of values.

Example 1: calculating summary statistics

To show how apply() works, here we’ll use a pre-installed dataset USPersonalExpenditure, which is a matrix. Use ?USPersonalExpenditure to find the description of the dataset.

class(USPersonalExpenditure)

## [1] "matrix" "array"

USPersonalExpenditure consists of United States personal expenditures (in billions of dollars) for the categories food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960.

USPersonalExpenditure

##                       1940   1945  1950 1955  1960
## Food and Tobacco    22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health   3.530  5.760  9.71 14.0 21.10
## Personal Care        1.040  1.980  2.45  3.4  5.40
## Private Education    0.341  0.974  1.80  2.6  3.64

To calculate the sum of personal expenditure across the years for each category, we use apply() to apply the function sum() to each row. The output is a vector.

apply(USPersonalExpenditure, 1, sum)

##    Food and Tobacco Household Operation  Medical and Health       Personal Care   Private Education 
##             286.300             137.700              54.100              14.270               9.355

Note that apply() uses the rownames from the matrix to identify the elements of the resulting vector or matrix. That’s why we are seeing the food categories in the outputs.

rownames(USPersonalExpenditure)

## [1] "Food and Tobacco"    "Household Operation" "Medical and Health"  "Personal Care"       "Private Education"

The code below finds the maximum personal expenditure across the categories for each year. It applies the function max() to each column and returns a vector.

apply(USPersonalExpenditure, 2, max)

## 1940 1945 1950 1955 1960 
## 22.2 44.5 59.6 73.2 86.8

If we apply range() to the columns of USPersonalExpenditure, we get a matrix in return. range() returns a vector of two elements, the minimum and the maximum.

apply(USPersonalExpenditure, 2, range)

##        1940   1945 1950 1955  1960
## [1,]  0.341  0.974  1.8  2.6  3.64
## [2,] 22.200 44.500 59.6 73.2 86.80

Example 2: Simpson’s Paradox (I)

Let’s use the famous 1973 UC Berkeley admissions data UCBAdmissions to explore some interesting questions. This is a subset of the complete data examined in a study published on Science in 1975.

In the fall of 1973, the University of California, Berkeley’s graduate division admitted about 44% of male applicants and 35% of female applicants. The school officials were worried about the difference, or bias, in the admission rates between male and female applicants. They asked a statistician to analyze the data, who was one of the authors of the Science paper.

The UCBAdmissions dataset provides aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973, classified by admission and gender. The description of the dataset can be found in its help file.

UCBAdmissions

## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317

UCBAdmissions is a 3-dimensional array resulting from cross-tabulating 4526 observations on 3 variables.

Dimension	Name	Levels
1	Admit	Admitted, Rejected
2	Gender	Male, Female
3	Dept	A, B, C, D, E, F

Now let’s go back to what worried Berkeley’s officials: the overall acceptance rates for female and male applicants. The formula of the overall acceptance rates is simply the number of admitted female and male applicants divided by the total number of male and female applicants.

How do we get the numbers from the array?

First, we calculate the total number for both male and female applicants. We pass the function sum() to FUN and sum up the values in each level of dimension Gender.

applicants <- apply(UCBAdmissions, "Gender", sum)
applicants

##   Male Female 
##   2691   1835

As we see here, when the input X has named dimnames, MARGIN can be a character vector selecting dimension names.

The same result can be achieved by replacing the dimension name with its number.

apply(UCBAdmissions, 2, sum)

##   Male Female 
##   2691   1835

Next, we calculate the number of female and male applicants in the admitted group. Our approach here is to get the numbers for both admitted and rejected applicants and extract the admitted applicants from the result.

We need to find each combination by Admit and Gender to apply the function sum(). Therefore, MARGIN is c("Admit","Gender"). Then we sum up their values across the departments.

apply(UCBAdmissions, c("Admit","Gender"), sum)

##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

The output is a matrix.

class(apply(UCBAdmissions, c("Admit","Gender"), sum))

## [1] "matrix" "array"

Then we extract the “Admitted” applicants from the output matrix.

apply(UCBAdmissions, c("Admit","Gender"), sum)["Admitted",]

##   Male Female 
##   1198    557

# apply(UCBAdmissions["Admitted",,], "Gender", sum)

Summarizing the two steps above, we have the number of admitted applicants in both genders.

admitted <- apply(UCBAdmissions, c("Admit","Gender"), sum)["Admitted",]
admitted

##   Male Female 
##   1198    557

Now we are ready to calculate the acceptance rates.

round(admitted / applicants * 100, 2)

##   Male Female 
##  44.52  30.35

It seems that the acceptance rate in the six departments for female applicants is much lower than the acceptance rate for the male applicants.

However, when the statisticians examined the data, they discovered that within specific departments, this bias against women went away. The acceptance rate for female applicants was higher than the acceptance rate for male applicants in several cases.

Let’s get the number of applicants for both genders in each department.

applicants2 <- apply(UCBAdmissions, c("Gender","Dept"), sum)
applicants2

##         Dept
## Gender     A   B   C   D   E   F
##   Male   825 560 325 417 191 373
##   Female 108  25 593 375 393 341

And the number of admitted applicants as well.

admitted2 <- UCBAdmissions["Admitted",,]
admitted2

##         Dept
## Gender     A   B   C   D   E   F
##   Male   512 353 120 138  53  22
##   Female  89  17 202 131  94  24

The acceptance rates are:

round(admitted2/applicants2*100, 2)

##         Dept
## Gender       A     B     C     D     E     F
##   Male   62.06 63.04 36.92 33.09 27.75  5.90
##   Female 82.41 68.00 34.06 34.93 23.92  7.04

But why? Because more women had applied to departments that admitted a small percentage of applicants, such as English, than to departments that admitted a large percentage of applicants, such as mechanical engineering¹³.

This phenomenon is called the Simpson’s Paradox. It occurs in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.

Example 3: Simpson’s Paradox (II)

Another example of the Simpson’s Paradox is the survival rates of the third class passengers and crew members on Titanic. The data is available in R in Titanic. It is a 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables.

Dimension	Name	Levels
1	Class	1st, 2nd, 3rd, Crew
2	Sex	Male, Female
3	Age	Child, Adult
4	Survived	No, Yes

If we compare the survival rates for adults, we’ll find that the numbers for third class passengers and crew members in the adults are close.

round(apply(Titanic, c(1,3,4), sum)[,"Adult","Yes"] /
        apply(Titanic, c(1,3), sum)[,"Adult"] * 100, 2)

##   1st   2nd   3rd  Crew 
## 61.76 36.02 24.08 23.95

The survival rate is 24.08% for the third class passengers and 23.95% for the crew members.

However, if we further break the data down by gender, the survival rates are higher for crew members compared to the third class passengers for both men and women.

round(apply(Titanic, c(1,2,3,4), sum)[,,"Adult","Yes"] / 
        apply(Titanic, c(1,2,3), sum)[,,"Adult"] * 100, 2)

##       Sex
## Class   Male Female
##   1st  32.57  97.22
##   2nd   8.33  86.02
##   3rd  16.23  46.06
##   Crew 22.27  86.96

Why do you think that happened?

`FUN`

FUN can be a named function or an anonymous function. We get an anonymous function if we choose not to give the function a name. This is useful when it’s not worth the effort to figure out a name for the function.

For example, the one-liner function below is an anonymous function. There is no need to name it if we are going to use it only once inside the apply() function.

apply(USPersonalExpenditure, 2, function(x) sum(x) * 10)

##    1940    1945    1950    1955    1960 
##  376.11  687.14 1025.60 1297.00 1631.40

FUN can take optional arguments ... .

For instance, na.rm is the second argument to mean(), although in this case we don’t have NAs to worry about.

apply(USPersonalExpenditure, 1, mean, na.rm = TRUE)

Every time that apply() calls mean(), the first argument will be a row of USPersonalExpenditure and the second argument will be na.rm = TRUE. The function call will be mean(row, na.rm = TRUE).

9.4 `tapply()`, `by()`

The second member of the apply family functions is tapply() and its cousin by.

tapply() and by() apply a function to groups of values.

`tapply()`

tapply(X, INDEX, FUN, ...) applies a function to each (non-empty) group of values given by a unique combination of the levels of certain factors.

X is an input vector. INDEX is a factor that defines the groups. The factor level identifies the group of each vector element in X. The vector and the factor are of the same length.

Vector	Factor
9	A
25	B
32	C
14	B
2	C
100	A

Example 4: aggregating data by group

We’ll use the dataset ToothGrowth to show how tapply() aggregates data by group.

ToothGrowth recorded the effect of Vitamin C on tooth growth in Guinea pigs.

head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

dose is dose and a numeric vector. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 milligrams/day) by one of two delivery methods. len is tooth length and a numeric vector.

supp is supplement type and a factor. The two supplement types are orange juice OJ or vitamin C VC. The factor level in supp identifies the group of each element in len and dose.

To get the mean of length by supplement types, we use tapply() to apply mean() to each group of supp. The result is a vector.

tapply(ToothGrowth$len, ToothGrowth$supp, mean)

tapply() can also manage multiple categories. This is handled by the argument INDEX, a list of one or more factors, each of same length as X. The elements are coerced to factors.

tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), mean)

##      0.5     1     2
## OJ 13.23 22.70 26.06
## VC  7.98 16.77 26.14

The result is a matrix.

`by()`

by(data, INDICES, FUN, ...) applies a function to a data frame, split by factors. by() is a wrapper for tapply() applied to data frames. The function returns a list.

The argument data normally is a data frame, but can possibly be a matrix. INDICES is a factor or a list of factors, each of length nrow(data).

by() calls the function FUN for each group within a data frame.

The example below summarizes the data by supp, and organizes the summary statistics in a reader friendly manner.

by(ToothGrowth, ToothGrowth$supp, summary)

## ToothGrowth$supp: OJ
##       len        supp         dose      
##  Min.   : 8.20   OJ:30   Min.   :0.500  
##  1st Qu.:15.53   VC: 0   1st Qu.:0.500  
##  Median :22.70           Median :1.000  
##  Mean   :20.66           Mean   :1.167  
##  3rd Qu.:25.73           3rd Qu.:2.000  
##  Max.   :30.90           Max.   :2.000  
## ------------------------------------------------------------------------------------- 
## ToothGrowth$supp: VC
##       len        supp         dose      
##  Min.   : 4.20   OJ: 0   Min.   :0.500  
##  1st Qu.:11.20   VC:30   1st Qu.:0.500  
##  Median :16.50           Median :1.000  
##  Mean   :16.96           Mean   :1.167  
##  3rd Qu.:23.10           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Same as tapply(), by() can handle multiple groups.

by(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), summary)

## : OJ
## : 0.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.20    9.70   12.25   13.23   16.18   21.50 
## ------------------------------------------------------------------------------------- 
## : VC
## : 0.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20    5.95    7.15    7.98   10.90   11.50 
## ------------------------------------------------------------------------------------- 
## : OJ
## : 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.50   20.30   23.45   22.70   25.65   27.30 
## ------------------------------------------------------------------------------------- 
## : VC
## : 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.60   15.28   16.50   16.77   17.30   22.50 
## ------------------------------------------------------------------------------------- 
## : OJ
## : 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.40   24.57   25.95   26.06   27.07   30.90 
## ------------------------------------------------------------------------------------- 
## : VC
## : 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.50   23.38   25.95   26.14   28.80   33.90

9.5 `lapply()`, `sapply()`, `mapply()`

lapply(), sapply() and mapply() apply a function over a list or vector.

lapply() returns a list.

lapply(ToothGrowth, is.numeric)

## $len
## [1] TRUE
## 
## $supp
## [1] FALSE
## 
## $dose
## [1] TRUE

sapply() is a user-friendly version of lapply(), which returns a vector or a matrix.

sapply(ToothGrowth, is.numeric)

##   len  supp  dose 
##  TRUE FALSE  TRUE

mapply() is a multivariate version of sapply().

`lapply()`

lapply(X, FUN, ...) applies a function over a list or vector. lapply() returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

For instance, lapply(mylist, mean) iterates over the list mylist to get the mean of three vectors. The result is a list.

mylist <- list(1:10, 10:1, -5:5)
lapply(mylist, mean)

## [[1]]
## [1] 5.5
## 
## [[2]]
## [1] 5.5
## 
## [[3]]
## [1] 0

Example 5: working with lists

Now let’s see a more meaningful example using the dataset state.x77. state.x77 is a matrix with 50 rows and 8 columns that gives the statistics of the 50 states of the United States. These include population, income, life expectancy, murder rate, percent of high-school graduates, and a few others.

head(state.x77, 3)

##         Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama       3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska         365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona       2212   4530        1.8    70.55    7.8    58.1    15 113417

We transformed and reorganized the dataset into a list of data frames ¹⁴ for the purpose of illustration.

state.df <- data.frame(state.x77)
state.list <- setNames(split(state.df, seq(nrow(state.df))), rownames(state.df))
head(state.list, 3)

## $Alabama
##         Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Alabama       3615   3624        2.1    69.05   15.1    41.3    20 50708
## 
## $Alaska
##        Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432
## 
## $Arizona
##         Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Arizona       2212   4530        1.8    70.55    7.8    58.1    15 113417

Let’s use the list to answer a few questions. First, what is the average population across the states?

pop <- lapply(state.list, "[[", "Population") 
mean(unlist(pop))

## [1] 4246.42

Note that in the case of functions like +, %*%, [[, etc., the function name must be backquoted or quoted.

pop <- lapply(state.list, "[[", "Population") 
pop <- lapply(state.list, `[[`, "Population")

Next, what is the average number of people per square miles in each state?

pop_den <- function(x) x[["Population"]]/x[["Area"]]
lapply(state.list, pop_den)

## $Alabama
## [1] 0.07129053
## 
## $Alaska
## [1] 0.0006443845
## 
## $Arizona
## [1] 0.01950325
## 
## $Arkansas
## [1] 0.04061989
## 
## $California
## [1] 0.1355709
## 
## $Colorado
## [1] 0.02448779
## 
## $Connecticut
## [1] 0.6375977
## 
## $Delaware
## [1] 0.2921292
## 
## $Florida
## [1] 0.1530227
## 
## $Georgia
## [1] 0.08491037
## 
## $Hawaii
## [1] 0.1350973
## 
## $Idaho
## [1] 0.009833448
## 
## $Illinois
## [1] 0.2008503
## 
## $Indiana
## [1] 0.1471867
## 
## $Iowa
## [1] 0.05114317
## 
## $Kansas
## [1] 0.02787729
## 
## $Kentucky
## [1] 0.08542245
## 
## $Louisiana
## [1] 0.08470955
## 
## $Maine
## [1] 0.03421734
## 
## $Maryland
## [1] 0.4167425
## 
## $Massachusetts
## [1] 0.7429083
## 
## $Michigan
## [1] 0.1603569
## 
## $Minnesota
## [1] 0.049452
## 
## $Mississippi
## [1] 0.04949679
## 
## $Missouri
## [1] 0.06909196
## 
## $Montana
## [1] 0.005124084
## 
## $Nebraska
## [1] 0.02018749
## 
## $Nevada
## [1] 0.005369054
## 
## $`New Hampshire`
## [1] 0.08995237
## 
## $`New Jersey`
## [1] 0.9750033
## 
## $`New Mexico`
## [1] 0.009422462
## 
## $`New York`
## [1] 0.3779139
## 
## $`North Carolina`
## [1] 0.1115005
## 
## $`North Dakota`
## [1] 0.009195502
## 
## $Ohio
## [1] 0.261989
## 
## $Oklahoma
## [1] 0.03947254
## 
## $Oregon
## [1] 0.02374615
## 
## $Pennsylvania
## [1] 0.2637548
## 
## $`Rhode Island`
## [1] 0.8875119
## 
## $`South Carolina`
## [1] 0.09316791
## 
## $`South Dakota`
## [1] 0.008965835
## 
## $Tennessee
## [1] 0.1009727
## 
## $Texas
## [1] 0.04668223
## 
## $Utah
## [1] 0.01465358
## 
## $Vermont
## [1] 0.05093342
## 
## $Virginia
## [1] 0.1252137
## 
## $Washington
## [1] 0.05346252
## 
## $`West Virginia`
## [1] 0.07474034
## 
## $Wisconsin
## [1] 0.08425749
## 
## $Wyoming
## [1] 0.003868193

Example 6: sequence generation

The input object of lapply() can also be a vector.

In the example below, n takes 5 and is passed to the function fun 3 times.

fun <- function(n){
  x <- rnorm(n) 
  y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
  list(X = x, Y = y)
}

lapply(rep(5, 3), fun)

## [[1]]
## [[1]]$X
## [1] -0.1494530  0.4332426 -0.6613785  1.4628078  1.0065455
## 
## [[1]]$Y
## [1] 0.02492883 0.29671244 1.36094897 0.08249475 0.36579978
## 
## 
## [[2]]
## [[2]]$X
## [1] -1.08800885 -0.08653661 -0.52323807 -1.36850595  0.08314541
## 
## [[2]]$Y
## [1] -0.004306292 -0.047003701 -0.048976420 -0.165990336 -0.540931057
## 
## 
## [[3]]
## [[3]]$X
## [1] -0.6191953  0.6530385 -0.6964274  0.3268801 -0.3784654
## 
## [[3]]$Y
## [1] -0.33223373 -0.18075812 -0.50677210 -0.18233192 -0.09451082

This usage of lapply() can be seen in sequence generations.

Example 7: getting data types of a data frame

Recall earlier we have mentioned that apply() works on the rows of a data frame when its columns are of the same type. If columns are not of the same type, this is when lapply() comes to rescue.

lapply() can be applied to a data frame, since data frame is a kind of list. The function passed to the argument FUN will be applied to each column of the data frame.

One application of this is to use lapply() to check the types of columns in a data frame. The output is a list.

lapply(ToothGrowth, class)

## $len
## [1] "numeric"
## 
## $supp
## [1] "factor"
## 
## $dose
## [1] "numeric"

If we want to turn the output to a vector, we can unlist() the output list.

unlist(lapply(ToothGrowth, class))

##       len      supp      dose 
## "numeric"  "factor" "numeric"

`sapply()`

sapply(X, FUN, ...) will try to simplify the result of lapply() if possible. The output of it can be a vector or a matrix.

If the result of lapply() is a list where every element is length 1, then using sapply() to run the same code will return a vector.

lapply(ToothGrowth, is.numeric)

## $len
## [1] TRUE
## 
## $supp
## [1] FALSE
## 
## $dose
## [1] TRUE

If we use sapply() instead, we will get the same output, but in a vector rather than a list.

sapply(ToothGrowth, is.numeric)

##   len  supp  dose 
##  TRUE FALSE  TRUE

If the result is a list where every element is a vector of the same length larger than 1, then using sapply() to run the same code will return a matrix.

This is an example we used earlier to explain lapply(). The output is a list with vectors longer than 1.

fun <- function(n){
  x <- rnorm(n) 
  y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
  list(X = x, Y = y)
}
lapply(rep(5, 3), fun)

## [[1]]
## [[1]]$X
## [1] -0.6612219 -1.4763768  1.9622459 -0.4203603 -0.4776670
## 
## [[1]]$Y
## [1] -0.043078436 -0.179093574 -0.423547688 -0.004014341 -0.136557351
## 
## 
## [[2]]
## [[2]]$X
## [1]  0.7185358 -1.3383289 -0.5433609 -1.3012167 -0.5769427
## 
## [[2]]$Y
## [1] -2.0314867 -0.8605874 -0.1730227 -0.9388725 -0.1447821
## 
## 
## [[3]]
## [[3]]$X
## [1]  0.9158401  1.1212610  0.4716513  0.4286128 -1.1353724
## 
## [[3]]$Y
## [1] 0.24942982 0.13744108 0.02898536 0.27205685 0.56304738

In such cases, using sapply() returns a matrix.

sapply(rep(5, 3), fun)

##   [,1]      [,2]      [,3]     
## X numeric,5 numeric,5 numeric,5
## Y numeric,5 numeric,5 numeric,5

`mapply()`

mapply(FUN, ...) is a multivariate version of sapply(). mapply() applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.

Note that the first argument of mapply() is a function, unlike lapply() and sapply().

mapply() applies the function element-wise to vectors or lists.

Let’s return to state.list and revisit the question on population density. With mapply(), our solution can be like something below. The output is a vector rather than a list.

pop <- lapply(state.list, "[[", "Population")
area <- lapply(state.list, "[[", "Area")
mapply("/", pop, area)

##        Alabama         Alaska        Arizona       Arkansas     California       Colorado    Connecticut 
##   0.0712905261   0.0006443845   0.0195032491   0.0406198864   0.1355708904   0.0244877898   0.6375976964 
##       Delaware        Florida        Georgia         Hawaii          Idaho       Illinois        Indiana 
##   0.2921291625   0.1530227399   0.0849103714   0.1350972763   0.0098334482   0.2008502547   0.1471867468 
##           Iowa         Kansas       Kentucky      Louisiana          Maine       Maryland  Massachusetts 
##   0.0511431687   0.0278772910   0.0854224464   0.0847095482   0.0342173351   0.4167424932   0.7429082545 
##       Michigan      Minnesota    Mississippi       Missouri        Montana       Nebraska         Nevada 
##   0.1603569354   0.0494520047   0.0494967862   0.0690919632   0.0051240839   0.0201874926   0.0053690542 
##  New Hampshire     New Jersey     New Mexico       New York North Carolina   North Dakota           Ohio 
##   0.0899523651   0.9750033240   0.0094224624   0.3779139052   0.1115004713   0.0091955019   0.2619890177 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina   South Dakota      Tennessee 
##   0.0394725364   0.0237461532   0.2637548370   0.8875119161   0.0931679074   0.0089658350   0.1009727062 
##          Texas           Utah        Vermont       Virginia     Washington  West Virginia      Wisconsin 
##   0.0466822312   0.0146535763   0.0509334197   0.1252136752   0.0534625207   0.0747403407   0.0842574912 
##        Wyoming 
##   0.0038681934

9.6 `purrr` map functions

The apply family bridges the gap between traditional loops and functional programming by allowing us to apply functions to vectors, arrays, and lists more elegantly and efficiently. However, for new R projects, it’s advisable to use the purrr functions for looping tasks wherever we can. purrr provides a consistent syntax and optimized functions compared to their base R counterparts.

Below we focus on the map family of functions, mainly map() and map2().

But note that there are many other useful purrr functions, such as pluck() and keep().

inputs and outputs

The map functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.

Like the apply family, these functions work with particular input and output data structures. Learn more here.

Input	Output	`purrr` function(s)
1 vector	list	`map()`
1 vector	vector of desired type	`map_lgl()` (logical) `map_int()` (integer) `map_dbl()` (double) `map_chr()` (character)
2 vectors	list	`map2()`
2 vectors	vector of desired type	`map2_lgl()` (logical) `map2_int()` (integer) `map2_dbl()` (double) `map2_chr()` (character)

As shown in the table above, map() always returns a list. The returned vector can be defined by the suffix _lgl(), _int(), _dbl() and _chr() , which returns a logical, integer, double, or character vector respectively.

key arguments

map(.x, .f) functions take two key arguments .x and .f. .x is the input data structure. .f is a function, which can be a named function, an anonymous function, or a formula. The anonymous function can be written either using R’s anonymous function shorthand map(x, y, \(x, y) x/y) or map(x, y, function(x, y) x/y).

map2(.x, .y, .f) maps over two inputs. .x and .y are a pair of vectors, usually the same length.

`tidyverse` consistency

Since purrr is part of tidyverse¹⁵, we can join multiple steps together either using the magrittr pipe %>%, or the base pipe R |>.

It’s recommended that we use the base tools because they don’t require that we load magrittr and work everywhere, not just in purrr functions.

examples

Below we rewrite some examples using map functions for what we used apply family previously.

library(purrr)

map() applies a function to each element of a vector, a pattern we should be quite familiar with after working with the apply family.

Previously we used lapply() to evaluate whether each element of the data frame ToothGrowth is numeric or not.

lapply(ToothGrowth, is.numeric)

## $len
## [1] TRUE
## 
## $supp
## [1] FALSE
## 
## $dose
## [1] TRUE

We now use map() to repeat the task.

map(ToothGrowth, is.numeric)

## $len
## [1] TRUE
## 
## $supp
## [1] FALSE
## 
## $dose
## [1] TRUE

The results are the same.

all.equal(lapply(ToothGrowth, is.numeric), map(ToothGrowth, is.numeric))

## [1] TRUE

We can use map_lgl() to return a logical vector.

ToothGrowth |> map_lgl(is.numeric)

##   len  supp  dose 
##  TRUE FALSE  TRUE

In the example below, map_dbl() gets the mean of each vector in the list mylist and returns a double vector, similar to the example above.

mylist <- list(1:10, 10:1, -5:5)
sapply(mylist, mean)

## [1] 5.5 5.5 0.0

mylist |> map_dbl(mean)

## [1] 5.5 5.5 0.0

map2_int() performs operations between two vectors and returns a double vector. Below we use map() and map2() to replace lapply() and mapply() respectively.

pop <- lapply(state.list, "[[", "Population")
area <- lapply(state.list, "[[", "Area")
result1 <- mapply("/", pop, area)

pop2 <- state.list |> map_dbl("Population") 
area2 <- state.list |> map_dbl("Area") 
result2 <- map2_dbl(pop2, area2, \(pop2, area2) pop2/area2)
# map2_dbl(pop2, area2, function(pop2, area2) pop2/area2)

all.equal(result1, result2)

## [1] TRUE

In addition to extracting components deep in a data structure, map() can also be used for sequence generation.

fun <- function(n){
  x <- rnorm(n) 
  y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
  list(X = x, Y = y)
}

set.seed(2024)
result1 <- lapply(rep(5, 3), fun)

set.seed(2024)
result2 <- rep(5, 3) |> map(fun)

all.equal(result1, result2)

## [1] TRUE

The function fun is applied to each element of the vector rep(5, 3). So 5 is applied to the function 3 times.

As the authors summarized in the paper, “The graduate departments that are easier to enter tend to be those that require more mathematics in the undergraduate preparatory curriculum. The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system. Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.” (Bickel, Hammel, O’Connell, 1975, p. 402)↩︎
Following this post, the rows in the data frame state.df became elements of a new list.↩︎
tidyverse is introduced in the chapter R Documentation.↩︎