9 Split, Apply, Combine
In this chapter, as we transition into the practical side of R usage, we discuss a tremendously useful strategy in data analysis, split-apply-combine, through implementing the apply family of functions and purrr
’s map functions. Both speak to the idea of split-apply-combine that we will frequently encounter in various scenarios of problem solving.
Below we first start with the apply family based on what we have learned from using and writing R functions. Apply functions are vectorized functions that minimize the need to write loops explicitly, and they allow us to apply a function to data structures in a more elegant and efficient manner.
While the apply family is powerful, it can be considered legacy functionality. Therefore, after meeting the apply family, we then introduce the modern solutions to the types of problems previously approached by the apply family: a group of map
functions provided by the purrr
package.
9.1 The split-apply-combine strategy
The split-apply-combine strategy, at its core, involves breaking up a big problem into manageable pieces, operating on each piece independently, and then putting all the pieces back together.
We may apply this strategy in a couple of applications, including performing group-wise operations in data manipulations, creating summaries for each group, sequence generation for specified number of times, and model fitting to each panel of panel data.
When utilizing the split-apply-combine strategy, it is recommended that we start with thinking about the input data structure and the desired output data structure of our task, and then identify the function that meets our goal.
9.2 Apply family of functions
Let’s start with the apply family:
apply
applies a function to arrays and matricestapply
applies a function over subsets of a vectorlapply
applies a function to listssapply
simplifies list applyvapply
is the list apply that returns a vectormapply
is the multiple argument list applyrapply
recursively applies a function to a listeapply
applies a function to each entry in an environment
They are quite versatile, considering that
- they can operate on different data structures (matrix, list, data frame etc.);
- the function to be applied can be passed to specific parts of the data (rows, columns, groups, rows and columns etc.) to handle a task;
- and they can return desired format of the output that is not necessarily the same as the input format.
This chapter discusses the following functions: apply
; tapply
and its cousin by
; and lapply
and its variants sapply
and mapply
. The split-apply-combine approach embedded in these base functions are also utilized by functions from add-on packages such as purrr
, albeit with greater efficiency.
For illustrative purposes, we’ll use several pre-installed datasets in the base package datasets
: USPersonalExpenditure
, UCBAdmissions
, ToothGrowth
, and state.x77
. These datasets are directly available to us.
9.3 apply()
apply(X, MARGIN, FUN, ...)
takes an input object X
, and applies a function to margins of the array.
X
can be an array, matrix, or a data frame. If the input X
is a data frame, R will convert it into a matrix. When the data frame has columns of different types, apply()
will convert the columns to one type; and the data frame becomes a matrix. If keeping the different data types of the columns is important, we should then use lapply()
, which we discuss later.
MARGIN
is a vector indicating the subscripts which the function will be applied over. For instance, when X
is a matrix, 1 indicates rows, 2 indicates columns, and c(1, 2)
indicates both rows and columns. When MARGIN = 1
, apply()
will call the function FUN
once for each row. Besides, MARGIN
can also be a character vector that selects dimension names if X
has named dimnames
, such as row names or column names.
FUN
is the function to be applied.
apply()
returns a vector, array, or list of values.
Example 1: calculating summary statistics
To show how apply()
works, here we’ll use a pre-installed dataset USPersonalExpenditure
, which is a matrix. Use ?USPersonalExpenditure
to find the description of the dataset.
## [1] "matrix" "array"
USPersonalExpenditure
consists of United States personal expenditures (in billions of dollars) for the categories food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960.
## 1940 1945 1950 1955 1960
## Food and Tobacco 22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health 3.530 5.760 9.71 14.0 21.10
## Personal Care 1.040 1.980 2.45 3.4 5.40
## Private Education 0.341 0.974 1.80 2.6 3.64
To calculate the sum of personal expenditure across the years for each category, we use apply()
to apply the function sum()
to each row. The output is a vector.
## Food and Tobacco Household Operation Medical and Health Personal Care Private Education
## 286.300 137.700 54.100 14.270 9.355
Note that apply()
uses the rownames
from the matrix to identify the elements of the resulting vector or matrix. That’s why we are seeing the food categories in the outputs.
## [1] "Food and Tobacco" "Household Operation" "Medical and Health" "Personal Care" "Private Education"
The code below finds the maximum personal expenditure across the categories for each year. It applies the function max()
to each column and returns a vector.
## 1940 1945 1950 1955 1960
## 22.2 44.5 59.6 73.2 86.8
If we apply range()
to the columns of USPersonalExpenditure
, we get a matrix in return. range()
returns a vector of two elements, the minimum and the maximum.
## 1940 1945 1950 1955 1960
## [1,] 0.341 0.974 1.8 2.6 3.64
## [2,] 22.200 44.500 59.6 73.2 86.80
Example 2: Simpson’s Paradox (I)
Let’s use the famous 1973 UC Berkeley admissions data UCBAdmissions
to explore some interesting questions. This is a subset of the complete data examined in a study published on Science in 1975.
In the fall of 1973, the University of California, Berkeley’s graduate division admitted about 44% of male applicants and 35% of female applicants. The school officials were worried about the difference, or bias, in the admission rates between male and female applicants. They asked a statistician to analyze the data, who was one of the authors of the Science paper.
The UCBAdmissions
dataset provides aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973, classified by admission and gender. The description of the dataset can be found in its help file.
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
UCBAdmissions
is a 3-dimensional array resulting from cross-tabulating 4526 observations on 3 variables.
Dimension | Name | Levels |
---|---|---|
1 | Admit | Admitted, Rejected |
2 | Gender | Male, Female |
3 | Dept | A, B, C, D, E, F |
Now let’s go back to what worried Berkeley’s officials: the overall acceptance rates for female and male applicants. The formula of the overall acceptance rates is simply the number of admitted female and male applicants divided by the total number of male and female applicants.
How do we get the numbers from the array?
First, we calculate the total number for both male and female applicants. We pass the function sum()
to FUN
and sum up the values in each level of dimension Gender
.
## Male Female
## 2691 1835
As we see here, when the input X
has named dimnames
, MARGIN
can be a character vector selecting dimension names.
The same result can be achieved by replacing the dimension name with its number.
## Male Female
## 2691 1835
Next, we calculate the number of female and male applicants in the admitted group. Our approach here is to get the numbers for both admitted and rejected applicants and extract the admitted applicants from the result.
We need to find each combination by Admit
and Gender
to apply the function sum()
. Therefore, MARGIN
is c("Admit","Gender")
. Then we sum up their values across the departments.
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
The output is a matrix.
## [1] "matrix" "array"
Then we extract the “Admitted” applicants from the output matrix.
## Male Female
## 1198 557
Summarizing the two steps above, we have the number of admitted
applicants in both genders.
## Male Female
## 1198 557
Now we are ready to calculate the acceptance rates.
## Male Female
## 44.52 30.35
It seems that the acceptance rate in the six departments for female applicants is much lower than the acceptance rate for the male applicants.
However, when the statisticians examined the data, they discovered that within specific departments, this bias against women went away. The acceptance rate for female applicants was higher than the acceptance rate for male applicants in several cases.
Let’s get the number of applicants for both genders in each department.
## Dept
## Gender A B C D E F
## Male 825 560 325 417 191 373
## Female 108 25 593 375 393 341
And the number of admitted applicants as well.
## Dept
## Gender A B C D E F
## Male 512 353 120 138 53 22
## Female 89 17 202 131 94 24
The acceptance rates are:
## Dept
## Gender A B C D E F
## Male 62.06 63.04 36.92 33.09 27.75 5.90
## Female 82.41 68.00 34.06 34.93 23.92 7.04
But why? Because more women had applied to departments that admitted a small percentage of applicants, such as English, than to departments that admitted a large percentage of applicants, such as mechanical engineering13.
This phenomenon is called the Simpson’s Paradox. It occurs in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
Example 3: Simpson’s Paradox (II)
Another example of the Simpson’s Paradox is the survival rates of the third class passengers and crew members on Titanic. The data is available in R in Titanic
. It is a 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables.
Dimension | Name | Levels |
---|---|---|
1 | Class | 1st, 2nd, 3rd, Crew |
2 | Sex | Male, Female |
3 | Age | Child, Adult |
4 | Survived | No, Yes |
If we compare the survival rates for adults, we’ll find that the numbers for third class passengers and crew members in the adults are close.
round(apply(Titanic, c(1,3,4), sum)[,"Adult","Yes"] /
apply(Titanic, c(1,3), sum)[,"Adult"] * 100, 2)
## 1st 2nd 3rd Crew
## 61.76 36.02 24.08 23.95
The survival rate is 24.08% for the third class passengers and 23.95% for the crew members.
However, if we further break the data down by gender, the survival rates are higher for crew members compared to the third class passengers for both men and women.
round(apply(Titanic, c(1,2,3,4), sum)[,,"Adult","Yes"] /
apply(Titanic, c(1,2,3), sum)[,,"Adult"] * 100, 2)
## Sex
## Class Male Female
## 1st 32.57 97.22
## 2nd 8.33 86.02
## 3rd 16.23 46.06
## Crew 22.27 86.96
Why do you think that happened?
FUN
FUN
can be a named function or an anonymous function. We get an anonymous function if we choose not to give the function a name. This is useful when it’s not worth the effort to figure out a name for the function.
For example, the one-liner function below is an anonymous function. There is no need to name it if we are going to use it only once inside the apply()
function.
## 1940 1945 1950 1955 1960
## 376.11 687.14 1025.60 1297.00 1631.40
FUN
can take optional arguments ...
.
For instance, na.rm
is the second argument to mean()
, although in this case we don’t have NA
s to worry about.
Every time that apply()
calls mean()
, the first argument will be a row of USPersonalExpenditure
and the second argument will be na.rm = TRUE
. The function call will be mean(row, na.rm = TRUE)
.
9.4 tapply()
, by()
The second member of the apply family functions is tapply()
and its cousin by
.
tapply()
and by()
apply a function to groups of values.
tapply()
tapply(X, INDEX, FUN, ...)
applies a function to each (non-empty) group of values given by a unique combination of the levels of certain factors.
X
is an input vector. INDEX
is a factor that defines the groups. The factor level identifies the group of each vector element in X
. The vector
and the factor
are of the same length.
Vector | Factor |
---|---|
9 | A |
25 | B |
32 | C |
14 | B |
2 | C |
100 | A |
Example 4: aggregating data by group
We’ll use the dataset ToothGrowth
to show how tapply()
aggregates data by group.
ToothGrowth
recorded the effect of Vitamin C on tooth growth in Guinea pigs.
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
dose
is dose and a numeric vector. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 milligrams/day) by one of two delivery methods. len
is tooth length and a numeric vector.
supp
is supplement type and a factor. The two supplement types are orange juice OJ
or vitamin C VC
. The factor level in supp
identifies the group of each element in len
and dose
.
To get the mean of length by supplement types, we use tapply()
to apply mean()
to each group of supp
. The result is a vector.
tapply()
can also manage multiple categories. This is handled by the argument INDEX
, a list of one or more factors, each of same length as X
. The elements are coerced to factors.
## 0.5 1 2
## OJ 13.23 22.70 26.06
## VC 7.98 16.77 26.14
The result is a matrix.
by()
by(data, INDICES, FUN, ...)
applies a function to a data frame, split by factors. by()
is a wrapper for tapply()
applied to data frames. The function returns a list.
The argument data
normally is a data frame, but can possibly be a matrix. INDICES
is a factor or a list of factors, each of length nrow(data)
.
by()
calls the function FUN
for each group within a data frame.
The example below summarizes the data by supp
, and organizes the summary statistics in a reader friendly manner.
## ToothGrowth$supp: OJ
## len supp dose
## Min. : 8.20 OJ:30 Min. :0.500
## 1st Qu.:15.53 VC: 0 1st Qu.:0.500
## Median :22.70 Median :1.000
## Mean :20.66 Mean :1.167
## 3rd Qu.:25.73 3rd Qu.:2.000
## Max. :30.90 Max. :2.000
## -------------------------------------------------------------------------------------
## ToothGrowth$supp: VC
## len supp dose
## Min. : 4.20 OJ: 0 Min. :0.500
## 1st Qu.:11.20 VC:30 1st Qu.:0.500
## Median :16.50 Median :1.000
## Mean :16.96 Mean :1.167
## 3rd Qu.:23.10 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
Same as tapply()
, by()
can handle multiple groups.
## : OJ
## : 0.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.20 9.70 12.25 13.23 16.18 21.50
## -------------------------------------------------------------------------------------
## : VC
## : 0.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.20 5.95 7.15 7.98 10.90 11.50
## -------------------------------------------------------------------------------------
## : OJ
## : 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.50 20.30 23.45 22.70 25.65 27.30
## -------------------------------------------------------------------------------------
## : VC
## : 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.60 15.28 16.50 16.77 17.30 22.50
## -------------------------------------------------------------------------------------
## : OJ
## : 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.40 24.57 25.95 26.06 27.07 30.90
## -------------------------------------------------------------------------------------
## : VC
## : 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.50 23.38 25.95 26.14 28.80 33.90
9.5 lapply()
, sapply()
, mapply()
lapply()
, sapply()
and mapply()
apply a function over a list or vector.
lapply()
returns a list.
## $len
## [1] TRUE
##
## $supp
## [1] FALSE
##
## $dose
## [1] TRUE
sapply()
is a user-friendly version of lapply()
, which returns a vector or a matrix.
## len supp dose
## TRUE FALSE TRUE
mapply()
is a multivariate version of sapply()
.
lapply()
lapply(X, FUN, ...)
applies a function over a list or vector. lapply()
returns a list of the same length as X
, each element of which is the result of applying FUN
to the corresponding element of X
.
For instance, lapply(mylist, mean)
iterates over the list mylist
to get the mean of three vectors. The result is a list.
## [[1]]
## [1] 5.5
##
## [[2]]
## [1] 5.5
##
## [[3]]
## [1] 0
Example 5: working with lists
Now let’s see a more meaningful example using the dataset state.x77
. state.x77
is a matrix with 50 rows and 8 columns that gives the statistics of the 50 states of the United States. These include population, income, life expectancy, murder rate, percent of high-school graduates, and a few others.
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
We transformed and reorganized the dataset into a list of data frames 14 for the purpose of illustration.
state.df <- data.frame(state.x77)
state.list <- setNames(split(state.df, seq(nrow(state.df))), rownames(state.df))
head(state.list, 3)
## $Alabama
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
##
## $Alaska
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
##
## $Arizona
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Let’s use the list to answer a few questions. First, what is the average population across the states?
## [1] 4246.42
Note that in the case of functions like +
, %*%
, [[
, etc., the function name must be backquoted or quoted.
Next, what is the average number of people per square miles in each state?
## $Alabama
## [1] 0.07129053
##
## $Alaska
## [1] 0.0006443845
##
## $Arizona
## [1] 0.01950325
##
## $Arkansas
## [1] 0.04061989
##
## $California
## [1] 0.1355709
##
## $Colorado
## [1] 0.02448779
##
## $Connecticut
## [1] 0.6375977
##
## $Delaware
## [1] 0.2921292
##
## $Florida
## [1] 0.1530227
##
## $Georgia
## [1] 0.08491037
##
## $Hawaii
## [1] 0.1350973
##
## $Idaho
## [1] 0.009833448
##
## $Illinois
## [1] 0.2008503
##
## $Indiana
## [1] 0.1471867
##
## $Iowa
## [1] 0.05114317
##
## $Kansas
## [1] 0.02787729
##
## $Kentucky
## [1] 0.08542245
##
## $Louisiana
## [1] 0.08470955
##
## $Maine
## [1] 0.03421734
##
## $Maryland
## [1] 0.4167425
##
## $Massachusetts
## [1] 0.7429083
##
## $Michigan
## [1] 0.1603569
##
## $Minnesota
## [1] 0.049452
##
## $Mississippi
## [1] 0.04949679
##
## $Missouri
## [1] 0.06909196
##
## $Montana
## [1] 0.005124084
##
## $Nebraska
## [1] 0.02018749
##
## $Nevada
## [1] 0.005369054
##
## $`New Hampshire`
## [1] 0.08995237
##
## $`New Jersey`
## [1] 0.9750033
##
## $`New Mexico`
## [1] 0.009422462
##
## $`New York`
## [1] 0.3779139
##
## $`North Carolina`
## [1] 0.1115005
##
## $`North Dakota`
## [1] 0.009195502
##
## $Ohio
## [1] 0.261989
##
## $Oklahoma
## [1] 0.03947254
##
## $Oregon
## [1] 0.02374615
##
## $Pennsylvania
## [1] 0.2637548
##
## $`Rhode Island`
## [1] 0.8875119
##
## $`South Carolina`
## [1] 0.09316791
##
## $`South Dakota`
## [1] 0.008965835
##
## $Tennessee
## [1] 0.1009727
##
## $Texas
## [1] 0.04668223
##
## $Utah
## [1] 0.01465358
##
## $Vermont
## [1] 0.05093342
##
## $Virginia
## [1] 0.1252137
##
## $Washington
## [1] 0.05346252
##
## $`West Virginia`
## [1] 0.07474034
##
## $Wisconsin
## [1] 0.08425749
##
## $Wyoming
## [1] 0.003868193
Example 6: sequence generation
The input object of lapply()
can also be a vector.
In the example below, n
takes 5 and is passed to the function fun
3 times.
fun <- function(n){
x <- rnorm(n)
y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
list(X = x, Y = y)
}
lapply(rep(5, 3), fun)
## [[1]]
## [[1]]$X
## [1] -0.1494530 0.4332426 -0.6613785 1.4628078 1.0065455
##
## [[1]]$Y
## [1] 0.02492883 0.29671244 1.36094897 0.08249475 0.36579978
##
##
## [[2]]
## [[2]]$X
## [1] -1.08800885 -0.08653661 -0.52323807 -1.36850595 0.08314541
##
## [[2]]$Y
## [1] -0.004306292 -0.047003701 -0.048976420 -0.165990336 -0.540931057
##
##
## [[3]]
## [[3]]$X
## [1] -0.6191953 0.6530385 -0.6964274 0.3268801 -0.3784654
##
## [[3]]$Y
## [1] -0.33223373 -0.18075812 -0.50677210 -0.18233192 -0.09451082
This usage of lapply()
can be seen in sequence generations.
Example 7: getting data types of a data frame
Recall earlier we have mentioned that apply()
works on the rows of a data frame when its columns are of the same type. If columns are not of the same type, this is when lapply()
comes to rescue.
lapply()
can be applied to a data frame, since data frame is a kind of list. The function passed to the argument FUN
will be applied to each column of the data frame.
One application of this is to use lapply()
to check the types of columns in a data frame. The output is a list.
## $len
## [1] "numeric"
##
## $supp
## [1] "factor"
##
## $dose
## [1] "numeric"
If we want to turn the output to a vector, we can unlist()
the output list.
## len supp dose
## "numeric" "factor" "numeric"
sapply()
sapply(X, FUN, ...)
will try to simplify the result of lapply()
if possible. The output of it can be a vector or a matrix.
If the result of lapply()
is a list where every element is length 1, then using sapply()
to run the same code will return a vector.
## $len
## [1] TRUE
##
## $supp
## [1] FALSE
##
## $dose
## [1] TRUE
If we use sapply()
instead, we will get the same output, but in a vector rather than a list.
## len supp dose
## TRUE FALSE TRUE
If the result is a list where every element is a vector of the same length larger than 1, then using sapply()
to run the same code will return a matrix.
This is an example we used earlier to explain lapply()
. The output is a list with vectors longer than 1.
fun <- function(n){
x <- rnorm(n)
y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
list(X = x, Y = y)
}
lapply(rep(5, 3), fun)
## [[1]]
## [[1]]$X
## [1] -0.6612219 -1.4763768 1.9622459 -0.4203603 -0.4776670
##
## [[1]]$Y
## [1] -0.043078436 -0.179093574 -0.423547688 -0.004014341 -0.136557351
##
##
## [[2]]
## [[2]]$X
## [1] 0.7185358 -1.3383289 -0.5433609 -1.3012167 -0.5769427
##
## [[2]]$Y
## [1] -2.0314867 -0.8605874 -0.1730227 -0.9388725 -0.1447821
##
##
## [[3]]
## [[3]]$X
## [1] 0.9158401 1.1212610 0.4716513 0.4286128 -1.1353724
##
## [[3]]$Y
## [1] 0.24942982 0.13744108 0.02898536 0.27205685 0.56304738
In such cases, using sapply()
returns a matrix.
## [,1] [,2] [,3]
## X numeric,5 numeric,5 numeric,5
## Y numeric,5 numeric,5 numeric,5
mapply()
mapply(FUN, ...)
is a multivariate version of sapply()
. mapply()
applies FUN
to the first elements of each ...
argument, the second elements, the third elements, and so on.
Note that the first argument of mapply()
is a function, unlike lapply()
and sapply()
.
mapply()
applies the function element-wise to vectors or lists.
Let’s return to state.list
and revisit the question on population density. With mapply()
, our solution can be like something below. The output is a vector rather than a list.
pop <- lapply(state.list, "[[", "Population")
area <- lapply(state.list, "[[", "Area")
mapply("/", pop, area)
## Alabama Alaska Arizona Arkansas California Colorado Connecticut
## 0.0712905261 0.0006443845 0.0195032491 0.0406198864 0.1355708904 0.0244877898 0.6375976964
## Delaware Florida Georgia Hawaii Idaho Illinois Indiana
## 0.2921291625 0.1530227399 0.0849103714 0.1350972763 0.0098334482 0.2008502547 0.1471867468
## Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts
## 0.0511431687 0.0278772910 0.0854224464 0.0847095482 0.0342173351 0.4167424932 0.7429082545
## Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada
## 0.1603569354 0.0494520047 0.0494967862 0.0690919632 0.0051240839 0.0201874926 0.0053690542
## New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio
## 0.0899523651 0.9750033240 0.0094224624 0.3779139052 0.1115004713 0.0091955019 0.2619890177
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee
## 0.0394725364 0.0237461532 0.2637548370 0.8875119161 0.0931679074 0.0089658350 0.1009727062
## Texas Utah Vermont Virginia Washington West Virginia Wisconsin
## 0.0466822312 0.0146535763 0.0509334197 0.1252136752 0.0534625207 0.0747403407 0.0842574912
## Wyoming
## 0.0038681934
9.6 purrr
map functions
The apply family bridges the gap between traditional loops and functional programming by allowing us to apply functions to vectors, arrays, and lists more elegantly and efficiently. However, for new R projects, it’s advisable to use the purrr
functions for looping tasks wherever we can. purrr
provides a consistent syntax and optimized functions compared to their base R counterparts.
Below we focus on the map family of functions, mainly map()
and map2()
.
But note that there are many other useful purrr
functions, such as pluck()
and keep()
.
inputs and outputs
The map functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.
Like the apply family, these functions work with particular input and output data structures. Learn more here.
Input | Output | purrr function(s) |
---|---|---|
1 vector | list | map() |
1 vector | vector of desired type |
|
2 vectors | list | map2() |
2 vectors | vector of desired type |
|
As shown in the table above, map()
always returns a list. The returned vector can be defined by the suffix _lgl()
, _int()
, _dbl()
and _chr()
, which returns a logical, integer, double, or character vector respectively.
key arguments
map(.x, .f)
functions take two key arguments .x
and .f
. .x
is the input data structure. .f
is a function, which can be a named function, an anonymous function, or a formula. The anonymous function can be written either using R’s anonymous function shorthand map(x, y, \(x, y) x/y)
or map(x, y, function(x, y) x/y)
.
map2(.x, .y, .f)
maps over two inputs. .x
and .y
are a pair of vectors, usually the same length.
tidyverse
consistency
Since purrr
is part of tidyverse
15, we can join multiple steps together either using the magrittr
pipe %>%
, or the base pipe R |>
.
It’s recommended that we use the base tools because they don’t require that we load magrittr
and work everywhere, not just in purrr
functions.
examples
Below we rewrite some examples using map
functions for what we used apply family previously.
map()
applies a function to each element of a vector, a pattern we should be quite familiar with after working with the apply family.
Previously we used lapply()
to evaluate whether each element of the data frame ToothGrowth
is numeric or not.
## $len
## [1] TRUE
##
## $supp
## [1] FALSE
##
## $dose
## [1] TRUE
We now use map()
to repeat the task.
## $len
## [1] TRUE
##
## $supp
## [1] FALSE
##
## $dose
## [1] TRUE
The results are the same.
## [1] TRUE
We can use map_lgl()
to return a logical vector.
## len supp dose
## TRUE FALSE TRUE
- In the example below,
map_dbl()
gets the mean of each vector in the listmylist
and returns a double vector, similar to the example above.
## [1] 5.5 5.5 0.0
## [1] 5.5 5.5 0.0
map2_int()
performs operations between two vectors and returns a double vector. Below we usemap()
andmap2()
to replacelapply()
andmapply()
respectively.
pop <- lapply(state.list, "[[", "Population")
area <- lapply(state.list, "[[", "Area")
result1 <- mapply("/", pop, area)
pop2 <- state.list |> map_dbl("Population")
area2 <- state.list |> map_dbl("Area")
result2 <- map2_dbl(pop2, area2, \(pop2, area2) pop2/area2)
# map2_dbl(pop2, area2, function(pop2, area2) pop2/area2)
all.equal(result1, result2)
## [1] TRUE
- In addition to extracting components deep in a data structure,
map()
can also be used for sequence generation.
fun <- function(n){
x <- rnorm(n)
y <- sign(mean(x))*rexp(n, rate = abs(1/mean(x)))
list(X = x, Y = y)
}
set.seed(2024)
result1 <- lapply(rep(5, 3), fun)
set.seed(2024)
result2 <- rep(5, 3) |> map(fun)
all.equal(result1, result2)
## [1] TRUE
The function fun
is applied to each element of the vector rep(5, 3)
. So 5 is applied to the function 3 times.
As the authors summarized in the paper, “The graduate departments that are easier to enter tend to be those that require more mathematics in the undergraduate preparatory curriculum. The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system. Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.” (Bickel, Hammel, O’Connell, 1975, p. 402)↩︎
Following this post, the rows in the data frame
state.df
became elements of a new list.↩︎tidyverse
is introduced in the chapter R Documentation.↩︎