22 Iteration
22.1 Setup
This chunk of R code loads the packages that we will be using.
22.1.1 Reading & Resources
Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, R for Data Science (2nd ed.), “27 Iteration”
JD Long and Paul Teetor, R Cookbook, 2nd ed., “Iterating with a Loop”, 2019
22.2 Iteration
Like functions, iteration is a way to reduce
repetition in your code, and
repetition in copy-and-paste tasks.
Iteration:
helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (R4DS, “21 Iteration”)
A rule to follow:
Never copy and paste more than twice.
A simple example, where we create a dataframe (tibble) named “df” with 4 columns of 50 records each, each one with a normal distribution centred on 100. (This is a variation on what’s in the “Iteration” chapter of R4DS.)
set.seed(8675309)
df <- tibble(
a = rnorm(50, mean = 100, sd = 10),
b = rnorm(50, mean = 100, sd = 10),
c = rnorm(50, mean = 100, sd = 10),
d = rnorm(50, mean = 100, sd = 10)
)
df
## # A tibble: 50 × 4
## a b c d
## <dbl> <dbl> <dbl> <dbl>
## 1 90.0 101. 107. 96.8
## 2 107. 101. 86.5 109.
## 3 93.8 91.7 103. 114.
## 4 120. 78.6 99.9 89.0
## 5 111. 102. 95.4 94.0
## 6 110. 90.5 102. 87.7
## 7 100. 93.5 105. 79.1
## 8 107. 107. 92.4 90.7
## 9 106. 110. 89.3 103.
## 10 109. 101. 97.4 96.0
## # ℹ 40 more rows
Calculate the median of variable a
:
median(df$a)
## [1] 100.3177
Calculate the median of the other three variables:
# solution
median(df$b)
## [1] 101.0977
median(df$c)
## [1] 99.91222
median(df$d)
## [1] 101.5601
22.3 For-Loop
Or rather than repeat the same function four times, you can write a “for-loop”, which executes the code within the loop for as many times as is specified.
There are three parts in a for-loop:
Define the output object before you start the loop—this makes your loop more efficient. Notice that the code below uses
ncol()
to count the number of columns in our dataframe “df”The sequence, which defines the number of times the loop iterates
The body of the code.
output <- vector(mode = "double", length = ncol(df)) # 1. output
# the loop
# - "i" is the counter that
for (i in seq_along(df)) { # 2. sequence
output[[i]] <- median(df[[i]]) # 3. body
}
# print the output object
output
## [1] 100.31772 101.09775 99.91222 101.56009
22.4 Detour: square brackets!
These are accessors: a method of accessing data by defining the location of that data in the object.
Square brackets inside square brackets is the base R method of referencing a particular location in a list.
One square bracket: the first list in the object
- think of this as a subset of the object
df[1]
## # A tibble: 50 × 1
## a
## <dbl>
## 1 90.0
## 2 107.
## 3 93.8
## 4 120.
## 5 111.
## 6 110.
## 7 100.
## 8 107.
## 9 106.
## 10 109.
## # ℹ 40 more rows
Two square brackets: the contents of the first object
df[[1]]
## [1] 90.03418 107.21824 93.82791 120.29392 110.65416 109.87220 100.27454 106.72872 105.72067
## [10] 109.03678 84.50448 110.22638 101.50083 93.40036 90.05411 119.72459 95.58198 90.99363
## [19] 98.49412 91.72106 119.85826 100.44005 95.95718 95.27001 95.85177 106.83234 106.90201
## [28] 105.33492 98.13895 103.82946 103.76184 111.53530 115.74903 105.88527 93.84955 97.71089
## [37] 100.63917 96.86526 97.51501 98.46205 93.64743 99.96206 100.45615 106.84072 97.47957
## [46] 98.48961 91.25069 80.23099 102.37663 100.36091
Two values inside two square brackets: this is a single value
- [[row, column]]
df[[50, 1]]
## [1] 100.3609
For a good explanation of R’s accessors and the application in selecting vector and matrix elements, see
“[, [[, $: R accessors explained”) by Christopher Brown
22.5 Data frame iteration
Let’s calculate the mean of every column in the mtcars
data frame:
mtcars
## # A tibble: 32 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ℹ 22 more rows
Advice from R4DS:
Think about the output, sequence, and body before you start writing the loop.
- Note that we can use the
names()
function to pull the variable names (column names) from the data table.
# create the output vector
output <- vector("double", ncol(mtcars))
# the names() function
names(output)
## NULL
# assign the names to the `output` vector
names(output) <- names(mtcars)
# the loop
for (i in names(mtcars)) {
output[i] <- mean(mtcars[[i]])
}
# print the output
output
## mpg cyl disp hp drat wt qsec vs am
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250
## gear carb
## 3.687500 2.812500
22.6 A very practical case
(This is modified from “21.3.5 Exercises” in R for Data Science, which uses CSV files.)
Imagine you have a directory full of Excel files that you want to read and then combine into a single dataframe. This circumstance is encountered in many data analysis situations—perhaps the finance department of your company produces a monthly report with the data from a single month, but your interest is to compare many months of data over a few years. The challenge is to merge the data into a single R object containing all of the available data.
For this example, we have three files that are identically structured. Because we now know how to write a loop, we can write some code that will read each file, and then merge them together.
The steps are:
- Create a list with the names of the files of interest (that is, the Excel files) in the sub-folder “data_monthly”. The function
dir()
returns a character vector of the names of the files (or directories) in the named directory.
In the code below, note the use of a regular expression "\\.xls*"
to define the pattern for any Excel file, including the wild card “*” which identifies both the older “.xls” extension or the newer “.xlsx”.
all_files <- dir("data_monthly/", pattern = "\\.xls*", full.names = TRUE)
all_files
## [1] "data_monthly/2019-01_Jan.xlsx" "data_monthly/2019-02-Feb.xlsx" "data_monthly/2019-03-Mar.xlsx"
- With the
length()
function, we can find out how many files there are, and can use that to assign our object:
The all_files
and df_list
objects are used in our loop:
for (i in seq_along(all_files)) {
df_list[[i]] <- readxl::read_excel(all_files[[i]])
}
This creates an interesting R object—it’s a list, but a list of tibbles, each one with the contents of a single Excel file. This structure is sometimes referred to as “nested”—the tibbles are nested inside the list.
Let’s take a look:
df_list
## [[1]]
## # A tibble: 3 × 3
## ref_date units price
## <dttm> <dbl> <dbl>
## 1 2019-01-01 00:00:00 14 97
## 2 2019-01-08 00:00:00 5 42
## 3 2019-01-24 00:00:00 24 24
##
## [[2]]
## # A tibble: 3 × 3
## ref_date units price
## <dttm> <dbl> <dbl>
## 1 2019-02-14 00:00:00 14 97
## 2 2019-02-17 00:00:00 5 42
## 3 2019-02-24 00:00:00 24 24
##
## [[3]]
## # A tibble: 3 × 3
## ref_date units price
## <dttm> <dbl> <dbl>
## 1 2019-03-14 00:00:00 1 2
## 2 2019-03-17 00:00:00 24 24
## 3 2019-03-24 00:00:00 14 97
The last step is to use the function bind_rows()
(from {dplyr}) to combine the list of data frames into a single data frame.
df <- dplyr::bind_rows(df_list)
df
## # A tibble: 9 × 3
## ref_date units price
## <dttm> <dbl> <dbl>
## 1 2019-01-01 00:00:00 14 97
## 2 2019-01-08 00:00:00 5 42
## 3 2019-01-24 00:00:00 24 24
## 4 2019-02-14 00:00:00 14 97
## 5 2019-02-17 00:00:00 5 42
## 6 2019-02-24 00:00:00 24 24
## 7 2019-03-14 00:00:00 1 2
## 8 2019-03-17 00:00:00 24 24
## 9 2019-03-24 00:00:00 14 97
22.6.0.1 An alternative
In this approach, we create the output object to be a list with the file names:
# this is the same
df2_list <- vector("list", length(all_files))
# assign the file names
names(df2_list) <- all_files
# the loop
for (fname in all_files) {
df2_list[[fname]] <- readxl::read_excel(fname)
}
df2 <- bind_rows(df2_list)
df2
## # A tibble: 9 × 3
## ref_date units price
## <dttm> <dbl> <dbl>
## 1 2019-01-01 00:00:00 14 97
## 2 2019-01-08 00:00:00 5 42
## 3 2019-01-24 00:00:00 24 24
## 4 2019-02-14 00:00:00 14 97
## 5 2019-02-17 00:00:00 5 42
## 6 2019-02-24 00:00:00 24 24
## 7 2019-03-14 00:00:00 1 2
## 8 2019-03-17 00:00:00 24 24
## 9 2019-03-24 00:00:00 14 97
-30-