22 Iteration

22.1 Setup

This chunk of R code loads the packages that we will be using.

library(tidyverse)

library(tidyverse)
library(gapminder)

22.1.1 Reading & Resources

Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, R for Data Science (2nd ed.), “27 Iteration”

JD Long and Paul Teetor, R Cookbook, 2nd ed., “Iterating with a Loop”, 2019

22.2 Iteration

Like functions, iteration is a way to reduce

repetition in your code, and
repetition in copy-and-paste tasks.

Iteration:

helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (R4DS, “21 Iteration”)

A rule to follow:

Never copy and paste more than twice.

A simple example, where we create a dataframe (tibble) named “df” with 4 columns of 50 records each, each one with a normal distribution centred on 100. (This is a variation on what’s in the “Iteration” chapter of R4DS.)

set.seed(8675309)

df <- tibble(
  a = rnorm(50, mean = 100, sd = 10),
  b = rnorm(50, mean = 100, sd = 10),
  c = rnorm(50, mean = 100, sd = 10),
  d = rnorm(50, mean = 100, sd = 10)
)
  
df

## # A tibble: 50 × 4
##        a     b     c     d
##    <dbl> <dbl> <dbl> <dbl>
##  1  90.0 101.  107.   96.8
##  2 107.  101.   86.5 109. 
##  3  93.8  91.7 103.  114. 
##  4 120.   78.6  99.9  89.0
##  5 111.  102.   95.4  94.0
##  6 110.   90.5 102.   87.7
##  7 100.   93.5 105.   79.1
##  8 107.  107.   92.4  90.7
##  9 106.  110.   89.3 103. 
## 10 109.  101.   97.4  96.0
## # ℹ 40 more rows

Calculate the median of variable a:

median(df$a)

## [1] 100.3177

Calculate the median of the other three variables:

# solution
median(df$b)

## [1] 101.0977

median(df$c)

## [1] 99.91222

median(df$d)

## [1] 101.5601

22.3 For-Loop

Or rather than repeat the same function four times, you can write a “for-loop”, which executes the code within the loop for as many times as is specified.

There are three parts in a for-loop:

Define the output object before you start the loop—this makes your loop more efficient. Notice that the code below uses ncol() to count the number of columns in our dataframe “df”
The sequence, which defines the number of times the loop iterates
The body of the code.

output <- vector(mode = "double", length = ncol(df))  # 1. output

# the loop
# - "i" is the counter that 
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}

# print the output object
output

## [1] 100.31772 101.09775  99.91222 101.56009

22.4 Detour: square brackets!

These are accessors: a method of accessing data by defining the location of that data in the object.

Square brackets inside square brackets is the base R method of referencing a particular location in a list.

One square bracket: the first list in the object

think of this as a subset of the object

df[1]

## # A tibble: 50 × 1
##        a
##    <dbl>
##  1  90.0
##  2 107. 
##  3  93.8
##  4 120. 
##  5 111. 
##  6 110. 
##  7 100. 
##  8 107. 
##  9 106. 
## 10 109. 
## # ℹ 40 more rows

Two square brackets: the contents of the first object

df[[1]]

##  [1]  90.03418 107.21824  93.82791 120.29392 110.65416 109.87220 100.27454 106.72872 105.72067
## [10] 109.03678  84.50448 110.22638 101.50083  93.40036  90.05411 119.72459  95.58198  90.99363
## [19]  98.49412  91.72106 119.85826 100.44005  95.95718  95.27001  95.85177 106.83234 106.90201
## [28] 105.33492  98.13895 103.82946 103.76184 111.53530 115.74903 105.88527  93.84955  97.71089
## [37] 100.63917  96.86526  97.51501  98.46205  93.64743  99.96206 100.45615 106.84072  97.47957
## [46]  98.48961  91.25069  80.23099 102.37663 100.36091

Two values inside two square brackets: this is a single value

[[row, column]]

df[[50, 1]]

## [1] 100.3609

For a good explanation of R’s accessors and the application in selecting vector and matrix elements, see

“[, [[, $: R accessors explained”) by Christopher Brown
R Cookbook, 2nd ed., “2.9 Selecting Vector Elements”

22.5 Data frame iteration

Let’s calculate the mean of every column in the mtcars data frame:

mtcars

## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ℹ 22 more rows

Advice from R4DS:

Think about the output, sequence, and body before you start writing the loop.

Note that we can use the names() function to pull the variable names (column names) from the data table.

# create the output vector
output <- vector("double", ncol(mtcars))

# the names() function 
names(output)

## NULL

# assign the names to the `output` vector
names(output) <- names(mtcars)

# the loop
for (i in names(mtcars)) {
  output[i] <- mean(mtcars[[i]])
}

# print the output
output

##        mpg        cyl       disp         hp       drat         wt       qsec         vs         am 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750   0.437500   0.406250 
##       gear       carb 
##   3.687500   2.812500

22.6 A very practical case

(This is modified from “21.3.5 Exercises” in R for Data Science, which uses CSV files.)

Imagine you have a directory full of Excel files that you want to read and then combine into a single dataframe. This circumstance is encountered in many data analysis situations—perhaps the finance department of your company produces a monthly report with the data from a single month, but your interest is to compare many months of data over a few years. The challenge is to merge the data into a single R object containing all of the available data.

For this example, we have three files that are identically structured. Because we now know how to write a loop, we can write some code that will read each file, and then merge them together.

The steps are:

Create a list with the names of the files of interest (that is, the Excel files) in the sub-folder “data_monthly”. The function dir() returns a character vector of the names of the files (or directories) in the named directory.

In the code below, note the use of a regular expression "\\.xls*" to define the pattern for any Excel file, including the wild card “*” which identifies both the older “.xls” extension or the newer “.xlsx”.

all_files <- dir("data_monthly/", pattern = "\\.xls*", full.names = TRUE)
all_files

## [1] "data_monthly/2019-01_Jan.xlsx" "data_monthly/2019-02-Feb.xlsx" "data_monthly/2019-03-Mar.xlsx"

With the length() function, we can find out how many files there are, and can use that to assign our object:

df_list <- vector("list", length(all_files))

The all_files and df_list objects are used in our loop:

for (i in seq_along(all_files)) {
  df_list[[i]] <- readxl::read_excel(all_files[[i]])
}

This creates an interesting R object—it’s a list, but a list of tibbles, each one with the contents of a single Excel file. This structure is sometimes referred to as “nested”—the tibbles are nested inside the list.

Let’s take a look:

df_list

## [[1]]
## # A tibble: 3 × 3
##   ref_date            units price
##   <dttm>              <dbl> <dbl>
## 1 2019-01-01 00:00:00    14    97
## 2 2019-01-08 00:00:00     5    42
## 3 2019-01-24 00:00:00    24    24
## 
## [[2]]
## # A tibble: 3 × 3
##   ref_date            units price
##   <dttm>              <dbl> <dbl>
## 1 2019-02-14 00:00:00    14    97
## 2 2019-02-17 00:00:00     5    42
## 3 2019-02-24 00:00:00    24    24
## 
## [[3]]
## # A tibble: 3 × 3
##   ref_date            units price
##   <dttm>              <dbl> <dbl>
## 1 2019-03-14 00:00:00     1     2
## 2 2019-03-17 00:00:00    24    24
## 3 2019-03-24 00:00:00    14    97

The last step is to use the function bind_rows() (from {dplyr}) to combine the list of data frames into a single data frame.

df <- dplyr::bind_rows(df_list)

df

## # A tibble: 9 × 3
##   ref_date            units price
##   <dttm>              <dbl> <dbl>
## 1 2019-01-01 00:00:00    14    97
## 2 2019-01-08 00:00:00     5    42
## 3 2019-01-24 00:00:00    24    24
## 4 2019-02-14 00:00:00    14    97
## 5 2019-02-17 00:00:00     5    42
## 6 2019-02-24 00:00:00    24    24
## 7 2019-03-14 00:00:00     1     2
## 8 2019-03-17 00:00:00    24    24
## 9 2019-03-24 00:00:00    14    97

22.6.0.1 An alternative

In this approach, we create the output object to be a list with the file names:

# this is the same
df2_list <- vector("list", length(all_files))
# assign the file names
names(df2_list) <- all_files

# the loop
for (fname in all_files) {
  df2_list[[fname]] <- readxl::read_excel(fname)
}


df2 <- bind_rows(df2_list)
df2

## # A tibble: 9 × 3
##   ref_date            units price
##   <dttm>              <dbl> <dbl>
## 1 2019-01-01 00:00:00    14    97
## 2 2019-01-08 00:00:00     5    42
## 3 2019-01-24 00:00:00    24    24
## 4 2019-02-14 00:00:00    14    97
## 5 2019-02-17 00:00:00     5    42
## 6 2019-02-24 00:00:00    24    24
## 7 2019-03-14 00:00:00     1     2
## 8 2019-03-17 00:00:00    24    24
## 9 2019-03-24 00:00:00    14    97

-30-

Assignment week 5: optional {#assignment5}

23 Data wrangling—continued