17.4 Loops over multiple indices with a design matrix

So far we’ve covered simple loops with a single index value - but how can you do loops over multiple indices? You could do this by creating multiple nested loops. However, these are ugly and cumbersome. Instead, I recommend that you use design matrices to reduce loops with multiple index values into a single loop with just one index. Here’s how you do it:

Let’s say you want to calculate the mean, median, and standard deviation of some quantitative variable for all combinations of two factors. For a concrete example, let’s say we wanted to calculate these summary statistics on the age of pirates for all combinations of colleges and sex.

To do this, we’ll start by creating a design matrix. This matrix will have all combinations of our two factors. To create this design matrix matrix, we’ll use the expand.grid() function. This function takes several vectors as arguments, and returns a dataframe with all combinations of values of those vectors. For our two factors college and sex, we’ll enter all the factor values we want. Additionally, we’ll add NA columns for the three summary statistics we want to calculate

design.matrix <- expand.grid("college" = c("JSSFP", "CCCC"), # college factor
                             "sex" = c("male", "female"), # sex factor
                             "median.age" = NA, # NA columns for our future calculations
                             "mean.age" = NA, #...
                             "sd.age" = NA, #...
                             stringsAsFactors = FALSE)

Here’s how the design matrix looks:

design.matrix
##   college    sex median.age mean.age sd.age
## 1   JSSFP   male         NA       NA     NA
## 2    CCCC   male         NA       NA     NA
## 3   JSSFP female         NA       NA     NA
## 4    CCCC female         NA       NA     NA

As you can see, the design matrix contains all combinations of our factors in addition to three NA columns for our future statistics. Now that we have the matrix, we can use a single loop where the index is the row of the design.matrix, and the index values are all the rows in the design matrix. For each index value (that is, for each row), we’ll get the value of each factor (college and sex) by indexing the current row of the design matrix. We’ll then subset the pirates dataframe with those factor values, calculate our summary statistics, then assign them

for(row.i in 1:nrow(design.matrix)) {

# Get factor values for current row
  college.i <- design.matrix$college[row.i]
  sex.i <- design.matrix$sex[row.i]

# Subset pirates with current factor values
  data.temp <- subset(pirates, 
                      college == college.i & sex == sex.i)

# Calculate statistics
  median.i <- median(data.temp$age)
  mean.i <- mean(data.temp$age)
  sd.i <- sd(data.temp$age)

# Assign statistics to row.i of design.matrix
  design.matrix$median.age[row.i] <- median.i
  design.matrix$mean.age[row.i] <- mean.i
  design.matrix$sd.age[row.i] <- sd.i

}

Let’s look at the result to see if it worked!

design.matrix
##   college    sex median.age mean.age sd.age
## 1   JSSFP   male         31       32    2.6
## 2    CCCC   male         24       23    4.3
## 3   JSSFP female         33       34    3.5
## 4    CCCC female         26       26    3.4

Sweet! Our loop filled in the NA values with the statistics we wanted.