Chapter 14 For Loops

This is a topic that I wanted to discuss for a long time. People read blogs from 2014-2016 and assume that for loops in R are bad. You should not use them and Loops in R are slow etc. etc… This chapter will help you understand how to use them more effectively.

R loops are not too slow compared to other programming languages like python, ruby etc… But Yes they are slower than the vectorized code. You can get a faster code by doing more with vectorization than a simple loop.

14.1 initialize objects before loops

create vectors for storing object even before the loop starts. Because it allocates memory before the loop it makes R loop a lot faster. And creating a vector is a vectorized C function call thus it’s always a lot faster.

R has a few functions to create the type of vector of vector you need.integer, numeric, character, logical are most common function for these cases. numeric can be used to store Date types as well. It’s always beneficial to start with a vector to store the values.

14.2 use simple data-types

Data-types are the most common reason people don’t get speed in R. If you run a loop on a Data.frame it always have to check the constraints of a data.frame like same length vectors to make sure you are not messing up the data type and it also creates a copy on each modification. But same code could be like a 1000 times faster if we just use a simple list.

R data.table packages provides an interface to set values inside a data-table without creating a copy which makes it faster for most of the use cases. Let’s compare how fast will it be.

set_dt_num =  function(
  data_table, 
  n_row
  ) {
  for(i in 1:n_row){
    data.table::set(
      x = data_table,
      i = i,
      j = 1L,
      value = i * 2L
    )
  }
}

set_dt_col = function(
  data_table, 
  n_row
  ) {
  for(i in 1:n_row){
    data.table::set(
      x = data_table,
      i = i,
      j = "x",
      value = i * 2L
    )
  }
}

n_row <- 1e3

data_table <- data.table::data.table(x = integer(n_row))
data_frame <- data.frame(x = integer(n_row))

microbenchmark::microbenchmark(
  set_df_col = {
    for(i in 1:n_row){
      data_frame$x[[i]] <- i * 2L
    }
  },
  set_dt_num = set_dt_num(data_table , n_row ),
  set_dt_col = set_dt_col(data_table , n_row ),
  times = 10
)

## Unit: milliseconds
##        expr       min      lq      mean    median        uq       max neval
##  set_df_col 10.267001 10.5887 11.218911 11.138451 11.767000 12.431101    10
##  set_dt_num  3.045801  3.0838  3.822981  3.623002  3.837501  6.831201    10
##  set_dt_col  3.152602  3.4841  3.922271  3.580301  3.903402  7.084401    10

This Code used to give me around 200x increment over base R in previous versions. From R 4.0 onward R is managing memory pretty efficiently and the base is performing better in this test. Let me try it on a larger data set.

n_row <- 1e5

data_table <- data.table::data.table(x = integer(n_row))
data_frame <- data.frame(x = integer(n_row))

microbenchmark::microbenchmark(
  set_df_col = {
    for(i in 1:n_row){
      data_frame$x[[i]] <- i * 2L
    }
  },
  set_dt_num = set_dt_num(data_table , n_row ),
  set_dt_col = set_dt_col(data_table , n_row ),
  times = 10
)

## Unit: milliseconds
##        expr       min        lq      mean    median        uq       max neval
##  set_df_col 5348.5332 5490.4668 5675.8209 5587.4427 5923.5778 6119.5538    10
##  set_dt_num  313.2469  325.0008  358.3290  353.7714  385.9434  414.7643    10
##  set_dt_col  326.3563  337.5951  385.5643  351.7514  436.3054  493.8522    10

Now we can see some improvement over the base. On bigger data set we are getting around 10x+ speed with data.table. It was just to establish the fact that data.table performs well over bigger data sets. Yet we can still get a better performance by moving to a lower level data structure.

n_row <- 1e5

data_table <- data.table::data.table(x = integer(n_row))
data_list <- list(x = integer(n_row))

microbenchmark::microbenchmark(
  set_list_col = {
    for(i in 1:n_row){
      data_list$x[[i]] <- i*2L
    }
  },
  set_dt_num = set_dt_num(data_table , n_row ),
  set_dt_col = set_dt_col(data_table , n_row ),
  times = 10
)

## Unit: milliseconds
##          expr      min       lq      mean    median       uq      max neval
##  set_list_col  23.9616  24.4937  28.42005  26.80065  28.8722  38.2675    10
##    set_dt_num 311.7625 314.1495 322.61970 323.88640 327.3298 337.5023    10
##    set_dt_col 329.6029 337.0087 364.85992 347.76210 363.7729 523.5365    10

Just by moving from data.frame to list we can get a substantial increment over the base. Now we are close to 400x over R data.table. Which is huge. But wait we can do it better. We haven’t tried the most atomic level structure in R the VECTORS. Let’s benchmark it again with vectors.

n_row <- 1e5

x <- integer(n_row)
data_list <- list(x = integer(n_row))

microbenchmark::microbenchmark(
  set_list_col = {
    for(i in 1:n_row){
      data_list$x[[i]] <- i*2L
    }
  },
  set_vector = {
    for(i in 1:n_row){
      x[[i]] <- i*2L
    }
  }
)

## Unit: milliseconds
##          expr       min       lq     mean   median       uq     max neval
##  set_list_col 23.843401 24.54130 26.79498 25.82325 27.28535 38.4216   100
##    set_vector  9.725901 10.12715 11.29298 10.69910 11.14740 25.6903   100

We were able to squeeze 2x + speed with base R vector. So finally we have vector that can do the entire computation at 6ms while the worst dataframe would do it at 7700ms. Which makes our code around 1200x faster. All you need to do is to remember can we do it with a simpler data-type.

This is the best change you can make in your code to make it run faster. I always prefer looping over a vector like 90% of the time. In some cases where it’s not possible I like to convert the dataframe to list and run a loop and then convert it back to a dataframe. Because dataframe itself is a list with some constraints. Remembering this will help you a lot in speeding up your code.

14.3 apply family

Use apply family for concise and efficient code. apply functions make your code smaller and more readable and you don’t have to write loops so you can focus on your tasks. Whenever possible you should use apply function because there are times when it could be more efficient than a loop. There are only 3 functions that you should really know. You can get away with just these 3 function to solve almost 90% of your tasks.

lapply for almost everything
apply for looping over rows
mapply for looping over multiple vectors as an argument to a single function

There is vapply too for strictly getting vector in return. I have used it very rarely because you can unlist the lapply for same results as well. Map is nothing more than mapply with parameter SIMPLIFY as true. Reduce can also be used but in very rare circumstances.

There are a certain things you need to know about apply function.

14.3.1 apply functions are not much faster than loops

Many people wrongly assume that just because you are using apply functions that means your code is vectorized, which is not true at all. Apply functions are loops under the hood and they are meant for convenience not for speed.

microbenchmark::microbenchmark(
  lapply = lapply(1:1e3, rnorm),
  forloop = for(i in 1:1e3){
      rnorm(i)
    },
  times = 10
)

## Unit: milliseconds
##     expr     min      lq     mean  median      uq     max neval
##   lapply 48.1757 50.2958 52.84344 51.1705 52.7505 70.8977    10
##  forloop 49.9527 50.4819 52.87576 52.3894 55.2797 58.3301    10

I have tested this on bigger vector and the results are almost identical. There difference is not too much. but lapply gives you optimized loop to begin and thus you should always prefer a lapply where ever it’s possible but you should not be scared to use a loop either as the speed is mostly the same.

14.3.2 Nested lapply have same speed as a normal lapply

microbenchmark::microbenchmark(
  nested = lapply(1:1e3, function(x){
    rnorm(x)
  }) %>% 
  lapply(function(x){
      sum(x)
  }),
  normal = lapply(1:1e3, function(x){
    rnorm(x) %>% 
      sum()
  }),
  times = 10
)

## Unit: milliseconds
##    expr     min      lq     mean   median      uq     max neval
##  nested 51.1481 54.0475 55.71928 54.53290 56.1110 66.4323    10
##  normal 54.4771 56.5014 57.85445 57.09725 58.1304 63.5937    10

as you can see that nesting multiple lapply function doesn’t slow the code. However it’s not a standard practice to do it and I would not recommend anybody to make your code harder to read by nesting multiple lapply functions.

So let me summarize it lapply is better than loops but it’s no where near the speed of a vectorized code. Let’s talk about the fastest way to speed up your code.

14.4 Vectorize your code

R is vectorized to the core. Every function in R is vectorized. Even the comparison operators are vectorized. This is a core strength of R. If you can break your task down to vectorized operation you can make it faster even after adding more steps to it. Let’s take an example.

dummy_text <- sample(
  x = letters,
  size = 1e3,
  replace = TRUE
)

dummy_category <- sample(
  x = c(1,2,3),
  size = 1e3,
  replace = TRUE
)

main_table <- data.frame(dummy_text, dummy_category)

Now this table has a 1000 text that I want to join into a a huge corpus based on their category. Anybody familiar with other programming languages like python or java or c++ will look for a loop that can solve it. If you try that approach it might go like this.

join_text_norm <- function(df = main_table){
  
  text <- character(length(unique(df$dummy_category)))
  
  for(i in seq_along(df$dummy_category)){
    if ( df$dummy_category[[i]] == 1 ) {
        text[[1]] <- paste0(text[[1]], df$dummy_text[[i]])
    } else   if ( df$dummy_category[[i]] == 2 ) {
        text[[2]] <- paste0(text[[2]], df$dummy_text[[i]])
    } else {
      text[[3]] <- paste0(text[[3]], df$dummy_text[[i]])
    }
  }
  
  return(text)
  
}

join_text_norm()

## [1] "akwnggdwpcjfrtqnmnxqmpdlmnqasxgmpdszivvsynzgzuxbjltfremtmxvwayattlwsubqiegrsmmewjjiflewkcevhuxkkabbjyreobilalktotxdodkwvquqkrgovirrjyaqfbytztkrtdsfbzsfnmtjzsjaysgxumvxmnaraqscolrlxoxvuullhtrwfkvszodwqlgnnfsstyjszgevyoqgppfnaispqyfbbcxzdsciigpqmwcrpjcpsrjxcttokcaodsmuhjhedtkprgqrvnjgoqybwjdcowadsumdgmeabzqfilrcmrvhnylbmkqjoygskcjtjpqubjq"         
## [2] "chiuvtpjyjqnkjzpluaouwqbzexassqintwbpwxjzrktfzmfdzfnfhwgyhyrsrcvgtawlzeaumuugsmcovtpnyekgkkqoyrxiiczmdqjbwpbugacpbvbtckkuytbotjzdifofdflskutqxzfavkkpdncqllhdfexgtcbxemtnelxzjjaojndamxbaljrmhwydpevxisbxmpgsblxmdmbtmckpvfcidscqfncdpxedramaoqgixzwdsftzojdueqyxozevgzudwsiujcuslpgwhqctgwsbbptycvpbsizjifrmmdpdbjnlnjopdb"                                
## [3] "pxplgtnqrikgjvkxaetnukykzvqyzaxgjznmmmespebewlkqueelsteovungkttlflahuibzxftgqylkoslbpjejhytzxafsfuocrxlrtacsfaioplfdlxirmswlbxikvlkjtugjhlczomfztbmjrkheshxaopnhyuydlppweblcjqoejpjhbjryhkkkhuvecvhbsnrtsbprbuuzpuzjcnvmazwvoldzfaidfvylzpwpzyelbgugcrsfwrqajcwtxxrqmsxiacpswfahprkelqmhoupdfibrftdhjphrxbyrnyezblmjadwngxhqhwqfezetjidbzxwkrapbvzhsmlfpmzp"

This is not the most optimized function but this can get the job done. And I am breaking a golden rule here.

14.4.1 never repeat a calculation

in the above code I could save the some time by storing the value of text into a variable and stop R from calculating it again and again.

join_text_saved <- function(
  df = main_table
  ){
  
  text <- character(length(unique(df$dummy_category)))

  for(i in seq_along(df$dummy_category)){
    curr_text <- df$dummy_text[[i]]
    curr_cat <- df$dummy_category[[i]]
    
    if (curr_cat  == 1 ) {
        text[[1]] <- paste0(text[[1]], curr_text)
    } else   if ( curr_cat == 2 ) {
        text[[2]] <- paste0(text[[2]], curr_text)
    } else {
      text[[3]] <- paste0(text[[3]], curr_text)
    }
  }
  
  return(text)
}

join_text_saved()

## [1] "akwnggdwpcjfrtqnmnxqmpdlmnqasxgmpdszivvsynzgzuxbjltfremtmxvwayattlwsubqiegrsmmewjjiflewkcevhuxkkabbjyreobilalktotxdodkwvquqkrgovirrjyaqfbytztkrtdsfbzsfnmtjzsjaysgxumvxmnaraqscolrlxoxvuullhtrwfkvszodwqlgnnfsstyjszgevyoqgppfnaispqyfbbcxzdsciigpqmwcrpjcpsrjxcttokcaodsmuhjhedtkprgqrvnjgoqybwjdcowadsumdgmeabzqfilrcmrvhnylbmkqjoygskcjtjpqubjq"         
## [2] "chiuvtpjyjqnkjzpluaouwqbzexassqintwbpwxjzrktfzmfdzfnfhwgyhyrsrcvgtawlzeaumuugsmcovtpnyekgkkqoyrxiiczmdqjbwpbugacpbvbtckkuytbotjzdifofdflskutqxzfavkkpdncqllhdfexgtcbxemtnelxzjjaojndamxbaljrmhwydpevxisbxmpgsblxmdmbtmckpvfcidscqfncdpxedramaoqgixzwdsftzojdueqyxozevgzudwsiujcuslpgwhqctgwsbbptycvpbsizjifrmmdpdbjnlnjopdb"                                
## [3] "pxplgtnqrikgjvkxaetnukykzvqyzaxgjznmmmespebewlkqueelsteovungkttlflahuibzxftgqylkoslbpjejhytzxafsfuocrxlrtacsfaioplfdlxirmswlbxikvlkjtugjhlczomfztbmjrkheshxaopnhyuydlppweblcjqoejpjhbjryhkkkhuvecvhbsnrtsbprbuuzpuzjcnvmazwvoldzfaidfvylzpwpzyelbgugcrsfwrqajcwtxxrqmsxiacpswfahprkelqmhoupdfibrftdhjphrxbyrnyezblmjadwngxhqhwqfezetjidbzxwkrapbvzhsmlfpmzp"

microbenchmark::microbenchmark(
  join_text_norm(df = main_table),
  join_text_saved(df = main_table)
)

## Unit: milliseconds
##                              expr      min       lq     mean   median       uq
##   join_text_norm(df = main_table) 6.290101 6.889651 7.455970 7.189351 7.535502
##  join_text_saved(df = main_table) 5.745501 6.085952 6.685338 6.473151 7.051451
##      max neval
##  15.2699   100
##  11.5883   100

We did not save much on it but we still saved one millisecond on just a 1000 loop. It’s an excellent practice of not to repeat calculation. Especially when you are calculating multiple things again and again.

Now coming back to the point. You could try this approach just like every other programming language does. Or you can try a vectorized approach with the built in paste function with collapse argument.

collapsed_fun <- function(
  df =  main_table
  ){
  text <- df %>% 
  split(f = dummy_category) %>% 
  lapply(function(x)
    paste0(x$dummy_text,collapse = "")
    ) %>% 
  unlist()
  
  return(text)
}

collapsed_fun(main_table)

##                                                                                                                                                                                                                                                                                                                                                             1 
##          "akwnggdwpcjfrtqnmnxqmpdlmnqasxgmpdszivvsynzgzuxbjltfremtmxvwayattlwsubqiegrsmmewjjiflewkcevhuxkkabbjyreobilalktotxdodkwvquqkrgovirrjyaqfbytztkrtdsfbzsfnmtjzsjaysgxumvxmnaraqscolrlxoxvuullhtrwfkvszodwqlgnnfsstyjszgevyoqgppfnaispqyfbbcxzdsciigpqmwcrpjcpsrjxcttokcaodsmuhjhedtkprgqrvnjgoqybwjdcowadsumdgmeabzqfilrcmrvhnylbmkqjoygskcjtjpqubjq" 
##                                                                                                                                                                                                                                                                                                                                                             2 
##                                 "chiuvtpjyjqnkjzpluaouwqbzexassqintwbpwxjzrktfzmfdzfnfhwgyhyrsrcvgtawlzeaumuugsmcovtpnyekgkkqoyrxiiczmdqjbwpbugacpbvbtckkuytbotjzdifofdflskutqxzfavkkpdncqllhdfexgtcbxemtnelxzjjaojndamxbaljrmhwydpevxisbxmpgsblxmdmbtmckpvfcidscqfncdpxedramaoqgixzwdsftzojdueqyxozevgzudwsiujcuslpgwhqctgwsbbptycvpbsizjifrmmdpdbjnlnjopdb" 
##                                                                                                                                                                                                                                                                                                                                                             3 
## "pxplgtnqrikgjvkxaetnukykzvqyzaxgjznmmmespebewlkqueelsteovungkttlflahuibzxftgqylkoslbpjejhytzxafsfuocrxlrtacsfaioplfdlxirmswlbxikvlkjtugjhlczomfztbmjrkheshxaopnhyuydlppweblcjqoejpjhbjryhkkkhuvecvhbsnrtsbprbuuzpuzjcnvmazwvoldzfaidfvylzpwpzyelbgugcrsfwrqajcwtxxrqmsxiacpswfahprkelqmhoupdfibrftdhjphrxbyrnyezblmjadwngxhqhwqfezetjidbzxwkrapbvzhsmlfpmzp"

Let’s compare it with the loop approach.

microbenchmark::microbenchmark(
  join_text_norm(),
  join_text_saved(),
  collapsed_fun()
)

## Unit: milliseconds
##               expr      min       lq     mean   median       uq       max neval
##   join_text_norm() 6.353301 6.641401 7.238728 7.070050 7.397151 12.855901   100
##  join_text_saved() 5.706501 5.964101 6.506474 6.430801 6.640600 11.475901   100
##    collapsed_fun() 1.500201 1.706001 1.887568 1.792001 1.930351  7.804401   100

Collapsed function is faster than all the other approach for just 1000 loops. Imagine doing it for 1 million. Vectorized function in those cases will be 1000 times faster than loops.

The real reason for that is vectorized code uses optimized c for looping which is almost always faster than loops in R. And at times you can get 1000x speed compared to a normal loop. and Thus sometimes you can get away with doing more with vectors than with loops.

14.4.2 Vectorized code can do 2 or 3 steps more in lesser time

There is a classical example that I read in a book efficient R and I was amazed to see why it happened.

if_norm <- function(
  x,
  size
){
  y <- character(size)
  for(i in 1:1e3){
    value <- x[[i]]
    if(value == 0){
      y[[i]] <- "zero"
    } else if(value == 1){
      y[[i]] <- "one"
    } else {
      y[[i]] <- "many"
    }
  }
  return(y)
}


if_vector <- function(
  x,
  size
){

  y <- character(size)
  
  y[1:size] <- "many"
  y[x == 1] <- "one"
  y[x == 0] <- "zero"
  
  return(y)
}

Both the function will return the same vector. However we are doing minimum with the normal function while in vectorized function we are doing redundant operations.

size <- 1e3

x <- sample(
  x = c(0,1,2),
  size = size,
  replace = TRUE
)

all.equal(
  if_norm(x, size),
  if_vector(x, size)
)

## [1] TRUE

Let’s check the speed of both the functions. Even though We are doing 3 steps in the vectorized solution and doing the same thing 3 times just to get a solution yet we are faster because vectorized code is very very fast you can get away with doing more and still saving time. Compare vectorized to be FLASH GORDEN and it can be faster even when it’s doing more than it should.

microbenchmark::microbenchmark(
  minimal = if_norm(x,size),
  vectorized = if_vector(x, size)
)

## Unit: microseconds
##        expr     min      lq      mean   median       uq      max neval
##     minimal 265.301 269.501 288.16599 273.4010 290.3510  440.201   100
##  vectorized  28.502  30.101  93.47598  31.6005  32.8015 6133.701   100

14.5 Understanding non-vectorized code

There are times when you need only scalar values and in those times it is redundant to use vectorized code. I see many people not understanding the difference and using vectorized code inside a loop on a scalar variable when non vectorized would have been more efficient. Let’s check it with an example.

n_size <- 1e4

binary_df <- data.frame(
  x = sample(
    x = c(TRUE, FALSE),
    size = n_size,
    replace = TRUE
  ),
  y = sample(
    x = c(TRUE, FALSE),
    size = n_size,
    replace = TRUE
  ),
  z = sample(
    x = c(TRUE, FALSE),
    size = n_size,
    replace = TRUE
  )
)

Let’s check if we can find all the rows where all the variable are true. The fastest method would be vectorized solution like this

all_true <- binary_df$x & binary_df$y & binary_df$z

But when suppose you are using it in a loop to find the exact same thing.

### vectorized code
vect_all_true <- function(
  df
){
  y <- logical(nrow(df))
  
  for(i in seq_along(binary_df$x)){
    y[i] <-  df$x[i] & df$y[i] & df$z[i]
  }
  
  return(y)
}

### scalar code
scalar_all_true <- function(
  df
){
  y <- logical(nrow(df))
  
  for(i in seq_along(df$x)){
    y[[i]] <-  df$x[[i]] && df$y[[i]] && df$z[[i]]
  }
  
  return(y)
}

Let’s compare both the functions for speed.

microbenchmark::microbenchmark(
  vect_all_true(df = binary_df),
  scalar_all_true(df = binary_df),
  times = 10
)

## Unit: milliseconds
##                             expr     min      lq     mean   median      uq
##    vect_all_true(df = binary_df) 32.2671 34.5365 36.24843 35.39615 36.8008
##  scalar_all_true(df = binary_df) 17.7967 19.0181 21.86647 22.37185 22.9407
##      max neval
##  43.8001    10
##  26.7061    10

I know this is not an excellent example because it can be vectorized easily but in the cases where you are working on individual scalar values using non-vectorized code gives you speed. In our current example we are getting twice the speed which is enough for such a small data set and the difference will increase with the number of rows.

14.6 Do as little as possible inside a loop

R is an interpreted language every thing you write inside a loop runs multiple times. The best thing you can do is to be parsimonious while writing code inside a loop. There are a number of steps that you can do to speed up a loop a bit more.

Calculate results before the loop
initialize objects before the loop
Iterate on as few numbers as possible
Write as less functions inside a loop as possible

The main tip is to Get out of loop as quickly as possible. There is another very crucial thing you can do to speed up your code.

14.6.1 Combine Vectorized code inside a loop

The best case is to figure which part of codes can be optimized with a vectorized solution and which would require you to loop through. The key is to use as minimum loops as possible and as much as vectorized code as possible. This is the same thing that helps in parallelizing the code too.

n_size <- 1e5

hr_df <- data.table::data.table(
  department = sample(
    x = letters[1:5],
    size = n_size,
    replace = TRUE
  ),
  salary = sample(
    x = 1e3:1e4,
    size = n_size,
    replace = TRUE
  )
)

Let’s try to find out How much money each department is paying for salary.

sum_salary <- function(
  df
){

  answer <- list()

  for(dep in unique(df$department)){

    value <- df[
      department == dep,
      sum(salary, na.rm = TRUE)
    ]

    answer[[dep]] <- value

  }

  return(answer)

}

microbenchmark::microbenchmark(
  sum_salary(df = hr_df)
)

## Unit: milliseconds
##                    expr     min       lq     mean  median       uq     max
##  sum_salary(df = hr_df) 12.7145 13.54815 14.39076 13.9214 14.26445 32.7294
##  neval
##    100

I am doing as much vectorized calculation as possible in this scenario and this is the reason this code runs pretty fast. If I write a loop that goes through all the 10^{5} rows then I would get around 100x slower speed. This is a neat trick you must use whenever you can. Do as little as possible inside a loop.

14.7 Conclusion

This chapter mainly focused on Loops and how to optimize them. Loops are necessary and these tips will help run them better

vectorize your code
You can do more with vectorization and still be faster than a loop
Use vectorized and scalar code with care
combine vectorized code with loops to gain maximum power
Initialize Your object before the loop
use simpler data-types inside loop
apply functions are not faster than loops
nested apply functions don’t necessary mean slower code but you should avoid them
Don’t repeat same calculation
cache or save the results
run garbage collection for heavy calculations