Chapter 13 For Loops

This is a topic that I wanted to discuss for a long time. People read blogs from 2014-2016 and assume that for loops in R are bad. You should not use them and Loops in R are slow etc. etc… This chapter will help you understand how to use them more effectively.

R loops are not too slow compared to other programming languages like python, ruby etc… But Yes they are slower than the vectorized code. You can get a faster code by doing more with vectorization than a simple loop.

13.1 initialize objects before loops

create vectors for storing object even before the loop starts. Because it allocates memory before the loop it makes R loop a lot faster. And creating a vector is a vectorized C function call thus it’s always a lot faster.

R has a few functions to create the type of vector of vector you need.integer, numeric, character, logical are most common function for these cases. numeric can be used to store Date types as well. It’s always beneficial to start with a vector to store the values.

13.2 use simple data-types

Data-types are the most common reason people don’t get speed in R. If you run a loop on a Data.frame it always have to check the constraints of a data.frame like same length vectors to make sure you are not messing up the data type and it also creates a copy on each modification. But same code could be like a 1000 times faster if we just use a simple list.

R data.table packages provides an interface to set values inside a data-table without creating a copy which makes it faster for most of the use cases. Let’s compare how fast will it be.

set_dt_num =  function(
  data_table, 
  n_row
  ) {
  for(i in 1:n_row){
    data.table::set(
      x = data_table,
      i = i,
      j = 1L,
      value = i * 2L
    )
  }
}

set_dt_col = function(
  data_table, 
  n_row
  ) {
  for(i in 1:n_row){
    data.table::set(
      x = data_table,
      i = i,
      j = "x",
      value = i * 2L
    )
  }
}
n_row <- 1e3

data_table <- data.table::data.table(x = integer(n_row))
data_frame <- data.frame(x = integer(n_row))

microbenchmark::microbenchmark(
  set_df_col = {
    for(i in 1:n_row){
      data_frame$x[[i]] <- i * 2L
    }
  },
  set_dt_num = set_dt_num(data_table , n_row ),
  set_dt_col = set_dt_col(data_table , n_row ),
  times = 10
)
## Unit: milliseconds
##        expr    min     lq    mean  median     uq     max neval cld
##  set_df_col 6.4192 6.7703 7.12948 7.20885 7.4998  7.7697    10  a 
##  set_dt_num 6.9962 7.2269 7.51301 7.44040 7.8234  8.4001    10  ab
##  set_dt_col 7.3007 7.3746 8.21150 7.70270 8.4784 11.8445    10   b

This Code used to give me around 200x increment over base R in previous versions. From R 4.0 onward R is managing memory pretty efficiently and the base is performing better in this test. Let me try it on a larger data set.

n_row <- 1e5

data_table <- data.table::data.table(x = integer(n_row))
data_frame <- data.frame(x = integer(n_row))

microbenchmark::microbenchmark(
  set_df_col = {
    for(i in 1:n_row){
      data_frame$x[[i]] <- i * 2L
    }
  },
  set_dt_num = set_dt_num(data_table , n_row ),
  set_dt_col = set_dt_col(data_table , n_row ),
  times = 10
)
## Unit: milliseconds
##        expr       min        lq      mean    median        uq       max neval
##  set_df_col 5636.7793 5713.2144 5923.5910 5815.6623 5999.2714 6667.3990    10
##  set_dt_num  726.3032  743.6081  765.2136  754.3738  772.6807  871.3212    10
##  set_dt_col  757.6400  767.5334  787.3691  788.5969  797.3524  825.1221    10
##  cld
##    b
##   a 
##   a

Now we can see some improvement over the base. On bigger data set we are getting around 10x+ speed with data.table. It was just to establish the fact that data.table performs well over bigger data sets. Yet we can still get a better performance by moving to a lower level data structure.

n_row <- 1e5

data_table <- data.table::data.table(x = integer(n_row))
data_list <- list(x = integer(n_row))

microbenchmark::microbenchmark(
  set_list_col = {
    for(i in 1:n_row){
      data_list$x[[i]] <- i*2L
    }
  },
  set_dt_num = set_dt_num(data_table , n_row ),
  set_dt_col = set_dt_col(data_table , n_row ),
  times = 10
)
## Unit: milliseconds
##          expr      min       lq      mean   median       uq       max neval cld
##  set_list_col  14.7344  15.8604  16.59509  16.6949  17.3308   18.3125    10  a 
##    set_dt_num 756.9703 759.1276 801.20570 786.8388 819.9383  940.1052    10   b
##    set_dt_col 758.6093 794.6850 856.10497 850.2959 907.3545 1015.2980    10   b

Just by moving from data.frame to list we can get a substantial increment over the base. Now we are close to 400x over R data.table. Which is huge. But wait we can do it better. We haven’t tried the most atomic level structure in R the VECTORS. Let’s benchmark it again with vectors.

n_row <- 1e5

x <- integer(n_row)
data_list <- list(x = integer(n_row))

microbenchmark::microbenchmark(
  set_list_col = {
    for(i in 1:n_row){
      data_list$x[[i]] <- i*2L
    }
  },
  set_vector = {
    for(i in 1:n_row){
      x[[i]] <- i*2L
    }
  }
)
## Unit: milliseconds
##          expr     min       lq      mean  median      uq     max neval cld
##  set_list_col 15.0020 16.34730 16.946135 17.0637 17.3738 20.5493   100   b
##    set_vector  6.0714  6.94135  7.248512  7.2372  7.5281  8.8899   100  a

We were able to squeeze 2x + speed with base R vector. So finally we have vector that can do the entire computation at 6ms while the worst dataframe would do it at 7700ms. Which makes our code around 1200x faster. All you need to do is to remember can we do it with a simpler data-type.

This is the best change you can make in your code to make it run faster. I always prefer looping over a vector like 90% of the time. In some cases where it’s not possible I like to convert the dataframe to list and run a loop and then convert it back to a dataframe. Because dataframe itself is a list with some constraints. Remembering this will help you a lot in speeding up your code.

13.3 apply family

Use apply family for concise and efficient code. apply functions make your code smaller and more readable and you don’t have to write loops so you can focus on your tasks. Whenever possible you should use apply function because there are times when it could be more efficient than a loop. There are only 3 functions that you should really know. You can get away with just these 3 function to solve almost 90% of your tasks.

  1. lapply for almost everything
  2. apply for looping over rows
  3. mapply for looping over multiple vectors as an argument to a single function

There is vapply too for strictly getting vector in return. I have used it very rarely because you can unlist the lapply for same results as well. Map is nothing more than mapply with parameter SIMPLIFY as true. Reduce can also be used but in very rare circumstances.

There are a certain things you need to know about apply function.

13.3.1 apply functions are not much faster than loops

Many people wrongly assume that just because you are using apply functions that means your code is vectorized, which is not true at all. Apply functions are loops under the hood and they are meant for convenience not for speed.

microbenchmark::microbenchmark(
  lapply = lapply(1:1e3, rnorm),
  forloop = for(i in 1:1e3){
      rnorm(i)
    },
  times = 10
)
## Unit: milliseconds
##     expr     min      lq     mean  median      uq     max neval cld
##   lapply 29.3617 30.5030 31.36278 31.1392 32.5946 33.3186    10  a 
##  forloop 30.6841 31.9347 33.87415 33.7539 34.1403 40.8678    10   b

I have tested this on bigger vector and the results are almost identical. There difference is not too much. but lapply gives you optimized loop to begin and thus you should always prefer a lapply where ever it’s possible but you should not be scared to use a loop either as the speed is mostly the same.

13.3.2 Nested lapply have same speed as a normal lapply

microbenchmark::microbenchmark(
  nested = lapply(1:1e3, function(x){
    rnorm(x)
  }) %>% 
  lapply(function(x){
      sum(x)
  }),
  normal = lapply(1:1e3, function(x){
    rnorm(x) %>% 
      sum()
  }),
  times = 10
)
## Unit: milliseconds
##    expr     min      lq     mean   median      uq     max neval cld
##  nested 31.5218 32.7724 34.51463 33.60145 34.4416 44.8997    10   a
##  normal 33.1568 35.4852 36.95852 36.19680 37.4690 45.3356    10   a

as you can see that nesting multiple lapply function doesn’t slow the code. However it’s not a standard practice to do it and I would not recommend anybody to make your code harder to read by nesting multiple lapply functions.

So let me summarize it lapply is better than loops but it’s no where near the speed of a vectorized code. Let’s talk about the fastest way to speed up your code.

13.4 Vectorize your code

R is vectorized to the core. Every function in R is vectorized. Even the comparison operators are vectorized. This is a core strength of R. If you can break your task down to vectorized operation you can make it faster even after adding more steps to it. Let’s take an example.

dummy_text <- sample(
  x = letters,
  size = 1e3,
  replace = TRUE
)

dummy_category <- sample(
  x = c(1,2,3),
  size = 1e3,
  replace = TRUE
)

main_table <- data.frame(dummy_text, dummy_category)

Now this table has a 1000 text that I want to join into a a huge corpus based on their category. Anybody familiar with other programming languages like python or java or c++ will look for a loop that can solve it. If you try that approach it might go like this.

join_text_norm <- function(df = main_table){
  
  text <- character(length(unique(df$dummy_category)))
  
  for(i in seq_along(df$dummy_category)){
    if ( df$dummy_category[[i]] == 1 ) {
        text[[1]] <- paste0(text[[1]], df$dummy_text[[i]])
    } else   if ( df$dummy_category[[i]] == 2 ) {
        text[[2]] <- paste0(text[[2]], df$dummy_text[[i]])
    } else {
      text[[3]] <- paste0(text[[3]], df$dummy_text[[i]])
    }
  }
  
  return(text)
  
}

join_text_norm()
## [1] "rusibzesfwgmqllolipftcofeuxkkobbrxlidqbllwptugacdikvvbtusgvlrbxirjvklkjbtbtktjzifmdzlkjvkxkdyoxculhlmaaqpesxgtrcbebxecunlejjzjrjzodqkkuvlvfsrtjrsmewprbujxipsxippyfsfdvwmmbpzmikggwcriwcccdraettaokrodmhhgsiixdsfhulqrunxqiobwgtddccrssdpggrqnzgwqlczeienlidqrjdpdojxcrjntplmljqmzdkqhwkjzvlyuqwsilluhyfdunxabxcreysmzvqfhfzoeflgfagvxgyxqnnnj"     
## [2] "urbgiagrmtzkasuejhsyvfstpyhgkkqaycrxjcrmraobjikfaioptlxmqkrouyyghtoodcsfobfsbutqfzfamheshzasgpyyddnfpllxolvmolhtprhkoysnwhacnhsngbhyuyupgnazwomqzifybzsiyetvpbmufpfpqsprjqxxjcrxaaqqmxzdtpsatprdkeehpgybjfdzujoupwjduuymhetgebwlmdwphqvhibsjmrvtfmbybgzswkupcntajjemowwnuwvmtucfzdflehtxppcoiuusbyivhwhwlwyairwwwakltzhgqtudqzjhplnqcxd"            
## [3] "hcvqtxyeewjjbumjuglemwzxaukocnevkiytzecsabofxldodrpwbqwikycaqfkuttzjrldzzffstmxrnktjspjannquhvxmlrcowxjqutlexbjjwafvhdkemxgnbabstlyjszvdzpzoeqcvpvmfbvnalsdgabllzbpcxxddclcpgqccsrdjsfnctwdmxcosujewacwfzokjprgqmqovjyodfzevrwsihahxmblwcysajbabnxtybqzcpfrfizhjybkmmkapbvzjqhbsfnjopbpuumqvslhbbiorlheylgugwpxpazccdlqsqiniltwagstftclqvlkypekvksx"

This is not the most optimized function but this can get the job done. And I am breaking a golden rule here.

13.4.1 never repeat a calculation

in the above code I could save the some time by storing the value of text into a variable and stop R from calculating it again and again.

join_text_saved <- function(
  df = main_table
  ){
  
  text <- character(length(unique(df$dummy_category)))

  for(i in seq_along(df$dummy_category)){
    curr_text <- df$dummy_text[[i]]
    curr_cat <- df$dummy_category[[i]]
    
    if (curr_cat  == 1 ) {
        text[[1]] <- paste0(text[[1]], curr_text)
    } else   if ( curr_cat == 2 ) {
        text[[2]] <- paste0(text[[2]], curr_text)
    } else {
      text[[3]] <- paste0(text[[3]], curr_text)
    }
  }
  
  return(text)
}

join_text_saved()
## [1] "rusibzesfwgmqllolipftcofeuxkkobbrxlidqbllwptugacdikvvbtusgvlrbxirjvklkjbtbtktjzifmdzlkjvkxkdyoxculhlmaaqpesxgtrcbebxecunlejjzjrjzodqkkuvlvfsrtjrsmewprbujxipsxippyfsfdvwmmbpzmikggwcriwcccdraettaokrodmhhgsiixdsfhulqrunxqiobwgtddccrssdpggrqnzgwqlczeienlidqrjdpdojxcrjntplmljqmzdkqhwkjzvlyuqwsilluhyfdunxabxcreysmzvqfhfzoeflgfagvxgyxqnnnj"     
## [2] "urbgiagrmtzkasuejhsyvfstpyhgkkqaycrxjcrmraobjikfaioptlxmqkrouyyghtoodcsfobfsbutqfzfamheshzasgpyyddnfpllxolvmolhtprhkoysnwhacnhsngbhyuyupgnazwomqzifybzsiyetvpbmufpfpqsprjqxxjcrxaaqqmxzdtpsatprdkeehpgybjfdzujoupwjduuymhetgebwlmdwphqvhibsjmrvtfmbybgzswkupcntajjemowwnuwvmtucfzdflehtxppcoiuusbyivhwhwlwyairwwwakltzhgqtudqzjhplnqcxd"            
## [3] "hcvqtxyeewjjbumjuglemwzxaukocnevkiytzecsabofxldodrpwbqwikycaqfkuttzjrldzzffstmxrnktjspjannquhvxmlrcowxjqutlexbjjwafvhdkemxgnbabstlyjszvdzpzoeqcvpvmfbvnalsdgabllzbpcxxddclcpgqccsrdjsfnctwdmxcosujewacwfzokjprgqmqovjyodfzevrwsihahxmblwcysajbabnxtybqzcpfrfizhjybkmmkapbvzjqhbsfnjopbpuumqvslhbbiorlheylgugwpxpazccdlqsqiniltwagstftclqvlkypekvksx"
microbenchmark::microbenchmark(
  join_text_norm(df = main_table),
  join_text_saved(df = main_table)
)
## Unit: milliseconds
##                              expr    min      lq     mean  median     uq    max
##   join_text_norm(df = main_table) 4.3522 5.05395 5.342287 5.26005 5.3965 8.7694
##  join_text_saved(df = main_table) 3.9887 4.62080 4.988805 4.84410 5.0288 8.8924
##  neval cld
##    100   b
##    100  a

We did not save much on it but we still saved one millisecond on just a 1000 loop. It’s an excellent practice of not to repeat calculation. Especially when you are calculating multiple things again and again.

Now coming back to the point. You could try this approach just like every other programming language does. Or you can try a vectorized approach with the built in paste function with collapse argument.

collapsed_fun <- function(
  df =  main_table
  ){
  text <- df %>% 
  split(f = dummy_category) %>% 
  lapply(function(x)
    paste0(x$dummy_text,collapse = "")
    ) %>% 
  unlist()
  
  return(text)
}

collapsed_fun(main_table)
##                                                                                                                                                                                                                                                                                                                                                     1 
##      "rusibzesfwgmqllolipftcofeuxkkobbrxlidqbllwptugacdikvvbtusgvlrbxirjvklkjbtbtktjzifmdzlkjvkxkdyoxculhlmaaqpesxgtrcbebxecunlejjzjrjzodqkkuvlvfsrtjrsmewprbujxipsxippyfsfdvwmmbpzmikggwcriwcccdraettaokrodmhhgsiixdsfhulqrunxqiobwgtddccrssdpggrqnzgwqlczeienlidqrjdpdojxcrjntplmljqmzdkqhwkjzvlyuqwsilluhyfdunxabxcreysmzvqfhfzoeflgfagvxgyxqnnnj" 
##                                                                                                                                                                                                                                                                                                                                                     2 
##             "urbgiagrmtzkasuejhsyvfstpyhgkkqaycrxjcrmraobjikfaioptlxmqkrouyyghtoodcsfobfsbutqfzfamheshzasgpyyddnfpllxolvmolhtprhkoysnwhacnhsngbhyuyupgnazwomqzifybzsiyetvpbmufpfpqsprjqxxjcrxaaqqmxzdtpsatprdkeehpgybjfdzujoupwjduuymhetgebwlmdwphqvhibsjmrvtfmbybgzswkupcntajjemowwnuwvmtucfzdflehtxppcoiuusbyivhwhwlwyairwwwakltzhgqtudqzjhplnqcxd" 
##                                                                                                                                                                                                                                                                                                                                                     3 
## "hcvqtxyeewjjbumjuglemwzxaukocnevkiytzecsabofxldodrpwbqwikycaqfkuttzjrldzzffstmxrnktjspjannquhvxmlrcowxjqutlexbjjwafvhdkemxgnbabstlyjszvdzpzoeqcvpvmfbvnalsdgabllzbpcxxddclcpgqccsrdjsfnctwdmxcosujewacwfzokjprgqmqovjyodfzevrwsihahxmblwcysajbabnxtybqzcpfrfizhjybkmmkapbvzjqhbsfnjopbpuumqvslhbbiorlheylgugwpxpazccdlqsqiniltwagstftclqvlkypekvksx"

Let’s compare it with the loop approach.

microbenchmark::microbenchmark(
  join_text_norm(),
  join_text_saved(),
  collapsed_fun()
)
## Unit: microseconds
##               expr    min      lq     mean median      uq     max neval cld
##   join_text_norm() 4339.0 4843.35 5139.115 5086.5 5259.20  7970.6   100   b
##  join_text_saved() 4005.1 4409.35 4881.698 4687.7 4807.75 15375.6   100   b
##    collapsed_fun()  928.9 1077.45 1199.531 1153.2 1236.45  4802.2   100  a

Collapsed function is faster than all the other approach for just 1000 loops. Imagine doing it for 1 million. Vectorized function in those cases will be 1000 times faster than loops.

The real reason for that is vectorized code uses optimized c for looping which is almost always faster than loops in R. And at times you can get 1000x speed compared to a normal loop. and Thus sometimes you can get away with doing more with vectors than with loops.

13.4.2 Vectorized code can do 2 or 3 steps more in lesser time

There is a classical example that I read in a book efficient R and I was amazed to see why it happened.

if_norm <- function(
  x,
  size
){
  y <- character(size)
  for(i in 1:1e3){
    value <- x[[i]]
    if(value == 0){
      y[[i]] <- "zero"
    } else if(value == 1){
      y[[i]] <- "one"
    } else {
      y[[i]] <- "many"
    }
  }
  return(y)
}


if_vector <- function(
  x,
  size
){

  y <- character(size)
  
  y[1:size] <- "many"
  y[x == 1] <- "one"
  y[x == 0] <- "zero"
  
  return(y)
}

Both the function will return the same vector. However we are doing minimum with the normal function while in vectorized function we are doing redundant operations.

size <- 1e3

x <- sample(
  x = c(0,1,2),
  size = size,
  replace = TRUE
)

all.equal(
  if_norm(x, size),
  if_vector(x, size)
)
## [1] TRUE

Let’s check the speed of both the functions. Even though We are doing 3 steps in the vectorized solution and doing the same thing 3 times just to get a solution yet we are faster because vectorized code is very very fast you can get away with doing more and still saving time. Compare vectorized to be FLASH GORDEN and it can be faster even when it’s doing more than it should.

microbenchmark::microbenchmark(
  minimal = if_norm(x,size),
  vectorized = if_vector(x, size)
)
## Unit: microseconds
##        expr   min     lq    mean median     uq    max neval cld
##     minimal 167.8 174.30 190.271 189.35 196.85  262.4   100   b
##  vectorized  20.6  22.35  64.926  23.30  24.75 3979.3   100  a

13.5 Understanding non-vectorized code

There are times when you need only scalar values and in those times it is redundant to use vectorized code. I see many people not understanding the difference and using vectorized code inside a loop on a scalar variable when non vectorized would have been more efficient. Let’s check it with an example.

n_size <- 1e4

binary_df <- data.frame(
  x = sample(
    x = c(TRUE, FALSE),
    size = n_size,
    replace = TRUE
  ),
  y = sample(
    x = c(TRUE, FALSE),
    size = n_size,
    replace = TRUE
  ),
  z = sample(
    x = c(TRUE, FALSE),
    size = n_size,
    replace = TRUE
  )
)

Let’s check if we can find all the rows where all the variable are true. The fastest method would be vectorized solution like this

all_true <- binary_df$x & binary_df$y & binary_df$z

But when suppose you are using it in a loop to find the exact same thing.

### vectorized code
vect_all_true <- function(
  df
){
  y <- logical(nrow(df))
  
  for(i in seq_along(binary_df$x)){
    y[i] <-  df$x[i] & df$y[i] & df$z[i]
  }
  
  return(y)
}

### scalar code
scalar_all_true <- function(
  df
){
  y <- logical(nrow(df))
  
  for(i in seq_along(df$x)){
    y[[i]] <-  df$x[[i]] && df$y[[i]] && df$z[[i]]
  }
  
  return(y)
}

Let’s compare both the functions for speed.

microbenchmark::microbenchmark(
  vect_all_true(df = binary_df),
  scalar_all_true(df = binary_df),
  times = 10
)
## Unit: milliseconds
##                             expr     min      lq     mean   median      uq
##    vect_all_true(df = binary_df) 22.7146 23.4207 25.31963 24.85225 25.0116
##  scalar_all_true(df = binary_df) 11.7749 12.3177 13.18605 12.58995 12.8907
##      max neval cld
##  34.2915    10   b
##  17.1805    10  a

I know this is not an excellent example because it can be vectorized easily but in the cases where you are working on individual scalar values using non-vectorized code gives you speed. In our current example we are getting twice the speed which is enough for such a small data set and the difference will increase with the number of rows.

13.6 Do as little as possible inside a loop

R is an interpreted language every thing you write inside a loop runs multiple times. The best thing you can do is to be parsimonious while writing code inside a loop. There are a number of steps that you can do to speed up a loop a bit more.

  1. Calculate results before the loop
  2. initialize objects before the loop
  3. Iterate on as few numbers as possible
  4. Write as less functions inside a loop as possible

The main tip is to Get out of loop as quickly as possible. There is another very crucial thing you can do to speed up your code.

13.6.1 Combine Vectorized code inside a loop

The best case is to figure which part of codes can be optimized with a vectorized solution and which would require you to loop through. The key is to use as minimum loops as possible and as much as vectorized code as possible. This is the same thing that helps in parallelizing the code too.

n_size <- 1e5

hr_df <- data.table::data.table(
  department = sample(
    x = letters[1:5],
    size = n_size,
    replace = TRUE
  ),
  salary = sample(
    x = 1e3:1e4,
    size = n_size,
    replace = TRUE
  )
)

Let’s try to find out How much money each department is paying for salary.

sum_salary <- function(
  df
){

  answer <- list()

  for(dep in unique(df$department)){

    value <- df[
      department == dep,
      sum(salary, na.rm = TRUE)
    ]

    answer[[dep]] <- value

  }

  return(answer)

}

microbenchmark::microbenchmark(
  sum_salary(df = hr_df)
)
## Unit: milliseconds
##                    expr   min     lq     mean median      uq     max neval
##  sum_salary(df = hr_df) 7.876 8.4403 9.346993 8.7304 9.10345 23.2677   100

I am doing as much vectorized calculation as possible in this scenario and this is the reason this code runs pretty fast. If I write a loop that goes through all the 10^{5} rows then I would get around 100x slower speed. This is a neat trick you must use whenever you can. Do as little as possible inside a loop.

13.7 Conclusion

This chapter mainly focused on Loops and how to optimize them. Loops are necessary and these tips will help run them better

  1. vectorize your code
  2. You can do more with vectorization and still be faster than a loop
  3. Use vectorized and scalar code with care
  4. combine vectorized code with loops to gain maximum power
  5. Initialize Your object before the loop
  6. use simpler data-types inside loop
  7. apply functions are not faster than loops
  8. nested apply functions don’t necessary mean slower code but you should avoid them
  9. Don’t repeat same calculation
  10. cache or save the results
  11. run garbage collection for heavy calculations