Chapter 14 For Loops
This is a topic that I wanted to discuss for a long time. People read blogs from 2014-2016 and assume that for loops in R are bad. You should not use them and Loops in R are slow etc. etc… This chapter will help you understand how to use them more effectively.
R loops are not too slow compared to other programming languages like python, ruby etc… But Yes they are slower than the vectorized code. You can get a faster code by doing more with vectorization than a simple loop.
14.1 initialize objects before loops
create vectors for storing object even before the loop starts. Because it allocates memory before the loop it makes R loop a lot faster. And creating a vector is a vectorized C function call thus it’s always a lot faster.
R has a few functions to create the type of vector of vector you need.integer, numeric, character, logical
are most common function for these cases. numeric can be used to store Date types
as well. It’s always beneficial to start with a vector to store the values.
14.2 use simple data-types
Data-types are the most common reason people don’t get speed in R. If you run a loop on a Data.frame it always have to check the constraints of a data.frame like same length vectors to make sure you are not messing up the data type and it also creates a copy on each modification. But same code could be like a 1000 times faster if we just use a simple list.
R data.table packages provides an interface to set values inside a data-table without creating a copy which makes it faster for most of the use cases. Let’s compare how fast will it be.
= function(
set_dt_num
data_table,
n_row
) {for(i in 1:n_row){
::set(
data.tablex = data_table,
i = i,
j = 1L,
value = i * 2L
)
}
}
= function(
set_dt_col
data_table,
n_row
) {for(i in 1:n_row){
::set(
data.tablex = data_table,
i = i,
j = "x",
value = i * 2L
)
} }
<- 1e3
n_row
<- data.table::data.table(x = integer(n_row))
data_table <- data.frame(x = integer(n_row))
data_frame
::microbenchmark(
microbenchmarkset_df_col = {
for(i in 1:n_row){
$x[[i]] <- i * 2L
data_frame
}
},set_dt_num = set_dt_num(data_table , n_row ),
set_dt_col = set_dt_col(data_table , n_row ),
times = 10
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## set_df_col 10.267001 10.5887 11.218911 11.138451 11.767000 12.431101 10
## set_dt_num 3.045801 3.0838 3.822981 3.623002 3.837501 6.831201 10
## set_dt_col 3.152602 3.4841 3.922271 3.580301 3.903402 7.084401 10
This Code used to give me around 200x increment over base R in previous versions. From R 4.0 onward R is managing memory pretty efficiently and the base is performing better in this test. Let me try it on a larger data set.
<- 1e5
n_row
<- data.table::data.table(x = integer(n_row))
data_table <- data.frame(x = integer(n_row))
data_frame
::microbenchmark(
microbenchmarkset_df_col = {
for(i in 1:n_row){
$x[[i]] <- i * 2L
data_frame
}
},set_dt_num = set_dt_num(data_table , n_row ),
set_dt_col = set_dt_col(data_table , n_row ),
times = 10
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## set_df_col 5348.5332 5490.4668 5675.8209 5587.4427 5923.5778 6119.5538 10
## set_dt_num 313.2469 325.0008 358.3290 353.7714 385.9434 414.7643 10
## set_dt_col 326.3563 337.5951 385.5643 351.7514 436.3054 493.8522 10
Now we can see some improvement over the base. On bigger data set we are getting around 10x+ speed with data.table. It was just to establish the fact that data.table performs well over bigger data sets. Yet we can still get a better performance by moving to a lower level data structure.
<- 1e5
n_row
<- data.table::data.table(x = integer(n_row))
data_table <- list(x = integer(n_row))
data_list
::microbenchmark(
microbenchmarkset_list_col = {
for(i in 1:n_row){
$x[[i]] <- i*2L
data_list
}
},set_dt_num = set_dt_num(data_table , n_row ),
set_dt_col = set_dt_col(data_table , n_row ),
times = 10
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## set_list_col 23.9616 24.4937 28.42005 26.80065 28.8722 38.2675 10
## set_dt_num 311.7625 314.1495 322.61970 323.88640 327.3298 337.5023 10
## set_dt_col 329.6029 337.0087 364.85992 347.76210 363.7729 523.5365 10
Just by moving from data.frame to list we can get a substantial increment over the base. Now we are close to 400x over R data.table. Which is huge. But wait we can do it better. We haven’t tried the most atomic level structure in R the VECTORS. Let’s benchmark it again with vectors.
<- 1e5
n_row
<- integer(n_row)
x <- list(x = integer(n_row))
data_list
::microbenchmark(
microbenchmarkset_list_col = {
for(i in 1:n_row){
$x[[i]] <- i*2L
data_list
}
},set_vector = {
for(i in 1:n_row){
<- i*2L
x[[i]]
}
} )
## Unit: milliseconds
## expr min lq mean median uq max neval
## set_list_col 23.843401 24.54130 26.79498 25.82325 27.28535 38.4216 100
## set_vector 9.725901 10.12715 11.29298 10.69910 11.14740 25.6903 100
We were able to squeeze 2x + speed with base R vector. So finally we have vector that can do the entire computation at 6ms while the worst dataframe would do it at 7700ms. Which makes our code around 1200x faster. All you need to do is to remember can we do it with a simpler data-type.
This is the best change you can make in your code to make it run faster. I always prefer looping over a vector like 90% of the time. In some cases where it’s not possible I like to convert the dataframe to list and run a loop and then convert it back to a dataframe. Because dataframe itself is a list with some constraints. Remembering this will help you a lot in speeding up your code.
14.3 apply family
Use apply family for concise and efficient code. apply functions make your code smaller and more readable and you don’t have to write loops so you can focus on your tasks. Whenever possible you should use apply function because there are times when it could be more efficient than a loop. There are only 3 functions that you should really know. You can get away with just these 3 function to solve almost 90% of your tasks.
lapply
for almosteverything
apply
for looping overrows
mapply
for looping overmultiple vectors
as an argument to a single function
There is vapply
too for strictly getting vector
in return. I have used it very rarely because you can unlist the lapply for same results as well. Map
is nothing more than mapply with parameter SIMPLIFY as true. Reduce
can also be used but in very rare circumstances.
There are a certain things you need to know about apply function.
14.3.1 apply functions are not much faster than loops
Many people wrongly assume that just because you are using apply functions that means your code is vectorized, which is not true at all. Apply functions are loops under the hood and they are meant for convenience not for speed.
::microbenchmark(
microbenchmarklapply = lapply(1:1e3, rnorm),
forloop = for(i in 1:1e3){
rnorm(i)
},times = 10
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## lapply 48.1757 50.2958 52.84344 51.1705 52.7505 70.8977 10
## forloop 49.9527 50.4819 52.87576 52.3894 55.2797 58.3301 10
I have tested this on bigger vector and the results are almost identical. There difference is not too much. but lapply gives you optimized loop to begin and thus you should always prefer a lapply where ever it’s possible but you should not be scared to use a loop either as the speed is mostly the same.
14.3.2 Nested lapply have same speed as a normal lapply
::microbenchmark(
microbenchmarknested = lapply(1:1e3, function(x){
rnorm(x)
%>%
}) lapply(function(x){
sum(x)
}),normal = lapply(1:1e3, function(x){
rnorm(x) %>%
sum()
}),times = 10
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## nested 51.1481 54.0475 55.71928 54.53290 56.1110 66.4323 10
## normal 54.4771 56.5014 57.85445 57.09725 58.1304 63.5937 10
as you can see that nesting multiple lapply function doesn’t slow the code. However it’s not a standard practice to do it and I would not recommend anybody to make your code harder to read by nesting multiple lapply functions.
So let me summarize it lapply is better than loops but it’s no where near the speed of a vectorized code. Let’s talk about the fastest way to speed up your code.
14.4 Vectorize your code
R is vectorized to the core. Every function in R is vectorized. Even the comparison operators are vectorized. This is a core strength of R. If you can break your task down to vectorized operation you can make it faster even after adding more steps to it. Let’s take an example.
<- sample(
dummy_text x = letters,
size = 1e3,
replace = TRUE
)
<- sample(
dummy_category x = c(1,2,3),
size = 1e3,
replace = TRUE
)
<- data.frame(dummy_text, dummy_category) main_table
Now this table has a 1000 text that I want to join into a a huge corpus based on their category. Anybody familiar with other programming languages like python or java or c++ will look for a loop that can solve it. If you try that approach it might go like this.
<- function(df = main_table){
join_text_norm
<- character(length(unique(df$dummy_category)))
text
for(i in seq_along(df$dummy_category)){
if ( df$dummy_category[[i]] == 1 ) {
1]] <- paste0(text[[1]], df$dummy_text[[i]])
text[[else if ( df$dummy_category[[i]] == 2 ) {
} 2]] <- paste0(text[[2]], df$dummy_text[[i]])
text[[else {
} 3]] <- paste0(text[[3]], df$dummy_text[[i]])
text[[
}
}
return(text)
}
join_text_norm()
## [1] "akwnggdwpcjfrtqnmnxqmpdlmnqasxgmpdszivvsynzgzuxbjltfremtmxvwayattlwsubqiegrsmmewjjiflewkcevhuxkkabbjyreobilalktotxdodkwvquqkrgovirrjyaqfbytztkrtdsfbzsfnmtjzsjaysgxumvxmnaraqscolrlxoxvuullhtrwfkvszodwqlgnnfsstyjszgevyoqgppfnaispqyfbbcxzdsciigpqmwcrpjcpsrjxcttokcaodsmuhjhedtkprgqrvnjgoqybwjdcowadsumdgmeabzqfilrcmrvhnylbmkqjoygskcjtjpqubjq"
## [2] "chiuvtpjyjqnkjzpluaouwqbzexassqintwbpwxjzrktfzmfdzfnfhwgyhyrsrcvgtawlzeaumuugsmcovtpnyekgkkqoyrxiiczmdqjbwpbugacpbvbtckkuytbotjzdifofdflskutqxzfavkkpdncqllhdfexgtcbxemtnelxzjjaojndamxbaljrmhwydpevxisbxmpgsblxmdmbtmckpvfcidscqfncdpxedramaoqgixzwdsftzojdueqyxozevgzudwsiujcuslpgwhqctgwsbbptycvpbsizjifrmmdpdbjnlnjopdb"
## [3] "pxplgtnqrikgjvkxaetnukykzvqyzaxgjznmmmespebewlkqueelsteovungkttlflahuibzxftgqylkoslbpjejhytzxafsfuocrxlrtacsfaioplfdlxirmswlbxikvlkjtugjhlczomfztbmjrkheshxaopnhyuydlppweblcjqoejpjhbjryhkkkhuvecvhbsnrtsbprbuuzpuzjcnvmazwvoldzfaidfvylzpwpzyelbgugcrsfwrqajcwtxxrqmsxiacpswfahprkelqmhoupdfibrftdhjphrxbyrnyezblmjadwngxhqhwqfezetjidbzxwkrapbvzhsmlfpmzp"
This is not the most optimized function but this can get the job done. And I am breaking a golden rule here.
14.4.1 never repeat a calculation
in the above code I could save the some time by storing the value of text into a variable and stop R from calculating it again and again.
<- function(
join_text_saved df = main_table
){
<- character(length(unique(df$dummy_category)))
text
for(i in seq_along(df$dummy_category)){
<- df$dummy_text[[i]]
curr_text <- df$dummy_category[[i]]
curr_cat
if (curr_cat == 1 ) {
1]] <- paste0(text[[1]], curr_text)
text[[else if ( curr_cat == 2 ) {
} 2]] <- paste0(text[[2]], curr_text)
text[[else {
} 3]] <- paste0(text[[3]], curr_text)
text[[
}
}
return(text)
}
join_text_saved()
## [1] "akwnggdwpcjfrtqnmnxqmpdlmnqasxgmpdszivvsynzgzuxbjltfremtmxvwayattlwsubqiegrsmmewjjiflewkcevhuxkkabbjyreobilalktotxdodkwvquqkrgovirrjyaqfbytztkrtdsfbzsfnmtjzsjaysgxumvxmnaraqscolrlxoxvuullhtrwfkvszodwqlgnnfsstyjszgevyoqgppfnaispqyfbbcxzdsciigpqmwcrpjcpsrjxcttokcaodsmuhjhedtkprgqrvnjgoqybwjdcowadsumdgmeabzqfilrcmrvhnylbmkqjoygskcjtjpqubjq"
## [2] "chiuvtpjyjqnkjzpluaouwqbzexassqintwbpwxjzrktfzmfdzfnfhwgyhyrsrcvgtawlzeaumuugsmcovtpnyekgkkqoyrxiiczmdqjbwpbugacpbvbtckkuytbotjzdifofdflskutqxzfavkkpdncqllhdfexgtcbxemtnelxzjjaojndamxbaljrmhwydpevxisbxmpgsblxmdmbtmckpvfcidscqfncdpxedramaoqgixzwdsftzojdueqyxozevgzudwsiujcuslpgwhqctgwsbbptycvpbsizjifrmmdpdbjnlnjopdb"
## [3] "pxplgtnqrikgjvkxaetnukykzvqyzaxgjznmmmespebewlkqueelsteovungkttlflahuibzxftgqylkoslbpjejhytzxafsfuocrxlrtacsfaioplfdlxirmswlbxikvlkjtugjhlczomfztbmjrkheshxaopnhyuydlppweblcjqoejpjhbjryhkkkhuvecvhbsnrtsbprbuuzpuzjcnvmazwvoldzfaidfvylzpwpzyelbgugcrsfwrqajcwtxxrqmsxiacpswfahprkelqmhoupdfibrftdhjphrxbyrnyezblmjadwngxhqhwqfezetjidbzxwkrapbvzhsmlfpmzp"
::microbenchmark(
microbenchmarkjoin_text_norm(df = main_table),
join_text_saved(df = main_table)
)
## Unit: milliseconds
## expr min lq mean median uq
## join_text_norm(df = main_table) 6.290101 6.889651 7.455970 7.189351 7.535502
## join_text_saved(df = main_table) 5.745501 6.085952 6.685338 6.473151 7.051451
## max neval
## 15.2699 100
## 11.5883 100
We did not save much on it but we still saved one millisecond on just a 1000 loop. It’s an excellent practice of not to repeat calculation. Especially when you are calculating multiple things again and again.
Now coming back to the point. You could try this approach just like every other programming language does. Or you can try a vectorized approach with the built in paste function with collapse argument.
<- function(
collapsed_fun df = main_table
){<- df %>%
text split(f = dummy_category) %>%
lapply(function(x)
paste0(x$dummy_text,collapse = "")
%>%
) unlist()
return(text)
}
collapsed_fun(main_table)
## 1
## "akwnggdwpcjfrtqnmnxqmpdlmnqasxgmpdszivvsynzgzuxbjltfremtmxvwayattlwsubqiegrsmmewjjiflewkcevhuxkkabbjyreobilalktotxdodkwvquqkrgovirrjyaqfbytztkrtdsfbzsfnmtjzsjaysgxumvxmnaraqscolrlxoxvuullhtrwfkvszodwqlgnnfsstyjszgevyoqgppfnaispqyfbbcxzdsciigpqmwcrpjcpsrjxcttokcaodsmuhjhedtkprgqrvnjgoqybwjdcowadsumdgmeabzqfilrcmrvhnylbmkqjoygskcjtjpqubjq"
## 2
## "chiuvtpjyjqnkjzpluaouwqbzexassqintwbpwxjzrktfzmfdzfnfhwgyhyrsrcvgtawlzeaumuugsmcovtpnyekgkkqoyrxiiczmdqjbwpbugacpbvbtckkuytbotjzdifofdflskutqxzfavkkpdncqllhdfexgtcbxemtnelxzjjaojndamxbaljrmhwydpevxisbxmpgsblxmdmbtmckpvfcidscqfncdpxedramaoqgixzwdsftzojdueqyxozevgzudwsiujcuslpgwhqctgwsbbptycvpbsizjifrmmdpdbjnlnjopdb"
## 3
## "pxplgtnqrikgjvkxaetnukykzvqyzaxgjznmmmespebewlkqueelsteovungkttlflahuibzxftgqylkoslbpjejhytzxafsfuocrxlrtacsfaioplfdlxirmswlbxikvlkjtugjhlczomfztbmjrkheshxaopnhyuydlppweblcjqoejpjhbjryhkkkhuvecvhbsnrtsbprbuuzpuzjcnvmazwvoldzfaidfvylzpwpzyelbgugcrsfwrqajcwtxxrqmsxiacpswfahprkelqmhoupdfibrftdhjphrxbyrnyezblmjadwngxhqhwqfezetjidbzxwkrapbvzhsmlfpmzp"
Let’s compare it with the loop approach.
::microbenchmark(
microbenchmarkjoin_text_norm(),
join_text_saved(),
collapsed_fun()
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## join_text_norm() 6.353301 6.641401 7.238728 7.070050 7.397151 12.855901 100
## join_text_saved() 5.706501 5.964101 6.506474 6.430801 6.640600 11.475901 100
## collapsed_fun() 1.500201 1.706001 1.887568 1.792001 1.930351 7.804401 100
Collapsed function is faster than all the other approach for just 1000 loops. Imagine doing it for 1 million. Vectorized function in those cases will be 1000 times faster than loops.
The real reason for that is vectorized code uses optimized c for looping which is almost always faster than loops in R. And at times you can get 1000x speed compared to a normal loop. and Thus sometimes you can get away with doing more with vectors than with loops.
14.4.2 Vectorized code can do 2 or 3 steps more in lesser time
There is a classical example that I read in a book efficient R
and I was amazed to see why it happened.
<- function(
if_norm
x,
size
){<- character(size)
y for(i in 1:1e3){
<- x[[i]]
value if(value == 0){
<- "zero"
y[[i]] else if(value == 1){
} <- "one"
y[[i]] else {
} <- "many"
y[[i]]
}
}return(y)
}
<- function(
if_vector
x,
size
){
<- character(size)
y
1:size] <- "many"
y[== 1] <- "one"
y[x == 0] <- "zero"
y[x
return(y)
}
Both the function will return the same vector. However we are doing minimum with the normal function while in vectorized function we are doing redundant operations.
<- 1e3
size
<- sample(
x x = c(0,1,2),
size = size,
replace = TRUE
)
all.equal(
if_norm(x, size),
if_vector(x, size)
)
## [1] TRUE
Let’s check the speed of both the functions. Even though We are doing 3 steps in the vectorized solution and doing the same thing 3 times just to get a solution yet we are faster because vectorized code is very very fast you can get away with doing more and still saving time. Compare vectorized to be FLASH GORDEN and it can be faster even when it’s doing more than it should.
::microbenchmark(
microbenchmarkminimal = if_norm(x,size),
vectorized = if_vector(x, size)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## minimal 265.301 269.501 288.16599 273.4010 290.3510 440.201 100
## vectorized 28.502 30.101 93.47598 31.6005 32.8015 6133.701 100
14.5 Understanding non-vectorized code
There are times when you need only scalar values and in those times it is redundant to use vectorized code. I see many people not understanding the difference and using vectorized code inside a loop on a scalar variable when non vectorized would have been more efficient. Let’s check it with an example.
<- 1e4
n_size
<- data.frame(
binary_df x = sample(
x = c(TRUE, FALSE),
size = n_size,
replace = TRUE
),y = sample(
x = c(TRUE, FALSE),
size = n_size,
replace = TRUE
),z = sample(
x = c(TRUE, FALSE),
size = n_size,
replace = TRUE
) )
Let’s check if we can find all the rows where all the variable are true. The fastest method would be vectorized solution like this
<- binary_df$x & binary_df$y & binary_df$z all_true
But when suppose you are using it in a loop to find the exact same thing.
### vectorized code
<- function(
vect_all_true
df
){<- logical(nrow(df))
y
for(i in seq_along(binary_df$x)){
<- df$x[i] & df$y[i] & df$z[i]
y[i]
}
return(y)
}
### scalar code
<- function(
scalar_all_true
df
){<- logical(nrow(df))
y
for(i in seq_along(df$x)){
<- df$x[[i]] && df$y[[i]] && df$z[[i]]
y[[i]]
}
return(y)
}
Let’s compare both the functions for speed.
::microbenchmark(
microbenchmarkvect_all_true(df = binary_df),
scalar_all_true(df = binary_df),
times = 10
)
## Unit: milliseconds
## expr min lq mean median uq
## vect_all_true(df = binary_df) 32.2671 34.5365 36.24843 35.39615 36.8008
## scalar_all_true(df = binary_df) 17.7967 19.0181 21.86647 22.37185 22.9407
## max neval
## 43.8001 10
## 26.7061 10
I know this is not an excellent example because it can be vectorized easily but in the cases where you are working on individual scalar values using non-vectorized code gives you speed. In our current example we are getting twice the speed which is enough for such a small data set and the difference will increase with the number of rows.
14.6 Do as little as possible inside a loop
R is an interpreted language every thing you write inside a loop runs multiple times. The best thing you can do is to be parsimonious while writing code inside a loop. There are a number of steps that you can do to speed up a loop a bit more.
- Calculate results before the loop
- initialize objects before the loop
- Iterate on as few numbers as possible
- Write as less functions inside a loop as possible
The main tip is to Get out of loop as quickly as possible. There is another very crucial thing you can do to speed up your code.
14.6.1 Combine Vectorized code inside a loop
The best case is to figure which part of codes can be optimized with a vectorized solution and which would require you to loop through. The key is to use as minimum loops as possible and as much as vectorized code as possible. This is the same thing that helps in parallelizing the code too.
<- 1e5
n_size
<- data.table::data.table(
hr_df department = sample(
x = letters[1:5],
size = n_size,
replace = TRUE
),salary = sample(
x = 1e3:1e4,
size = n_size,
replace = TRUE
) )
Let’s try to find out How much money each department is paying for salary.
<- function(
sum_salary
df
){
<- list()
answer
for(dep in unique(df$department)){
<- df[
value == dep,
department sum(salary, na.rm = TRUE)
]
<- value
answer[[dep]]
}
return(answer)
}
::microbenchmark(
microbenchmarksum_salary(df = hr_df)
)
## Unit: milliseconds
## expr min lq mean median uq max
## sum_salary(df = hr_df) 12.7145 13.54815 14.39076 13.9214 14.26445 32.7294
## neval
## 100
I am doing as much vectorized calculation as possible in this scenario and this is the reason this code runs pretty fast. If I write a loop that goes through all the 10^{5} rows then I would get around 100x slower speed. This is a neat trick you must use whenever you can. Do as little as possible inside a loop.
14.7 Conclusion
This chapter mainly focused on Loops and how to optimize them. Loops are necessary and these tips will help run them better
- vectorize your code
- You can do more with vectorization and still be faster than a loop
- Use vectorized and scalar code with care
- combine vectorized code with loops to gain maximum power
- Initialize Your object before the loop
- use simpler data-types inside loop
- apply functions are not faster than loops
- nested apply functions don’t necessary mean slower code but you should avoid them
- Don’t repeat same calculation
- cache or save the results
- run garbage collection for heavy calculations