Chapter 12 Release Memory

Sometimes even after applying all the best techniques to save memory you still need more space. In those cases you need to understand that R has 2 techniques available to you. These techniques are mostly useful in different context. I mostly recommend using gc() inside a loop and rm() inside an ETL or big function.

12.1 use rm()

rm function deletes any object from R and releases some memory that could be used for further operations. It’s excellent for cases where you have a lot of intermediate results saved inside an R script that is supposed to run through an ETL. like this

value <- length(mtcars$mpg)
sorted_mpg <- sort(mtcars$mpg)
median_values <- sorted_mpg[value/2] + sorted_mpg[(value/2) + 1]
median_mpg <- median_values / 2

rm(list = c("value", "sorted_mpg", "median_values"))

In these scenarios where you are running scripts like these and storing intermediate result for further processing it’s okay to use rm function. However when you wrap it inside a function only the final result stays and everything else is lost anyways so you don’t need to use rm function at all. The only difference is when your function is huge. I once wrote an ETL script where I compose multiple functions to create a bigger one which was supposed to run like an ETL job. The data created there was huge and it was crashing my RAM everytime. So I had to modify it to make sure I delete the objects created after I am done with it.

# ## this is a bad style it will exhaust your RAM way more quickly and you would always need more RAM than necessary to perform this action.
# 
# bad_func <- function(){
#   
#   a <- load_huge_datasets("some file")
#   b <- load_huge_datasets("some other file or db")
#   c <- load_huge_datasets("some other file or db")
#   
#   results_a <- long_computaions(a)
#   results_b <- other_long_computaions(b)
#   results_c <- other_long_computaions(c)
#   
#   final_results <- results_a + results_b + results_c
#   
#   return(
#     final_results
#   )
# }
# 
# 
# ## This style is only using the RAM that's necessary at a time and then deleting objects to make room for other RAM
# 
# good_func <- function(){
#   
#   a <- load_huge_datasets("some file")
#   results_a <- long_computaions(a)
#   rm(a)
# 
#   b <- load_huge_datasets("some other file or db")
#   results_b <- other_long_computaions(b)
#   rm(b)
#   
#   c <- load_huge_datasets("some other file or db")
#   results_c <- other_long_computaions(c)
#   rm(c)
#   
#   final_results <- results_a + results_b + results_c
#   
#   return(
#     final_results
#   )
#   
# }

Read both the functions mentioned above if you ever faced with a situation where RAM is crucial always make a habit of writing code like good_func where you can use rm() inside a function to make room for more objects.

I would not advice anyone to use this function unless you are writing an ETL script on a data that is large enough that you would need space after every iteration. Even in those scenarios you can get away with not using intermediate variables for storing the results. But if you must and only and only in those very rare cases this function should be used. In broader cases R’s garbage collection will have you covered.

12.2 use gc()

Garbage collection is a process by which a software or a program returns the memory back to OS or frees it for it’s own usage. R has significantly improved it’s memory management in recent years. All the blogs you read about R being slow and memory hungry are mostly written during 2014-2017. They aren’t true anymore.

12.2.1 R version 3.5

In R version 3.5 altrep was introduced which helps in saving a lot of memory in special cases like loops. Packages like vroom have achieved significant speed ups because now they didn’t have to load all the data in memory and can only load as much as they need. If you want to read more about it you should visit this page https://svn.r-project.org/R/branches/ALTREP/ALTREP.html.

12.2.2 R version 4.0

In R version 4.0 ARC ( automatic reference counting )was introduced in R. Previously R used to keep a track of how many copies an object has with NRC ( named reference counting )and that had only 3 option, 0, 1 and 2; where I think 2 meant many. Now with ARC R can increase and decrease these numbers for keeping a better track of how many object points to the same memory address. You can read more about in a blog post here: https://msmith.de/2020/05/14/exploring-r-reference-counting.html#:~:text=The%20CRAN%20NEWS%20accompanying%20R,further%20optimizations%20in%20the%20future..

So the gist of the matter is that R has been improving performance and memory management for a very long time.

Now coming back to the main point. R has a function by name gc to run a garbage collector and release the memory that is not needed anymore. It’s useful to run it in mostly 2 situations.

12.2.3 Inside a heavy loop

Loops are slow in R and sometime when are you doing heavy manipulation in a loop. R might consume more memory than it’s needed and will slow the further processes. This is the point I use a gc() inside a loop.

gc_after_a_number <- 1e2

for(i in 1:1e4){
  ### heavy calculations
  if( i %% gc_after_a_number == 0){
    gc()
  }
}

Here in the code above I am running a long loop but after every 100 iterations I am freeing up memory by running a gc() function. After how many iterations do you want to use a loop is totally dependent on what are you doing inside a loop. It could go from anywhere between 10 to 10 thousand depending upon the situation. It surely helps and I would recommend everyone to use it but only for very heavy calculations.

12.2.4 anything that takes more than 30 seconds

If you know a function takes more than 30 second to execute. I would suggest you to run a gc() after it. It helps keep a check on R’s memory. It’s useful even if you are running a shiny app or an API. If anything takes more than 30 second it’s worth running the gc() function because than adding a few milliseconds will not be your main problem for sure.

12.3 Cache / Store calculations

It’s a good practice in general to not repeat the same calculations over and over again. let’s check that with simple function.

It’s a very famous problem of print foo when a number is divisible by 3 printing bar when it’s divisible by 5 and foobar when it’s divisible by both 3 and 5 or 15.

foo <- function(
  num
){
  x <- data.frame(y = 1:num)
  z <- integer(num)
  
  for(i in 1:num){
    if(
        x$y[[i]] %% 3 == 0 &&
        x$y[[i]] %% 5 == 0
    ){
      z <- "foobar"
    } else if( x$y[[i]] %% 3 == 0){
      z <- "foo"
    } else if ( x$y[[i]] %% 5 == 0){
      z <- "bar"
    } else {
      z <- i
    }
  }
  
  return(z)
  
}

Now look at the code above is it readable? what If I want to change the number of iterations. I will have to change it at 2 different places but that’s okay because I have wrapped in a function. But the key point is that I am also calculating x$y[[i]] multiple times and If I have make changes to it I will have to change it in 4 places. Let’s make it a little readable by storing the calculation and thus making our code readable and make it good enough that future changes can be done easily.

bar <- function(
  num
){
  
  x <- data.frame(y = 1:num)
  z <- integer(num)
  
  for(i in 1:num){
    curr_value <- x$y[[i]]
    
    if(
        curr_value %% 3 == 0 &&
        curr_value %% 5 == 0
    ){
       z <- "foobar"
    } else if( curr_value %% 3 == 0){
      z <- "foo"
    } else if ( curr_value %% 5 == 0){
      z <- "bar"
    } else {
      z <- i
    }
  }
  return(z)
}

Let’s check which is faster of the above implementations.

microbenchmark::microbenchmark(
  foo(1e4),
  bar(1e4)
)
## Unit: milliseconds
##        expr     min       lq     mean   median       uq     max neval
##  foo(10000) 35.9884 37.68765 42.10543 40.42580 44.01300 93.2937   100
##  bar(10000) 17.4109 18.22870 19.91110 19.28375 20.38545 34.1456   100

We are saving almost half the time just because we are saving some intermediate calculations. You will be amazed to see how often can you do this. And it makes your code

  • more performant

  • more readable

  • more adaptable

  • more memory efficient

Don’t force the computer to calculate something it has already calculated. This simple thing has brought into life big concepts like mem-cache or reddis. R too has the ability to cache it’s results. There is a package by name memoise. It can store the results of a previous calculation in RAM. It’s a classical case of RAM vs CPU trade off. You are using some RAM so that your CPU doesn’t have to work over and over again.

Let’s try to calculate factorials to demonstrate how we can use it.

fact <- function(num){
  if( num == 0){
    return(1)
  } else {
    return(
      num * fact(num - 1)
    )
  }
}

Now because being a recursive function, which you should avoid writing at all costs, It does the same computation over and over again. It goes something like this

  1. fact(0) * fact(1)

    1. fact(1) * fact(2)

      1. fact(2) * fact(3)

        1. fact(4) * fact(5)

          1. fact(5) * fact(6)

            1. ……………….

now it’s calculating same thing over and over again without caching it. It makes it a valid candidate for memoise.

mem_fact <- function(num){
  if( num == 0){
    return(1)
  } else {
    return(
      num * mem_fact(num - 1)
    )
  }
}

mem_fact <- memoise::memoise(mem_fact)

This is how you write a simple memoise function. Let’s compare the results.

microbenchmark::microbenchmark(
  fact(100),
  mem_fact(100)
)
## Unit: microseconds
##           expr    min      lq    mean   median       uq       max neval
##      fact(100) 73.701 75.4010 134.686  78.0510  90.1515  4612.501   100
##  mem_fact(100) 89.200 94.1505 632.618 101.3005 116.4005 51373.001   100

In this case we haven’t achieved much performance over the base code but if we run this function multiple times it will just fetch the results from RAM and save computations in which case it will be worth the effort. We will go through example in next chapters where memoise will be faster too…

memoise is not the only package which does this sort of caching. There is a package by name cachem which is used in shiny and you will find some other products like redis which do the same thing but at scale. It’s an important technique to know for saving both time and memory.

12.4 Conclusion

In this chapter we learned how to free up memory for more usage in R. This should help you manage your R environment better. Let’s see what we have studied so far.

  1. Don’t use rm unless necessary
  2. rm should only be used in an ETL or a function
  3. remove objects after you are done with it.
  4. Most of the time you can avoid rm
  5. use gc() inside heavy loops after every few iteration
  6. use gc() after every heavy transaction which takes more than a couple of seconds.
  7. Don’t repeat calculations