Chapter 11 Release Memory

Sometimes even after applying all the best techniques to save memory you still need more space. In those cases you need to understand that R has 2 techniques available to you. These techniques are mostly useful in different context. I mostly recommend using gc() inside a loop and rm() inside an ETL or R script.

11.1 use rm()

rm function deletes any object from R and releases some memory that could be used for further operations. It’s excellent for cases where you have a lot of intermediate results saved inside an R script that is supposed to run through an ETL. like this

value <- length(mtcars$mpg)
sorted_mpg <- sort(mtcars$mpg)
median_values <- sorted_mpg[value/2] + sorted_mpg[(value/2) + 1]
median_mpg <- median_values / 2

rm(list = c("value", "sorted_mpg", "median_values"))

In these scenarios where you are running scripts like these and storing intermediate result for further processing it’s okay to use rm function. However when you wrap it inside a function only the final result stays and everything else is lost anyways so you don’t need to use rm function at all.

I would not advice anyone to use this function unless you are writing an ETL script on a data that is large enough that you would need space after every iteration. Even in those scenarios you can get away with not using intermediate variables for storing the results. But if you must and only and only in those very rare cases this function should be used. In broader cases R’s garbage collection will have you covered.

11.2 use gc()

Garbage collection is a process by which a software or a program returns the memory back to OS or frees it for it’s own usage. R has significantly improved it’s memory management in recent years. All the blogs you read about R being slow and memory hungry are mostly written during 2014-2017. They aren’t true anymore.

11.2.1 R version 3.5

In R version 3.5 altrep was introduced which helps in saving a lot of memory in special cases like loops. Packages like vroom have achieved significant speed ups because now they didn’t have to load all the data in memory and can only load as much as they need. If you want to read more about it you should visit this page https://svn.r-project.org/R/branches/ALTREP/ALTREP.html.

11.2.2 R version 4.0

In R version 4.0 ARC ( automatic reference counting )was introduced in R. Previously R used to keep a track of how many copies an object has with NRC ( named reference counting )and that had only 3 option, 0, 1 and 2; where I think 2 meant many. Now with ARC R can increase and decrease these numbers for keeping a better track of how many object points to the same memory address. You can read more about in a blog post here: https://msmith.de/2020/05/14/exploring-r-reference-counting.html#:~:text=The%20CRAN%20NEWS%20accompanying%20R,further%20optimizations%20in%20the%20future..

So the gist of the matter is that R has been improving performance and memory management for a very long time.

Now coming back to the main point. R has a function by name gc to run a garbage collector and release the memory that is not needed anymore. It’s useful to run it in mostly 2 situations.

11.2.3 Inside a heavy loop

Loops are slow in R and sometime when are you doing heavy manipulation in a loop. R might consume more memory than it’s needed and will slow the further processes. This is the point I use a gc() inside a loop.

gc_after_a_number <- 1e2

for(i in 1:1e4){
  ### heavy calculations
  if( i %% gc_after_a_number == 0){
    gc()
  }
}

Here in the code above I am running a long loop but after every 100 iterations I am freeing up memory by running a gc() function. After how many iterations do you want to use a loop is totally dependent on what are you doing inside a loop. It could go from anywhere between 10 to 10 thousand depending upon the situation. It surely helps and I would recommend everyone to use it but only for very heavy calculations.

11.2.4 anything that takes more than 30 seconds

If you know a function takes more than 30 second to execute. I would suggest you to run a gc() after it. It helps keep a check on R’s memory. It’s useful even if you are running a shiny app or an API. If anything takes more than 30 second it’s worth running the gc() function because than adding a few milliseconds will not be your main problem for sure.

11.3 Cache / Store calculations

It’s a good practice in general to not repeat the same calculations over and over again. let’s check that with simple function.

It’s a very famous problem of print foo when a number is divisible by 3 printing bar when it’s divisible by 5 and foobar when it’s divisible by both 3 and 5 or 15.

foo <- function(
  num
){
  x <- data.frame(y = 1:num)
  z <- integer(num)
  
  for(i in 1:num){
    if(
        x$y[[i]] %% 3 == 0 &&
        x$y[[i]] %% 5 == 0
    ){
      z <- "foobar"
    } else if( x$y[[i]] %% 3 == 0){
      z <- "foo"
    } else if ( x$y[[i]] %% 5 == 0){
      z <- "bar"
    } else {
      z <- i
    }
  }
  
  return(z)
  
}

Now look at the code above is it readable? what If I want to change the number of iterations. I will have to change it at 2 different places but that’s okay because I have wrapped in a function. But the key point is that I am also calculating x$y[[i]] multiple times and If I have make changes to it I will have to change it in 4 places. Let’s make it a little readable by storing the calculation and thus making our code readable and make it good enough that future changes can be done easily.

bar <- function(
  num
){
  
  x <- data.frame(y = 1:num)
  z <- integer(num)
  
  for(i in 1:num){
    curr_value <- x$y[[i]]
    
    if(
        curr_value %% 3 == 0 &&
        curr_value %% 5 == 0
    ){
       z <- "foobar"
    } else if( curr_value %% 3 == 0){
      z <- "foo"
    } else if ( curr_value %% 5 == 0){
      z <- "bar"
    } else {
      z <- i
    }
  }
  return(z)
}

Let’s check which is faster of the above implementations.

microbenchmark::microbenchmark(
  foo(1e4),
  bar(1e4)
)
## Unit: milliseconds
##        expr     min       lq     mean  median      uq     max neval cld
##  foo(10000) 22.3464 24.53065 26.27622 25.3623 27.1846 47.4518   100   b
##  bar(10000) 10.7788 12.05795 13.51551 12.5670 13.3212 28.4267   100  a

We are saving almost half the time just because we are saving some intermediate calculations. You will be amazed to see how often can you do this. And it makes your code

  • more performant

  • more readable

  • more adaptable

  • more memory efficient

Don’t force the computer to calculate something it has already calculated. This simple thing has brought into life big concepts like mem-cache or reddis. R too has the ability to cache it’s results. There is a package by name memoise. It can store the results of a previous calculation in RAM. It’s a classical case of RAM vs CPU trade off. You are using some RAM so that your CPU doesn’t have to work over and over again.

Let’s try to calculate factorials to demonstrate how we can use it.

fact <- function(num){
  if( num == 0){
    return(1)
  } else {
    return(
      num * fact(num - 1)
    )
  }
}

Now because being a recursive function, which you should avoid writing at all costs, It does the same computation over and over again. It goes something like this

  1. fact(0) * fact(1)

    1. fact(1) * fact(2)

      1. fact(2) * fact(3)

        1. fact(4) * fact(5)

          1. fact(5) * fact(6)

            1. ……………….

now it’s calculating same thing over and over again without caching it. It makes it a valid candidate for memoise.

mem_fact <- function(num){
  if( num == 0){
    return(1)
  } else {
    return(
      num * mem_fact(num - 1)
    )
  }
}

mem_fact <- memoise::memoise(mem_fact)

This is how you write a simple memoise function. Let’s compare the results.

microbenchmark::microbenchmark(
  fact(100),
  mem_fact(100)
)
## Unit: microseconds
##           expr  min   lq    mean median    uq     max neval cld
##      fact(100) 39.2 39.7  63.307  39.95 40.45  2272.5   100   a
##  mem_fact(100) 52.4 55.1 326.264  56.75 59.40 26781.1   100   a

In this case we haven’t achieved much performance over the base code but if we run this function multiple times it will just fetch the results from RAM and save computations in which case it will be worth the effort. We will go through example in next chapters where memoise will be faster too…

memoise is not the only package which does this sort of caching. There is a package by name cachem which is used in shiny and you will find some other products like redis which do the same thing but at scale. It’s an important technique to know for saving both time and memory.

11.4 Conclusion

In this chapter we learned how to free up memory for more usage in R. This should help you manage your R environment better. Let’s see what we have studied so far.

  1. Don’t use rm unless necessary
  2. rm should only be used in an ETL or a manual script
  3. Most of the time you can avoid rm
  4. use gc() inside heavy loops after every few iteration
  5. use gc() after every heavy transaction which takes more than a couple of seconds.
  6. Don’t repeat calculations