Chapter 10 Pass By Value-Reference

In programming we have a concept of how to pass a value to a function. If we can do away with modification of the object inside the function then it’s okay to pass the original object and let it change else we can create a copy of it and let the function modify it at will without effecting the object itself.

understanding this concept is very crucial if you want to write efficient code. Let’s dive deeper into it.

10.1 Understanding the system

There are mostly 2 systems available for passing the objects from one function to another. Let’s understand both of them.

10.1.1 Pass by Value

This is when you create a copy of the original object and pass it to the function. Because you are actually passing just a copy to the function whatever you do to the object doesn’t impact the original one. Let’s check it by an example.

x <- list(y = 1:10)

pass_by_value <- function(x){
  x$y <- 10:1
}

pass_by_value(x)
x$y
##  [1]  1  2  3  4  5  6  7  8  9 10

now x was passed to the function and modified yet it remains same because only copy of the object was passed to the function ( Well, not precisely but this is what we will discuss later).

10.1.2 Pass by reference

This is when you pass the entire object as is. Basically you pass the pointer to the original object and now if you change the object you change the original copy of it. Let’s check the same example again.

x <- new.env()
x$y <- 1:10

pass_by_value <- function(x){
  x$y <- 10:1
}

pass_by_value(x)
x$y
##  [1] 10  9  8  7  6  5  4  3  2  1

Now x was passed by reference and no copy was assigned to the function. So when you changed the object inside the function original object was changed.

Hope you now understand practically what does the word mean.

10.2 Copy on modify

R has no effective means to specify when to pass with value and when to pass with reference. And because there are only 2 ways to deal with this problem everybody assumes that R does create a copy of the object every time it passes the object through a function. But R has a different way of doing things which is called copy of modify. There are better blogs written over it and nuances are very peculiar which while writing code you shouldn’t worry about much. I will try to simplify the concept from the practical point of you view so that you can use it in real life without much thought to it.

R basically passes an object by references until you modify it. Let’s check it live:

mt_tbl <- data.frame(mtcars)

tracemem(mt_tbl)
## [1] "<00000000264B40B8>"
dummy_tbl <- mt_tbl
## No tracemem yet

mpg_col <- as.character(mt_tbl$mpg)
## No tracemem yet

mt_tbl[
  (mt_tbl$cyl == 6) &
  (mt_tbl$hp > 90),
]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## No tracemem yet

subset(
  mt_tbl,
  cyl == 6)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## No tracemem yet

tracemem is a function that will return memory address every time the object is copied. So far it didn’t return anything even though it passed through so many functions and each of those functions must be using multiple functions internally. Yet no copy of the object was made. Because So Far we haven’t modified anything. now look at the code below.

mt_tbl %>%
  filter(cyl == 6,
         hp > 90) %>%
  group_by(gear) %>%
  summarise(n()) %>%
  select(gear)
## tracemem[0x00000000264b40b8 -> 0x000000002075db90]: initialize <Anonymous> filter_rows filter.data.frame filter group_by summarise select %>% eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> render_cur_session render_book render_book_script in_dir <Anonymous> <Anonymous> 
## tracemem[0x000000002075db90 -> 0x000000002075dc40]: initialize <Anonymous> filter_rows filter.data.frame filter group_by summarise select %>% eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> render_cur_session render_book render_book_script in_dir <Anonymous> <Anonymous>
## # A tibble: 3 x 1
##    gear
##   <dbl>
## 1     3
## 2     4
## 3     5

dplyr will change the data.frame to tibble and trigger tracemem This is one of the reason I absolutely love and recommend data.table to everybody. Which manages memory very efficiently it’s at par with any in memory table of a DB. If you are actually concerned about memory use data.table.

new_tbl <- mt_tbl %>%
  filter(cyl == 6,
         hp > 90) %>%
  group_by(gear) %>%
  summarise(n()) %>%
  select(gear)
## tracemem[0x00000000264b40b8 -> 0x0000000020845bd0]: initialize <Anonymous> filter_rows filter.data.frame filter group_by summarise select %>% eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> render_cur_session render_book render_book_script in_dir <Anonymous> <Anonymous> 
## tracemem[0x0000000020845bd0 -> 0x0000000020845c80]: initialize <Anonymous> filter_rows filter.data.frame filter group_by summarise select %>% eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> render_cur_session render_book render_book_script in_dir <Anonymous> <Anonymous>

now we are modifying the results somewhere and thus a copy is created. The actual rules are very very complicated. But in simple term as long as you don’t modify any thing R doesn’t create a copy and everything is passed down by reference.

It impacts speed too… Let check it by an example

foo <- function(x){
  sum(x)
}

bar <- function(x){
  x[[1]] <- 1
  sum(x)
}

As you can see both the functions are identical the only difference is that in bar I am modifying the object while in foo I am not changing the object. Let’s run a speed test…

x <- rnorm(1e7)

microbenchmark::microbenchmark(
  foo = foo(x),
  bar = bar(x),
  times = 10
)
## Unit: milliseconds
##  expr     min      lq     mean   median      uq     max neval cld
##   foo  9.1626  9.2340  9.77099  9.60895 10.1424 10.8262    10  a 
##   bar 32.1337 32.2695 33.11242 32.70875 32.9868 36.5435    10   b

As you can see the difference in time is because bar is creating a copy of the object. And you may assume that it will create a copy at every time you change a object and you will be dead wrong as R is smart enough to understand that It can get away with only single copy of the object. Lets create a function that changes more things in x and see the difference.

bar_new <- function(x){
  x[[1]] <- 1
  x[[10]] <- 10
  x[[1e3]] <- 1e3
  sum(x)
}

microbenchmark::microbenchmark(
  foo = foo(x),
  bar = bar(x),
  bar_new = bar_new(x),
  times = 10
)
## Unit: milliseconds
##     expr     min      lq     mean  median      uq      max neval cld
##      foo  8.8944  9.1426  9.37689  9.2620  9.3713  10.7012    10  a 
##      bar 31.8602 32.5249 43.24724 33.1897 37.7848 125.5525    10   b
##  bar_new 31.7165 31.9476 33.00687 32.2612 33.1326  36.9782    10   b

Now as you can see that while the function foo and bar have significant differences in performance, same is not true for bar and bar_new. Because bar_new too creates a copy but maintains that copy for the entire function.

So R is smart enough to understand when to create a copy and when not to create a copy. Once a copy is created it is retained in R and R uses it smartly. We can gain speed and memory benefits by making sure all the modification is done inside a single function. So that R doesn’t create much copies.

Instead of using bar 3 times it’s better to use bar_new once. So that you don’t copy it multiple times. See the difference for yourself. And thus try to keep all the modifications close and in as less functions as possible.

microbenchmark::microbenchmark(
  bar = {
    bar(x)
    bar(x)
    bar(x)
    },
  bar_new = bar_new(x),
  times = 10
)
## Unit: milliseconds
##     expr     min      lq      mean  median       uq      max neval cld
##      bar 97.3352 97.4706 108.60056 98.9124 100.5602 189.8442    10   a
##  bar_new 32.2984 33.0995  75.09429 36.3545 104.0295 247.6080    10   a

best is to group these modifications together.

So the gist of the matter is:

  1. R passes everything by reference until you modify it
  2. R creates a copy when you modify the object
  3. You should always keep all the Object modifications in same function

10.3 for pass by reference

As I told you before R has no way of specifying when the object will be pass by reference and when it will be passed by value. And there are certainly times you wish you had passed it by value and certainly times when you wish you passed it by reference.

When you modify something inside a function you create a copy of it. So take example of a loop inside and outside a function

x <- numeric(10)
for(i in 1:10){
  x[[i]] <- rnorm(1)
}
x
##  [1]  0.09345514  0.30016724  1.37123862 -0.75835430  0.93895510 -0.19666791
##  [7] -0.66356191  0.71313355  1.65562438  1.02137192

It modifies the object in place. Now lets wrap it in a function and see what happens.

x <- numeric(10)

foo <- function(x){
  for(i in 1:10){
    x[[i]] <- rnorm(1)
  }
  return(x)
}

foo(x)
##  [1] -0.32193547 -0.12438617 -0.28606108 -0.29332355 -0.08433643 -0.46792772
##  [7] -0.62798252  0.88368918  1.30927814 -0.04613028
x
##  [1] 0 0 0 0 0 0 0 0 0 0

Now x is not modified because it is being modified inside a function. This is crucial at times when you are running a long job that might take hours to complete just to find an error in the middle. You might want to start the loop from the exact position you left off. With this sort of code you will not reach that. Let’s generate an error in the code and uses bigger number.

total_length <- 1e2
set.seed(1)

x <- numeric(total_length)

foo <- function(number){
  y <- sample(1:total_length,1)
  for(i in 1:total_length){
    number[[i]] <- i
    if(y == i){
      stop(sprintf("there is an error at %s", y))
    }
  }
  return(number)
}

foo(x) ## You will get an Error
## Error in foo(x): there is an error at 68

If you run this code you will get an error at some number and x will still be the same. All the processing of code till that moment is lost for everybody. Which is not what you want if each iteration took just 2 minutes to run. This difference could mean hours in some scenarios.

R has 4 datatypes that provide mutable objects or pass by reference semantics.

  1. R6 Classes
  2. environments
  3. data.table
  4. listenv

I wouldn’t recommend writing an R6 class just to run a simple loop, however if your use case is pretty complex R6 would be a valid solution for it. We already saw how environments can be used for pass by reference. But passing around environments is not a good idea it requires you to know too much about the language and be very careful with what you are doing hence I only prefer 2 approaches. One with data.table and other with listenv package.

But their usecase is very different. One should be used where you are comfortable with lists are more suited while other should be used where data.frame or vectors are more suited for the task. Doing it for listenv is very easy. It’s the same code with just the new listenv object.

foo_list <- function(list){
  y <- sample(1:total_length,1)
  for(i in 1:total_length){
    list$x[[i]] <- i
    if(y == i){
      stop(sprintf("there is an error at %s", y))
    }
  }
}

list_env <- listenv::listenv()
list_env$x <- numeric(total_length)

foo_list(list_env)
## Error in foo_list(list_env): there is an error at 39

Now again we got an errors but this time all the other changes have been saved in x.

list_env$x
##   [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
##  [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39  0  0  0  0  0  0  0  0  0  0  0
##  [51]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##  [76]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Same thing could be done in data.table as well. let’s write a new function for doing it. data.table has 2 ways of looping through the vectors.

  1. With := operator which is slow but useful for more data insertion than one by one
  2. with set function which is faster where you need to insert data one by one.

Let’s use the second approach to write a function.

x_dt <- data.table::data.table(x = numeric(total_length))

foo_dt <- function(dt){
  y <- sample(1:total_length,1)
  for(i in 1:total_length){
    data.table::set(
      x = dt,
      i = i,
      j = "x",
      value = i)

    if(y == i){
      stop(sprintf("there is an error at %s", y))
    }
  }
}

foo_dt(x_dt)
## Error in foo_dt(x_dt): there is an error at 1

Now just like again even though we got errors we can still check the ones that have been completed during the loop.

x_dt$x
##   [1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

So let me make things simpler. When you want to modify objects in place you need to use 1 of the 2 approach. When you are working on data.frames and vectors use data.table while when you are working on anything else, anything in general, use listenv approach.

10.4 Conclusion

This chapter focused on how to save memory of your R program by using objects through reference and avoid creating copies of the object. Let’s summarize what we have read so far.

  1. keep all the modifications of objects in a single function
  2. use pass by reference through listenv and data.table for saving memory
  3. avoid creating multiple copies of an object at all costs