1.8 The Role of Physical Memory

The learning objectives of this section are to:

  • Describe how memory is used in R sessions to store R objects

Generally speaking, R stores and manipulates all objects in the physical memory of your computer (i.e. the RAM). Therefore, it’s important to be aware of the limits of your computing environment with respect to available memory and how that may affect your ability to use R. In the event that your computer’s physical memory is insufficient for some of your work, there have been some developments that allow R users to deal with objects out of physical memory and we will discuss them below.

The first thing that is worth keeping in mind as you use R is how much physical memory your computer actually has. Typically, you can figure this out by looking at your operating system’s settings. For example, as of this writing, Roger has a 2015-era Macbook with 8 GB of RAM. Of course, the amount of RAM available to R will be quite a bit less than that, but it’s a useful upper bound. If you plan to read into R an object that is 16 GB on this computer, you’re going to have ask Roger for a new computer.

The pryr package provides a number of useful functions for interrogating the memory usage of your R session. Perhaps the most basic is the mem_used() function, which tells you how much memory your current R session is using.

library(pryr)
mem_used()
156 MB

The primary use of this function is to make sure your memory usage in R isn’t getting too big. If the output from mem_used() is in the neighborhood of 75%-80% of your total physical RAM, you might need to consider a few things.

First, you might consider removing a few very large objects in your workspace. You can see the memory usage of objects in your workspace by calling the object_size() function.

ls()  ## Show objects in workspace
 [1] "a"                   "a_tale"              "andrew_tracks"      
 [4] "b"                   "cases"               "check_months"       
 [7] "check_tracks"        "check_weekdays"      "denver"             
[10] "ext_tracks"          "ext_tracks_colnames" "ext_tracks_file"    
[13] "ext_tracks_widths"   "join_funcs"          "katrina"            
[16] "katrina_reduced"     "knots_to_mph"        "logdates"           
[19] "logs"                "m"                   "mc_tibl"            
[22] "meso_url"            "miami"               "msg"                
[25] "old"                 "pasted_states"       "readr_functions"    
[28] "regular_expression"  "shapes"              "start_end_vowel"    
[31] "state_tbl"           "string_to_search"    "team_standings"     
[34] "teams"               "to_trim"             "two_cities"         
[37] "two_s"               "VADeaths"            "vowel_state_lgl"    
[40] "wc_table"            "worldcup"            "x"                  
[43] "y"                   "zika_brazil"         "zika_file"          
object_size(worldcup)
56 kB

The object_size() function will print the number of bytes (or kilobytes, or megabytes) that a given object is using in your R session. If you want see what the memory usage of the largest 5 objects in your workspace is, you can use the following code.

library(magrittr)
sapply(ls(), function(x) object.size(get(x))) %>% sort %>% tail(5)
    worldcup       denver check_tracks   ext_tracks        miami 
       56216       222768       239848      1842472     13121560 

Note: We have had to use the object.size() function here (see note below) because the current version of object_size() in pryr throws an error for certain types of objects.

Here we can see that the miami and ext_tracks objects (created in previous chapters of this book) are currently taking up the most memory in our R session. Since we no longer need those objects, we can remove them from the workspace and free up some memory.

mem_used()
156 MB
rm(ext_tracks, miami)
mem_used()
155 MB

Here you can see how much memory we save by deleting these two objects. But you may be wondering why there isn’t a larger savings, given the number reported by object_size(). This has to do with the internal representation of the miami object, which is of the class ggmap. Occasionally, certain types of R objects can appear to take up more memory than the actually do, in which case functions like object_size() will get confused.

Viewing the change in memory usage by executing an R expression can actually be simplified using the mem_change() function. We can see what happens when we remove the next three largest objects.

mem_change(rm(check_tracks, denver, b))
-460 kB

Here the decrease is about 400 KB.

R has a built in function called object.size() that also calculates the size of an object, but it uses a slightly different calculation than object_size() in pryr. While the two functions will generally agree for most objects, for things like functions and formulas, which have enclosing environments attached to them, they will differ. Similarly, objects with shared elements (i.e. character vectors) may result in different computations of their size. The compare_size() function in pryr allows you to see how the two functions compare in their calculations. We will discuss these concepts more in the next chapter.

1.8.1 Back of the Envelope Calculations

When reading in large datasets or creating large R objects, it’s often useful to do a back of the envelope calculation of how much memory the object will occupy in the R session (ideally before creating the object). To do this it’s useful to know roughly how much memory different types of atomic data types in R use.

It’s difficult to generalize how much memory is used by data types in R, but on most 64 bit systems today, integers are 32 bits (4 bytes) and double-precision floating point numbers (numerics in R) are 64 bits (8 bytes). Furthermore, character data are usually 1 byte per character. Because most data come in the form of numbers (integer or numeric) and letters, just knowing these three bits of information can be useful for doing many back of the envelope calculations.

For example, an integer vector is roughly 4 bytes times the number of elements in the vector. We can see that for a zero-length vector, that still requires some memory to represent the data structure.

object_size(integer(0))
40 B

However, for longer vectors, the overhead stays roughly constant, and the size of the object is determined by the number of elements.

object_size(integer(1000))  ## 4 bytes per integer
4.04 kB
object_size(numeric(1000))  ## 8 bytes per numeric
8.04 kB

If you are reading in tabular data of integers and floating point numbers, you can roughly estimate the memory requirements for that table by multiplying the number of rows by the memory required for each of the columns. This can be a useful exercise to do before reading in large datasets. If you accidentally read in a dataset that requires more memory than your computer has available, you may end up freezing your R session (or even your computer).

The .Machine object in R (found in the base package) can give you specific details about how your computer/operation system stores different types of data.

str(.Machine)
List of 18
 $ double.eps           : num 2.22e-16
 $ double.neg.eps       : num 1.11e-16
 $ double.xmin          : num 2.23e-308
 $ double.xmax          : num 1.8e+308
 $ double.base          : int 2
 $ double.digits        : int 53
 $ double.rounding      : int 5
 $ double.guard         : int 0
 $ double.ulp.digits    : int -52
 $ double.neg.ulp.digits: int -53
 $ double.exponent      : int 11
 $ double.min.exp       : int -1022
 $ double.max.exp       : int 1024
 $ integer.max          : int 2147483647
 $ sizeof.long          : int 8
 $ sizeof.longlong      : int 8
 $ sizeof.longdouble    : int 16
 $ sizeof.pointer       : int 8

The floating point representation of a decimal number contains a set of bits representing the exponent and another set of bits representing the significand or the mantissa. Here the number of bits used for the exponent is 11, from double.exponent, and the number of bits for the significand is 53, from the double.digits element. Together, each double precision floating point number requires 64 bits, or 8 bytes to store.

For integers, we can see that the maximum integer indicated by the integer.max is 2147483647, we can take the base 2 log of that number and see that it requires 31 bits to encode. Because we need another bit to encode the sign of the number, the total number of bits for an integer is 32, or 4 bytes.

Much of the point of this discussion of memory is to determine if your computer has sufficient memory to do the work you want to do. If you determine that the data you’re working with cannot be completely stored in memory for a given R session, then you may need to resort to alternate tactics. We discuss one such alternative in the section below, “Working with large datasets”.

1.8.2 Internal Memory Management in R

If you’re familiar with other programming languages like C, you’ll notice that you do not need to explicitly allocate and de-allocate memory for objects in R. This is because R has a garbage collection system that recycles unused memory and gives it back to R. This happens automatically without the need for user intervention.

Roughly, R will periodically cycle through all of the objects that have been created and see if there are still any references to the object somewhere in the session. If there are no references, the object is garbage-collected and the memory returned. Under normal usage, the garbage collection is not noticeable, but occasionally, when working with very large R objects, you may notice a “hiccup” in your R session when R triggers a garbage collection to reclaim unused memory. There’s not really anything you can do about this except not panic when it happens.

The gc() function in the base package can be used to explicitly trigger a garbage collection in R. Calling gc() explicitly is never actually needed, but it does produce some output that is worth understanding.

gc()
          used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 2047348 109.4    3205452 171.2  3205452 171.2
Vcells 4929073  37.7   15695277 119.8 19618287 149.7

The used column gives you the amount of memory currently being used by R. The distinction between Ncells and Vcells is not important—the mem_used() function in pryr essentially gives you the sum of this column. The gc trigger column gives you the amount of memory that can be used before a garbage collection is triggered. Generally, you will see this number go up as you allocate more objects and use more memory. The max used column shows the maximum space used since the last call to gc(reset = TRUE) and is not particularly useful.