1.8 The Role of Physical Memory
The learning objectives of this section are to:
- Describe how memory is used in R sessions to store R objects
Generally speaking, R stores and manipulates all objects in the physical memory of your computer (i.e. the RAM). Therefore, it’s important to be aware of the limits of your computing environment with respect to available memory and how that may affect your ability to use R. In the event that your computer’s physical memory is insufficient for some of your work, there have been some developments that allow R users to deal with objects out of physical memory and we will discuss them below.
The first thing that is worth keeping in mind as you use R is how much physical memory your computer actually has. Typically, you can figure this out by looking at your operating system’s settings. For example, as of this writing, Roger has a 2015-era Macbook with 8 GB of RAM. Of course, the amount of RAM available to R will be quite a bit less than that, but it’s a useful upper bound. If you plan to read into R an object that is 16 GB on this computer, you’re going to have ask Roger for a new computer.
pryr package provides a number of useful functions for interrogating the memory usage of your R session. Perhaps the most basic is the
mem_used() function, which tells you how much memory your current R session is using.
library(pryr) 'pryr': Registered S3 method overwritten by method from print.bytes Rcppmem_used() 376 MB
The primary use of this function is to make sure your memory usage in R isn’t getting too big. If the output from
mem_used() is in the neighborhood of 75%-80% of your total physical RAM, you might need to consider a few things.
First, you might consider removing a few very large objects in your workspace. You can see the memory usage of objects in your workspace by calling the
ls() ## Show objects in workspace 1] "a" "a_tale" "andrew_tracks"  "b" "cases" "check_months"  "check_tracks" "check_weekdays" "denver"  "ext_tracks" "ext_tracks_colnames" "ext_tracks_file"  "ext_tracks_widths" "join_funcs" "katrina"  "katrina_reduced" "knots_to_mph" "logdates"  "logs" "m" "maps_api_key"  "mc_tibl" "meso_url" "miami"  "msg" "old" "pasted_states"  "readr_functions" "regular_expression" "shapes"  "start_end_vowel" "state_tbl" "string_to_search"  "team_standings" "teams" "to_trim"  "two_cities" "two_s" "VADeaths"  "vowel_state_lgl" "wc_table" "worldcup"  "x" "y" "zika_brazil"  "zika_file" [object_size(worldcup) 61.2 kB
object_size() function will print the number of bytes (or kilobytes, or megabytes) that a given object is using in your R session. If you want see what the memory usage of the largest 5 objects in your workspace is, you can use the following code.
library(magrittr) sapply(ls(), function(x) object.size(get(x))) %>% sort %>% tail(5) worldcup denver check_tracks ext_tracks miami 61464 223376 287488 2795080 13123912
Note: We have had to use the
object.size() function here (see note below) because the current version of
pryr throws an error for certain types of objects.
Here we can see that the
ext_tracks objects (created in previous chapters of this book) are currently taking up the most memory in our R session. Since we no longer need those objects, we can remove them from the workspace and free up some memory.
mem_used() 376 MB rm(ext_tracks, miami) mem_used() 373 MB
Here you can see how much memory we save by deleting these two objects. But you may be wondering why there isn’t a larger savings, given the number reported by
object_size(). This has to do with the internal representation of the
miami object, which is of the class
ggmap. Occasionally, certain types of R objects can appear to take up more memory than the actually do, in which case functions like
object_size() will get confused.
Viewing the change in memory usage by executing an R expression can actually be simplified using the
mem_change() function. We can see what happens when we remove the next three largest objects.
mem_change(rm(check_tracks, denver, b)) -539 kB
Here the decrease is about 400 KB.
A> R has a built in function called
object.size() that also calculates the size of an object, but it uses a slightly different calculation than
pryr. While the two functions will generally agree for most objects, for things like functions and formulas, which have enclosing environments attached to them, they will differ. Similarly, objects with shared elements (i.e. character vectors) may result in different computations of their size. The
compare_size() function in
pryr allows you to see how the two functions compare in their calculations. We will discuss these concepts more in the next chapter.
1.8.1 Back of the Envelope Calculations
When reading in large datasets or creating large R objects, it’s often useful to do a back of the envelope calculation of how much memory the object will occupy in the R session (ideally before creating the object). To do this it’s useful to know roughly how much memory different types of atomic data types in R use.
It’s difficult to generalize how much memory is used by data types in R, but on most 64 bit systems today, integers are 32 bits (4 bytes) and double-precision floating point numbers (numerics in R) are 64 bits (8 bytes). Furthermore, character data are usually 1 byte per character. Because most data come in the form of numbers (integer or numeric) and letters, just knowing these three bits of information can be useful for doing many back of the envelope calculations.
For example, an integer vector is roughly 4 bytes times the number of elements in the vector. We can see that for a zero-length vector, that still requires some memory to represent the data structure.
object_size(integer(0)) 48 B
However, for longer vectors, the overhead stays roughly constant, and the size of the object is determined by the number of elements.
object_size(integer(1000)) ## 4 bytes per integer 4.05 kB object_size(numeric(1000)) ## 8 bytes per numeric 8.05 kB
If you are reading in tabular data of integers and floating point numbers, you can roughly estimate the memory requirements for that table by multiplying the number of rows by the memory required for each of the columns. This can be a useful exercise to do before reading in large datasets. If you accidentally read in a dataset that requires more memory than your computer has available, you may end up freezing your R session (or even your computer).
.Machine object in R (found in the
base package) can give you specific details about how your computer/operation system stores different types of data.
str(.Machine) 28 List of $ double.eps : num 2.22e-16 $ double.neg.eps : num 1.11e-16 $ double.xmin : num 2.23e-308 $ double.xmax : num 1.8e+308 $ double.base : int 2 $ double.digits : int 53 $ double.rounding : int 5 $ double.guard : int 0 $ double.ulp.digits : int -52 $ double.neg.ulp.digits : int -53 $ double.exponent : int 11 $ double.min.exp : int -1022 $ double.max.exp : int 1024 $ integer.max : int 2147483647 $ sizeof.long : int 8 $ sizeof.longlong : int 8 $ sizeof.longdouble : int 16 $ sizeof.pointer : int 8 $ longdouble.eps : num 1.08e-19 $ longdouble.neg.eps : num 5.42e-20 $ longdouble.digits : int 64 $ longdouble.rounding : int 5 $ longdouble.guard : int 0 $ longdouble.ulp.digits : int -63 $ longdouble.neg.ulp.digits: int -64 $ longdouble.exponent : int 15 $ longdouble.min.exp : int -16382 $ longdouble.max.exp : int 16384
The floating point representation of a decimal number contains a set of bits representing the exponent and another set of bits representing the significand or the mantissa. Here the number of bits used for the exponent is 11, from
double.exponent, and the number of bits for the significand is 53, from the
double.digits element. Together, each double precision floating point number requires 64 bits, or 8 bytes to store.
For integers, we can see that the maximum integer indicated by the
integer.max is 2147483647, we can take the base 2 log of that number and see that it requires 31 bits to encode. Because we need another bit to encode the sign of the number, the total number of bits for an integer is 32, or 4 bytes.
Much of the point of this discussion of memory is to determine if your computer has sufficient memory to do the work you want to do. If you determine that the data you’re working with cannot be completely stored in memory for a given R session, then you may need to resort to alternate tactics. We discuss one such alternative in the section below, “Working with large datasets.”
1.8.2 Internal Memory Management in R
If you’re familiar with other programming languages like C, you’ll notice that you do not need to explicitly allocate and de-allocate memory for objects in R. This is because R has a garbage collection system that recycles unused memory and gives it back to R. This happens automatically without the need for user intervention.
Roughly, R will periodically cycle through all of the objects that have been created and see if there are still any references to the object somewhere in the session. If there are no references, the object is garbage-collected and the memory returned. Under normal usage, the garbage collection is not noticeable, but occasionally, when working with very large R objects, you may notice a “hiccup” in your R session when R triggers a garbage collection to reclaim unused memory. There’s not really anything you can do about this except not panic when it happens.
gc() function in the
base package can be used to explicitly trigger a garbage collection in R. Calling
gc() explicitly is never actually needed, but it does produce some output that is worth understanding.
gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) 2570669 137.3 4425981 236.4 NA 4425981 236.4 Ncells 28583343 218.1 52839845 403.2 65536 52594427 401.3Vcells
used column gives you the amount of memory currently being used by R. The distinction between
Vcells is not important—the
mem_used() function in
pryr essentially gives you the sum of this column. The
gc trigger column gives you the amount of memory that can be used before a garbage collection is triggered. Generally, you will see this number go up as you allocate more objects and use more memory. The
max used column shows the maximum space used since the last call to
gc(reset = TRUE) and is not particularly useful.