7 Using Textual and Binary Formats for Storing Data
There are a variety of ways that data can be stored, including structured text files like CSV or tab-delimited, or more complex binary formats. However, there is an intermediate format that is textual, but not as simple as something like CSV. The format is native to R and is somewhat readable because of its textual nature.
One can create a more descriptive representation of an R object by
dump() functions. The
functions are useful because the resulting textual format is
edit-able, and in the case of corruption, potentially
recoverable. Unlike writing out a table or CSV file,
dput() preserve the metadata (sacrificing some readability), so
that another user doesn’t have to specify it all over again. For
example, we can preserve the class of each column of a table or the
levels of a factor variable.
Textual formats can work much better with version control programs like subversion or git which can only track changes meaningfully in text files. In addition, textual formats can be longer-lived; if there is corruption somewhere in the file, it can be easier to fix the problem because one can just open the file in an editor and look at it (although this would probably only be done in a worst case scenario!). Finally, textual formats adhere to the Unix philosophy, if that means anything to you.
There are a few downsides to using these intermediate textual formats. The format is not very space-efficient, because all of the metadata is specified. Also, it is really only partially readable. In some instances it might be preferable to have data stored in a CSV file and then have a separate code file that specifies the metadata.
One way to pass data around is by deparsing the R object with
and reading it back in (parsing it) using
Notice that the
dput() output is in the form of R code and that it
preserves metadata like the class of the object, the row names, and
the column names.
The output of
dput() can also be saved directly to a file.
Multiple objects can be deparsed at once using the dump function and
read back in using
dump() R objects to a file by passing a character vector of
The inverse of
7.2 Binary Formats
The complement to the textual format is the binary format, which is sometimes necessary to use for efficiency purposes, or because there’s just no useful way to represent data in a textual manner. Also, with numeric data, one can often lose precision when converting to and from a textual format, so it’s better to stick with a binary format.
The key functions for converting R objects into a binary format are
serialize(). Individual R objects can
be saved to a file using the
If you have a lot of objects that you want to save to a file, you can
save all objects in your workspace using the
Notice that I’ve used the
.rda extension when using
save() and the
.RData extension when using
save.image(). This is just my personal
preference; you can use whatever file extension you want. The
save.image() functions do not care. However,
are fairly common extensions and you may want to use them because they
are recognized by other software.
serialize() function is used to convert individual R objects
into a binary format that can be communicated across an arbitrary
connection. This may get sent to a file, but it could get sent over a
network or other connection.
When you call
serialize() on an R object, the output will be a raw
vector coded in hexadecimal format.
If you want, this can be sent to a file, but in that case you are
better off using something like
The benefit of the
serialize() function is that it is the only way
to perfectly represent an R object in an exportable format, without
losing precision or any metadata. If that is what you need, then
serialize() is the function for you.