2.2 Defining R objects
To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.
Everything that exists in R must be represented somehow. A generic term for something that we know not much about is an object. In R, we can distinguish between two main types of objects:
- data objects are typically passive containers for values (e.g., numbers, truth values, or strings of text), whereas
- function objects are active elements or tools: They do things with data.
2.2.1 Data objects
When using R, we typically create data objects that store the information that we care about (e.g., some data file). To achieve our goals (e.g., understand or reveal some new aspect of the data), we use or design functions as tools for manipulating and changing data (e.g., inputs) to create new data (e.g., outputs).
Creating and changing objects by assignment
Objects are created or changed by assignment using the
<- operator and the following structure:
Here are definitions of four different types of objects:
<- TRUE lg <- 1 n1 <- 2L n2 <- "hi"cr
To determine the type of these objects, we can evaluate the
typeof() function on each of them:
typeof(lg) #>  "logical" typeof(n1) #>  "double" typeof(n2) #>  "integer" typeof(cr) #>  "character"
To do something with these objects, we can apply other functions to them:
!lg # negate a logical #>  FALSE # print an object's current value n1 #>  1 + n2 # add 2 numeric objects n1 #>  3 nchar(cr) # number of characters #>  2
To change an object, we need to assign something else to it:
<- 8 # change by re-assignment n1 n1#>  8 + n2 n1 #>  10
Note that the last two code chunks contained the lines
n1 + n2 twice, but yielded different results.
The reason is that
n1 was initially assigned to
1, but then changed (or rather re-assigned) to
This implies that R needs to keep track of all our current variable bindings. This is done in R’s so-called environment.
As R’s current environment is an important feature of R, here is another example:
Changing an object by re-assigning it creates multiple “generations” of an object.
In the following code, the logical object
lg refers to two different truth values at different times:
# assigned to TRUE (above) lg #>  TRUE <- !lg # change by re-assignment lg # assigned to FALSE (now) lg #>  FALSE
Here, the value of the logical object
lg changed from
FALSE by the assignment
lg <- !lg.
As this assignment negates the current value of the same object, the value of
lg to the left of the assignment operator
<- differs from the value of
lg to its right. Importantly, whenever an object (e.g.,
lg) is re-assigned, its previous value(s) are lost, unless they have been assigned to (or stored as) a different object.
Naming objects (both data objects and functions) is an art in itself. A good general recommendation is to always aim for consistency and clarity. This may sound trivial, but if you ever tried to understand someone else’s code (including your own from a while ago) it is astonishing how hard it actually is.
Here are some generic recommendations (some of which may be personal preferences):
- Always aim for short but clear and descriptive names.
- data objects can be abstract (e.g.,
v_output) or short words or abbreviations (e.g.,
- functions should be verbs (like
print()) or composita (e.g.,
- data objects can be abstract (e.g.,
Honor existing conventions (e.g., using
Nfor sample or population sizes, …).
Create new conventions when this promotes consistency (e.g., giving objects that belong together similar names, or calling all functions that plot something with
plot_...(), that compute something with
Use only lowercase letters and numbers for names (as they are easy to type — and absolutely avoid all special characters, as they may not exist or look very different on other people’s computers),
snake_casefor combined names, rather than
camelCase, and — perhaps most importantly —
Break any of those rules if there are good (i.e., justifiable) reasons for this.
2.2.2 Data types
Throughout this book, we will work with the following data types:
- logical values (aka. Boolean values, of type logical)
- numbers (of type integer or double)
- text or string data (of type character)
- dates and times (with various data types)
Note that we distinguish data types and data structures (aka. data shapes). The terms type and structure are often used interchangeably, but data structures actually combine data objects of particular types into larger shapes.
We already defined objects of type “integer,” “double,” “character” and “logical” above.
Two elementary functions that can be applied to any R object are
typeof(TRUE) #>  "logical" typeof(10L) #>  "integer" typeof(10) #>  "double" typeof("oops") #>  "character" length(FALSE) #>  1 length(100L) #>  1 length(123) #>  1 length("hoopla") #>  1
Elementary objects with a length of 1 are known as scalar objects.
In the following sections, we will briefly consider some examples of each of the four main data types.
The simplest data types are logical values (aka. Boolean values).
They only exist in two varieties:
<- TRUE A <- FALSEB
It is possible in R to abbreviate
F, but as
F are non-protected names and can also be set to other values, these abbreviations should be avoided.
By combining logical values with logical operators, we can create more complex logical statements:
!A # negation #>  FALSE & B # logical AND A #>  FALSE | B # logical OR A #>  TRUE
With these logical values and operators, we can define quite fancy statements of predicate logic. For instance, the following statements verify the validity of De Morgan’s Laws (e.g., on Wikipedia) in R:
<- TRUE # set to either TRUE or FALSE. A <- FALSE # set to either TRUE or FALSE. B # Irrespective of the values of A and B, # the following should ALWAYS evaluate to TRUE: # (1) not (A or B) = not A and not B: !(A | B) == (!A & !B) # (2) not (A and B) = not A or not B: !(A & B) == (!A | !B)
Irrespective of the truth value of
B, the statements (1) and (2) are always
A noteworthy feature of R is that logical values are interpreted as numbers when the context suggests this interpretation.
In these cases, any value of
TRUE is interpreted as the number 1 and any value of
FALSE is interpreted as the number 0.
For instance, when logical values appear in calculations:
TRUE + FALSE #>  1 TRUE - FALSE + TRUE #>  2 3 * TRUE - 11 * FALSE/7 #>  3
The same interpretation of truth values is made when applying arithmetic functions to (vectors of) truth values:
sum(c(TRUE, FALSE, TRUE)) #>  2 mean(c(TRUE, FALSE, FALSE)) #>  0.3333333
Calculating with logical values may seem a bit strange at first, but provides a useful bridge between logical and numeric data types.
Numbers can be represented and entered into R in a variety of ways.
In most cases, they are either entered using the decimal notation, as the result of computations, or in scientific notation (using the
e\(x\) to indicate the \(x\)-th power of 10).
By default, R represents all numbers as data of type double.
Here are different ways of entering the number 3.14 and then testing for its type:
typeof(3.14) # decimal #>  "double" typeof(314/100) # ratio #>  "double" typeof(3.14e0) # scientific notation #>  "double" typeof(round(pi, 2)) # a built-in constant #>  "double"
If we specifically want to represent a number as an integer, it must not contain a fractional part and be followed by
L (reminisicent of the “Long” data type in C):
typeof(123) #>  "double" typeof(123L) #>  "integer" # Note: # 12.3L # would return an error, as integers may not contain fractional values.
Three special numeric values are
0/0 #>  NaN 1/0 #>  Inf -1/0 #>  -Inf
Note that the data type of these special numbers is still
typeof(1/0) #>  "double" typeof(0/0) #>  "double"
Note that we used
/ to indicate a fraction or the “divided by” operator.
Numbers are primarily useful for calculating other numbers from them. This is either done by applying numeric functions, but also by applying arithmetic operators to numeric objects.
- Here are some common numeric functions to be applied to numeric objects:6
# define some numeric object: <- c(-10, 0, 2, 4, 6) nums # basic functions: min(nums) # minimum #>  -10 max(nums) # maximum #>  6 sum(nums) # sum #>  2 # statistical functions: mean(nums) # mean #>  0.4 var(nums) # variance #>  38.8 sd(nums) # standard deviation #>  6.228965
- Here are examples of the the most common arithmetic operators:
<- 5 x <- 2 y + x # keeping sign #>  5 - y # reversing sign #>  -2 + y # addition x #>  7 - y # subtraction x #>  3 * y # multiplication x #>  10 / y # division x #>  2.5 ^ y # exponentiation x #>  25 %/% y # integer division x #>  2 %% y # remainder of integer division (x mod y) x #>  1
When an arithmetic expression contains more than one operator, the issue of operator precedence arises. Fortunately, R uses the same precedence rules as we have learned in school — the so-called “BEDMAS” order:
1 / 2 * 3 # left to right #>  1.5 1 + 2 * 3 # precedence: */ before +- #>  7 1 + 2) * 3 # changing order by parentheses (#>  9 2^1/2 == 1 #>  TRUE 2^(1/2) == sqrt(2) #>  TRUE
?Syntax provides a longer list of operator precedence.
However, using parentheses to structure longer (arithmetic or logical) expressions always increases transparency.
Numbers can also be compared to other numbers. When comparing numbers (i.e., applying comparison operators to them), we get logical values (i.e., scalars of type “logical” that are either
- For instance, each of the following comparisons of numeric values yields a logical object (i.e., either
FALSE) as its result:
2 > 1 # larger than #>  TRUE 2 >= 2 # larger than or equal to #>  TRUE 2 < 1 # smaller than #>  FALSE 2 <= 1 # smaller than or equal to #>  FALSE
== tests for the equality of objects, whereas
!= tests for inequality (or non-equality):
1 == 1 # == ... equality #>  TRUE 1 != 1 # != ... inequality #>  FALSE
A common error of R novices is to use
= instead of
= can be used as the assignment operator
<-, this often yields unexpected results or “assignment” errors.
== operator often yields unexpected results when checking the equality of two numbers.
As computers store (real) numbers as approximations,
x == y often evaluates to
FALSE even we mathematically know that
y should be equal. For example:
<- sqrt(2) x ^2 == 2 # should be TRUE, but: x#>  FALSE # Reason: ^2 - 2 # tiny numeric difference x#>  4.440892e-16
When checking for the equality of numbers, we need to use functions that allow for minimal tolerances due to the way in which computer represent so-called floating point numbers. One such function is the
all.equal(x^2, 2) #>  TRUE
Text data (also called “strings”) is represented as data of type character.
To distinguish character objects from the names of other R objects, they need to be surrounded by double quotes (as in “hello”) or single quotes (as in ‘bye’).
Special characters (that have a special meaning in R) are escaped with a backslash (e.g.,
?Quotes for details).
The length of a word
w is not determined by
length(w), but by a special function
nchar() (for “number of characters”).
The following proves that
word is a four-letter word:
nchar("word") #>  4
Alphabetical characters come in two varieties: lowercase (e.g.,
c) and uppercase (e.g.,
R comes with two corresponding built-in constants:
letters#>  "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" #>  "t" "u" "v" "w" "x" "y" "z" LETTERS#>  "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" #>  "T" "U" "V" "W" "X" "Y" "Z"
and special functions that evaluate or change objects of type of character:
<- "Make America great again." slogan nchar(slogan) #>  25 toupper(slogan) #>  "MAKE AMERICA GREAT AGAIN." tolower(slogan) #>  "make america great again."
In terms of representation, the difference between
a is not one of data type.
They are both objects of type “character”", but different ones.
Dates and times
Dates and times are complicated data types, as their definition and interpretation needs to consider a lot of context, irregularities, and conventions. At this early point in our R careers, we only need to know that such types exist and consider two particular functions that return the current date and time:
Sys.Date() #>  "2021-03-07" Sys.time() #>  "2021-03-07 09:12:39 CET"
Note the data type of both:
typeof(Sys.Date()) #>  "double" typeof(Sys.time()) #>  "double"
A special type of value is
NA, which stands for not available, not applicable, or missing values:
<- NA # NA_integer_ ms ms#>  NA
NA is of a logical data type, but other data types have their own type of
NA_integer_ (of type integer),
NA_real_ (of type double), and
NA_character_ (of type character).
But as they are flexibly changed into each other when necessary, we can ignore this distinction.
Functions are R objects that serve as tools — they do stuff with data (as input) to often create new data (as output). Alternative terms for function — depending on context — are command, method, procedure, or strategy.
Theoretically, a function provides a mapping from one set of objects (e.g., inputs or x-values) to another set of objects (e.g., outputs or y-values). Internally, a function is a computer program that takes some input(s) and yields some output(s) or side-effects. Most functions are only a few lines of code long, but others may contain thousands of lines of code or call many other functions.
The great thing about functions is that they are tools that encapsulate processes — they are “abstraction devices.” Just like we usually do not care how some technological device (e.g., a phone) gets its task done, we can treat a function as a black box that is designed to perform some task. If we provide it with the right kind of input, a function typically does something and returns some output. If all works well (i.e., we get our desired task done), we never need to care about how a function happens to look or work inside.
R comes pre-loaded with perhaps a few hundred functions. However, one of the main perks of R is that more than 16,000 R packages contributed by R developers provide countless additional functions that someone else considered useful enough for creating and sharing it. So rather than ever learning all R functions, it is important that we learn how to find those that are useful to solve our problems and how to explore and use them in a productive fashion. And as we get more experienced, we will also learn how to create our own R functions (by — surprise — using an R function). But before we do all that, we first need to learn how to use a function that already is provided to us.
Whenever we are interested in an existing function, we can obtain help on it by typing its name preceded by a question mark.
For instance, if we wanted to learn how the
substr() function worked, we would evaluate the following command in our Console:
substr() # yields documentation ?# (works as well)?substr
Do not be discouraged if some of the function’s documentation seems cryptic at first. This is perfectly normal — and as our knowledge of R grows, we will keep discovering new nuggets of wisdom in R’s very extensive help pages.7 And even when not understanding anything of a function’s documentation, trying out its Examples usually provides some idea what the function is all about.
Importantly, functions have a name and accept arguments (in round parentheses).
For instance, a function
fun_name() could have the following structure:
fun_name(x, arg_1 = 1, arg_2 = 99)
Arguments are named slots that allow providing inputs to functions.
The value of an argument is either required (i.e., must be provided by the user) or is set to default value (which is used if the argument is not provided).
In our example structure, the arguments
x is required, but
arg_2 have default values.
We use the
substr() function to illustrate the use of arguments with an actual function.
?substr() describes its purpose, arguments, and many details and examples for its usage.
To identify an argument, we can use their name or the order of arguments:
substr(x = "perspective", start = 4, stop = 8) # explicating argument names #>  "spect" substr("perspective", 4, 8) # using the order of arguments #>  "spect"
Note that there is no space between the function name and the parentheses and multiple arguments are separated by commas. Although it is faster and more convenient to omit argument names, explicating argument names is always safer. For instance, calling functions with explicit argument names would still work if the author of a function added or changed the order of arguments:
substr(start = 4, x = "perspective", stop = 8) # explicit names (in different order) #>  "spect"
Note that a function’s documentation typically mentions the data types of its input(s), its output(s), and its argument(s). This is important, as most functions are designed to work with specific data types.
As R and R packages contain countless functions, an important skill consists in exploring new functions. Exploring a new function is a bit like conducting a small research study. To be successful, we need a mix of theoretical guidance and empirical observations to become familiar with a new object. When exploring an R function, we should always ask the following questions:
- Purpose: What does this function do?
- Arguments: What inputs does it take?
- Outputs: Which outputs does it yield?
- Limits: Which boundary conditions apply to its use?
Note that we currently explore functions from a user’s perspective. When we later learn to write our own functions, we will ask the same questions from a designer’s perspective.
Example: The ds4psy package provides a function
plot_fn() that is deliberately kept cryptic and obscure to illustrate how a function and its arguments can be explored. Most actual R functions will be easier to explore, but they also require some active exploration to become familiar with them. Here is what we normally do to explore an existing function:
library(ds4psy) # load package # get documentation?plot_fn
The documentation (shown in the Help window of RStudio) answers most of our questions. It also provides some examples, which we can copy and evaluate for ourselves:
# Basics: plot_fn()
# Exploring options: plot_fn(x = 2, A = TRUE)
plot_fn(x = 3, A = FALSE, E = TRUE)
plot_fn(x = 4, A = TRUE, B = TRUE, D = TRUE)
plot_fn(x = 5, A = FALSE, B = TRUE, E = TRUE, f = c("black", "white", "gold"))
plot_fn(x = 7, A = TRUE, B = TRUE, F = TRUE, f = c("steelblue", "white", "forestgreen"))
This illustrates that
plot_fn() creates a range of plots. Its arguments names are uninformative, as they are named by single lowercase or uppercase letters. However, the documentation tells us what type of data each argument needs to be (e.g., a numeric or logical value) and what the default value is.
See 1.2.3 Exploring functions for examples of exploring simple and complex functions.
Testing for missing values
An important function in R is
is.na(): It tests whether an object is missing or contains any missing values:
is.na(x) #>  FALSE is.na(ms) #>  TRUE
and returns logical value(s) of either
Note that even a missing R object (i.e., an object for which
TRUE) needs to exist to be evaluated.
Thus, the following would yield an error, unless a
unicorn object existed in our current environment:
This section provides some additional examples to help you think about and practice basic R data types and functions.
Assume you loaded some table of
data (e.g., from the tidyverse package tidyr) to practice your R skills:
<- tidyr::table1 data data#> # A tibble: 6 x 4 #> country year cases population #> <chr> <int> <int> <int> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583
When further analyzing and changing
data, it is quite possible that you make errors at some point.
data was valuable to your project and you were afraid of messing it up along the way.
- How could you ensure that you always could always retrieve your original
Solution: Store a backup copy of
data by assigning it to another object (which is not manipulated).
However, note the difference between the following alternatives:
<- tidyr::table1 # backup of original data data_backup <- data # backup of current datadata_backup
Predict, evaluate, and explain the result of the following commands (for different combinations of logical values of
<- TRUE P <- FALSE Q !P | Q) == !(P & !Q)(
Solution: The expression always evaluates to
TRUE as its two sub-expressions
(!P | Q) and
!(P & !Q) are alternative ways of expressing the logical conditional “if
Q” in R. The R way of checking all four possible combinations at once would use vectors:
<- c(TRUE, TRUE, FALSE, FALSE) P <- c(TRUE, FALSE, TRUE, FALSE) Q !P | Q) == !(P & !Q)(
Using the vectors essentially creates the following table:
Predict, evaluate, and explain the result of the following commands:
<- "Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft" word length(word) length("word")
Hint: Think before starting to count.
Predict, evaluate, and explain the result of the following commands:
<- "parapsychological bullshit" ppbs substr(ppbs) substr(ppbs, 5, 10) substr(start = 1, ppbs) substr(stop = 17, ppbs, 11) substr(stop = 99, start = -11, ppbs)
In this example, the R object
numsis defined as a vector of five numbers, which we will cover below.↩︎
In the old days, people used to get and read books to get this kind of information. One of the most astonishing things about R is that all of its documentation is already available on your computer when installing an R package.↩︎