2.2 Defining R objects
To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.John Chambers
Everything that exists in R must be represented somehow. A generic term for something that we know not much about is an object. In R, we can distinguish between two main types of objects:
data objects are typically passive containers for values (e.g., numbers, truth values, or strings of text), whereas
function objects are active elements or tools: They do things with data.
When considering any data object, we can always ask two questions:
What type of data is it?
In what shape is the data represented?
Particular combinations of data types and shapes are known as data structures. Thus, an important goal of this and the next chapter is to distinguish between different data types (in Section 2.2.2) and data structures (in Section 3.2). To illustrate the data types and data structures that are available in R, we first need to explain how to define corresponding data objects. This involves doing something, i.e., using functions.
2.2.1 Data objects
When using R, we typically create data objects that store the information that we care about (e.g., some data file). To achieve our goals (e.g., understand or reveal some new aspect of the data), we use or design functions as tools for manipulating and changing data (e.g., inputs) to create new data (e.g., outputs).
Creating and changing objects by assignment
Objects are created or changed by assignment using the <-
operator and the following structure:
<- value obj_name
Here are definitions of four different types of objects:
<- TRUE
lg <- 1
n1 <- 2L
n2 <- "hi" cr
To determine the type of these objects, we can evaluate the typeof()
function on each of them:
typeof(lg)
#> [1] "logical"
typeof(n1)
#> [1] "double"
typeof(n2)
#> [1] "integer"
typeof(cr)
#> [1] "character"
To do something with these objects, we can apply other functions to them:
!lg # negate a logical
#> [1] FALSE
# print an object's current value
n1 #> [1] 1
+ n2 # add 2 numeric objects
n1 #> [1] 3
nchar(cr) # number of characters
#> [1] 2
To change an object, we need to assign something else to it:
<- 8 # change by re-assignment
n1
n1#> [1] 8
+ n2
n1 #> [1] 10
Note that the last two code chunks contained the lines n1
and n1 + n2
twice, but yielded different results.
The reason is that n1
was initially assigned to 1
, but then changed (or rather re-assigned) to 8
.
This implies that R needs to keep track of all our current variable bindings. This is done in R’s so-called environment.
As R’s current environment is an important feature of R, here is another example:
Changing an object by re-assigning it creates multiple “generations” of an object.
In the following code, the logical object lg
refers to two different truth values at different times:
# assigned to TRUE (above)
lg #> [1] TRUE
<- !lg # change by re-assignment
lg # assigned to FALSE (now)
lg #> [1] FALSE
Here, the value of the logical object lg
changed from TRUE
to FALSE
by the assignment lg <- !lg
.
As this assignment negates the current value of the same object, the value of lg
to the left of the assignment operator <-
differs from the value of lg
to its right. Importantly, whenever an object (e.g., lg
) is re-assigned, its previous value(s) are lost, unless they have been assigned to (or stored as) a different object.
Naming objects
Naming objects (both data objects and functions) is an art in itself. A good general recommendation is to always aim for consistency and clarity. This may sound trivial, but if you ever tried to understand someone else’s code (including your own from a while ago) it is astonishing how hard it actually is.
Here are some generic recommendations (some of which may be personal preferences):
- Always aim for short but clear and descriptive names:
- data objects can be abstract (e.g.,
abc
,t_1
,v_output
) or short words or abbreviations (e.g.,data
,cur_df
), - functions should be verbs (like
print()
) or composita (e.g.,plot_bar()
,write_data()
).
- data objects can be abstract (e.g.,
Honor existing conventions (e.g., using
v
for vectors,i
andj
for indices,x
andy
for coordinates,n
orN
for sample or population sizes, …).Create new conventions when this promotes consistency (e.g., giving objects that belong together similar names, or calling all functions that plot something with
plot_...()
, that compute something withcomp_...()
, etc.).Use only lowercase letters and numbers for names (as they are easy to type — and absolutely avoid all special characters, as they may not exist or look very different on other people’s computers),
Use
snake_case
for combined names, rather thancamelCase
, and — perhaps most importantly —Break any of those rules if there are good (i.e., justifiable) reasons for this.
2.2.2 Data types
For any data object, we distinguish between its shape and its type. The shape of an object mostly depends on its structure. As this chapter uses only a single data structure (i.e., vectors), we will address issues of data shape later (in Chapter 3 on Data structures).
Here, we focus on data types (which are also described as data modes in R). Throughout this book, we will work with the following data types:
- logical values (aka. Boolean values, of type logical)
- numbers (of type integer or double)
- text or string data (of type character)
- dates and times (with various data types)
We already defined objects of type “integer,” “double,” “character” and “logical” above.
To check the type of a data object, two elementary functions that can be applied to any R object are typeof()
and mode()
:
typeof(TRUE)
#> [1] "logical"
typeof(10L)
#> [1] "integer"
typeof(10)
#> [1] "double"
typeof("oops")
#> [1] "character"
mode(TRUE)
#> [1] "logical"
mode(10L)
#> [1] "numeric"
mode(10)
#> [1] "numeric"
mode("oops")
#> [1] "character"
If we want to check objects for having a particular data type, the following functions allow asking more specific questions:
is.character(TRUE)
#> [1] FALSE
is.double(10L)
#> [1] FALSE
is.integer(10)
#> [1] FALSE
is.numeric("oops")
#> [1] FALSE
To check the shape of a data object, two basic functions that can be applied to many R objects are length()
and str()
:
length(FALSE)
#> [1] 1
length(100L)
#> [1] 1
length(123)
#> [1] 1
length("hoopla")
#> [1] 1
str(FALSE)
#> logi FALSE
str(100L)
#> int 100
str(123)
#> num 123
str("hoopla")
#> chr "hoopla"
The length()
function actually describes a basic property of the most fundamental data structure in R: Atomic vectors.
Vectors with a length of 1 are known as scalar objects.
As the objects checked here are pretty simple (i.e., they all are scalars), the results of these functions are pretty boring. But as we move on to more complex data structures, we will learn more ways of checking object shapes and encounter richer data structures.
At this point, it is good to note that even data objects that may look more complicated can internally be simple objects:
In the following sections, we explore additional examples of each of the four main data types.
Logicals
The simplest data types are logical values (aka. Boolean values).
Logical values exist in exactly two varieties: TRUE
and FALSE
.
<- TRUE
A <- FALSE B
It is possible in R to abbreviate TRUE
and FALSE
by T
and F
. However, as T
and F
are non-protected names and can also be set to other values, these abbreviations should be avoided.
By combining logical values with logical operators, we can create more complex logical expressions:
!A # negation
#> [1] FALSE
& B # logical AND
A #> [1] FALSE
| B # logical OR
A #> [1] TRUE
== !!A # equality (+ double negation)
A #> [1] TRUE
Note that the result of evaluating any logical expression is a logical values (i.e., TRUE
or FALSE
).
By combining logical values and operators, we can express quite fancy statements of predicate logic. For instance, the following statements verify the validity of De Morgan’s Laws (e.g., on Wikipedia) in R:
<- TRUE # set to either TRUE or FALSE.
A <- FALSE # set to either TRUE or FALSE.
B
# Irrespective of the values of A and B,
# the following should ALWAYS evaluate to TRUE:
# (1) not (A or B) = not A and not B:
!(A | B) == (!A & !B)
# (2) not (A and B) = not A or not B:
!(A & B) == (!A | !B)
Irrespective of the truth value of A
and B
, the statements (1) and (2) are always TRUE
.
A noteworthy feature of R is that logical values are interpreted as numbers when the context suggests this interpretation.
In these cases, any value of TRUE
is interpreted as the number 1 and any value of FALSE
is interpreted as the number 0.
For instance, when logical values appear in calculations:
TRUE + FALSE
#> [1] 1
TRUE - FALSE + TRUE
#> [1] 2
3 * TRUE - 11 * FALSE/7
#> [1] 3
The same interpretation of truth values is made when applying arithmetic functions to (vectors of) truth values:
sum(c(TRUE, FALSE, TRUE))
#> [1] 2
mean(c(TRUE, FALSE, FALSE))
#> [1] 0.3333333
Calculating with logical values may seem a bit strange at first, but provides a useful bridge between logical and numeric data types.
Numbers
Numbers can be represented and entered into R in a variety of ways.
In most cases, they are either entered using the decimal notation, as the result of computations, or in scientific notation (using the e
\(x\) notation to indicate the \(x\)-th power of 10).
By default, R represents all numbers as data of type double.
Here are different ways of entering the number 3.14 and then testing for its type:
typeof(3.14) # decimal
#> [1] "double"
typeof(314/100) # ratio
#> [1] "double"
typeof(3.14e0) # scientific notation
#> [1] "double"
typeof(round(pi, 2)) # a built-in constant
#> [1] "double"
If we specifically want to represent a number as an integer, it must not contain a fractional part and be followed by L
(reminisicent of the “Long” data type in C):
typeof(123)
#> [1] "double"
typeof(123L)
#> [1] "integer"
# Note:
# 3.14L # issues a warning, as
# integers cannot contain fractional values.
Note that entering a decimal number with an added L
(e.g., 3.14L
) would return the decimal number with a warning, as integers cannot contain fractional values.
Three special numeric values are infinity (positive Inf
and negative -Inf
) and non-numbers (NaN
):
1/0 # positive infinity
#> [1] Inf
-1/0 # negative infinity
#> [1] -Inf
0/0 # not defined
#> [1] NaN
In these examples, we use the /
operator to indicate a fraction or “division by” operation.
The results computed by R conform to our standard axioms of arithmetic.
Note that NaN
is different from a missing number (denoted in R as NA
) and that the data type of these special numbers is still double
:
typeof(1/0)
#> [1] "double"
typeof(0/0)
#> [1] "double"
Numbers are primarily useful for calculating other numbers from them. This is either done by applying numeric functions, but also by applying arithmetic operators to numeric objects.
- Here are some common numeric functions to be applied to numeric objects:7
# define some numeric object:
<- c(-10, 0, 2, 4, 6)
nums
# basic functions:
min(nums) # minimum
#> [1] -10
max(nums) # maximum
#> [1] 6
sum(nums) # sum
#> [1] 2
# statistical functions:
mean(nums) # mean
#> [1] 0.4
var(nums) # variance
#> [1] 38.8
sd(nums) # standard deviation
#> [1] 6.228965
- Here are examples of the the most common arithmetic operators:
<- 5
x <- 2
y
+ x # keeping sign
#> [1] 5
- y # reversing sign
#> [1] -2
+ y # addition
x #> [1] 7
- y # subtraction
x #> [1] 3
* y # multiplication
x #> [1] 10
/ y # division
x #> [1] 2.5
^ y # exponentiation
x #> [1] 25
%/% y # integer division
x #> [1] 2
%% y # remainder of integer division (x mod y)
x #> [1] 1
When an arithmetic expression contains more than one operator, the issue of operator precedence arises. Fortunately, R uses the same precedence rules as we have learned in school — the so-called “BEDMAS” order:
- Brackets
()
, - Exponents
^
, - Division
/
and Multiplication*
, - Addition
+
and Subtraction-
1 / 2 * 3 # left to right
#> [1] 1.5
1 + 2 * 3 # precedence: */ before +-
#> [1] 7
1 + 2) * 3 # changing order by parentheses
(#> [1] 9
2^1/2 == 1
#> [1] TRUE
2^(1/2) == sqrt(2)
#> [1] TRUE
Calling ?Syntax
provides a longer list of operator precedence.
However, using parentheses to structure longer (arithmetic or logical) expressions always increases transparency.
Numbers can also be compared to other numbers. When comparing numbers (i.e., applying comparison operators to them), we get logical values (i.e., scalars of type “logical” that are either TRUE
or FALSE
).
- For instance, each of the following comparisons of numeric values yields a logical object (i.e., either
TRUE
orFALSE
) as its result:
2 > 1 # larger than
#> [1] TRUE
2 >= 2 # larger than or equal to
#> [1] TRUE
2 < 1 # smaller than
#> [1] FALSE
2 <= 1 # smaller than or equal to
#> [1] FALSE
The operator ==
tests for the equality of objects, whereas !=
tests for inequality (or non-equality):
1 == 1 # == ... equality
#> [1] TRUE
1 != 1 # != ... inequality
#> [1] FALSE
A common error of R novices is to use =
instead of ==
.
As =
can be used as the assignment operator <-
, this often yields unexpected results or “assignment” errors.
Importantly, the ==
operator often yields unexpected results when checking the equality of two numbers.
As computers store (real) numbers as approximations, x == y
often evaluates to FALSE
even we mathematically know that x
and y
should be equal. For example:
<- sqrt(2)
x ^2 == 2 # should be TRUE, but:
x#> [1] FALSE
# Reason:
^2 - 2 # tiny numeric difference
x#> [1] 4.440892e-16
When checking for the equality of numbers, we need to use functions that allow for minimal tolerances due to the way in which computer represent so-called floating point numbers. One such function is the all.equal()
function:
all.equal(x^2, 2)
#> [1] TRUE
Characters
Text data (also called “strings”) is represented as data of type character.
To distinguish character objects from the names of other R objects, they need to be surrounded by double quotes (as in “hello”) or single quotes (as in ‘bye’).
Special characters (that have a special meaning in R) are escaped with a backslash (e.g., \;
, see ?Quotes
for details).
The length of a word w
is not determined by length(w)
, but by a special function nchar()
(for “number of characters”).
The following proves that word
is a four-letter word:
nchar("word")
#> [1] 4
Alphabetical characters come in two varieties:
lowercase (e.g., a
, b
, c
) and uppercase (e.g., A
, B
, C
).
R comes with two corresponding built-in constants:
letters#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"
LETTERS#> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#> [20] "T" "U" "V" "W" "X" "Y" "Z"
that provide vectors containing the 26 characters of the Western alphabet (in lowercase vs. uppercase letters, respectively) and special functions that evaluate or change objects of type of character:
<- "Make America great again."
slogan nchar(slogan)
#> [1] 25
toupper(slogan)
#> [1] "MAKE AMERICA GREAT AGAIN."
tolower(slogan)
#> [1] "make america great again."
In terms of representation, the difference between A
and a
is not one of data type.
They are both objects of type “character”", but different ones.
Dates and times
Dates and times are more complicated data types —/ not because they are complictated per se, but because their definition and interpretation needs to account for a lot of context, conventions, and some irregularities. At this early point in our R careers, we only need to know that such data types exist and consider two particular functions that return the current date and time:
Sys.Date()
#> [1] "2022-04-22"
Sys.time()
#> [1] "2022-04-22 15:47:36 CEST"
Note the data type (and mode) of both:
typeof(Sys.Date())
#> [1] "double"
mode(Sys.Date())
#> [1] "numeric"
unclass(Sys.Date()) # show internal representation
#> [1] 19104
typeof(Sys.time())
#> [1] "double"
mode(Sys.time())
#> [1] "numeric"
unclass(Sys.time()) # show internal representation
#> [1] 1650635256
Thus, R internally stores dates and times as numbers (of type “double”), but then interprets them to print them in a format facilitates our interpretation of them. Note that this format is subject to many conventions and idiosyncracies (e.g., regarding the arrangement of elements and local time zones).
Missing values
A special type of value is NA
, which stands for not available, not applicable, or missing values:
typeof(NA)
#> [1] "logical"
mode(NA)
#> [1] "logical"
<- NA # NA_integer_
ms
ms#> [1] NA
By default, NA
is of a logical data type, but other data types have their own type of NA
value: NA_integer_
(of type integer), NA_real_
(of type double), and NA_character_
(of type character).
But as they are flexibly changed into each other when necessary, we can ignore this distinction.
In R, NA
values are typically “addictive” in the sense of creating more NA
values when applying functions to them:
NA + 1
#> [1] NA
sum(1, 2, NA, 4)
#> [1] NA
but many functions have ways of instructing R to ignore missing values:
sum(1, 2, NA, 4, na.rm = TRUE)
#> [1] 7
In this example, the R object
nums
is defined as a vector of five numbers, which we will cover below.↩︎