1.2 Data vs. functions

“To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.”
John Chambers

Using R and most other programming languages consists in

  1. defining or loading data (material, values), and
  2. evaluating functions (actions, verbs).

Confusingly, both data and functions in R are objects (stuff) and evaluating functions typically returns or creates new data objects (more stuff). To distinguish data from functions, think of data as matter (i.e., material, stuff, or values) that are being measured or manipulated, and functions as procedures (i.e., actions, operations, or verbs) that measure or manipulate data.

In the following, we will introduce some ways to describe data (by shape and type), learn to create new data objects (by assigning them with <-), and apply some pre-defined functions to these data objects.

1.2.1 Data objects

In R, different data objects are characterized by their shape and by their type.

  1. The most common shapes of data are:

    • scalars: atomic objects (i.e., a data point, with a length of 1);
    • vectors: a chain/sequence of objects of the same type (i.e., extending in 1 dimension: length);
    • data frames/matrices/tibbles: rectangular data (i.e., tables with 2 dimensions: rows vs. columns).
  2. The most common types of data are:

    • numeric data (of type double or integer);
    • text data (of type character);
    • logical data (of type logical, aka. Boolean values).

There are other data types (e.g., to represent dates, times, time series, or geographical data), but those do not need to concern us here.

1.2.2 Using functions

In R, functions are ‘action objects’ that are applied to ‘data objects’ (a function’s so-called arguments, e.g., a1 and a2) by specifying the function name and providing the arguments in round parentheses (i.e., function(a1, a2)). Think of functions as a process that takes some input (its arguments) and transforms it into output (the result returned by the function). An example of a function with 2 arguments is:

sum(1, 2)
#> [1] 3

Here, the function sum is applied to 2 numeric arguments 1 and 2 (or a data structure that consists of 2 numeric scalars). Evaluating the expression sum(1, 2) returns a new data object 3, which is the sum of the 2 original arguments.

Note that we are currently using functions that are built into R or provided by R packages. As we progress, we will learn to write our own functions (in Chapter 11). Once we have written many useful functions, we can share them with others by writing R packages.

Practice

  • Evaluate the following expressions (i.e., functions with arguments) and try to describe what they are doing (in terms of applying functions to data arguments to obtain new data):
# Functions with numeric arguments:
sum(1, 2, -3, 4)
min(1, 2, -3, 4)
mean(1, 2, -3, 4)
sqrt(121)

# Functions with characters as arguments:
paste0("ab", "cd")
substr(x = "television", start = 5, stop = 10)
isTRUE(2 > 1 & nchar("abc") == 4)

Obtaining help and omitting argument names

To obtain help on an existing function, you can call ? and the name of the desired function. For example:

# Note: Evaluate 
?substr
# to see the documentation of the substr() function.

The argument names of functions can be omitted. When this is the case, R assumes that arguments are entered in the order in which they appear in the function definition. When explicitly mentioning the argument names, their order can be switched. Thus, the following function calls should all evaluate to the same results:

# Ways of calling the same function:
substr(x = "television", start = 5, stop = 10)
substr("television", 5, 10)
substr(start = 5, x = "television", stop = 10)

Practice

  • Try predicting the outcome of substr("television", 20, 30) and check your prediction by evaluating it (and some variants).
substr("television", 20,  30)  

# variants: 
substr("television",  1,  30)
substr("television",  1,   4)
substr("television",  9,  99)
substr("abcdefg",  4,  6)

1.2.3 Exploring functions

Exploring new functions is an important task and skill when dealing with a functional programming language like R. We can think of a function as a black box that accept certain inputs (called arguments) and transform them into outputs. Understanding a new function aims to map the relation between inputs and outputs. Importantly, we do not need to understand how the function internally transforms inputs into outputs. Instead, adopting a “functional” perspective on a particular function involves recognizing

  • its purpose: What does this function do? What inputs does it take and what output does it yield?

  • its options: What arguments does this function provide? How do they modify the function’s output?

In the following, we will explore both a simple and a more complex function to illustrate how functions can be understood by studying the relation between their inputs and outputs.

Exploring a simple function

Many functions are almost self-explanatory and relatively easy to use, as they require some data and offer only a small number of arguments that specify how the data is to be processed. Ideally, the name of a function should state what it does. An example of such a straightforward function is the sum() function, which — surprise, surprise — computes the sum of its arguments:

sum(1, 2, -3, 4)
#> [1] 4

This seems trivial, but evaluating ?sum reveals that sum() also allows for an argument na.rm that is set to the logical value of FALSE by default. This arguments essentially asks “Should missing values be removed?”. To see and understand its effects, we need to include a missing value in our numeric arguments. In R, missing values are denoted as NA:

sum(1, 2, -3, NA, 4)  
#> [1] NA
# is the same as:
sum(1, 2, -3, NA, 4, na.rm = FALSE)
#> [1] NA

# Note the difference:
sum(1, 2, - 3, NA, 4, na.rm = TRUE)
#> [1] 4

These explorations show that sum() normally yields a missing value (NA) when its arguments include a missing value (NA), as na.rm = FALSE by default. To instruct the sum() function to ignore or remove missing values from its computation, we need to specify na.rm = TRUE.

Exploring a complex function

Learning to use a more complex function is a bit like getting a new toy that comes with many buttons: What happens when we push this one? What if we combine these two, but not this other one? Given the ubiquity of functions in R, understanding how they work is an important skill. In fact, exploring a new function is a bit like doing research: As functions typically are systematic systems, we can test hypotheses about the effects of their arguments. Discovering the systematic properties of some system can be entertaining and exciting.

To illustrate this process, the ds4psy contains a function plot_fn() that intentionally obscures the meaning of its arguments. In the following, we will explore the plot_fn() function to find out what its arguments do. Loading the package and evaluating the plot_fn() with its default arguments yields the following:

library(ds4psy)  # load the ds4psy package
plot_fn()

It seems that the function plots several colored squares in a row. However, calling it repeatedly yields a varying number of squares — there appears to be some random element to the function. When entering plot_fn() in the editor or console, placing the cursor within the parentheses (), and hitting the Tab key of our keyboard (within the R Studio IDE) we see that plot_fn() accepts a range of arguments:

  • x A (natural) number. Default: x = NA.
  • y A (decimal) number. Default: y = 1.
  • A Boolean. Default: A = TRUE.
  • B Boolean. Default: B = FALSE.
  • C Boolean. Default: C = TRUE.
  • D Boolean. Default: D = FALSE.
  • E Boolean. Default: E = FALSE.
  • F Boolean. Default: F = FALSE.
  • f A color palette (e.g., as a vector). Default: f = c(rev(pal_seeblau), "white", pal_pinky). Note: Using colors of the unikn package by default.
  • g A color (e.g., as a character). Default: g = "white".

Calling a function without arguments is the same as calling the function with its default arguments (except for random elements):

plot_fn()  # is the same as calling the function with its default settings:
plot_fn(x = NA, y = 1, A = TRUE, B = FALSE, C = TRUE, D = FALSE, E = FALSE, F = FALSE)
# (except for random elements). 

Its long list of arguments promises that plot_fn() can do more than we have discovered so far. But as the argument names and the documentation of plot_fn() are deliberately kept rather cryptic, we need to call the function with different combinations of argument values to find out what exactly these arguments do. So let’s investgate each argument in turn…

  • x is a (natural) number. Let’s try out some values for x:
plot_fn(x = 1)

plot_fn(x = 2)

plot_fn(x = 3)

plot_fn(x = 7)

The results of these tests suggest that x specifies the number of squares to plot. However, if x is set to NA or some non-natural number (e.g., x = 1/2), a random natural number is chosen for x (see for yourself, to verify this).

Regarding the 2nd argument y, we have been informed:

  • y is (decimal) number, with default: y = 1. Let’s try some values that differ from y = 1:
plot_fn(x = 5, y = 2)

plot_fn(x = 5, y = 3.14)

Strangely, nothing new seems to happen — but we also did not receive an error message. A possible reason for this is that the value of y governs some property of the plot that is currently invisible or switched off. The long list of Boolean arguments (i.e., logical arguments that are either TRUE or FALSE) suggests that there is a lot to switch on and off in plot_fn(). So let’s continue with our explorations of the arguments A to F and return to y later:

plot_fn(x = 5, y = 1, A = TRUE)

plot_fn(x = 5, y = 1, A = FALSE)

The argument A seems to determine the orientation of the squares from a horizontal row (A = TRUE) to a vertical column (A = FALSE). We continue with B:

plot_fn(x = 5, y = 1, A = TRUE, B = FALSE)

plot_fn(x = 5, y = 1, A = TRUE, B = TRUE)

Interestingly, B seems to switch from a linear plot to a pie plot. To understand this better, we should check what happens when we also vary A:

plot_fn(x = 5, y = 1, A = FALSE, B = FALSE)

plot_fn(x = 5, y = 1, A = FALSE, B = TRUE)

This confirms our intuition: Setting B = TRUE plots the row or column of squares in a circular fashion (i.e., using polar coordinates).

To see what C does, we set it to FALSE:

plot_fn(x = 5, y = 1, A = FALSE, B = TRUE, C = FALSE)

It seems that the order of colors has changed from regular to random. Let’s verify this for a row with 11 squares:

plot_fn(x = 11, C = TRUE)

plot_fn(x = 11, C = FALSE)

This is in line with our expectations, so we are satisfied for now. So on to argument D:

plot_fn(x = 11, D = FALSE)

plot_fn(x = 11, D = TRUE)

Setting D = TRUE seems to have caused some white lines to appear between the squares. As lines have additional parameters (e.g., their width and color), we now could reconsider the argument y that we failed to understand above. Let’s add a value of y = 5 to our last command:

plot_fn(x = 11, y = 5, D = TRUE)

This supports our hypothesis that y may regulate the line width between or around squares. As these lines are white, our attention shifts to another argument with a default of g = "white" (with a letter “g” in lowercase).
Let’s vary y and g (changing g = "white" to g = "black") to observe the effect:

plot_fn(x = 11, y = 8, D = TRUE, g = "black")

Again, our hunches are confirmed (or — for the logical positivists among us — not rejected yet). Before we decipher the meaning of the remaining arguments, let’s double check what we know so far by creating a plot with 5 elements that are arranged in a column and plotted in a circular fashion, with lines of width 4 in a “darkblue” color:

plot_fn(x = 5, y = 4, A = FALSE, B = TRUE, C = FALSE, D = TRUE, g = "darkblue")

Again, our intuitions are confirmed, which motivates us to tackle arguments E and F together:

plot_fn(x = 5, E = TRUE,  F = FALSE)

plot_fn(x = 5, E = FALSE, F = TRUE)

plot_fn(x = 5, E = TRUE,  F = TRUE)

We infer that E = TRUE causes the squares to be labeled (with numbers), whereas F = TRUE adds a label above the plot (indicating the number of elements). To see whether this also holds for other versions of the plot, we add both arguments to our more complicated command from above:

plot_fn(x = 5, y = 4, A = FALSE, B = TRUE, C = FALSE, D = TRUE, E = TRUE, F = TRUE, g = "darkblue")

This satisfies us and leaves us only with the lowercase f as the final argument (not to be confused with uppercase F). Its default setting of

  • Default: f = c(rev(pal_seeblau), "white", pal_pinky)

suggests that this regulates the colors of the squares or rings. To check this, we re-do our previous command with a version of f that replaces the complicated c(...) thing with a simple color “gold”:

plot_fn(x = 5, y = 4, A = FALSE, B = TRUE, C = FALSE, D = TRUE, E = TRUE, F = TRUE, g = "darkblue", f = "gold")

In Section 1.4, we will learn that c(...) defines a vector. To check the effect of a vector of colors on our plot, we try to set f to a vector of 2 color names: f = c("black", "white"), and the line color g to "gold":

plot_fn(x = 5, y = 4, A = FALSE, B = TRUE, C = FALSE, D = TRUE, E = TRUE, F = TRUE, 
        f = c("black", "white"), g = "gold")

The result shows that plot_fn() uses the colors entered in f to create some color gradient (as the plot contains various shades of grey).

This concludes our exploration of plot_fn() — leaving us with the impression that this function can be used to create a wide variety of plots. Overall, our explorations have discovered the following meanings of the plot_fn() function’s arguments:

  • x A (natural) number that specifies the number of squares to plot. Default: x = NA (choosing a random natural number for x).
  • y A (decimal) number that specifies the width of lines around squares when D = TRUE . Default: y = 1.
  • A A Boolean value that regulates the orientation of the plot from a horizontal row (A = TRUE) to a vertical column (A = FALSE). Default: A = TRUE (i.e., row).
  • B A Boolean value that regulates the coordinate system from linear to circular/polar. Default: B = FALSE (i.e., linear).
  • C A Boolean value that regulates whether squares are sorted or unsorted. Default: C = TRUE (i.e., sorted).
  • D A Boolean value that specifies whether lines should be shown around squares . Default: D = FALSE (i.e., no lines).
  • E A Boolean value that specifies whether the parts should be labeled (with numbers). Default: E = FALSE (i.e., no labels).
  • F A Boolean value that specifies whether the plot should be labeled (with the number of parts). Default: F = FALSE.
  • f A color palette (e.g., as a vector). Default: f = c(rev(pal_seeblau), "white", pal_pinky), using color palettes from the unikn package.
  • g A color name (e.g., as a character) for the line around squares. Default: g = "white".

Practice

  • You can practice to explore another complex function in Exercise 1.2 below (see Section 1.8.1).
    However, knowing how to explore functions is a task that will accompany and enrich your entire R career.

In Chapter 11, we will learn to write our own functions. But before we do so, we need lean more about other objects in R — and then spend several chapters on learning functions written by the authors of R and various packages that allow us to manipulate objects. Most importantly at this point, we need to find out how objects can be defined and named, and what kind of objects exist in R.

1.2.4 Defining objects

To define a new object o as x, use the assignment function o <- x and note that object names are case-sensitive (i.e., a and A are different object names).

For example, we can assign the output of a function to an object:

s <- sum(1, 2)
s
#> [1] 3

Here, we created a new object (named s) and assigned the sum of 1 and 2 to it. As s is a numeric object (with the value 3), we can now apply any numeric function to it:

s * 2
#> [1] 6
s^2
#> [1] 9

Here is another example of creating 2 simple objects and applying simple arithmetic functions to them:

o <- 10  # assign/set o to 10
O <-  5  # assign/set O to  5

# Computing with objects: 
o * O
#> [1] 50
o * 0
#> [1] 0
o / O * 0
#> [1] 0

In R, objects can be described by their length and type.

  • Objects with a length of 1 are called scalars, longer objects (i.e., length(object) > 1) are vectors or lists.

  • The type of an object corresponds to the type of the object to which it has been assigned (e.g., numeric, character, or logical).

Hence, we can change an object’s type be re-assigning it:

o <- 100
o
#> [1] 100

# Check type: 
is.numeric(o)
#> [1] TRUE
is.character(o)
#> [1] FALSE
is.logical(o)
#> [1] FALSE

# Re-assigning o:
o <- paste0("ene", " ", "mene", " ", "mu")
o
#> [1] "ene mene mu"

# Check type: 
is.numeric(o)
#> [1] FALSE
is.character(o)
#> [1] TRUE
is.logical(o)
#> [1] FALSE

Practice

  • Assign o to 1 > 2 and check the type of the resulting object.
# Re-assign o:
o <- 1 > 2
o

# Check type: 
is.numeric(o)
is.character(o)
is.logical(o)

1.2.5 Naming objects

Finding good names for variables and functions is key for writing transparent code, but can often be challenging, especially as our understanding of a topic changes over time.

Rules

When naming objects, beware of the following characteristics and constraints:

  • R is case sensitive (so tea_pot, Tea_pot and tea_Pot are different names that denote 3 different objects);
  • No spaces inside variables (even though names like 'tea pot' are possible);

  • No special keywords that are reserved for R commands (e.g., TRUE, FALSE, function, for, in, next, break, if, else, repeat, while, NULL, Inf, NaN, NA and some variants like NA_character, NA_integer, …).

Recommendations

Naming objects is partly a matter of style, but clarity and consistency are always important. Sensible recommendations include:

  • Aim for short and consistent names;

  • Avoid dots and special characters (e.g., spaces, quotes, etc.) in names;

  • Consistently use either snake_case (with underscores) or camelCase (with capitalised first letters) for combined names;

  • Avoid lazy shortcuts like T for TRUE and F for FALSE. They may save a few milliseconds when typing a value, but create predictable confusion when someone sets T <- FALSE somewhere.