11.2 Essentials of functions

Explicating that functions are pretty powerful tools (see Section 11.1) really whet our appetite for creating our own functions. Thus, the next section illustrates how new functions can be defined in R (Section 11.2.1). However, the process of defining new functions is closely connected to our ability for understanding functions (Section 11.2.2) and checking functions (Section 11.2.3). We conclude by mentioning some issues of style that matter for creating understandable and usable functions (Section 11.2.4).

11.2.1 Defining functions

The basic structure for writing a new function looks as follows:

name <- function(<args>) {<body>}

Before going into any more detail, let’s pause for a moment and note some remarkable facts: In R, functions are objects, just like any other object (e.g., data, parameters, or variables). Consequently, a new function is defined — just like any other object — by choosing its name and assigning some content to it. Thus, the key difference between simpler data objects and the “action object” of a function must be in its content: Rather than only describing what an object is, defining a function must include an instruction what to do.

Defining an instruction “what to do” immediately raises the question: Do what with what? This question helps us disentangling the abstract structure of a function definition into its parts. Writing a function requires three distinct considerations:

  1. Do what?: What is the task to be performed by the function? And what would be a good name for the function? Function names should clearly convey their purpose (“Which task does this function solve?”), resemble verbs (ideally say what the function does, e.g., plot, or state its output, e.g., mean), and should be short (as they will be typed many times).

  2. With what?: What does the function work with? Which inputs does it take? Any input that is needed for the function to perform its task must to be supplied as an argument in <args>. Note that <args> are enclosed by round parentheses (...) and multiple arguments are separated by commas. We further characterize and distinguish between different types of arguments:

    • Arguments typically include data (e.g., scalars, vectors, tables, or anything else that the function uses to perform its task) and can contain additional details or parameters (e.g., instructions on how something is to be done: Which type of calculation is desired? How should NA values be treated? Which color should be used? etc.). Data arguments are typically listed first.

    • Any argument can be optional (by including a default value that is used when no other argument value is provided) or required (aka. mandatory, by not providing a default value). It is common to make key data arguments required, and use optional arguments for details (e.g., setting parameter values).

    • If a function calls other functions and we want to allow users to pass arguments to these latter functions, we can add ... (aka. “dot-dot-dot”) as a special argument. However, this requires that the functions used within the function can deal with the specific arguments that are provided later, and can cause problems and unexpected results.

  3. Do how? Once we identified what task is to be performed with what input, the key question remaining is: How is the task to be solved? The <body> of a function uses the inputs provided by <args> to perform the task for which the function is created. Thus, the part in the curly brackets {...} is typically the longest and most complicated part of a function. Again, this part can be split into multiple parts:

    • Although some functions are primarily used for their side-effects (e.g., load or save an object, print a message, or create a visualization, etc.), most functions are created to (also) return some sort of output (e.g., some computed result). This is achieved by calling the return() function in the function <body> (e.g., return(result)), typically as its last statement.66 When the function does not contain an explicit return() statement, it typically returns the result of the last expression evaluated in the function.

    • The output of a function can assume many different data types (scalar, vector, list, table, etc.). For instance, the result of 1:3 + 3:1 is a vector 4, 4, 4.

When taking these considerations into account, are more detailed template of a typical R function is:

task_name <- function(arg_1, arg_2 = FALSE) {
  
  # 1. Check inputs: 
  {Verify that arg_1 and arg_2 are appropriate inputs.}
    
  # 2. Solve task by using arguments: 
  {Use arg_1 and arg_2 to solve the task.}
    
  # 3. Collect and assign outputs:
  result <- {The task solution.}
      
  # 4. Return output: 
  return(result)
    
}

In this template, a new function task_name() is being defined. Its task is solved by accepting two arguments: arg_1 is a mandatory argument (as no default is provided) and arg_2 is an optional argument (with a default value of TRUE). In the function body, the arguments are first checked (e.g., whether they have appropriate type and values) and then used to solve the task. The task solution is then assigned to an object result and then returned by return(result).

Typical uses of this function task_name() could look as follows:

# Using this function:
task_name(arg_1 = x)                 # providing only required argument
task_name(arg_1 = y, arg_2 = FALSE)  # providing both arguments
task_name(arg_2 = TRUE, arg_1 = z)   # providing both arguments in reverse order
task_name(z, FALSE)                  # providing arguments by argument order
task_name(z)                         # providing only the mandatory (1st) argument

We see that, although the function task_name() was purely abstract, we can say quite a bit about its functionality by merely knowing its arguments. To render the interplay of functions and its arguments more concrete, let’s consider some examples of simple functions.

A power function

Consider the following definition of a power() function:

power <- function(x, exp = 1) {
  x^exp
}

Note that the function’s definition only includes a single line of code. Thus, our power() function really is just a wrapper for R’s arithmetic operator ^, but still allows illustrating all key features of a function. We can explicate its definition as follows:

  • The function power() has two arguments: A mandatory argument x and a optional argument exp with a default value of 1.

  • The function computes x^exp and returns the result (as this is the final statement in the function’s body).

  • Thus, the task performed by power consists in raising x to the exp-th power.

Perhaps note quite as obvious are the following three observations about our power() function:

  • Although the function does not check or verify its inputs, x and exp are assumed to be numeric. Providing other types of inputs may cause errors.

  • Naming the function’s 2nd argument exp (for “exponent”) works, but is easily confused with the base R function exp() for computing exponential values (e.g., exp(1) \(= e \approx 2.718\)).

  • Given what we know about R, the function probably also works for vector inputs of x. However, it is harder to guess what happens if exp is a vector, or if both arguments are vectors.

Checking a function

It is very important to run a range of checks after writing a new function. When doing these checks, examples that use unusual inputs (like NA, NULL, or arguments of different types) are particularly informative. Here are some possible ways of checking how power() works:

# Check:
power(x = 3)
#> [1] 3
power(x = 3, exp = 2)
#> [1] 9

# Note that the function also works for vector inputs: 
power(x = 1:5, exp = 2) 
#> [1]  1  4  9 16 25
power(x = 2,   exp = 1:4)
#> [1]  2  4  8 16

# Note what happens when both x and exp are vectors: 
power(x = 1:3, exp = 1:3)
#> [1]  1  4 27
power(x = 1:2, exp = 1:4)
#> [1]  1  4  1 16

# Note what happens with NA values: 
power(x = NA) 
#> [1] NA
power(x = 3, exp = NA) 
#> [1] NA
# => NA values are 'contagious'.

When creating a new function, it is always a good idea to explore and test its limits. Here are some boundary cases that would result in warning or error messages:

# Warning:
power(x = 1:2, exp = 1:3)

# Errors:
power()                  # no argument(s)
power(x = "A")           # x is non-numeric argument
power(x = 3, exp = "B")  # exp is non-numeric argument

It is not necessarily problematic when functions return warnings or errors — in fact, such messages are often very helpful for understanding functions. As a function’s author or designer (i.e., programmer), we primarily need to decide whether returning an error is justified, given the intended use of the function. If a user enters no arguments or the wrong type of argument to a function, yielding an error can be the appropriate way of saying: This is wrong and will fail. But expert programmers also aim to see their functions from the perspectives of their future users: Was this input reasonable, given the function’s purpose? Do the names of the function and its arguments clearly signal their purposes? Will users know which types and shapes of data are to be provided as inputs? What else may users want to achieve with this function? Many misunderstandings can be avoided by choosing transparent names (for both the function and its arguments) and providing good documentation and examples to a function. And if you anticipate many unconventional uses of a function or its arguments, it may be polite to check user inputs for their shape or type, and issue messages or warnings to the user if something was unexpected, missing, or wrong.

Omitting argument names

In R, it is possible to omit the argument names of functions. If this is done, the values provided are assigned to the arguments in the order in which they are provided:

# Omitting argument names:
power(3)     # names can be omitted, and 
#> [1] 3
power(3, 2)  # arguments are used in order given.
#> [1] 9
power(1:5, 2)
#> [1]  1  4  9 16 25
power(2, 1:4)
#> [1]  2  4  8 16

Although omitting argument names saves typing, it is typically more informative to explicitly state the arguments. This makes it more transparent to future readers of your code (including yourself) which value is assigned to which argument and has the advantage that you can enter arguments in any order:

# When arguments are named: 
power(exp = 3, x = 2)  # order is irrelevant.
#> [1] 8
# => Recommendation: Always provide argument names!

Thus, although laziness will occasionally cause us to omit argument names (particularly for common functions), it is good practice to always provide argument names, especially when we want others to read and understand our code.

Practice

Let’s try defining some additional functions:

  1. Write a new function that computes the n-th root of a number (or vector of numbers) x, then check it, and explore its limits.

Hint: The mathematical fact that \(\sqrt[n]{x} = x^{1/n}\) may be helpful for solving this task.

  1. One of the first functions that we encountered in this book (in Sections 1.2.5 and 1.2.2) was sum(). Incidentally, the sum() function is more complex than it first seemed, as its arguments can be a combination of vectors and scalars:
sum(1, 2, 3, 4)
#> [1] 10
sum(1, c(2, 3), 4)
#> [1] 10
sum(c(1, 2), c(3, 4))
#> [1] 10

We now have learned that values are assigned by their position to function arguments when we omit argument names. With this in mind, evaluate the following expressions and explain why they yield different results:

sum(1, 2, 3, NA, na.rm = TRUE)
#> [1] 6
sum(1, 2, 3, NA, TRUE)
#> [1] NA

Hint: What do sum(1, 2, 3, NA) and sum(TRUE) evaluate to?

Explicit return()

A more explicit version of our power() function from above could look as follows:

power <- function(x, exp = 1) {
  
  result <- x^exp
  
  return(result)
}

Whereas the shorter version of the function relied on returning its last (and only) expression, this version makes it more explicit what is being computed and returned. As functions get larger and more complicated, it is generally a good idea to include explicit return() statements. Importantly, a function can include multiple return statements and is exited as soon as a return() is reached. For instance, the following variant would never print its final disclaimer:

power_joke <- function(x, exp = 1) {
  
  result <- x^exp
  
  return(result)
  
  "Just kidding"
}

Practice

  • Test the power_joke() function (by evaluating it with various arguments) and try to obtain the "Just kidding" line as an output.

Solution

The following expressions are suited to explore the power_joke() function:

# Checks: ------ 
power_joke(5)
power_joke(5, 2)
power_joke(1/2, 1/2) # non-integers 
power_joke(1:5, 2)   # vectors 1 
power_joke(2, 1:5)   # vectors 2
power_joke(1:5, 1:5) # vectors 1+2
power_joke(NA)       # missing values

# Errors: 
power_joke()
power_joke("A")

# => "Just kidding" is never reached.

These tests show that the final expression (i.e., the character string “just kidding”) is never reached.

  • In a 2nd step, comment out the line return(result) and re-run the same checks.

Solution

# Commenting out return(result): ------ 
power_joke <- function(x, exp = 1) {
  
  result <- x^exp
  
  # return(result)
  
  "Just kidding"
}

# Checks:
power_joke(5)
power_joke(5, 2)
power_joke(1/2, 1/2) # non-integers
power_joke(1:5, 2)   # vectors 1 
power_joke(2, 1:5)   # vectors 2
power_joke(1:5, 1:5) # vectors 1+2
power_joke(NA)       # missing values

# Errors: 
power_joke()
power_joke("A")

Without return(result), the function always returns “Just kidding”, unless an earlier error occurs.

11.2.2 Understanding functions

How can we understand a function? Even when we are completely clueless about a function, we can always try to understand it by using it with different inputs and see what happens. This treats the function as a black box and is exactly what we did to explore the plot_fn() and plot_fun() functions of ds4psy (in Section 1.2.5 and Exercise 1 of Chapter 1). But having progressed further in our R career, we now dare to look inside the black box and ask: How does this function transform its inputs into outputs? Asking and answering such how questions promotes a mechanistic understanding of a function, that not only provide us with an idea about the function’s purpose (or ``function’’), but also enables us to criticize and improve it.

Example

Let’s define a new function describe() and try to understand what it does by asking how it transforms its inputs into outputs:

describe <- function(v, rm_na = TRUE){
  
  # (a) Check v: 
  if (all(is.na(v))) {return(v)}
  if (all(is.null(v))) {return(v)}
  
  if (!is.numeric(v)) { 
    message("v must be numeric:")
    return(v)
  }
  
  # (b) Compute some metrics:
  mn <- mean(v, na.rm = rm_na)
  md <- median(v, na.rm = rm_na)
  min <- min(v, na.rm = rm_na)
  max <- max(v, na.rm = rm_na)
  q25 <- quantile(v, .25, na.rm = TRUE)
  q75 <- quantile(v, .75, na.rm = TRUE)
  nr_NA <- sum(is.na(v))
  
  # (c) Create output vector:
  out <- c(min, q25, md, mn, q75, max, nr_NA)
  names(out) <- c("min", "q25", "md", "mn", "q75", "max", "nr_NA")
  
  # (d)
  return(out)  
  
}

This example of defining a describe() function illustrates the difference between using and understanding a function — and that the definition of a function can get long and complicated. As you can think of a function as a program to tackle a specific task, it is not uncommon for function bodies to stretch over dozens or even hundreds of lines of code. The longer and more complicated a function gets, the more difficult it is to understand and — from a programmer’s perspective — to write and to debug the function. For this reason, programmers typically try to decompose long and complex functions into smaller parts, which can then be delegated to shorter functions. But understanding a function that calls many other functions then implies that we must understand (or trust) the other functions before we can understand the current function.

By contrast, using a very long and complex function does not need to be difficult. In fact, when calling functions like mutate(), ggplot(), or summarise() we typically do not notice that we implicitly call upon the mighty machinery of the entire dplyr and ggplot2 packages. It is conceivable (either as a spooky dystopia, or as a marvelous feat of ‘artificial intelligence’) that we could simply run some do_stats() or write_paper() function and let the computer do our job. But as long as other programmers and machine learning have not yet solved these tasks, we need to learn how to use, write, and understand functions to address them.

Practice

Before reading on, describe the describe() function (defined in the previous code chunk):

  • What types of inputs does it take?
  • What do its different parts (a) to (d) do?
  • What outputs will it return?
  • What is the task, goal, or purpose of this function?
  • Which calls will yield errors?

Check your predictions by copying and calling the function with various arguments.

Solution

Gaining a mechanistic understanding of a function implies that we understand how its outputs depend on its inputs. Eventually, this should also indicate the function’s purpose, but first we simply describe what a function does with its inputs.

The describe() function could be described as follows:

  • The function describe() has two arguments: A mandatory argument v and a optional argument rm_na with a default value of TRUE.

  • The function first examines its input argument v:

  • Does v evaluate to NA or NULL? If so, it simply returns v (i.e., NA or NULL, respectively).

  • Is v non-numeric? If so, it prints a message to the user and returns v.

  • The function then computes seven different statistical measures. This illustrates that functions can do multiple things and typically use other functions to do so. For some of these functions (e.g., mean), the describe() function passes the value of its optional argument rm_na to another function’s na.rm argument. However, for another function (quantile), describe() does not use its rm_na argument, but always provides na.rm = TRUE.

  • The function then creates a numeric vector out that includes the 7 computed measures in a specific order and adds names to the vector.

  • The function then returns the vector out.

  • Overall, the task addressed by describe() is to provide a range of descriptive statistics of a numeric vector v.

11.2.3 Checking functions

Once we gained a basic understanding of a function, we can check both the function and our understanding of it by using it with a variety of arguments. Ideally, we should use our understanding to predict what happens when calling the function with a specific argument and then use the function to verify or falsify our prediction.

As we saw above, the results of such checks are more informative if you use the function not only with its intended inputs. Using unusual and probably unintended inputs (like NA, NULL, or inputs of different data types) will show you the limits of a function. And given the importance of vectors in R, a good question to ask about a new function is: Does this function only work with scalar inputs, or does it also works with vectors?

Example

Here are some possible ways of checking how the describe() function works:

# Check:
v <- 1:10
describe(v)
#>   min   q25    md    mn   q75   max nr_NA 
#>  1.00  3.25  5.50  5.50  7.75 10.00  0.00
describe(c(NA, v, NA))
#>   min   q25    md    mn   q75   max nr_NA 
#>  1.00  3.25  5.50  5.50  7.75 10.00  2.00
describe(v, rm_na = FALSE)
#>   min   q25    md    mn   q75   max nr_NA 
#>  1.00  3.25  5.50  5.50  7.75 10.00  0.00
describe(c(v, NA), rm_na = FALSE)
#>   min   q25    md    mn   q75   max nr_NA 
#>    NA  3.25    NA    NA  7.75    NA  1.00

# Note:
describe(NA)
#> [1] NA
describe(NULL)
#> NULL
describe(c(NA, NA, NA))
#> [1] NA NA NA
describe("A")
#> [1] "A"
describe(tibble::tibble(v = 1:10))
#> # A tibble: 10 × 1
#>        v
#>    <int>
#>  1     1
#>  2     2
#>  3     3
#>  4     4
#>  5     5
#>  6     6
#>  7     7
#>  8     8
#>  9     9
#> 10    10

# Note: The following calls yield errors:
# describe()
# describe(x)

Actually, describe() is — apart from subtle differences — quite similar to the base R function summary():

# Compare with base::summary function: 
summary(v)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    3.25    5.50    5.50    7.75   10.00
summary(c(NA, v, NA))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    1.00    3.25    5.50    5.50    7.75   10.00       2

# But note differences: 
summary(NA)
#>    Mode    NA's 
#> logical       1
summary(NULL)
#> Length  Class   Mode 
#>      0   NULL   NULL
summary("A")
#>    Length     Class      Mode 
#>         1 character character
summary(tibble::tibble(v = 1:10))
#>        v        
#>  Min.   : 1.00  
#>  1st Qu.: 3.25  
#>  Median : 5.50  
#>  Mean   : 5.50  
#>  3rd Qu.: 7.75  
#>  Max.   :10.00

# Error for: 
# summary()

Practice

  • Predict the result of describe(c(NULL, NA)). Then evaluate the expression and explain its result.
describe(c(NULL, NA))

# Hint: Check the result of 
c(NULL, NA)

11.2.4 Issues of style

Creating new functions only makes sense when someone can understand and use them. Hence, writing a new function always needs to take into account the viewpoint of its users, even if those will mostly be our future selves.

In any art or craft, issues of style are important, partially a matter of taste, and largely a matter of practice and experience. Just as the work of architects, designers, and authors tends to mature in an exchange with colleagues, and with more time and multiple revisions, computer programming tends to benefit from feedback and well-organized teams. But even for an individual programmer, writing good functions is a journey, rather than a destination, and a life-long aspiration.

The primary goal of programming new functions is providing some required functionality in a clear and transparent fashion. Here are some general guidelines towards and questions that help achieving this goal:

  1. Task/goal/purpose: Be aware of the task, goal, and purpose of a function. Explicating these “functional” aspects may involve answering a range of related questions:

    • Task: What task does the function perform?
    • Mechanism: How does the function solve this task?
    • Audience: For whom does the function solve this task?
  1. Arguments: Consider the requirements of a function:

    • What does the function need to achieve its goal?
    • Of what shape or type are the objects that the function uses as its inputs?
    • Which arguments are necessary or required, which ones are optional?
  1. Result: Consider the output of a function:

    • What should the function return?
    • Of what shape or type are the objects that the function provides as its outputs?
    • Are there any side-effects to consider?
  1. Naming: Make sure that you choose good names for both a function and its arguments.

    • Do the names clearly convey the task, goal and purpose of the function and its arguments?
    • Are all names succinct and can easily be typed?
    • Do all names correspond to the names of related functions?
  1. Formatting: Code each function so that its internal structure is clear and it is easy to read and understand (i.e., use blank lines between different parts and automatic indentation to signal the hierarchy level of elements).
  1. Documentation: Provide clear comments that explain what the function does, why it is doing it in a particular way, and anything else that may be difficult to understand.

  2. Examples: Any new function needs to be checked to demonstrate that it actually solves its task. What are typical use cases of this function and under which conditions will it fail? Providing informative examples renders the life of future users (including ourselves) much easier.

Ultimately, programming functions always involves a considerable portion of psychology. Thus, we can always ask “Who is the audience of this function?” and try to anticipate the needs, preferences, and wishes of our future users. Will they be able to understand and use our function? How robust is our function when users provide different data types or may misinterpret its purpose or scope? Although we probably are the primary users of our new function, anticipating possible misconceptions and responding to user feedback are important aspects of ensuring that it will remain useful in the future.


  1. In most functions, the return() statement is the final statement of the function <body>.
    See Chapter 19.6: Return values for special cases in which it makes sense to provide earlier and multiple return() statements or provide invisible() return values (e.g., to write pipeable functions).↩︎