11.3 Essentials of conditionals

Whereas most of our scripts so far relied on being executed linearly (in a top-down, left-to-right, line-by-line fashion), using functions implies jumping around in large amounts of code. Strictly speaking, we have also been using statements that were parsed from right to left (e.g., assignments like x <- 1) or bottom-to-top (e.g., when assigning a multi-line pipe of dplyr statements to an object). Also, given that we have been using functions all along, we really have been jumping around in base R code since our very first session.56

This section addresses a special type of controlling information flow: When thinking about the flow of information through a program (which can be a single R function, or an entire system of R packages), we often come to junctions at which we want to say: Based on some criterion being met, we either want to do this, or, if the criterion is not met, do something else. Such junctions are an important feature of any programming language and are typically handled by special functions, special control structures, or conditionals (i.e., “if-then” or “if-then-else” statements).

11.3.1 Flow control

Creating functions often requires controlling the flow of information within the body of a function. We can distinguish between several ways how this can be achieved:

  • Special functions (e.g., like return, print, or stop) can cause side-effects or skip code (e.g., by exiting the function).

  • Functions often incorporate iteration and loops, which are covered in the next chapter (i.e., Chapter 12 on Iteration).

  • Testing input arguments or distinguishing between several cases requires the conditional execution of code, discussed in this section. In the definition of describe() above, we have seen that functions frequently require checking some properties of its inputs, distinguishing between cases, and controlling the flow of data processing based on test results. This is the job of conditional statements, which exist in many different forms. In this section, we only cover the most essential types.

11.3.2 If-then

A conditional statement conducts a test (which evaluates to either TRUE or FALSE) and executes additional code based on the value of the test. The simplest conditional in R is the if function, which implements the logic of if-then in the following if (test) {...} structure:

Here, test must evaluate to a single Boolean value (i.e., either TRUE or FALSE). If test is TRUE the code in the subsequent {...} is executed (here: "ok" is printed to the Console) – otherwise the code in the subsequent {...} is skipped, as if it was not there or commented out:

Note that if test is a Boolean value, we do not need to ask for the condition test == TRUE.

11.3.3 If-then-else

If a test fails, we often want something else to happen. To accommodate this desire, a slightly more complicated form of if statement includes an additional {...} after an else statement:

Here, the truth value of test determines whether the 1st or the 2nd {...} is executed. As test must be either TRUE or FALSE, we either see “case 1” printed (if test is TRUE) or “case 2” printed (if test is FALSE).

The following sequence illustrates how tests work (and can fail to work):

11.3.4 Vectorized ifelse

A crucial limitation of R’s basic if statement is that its test only assumes a single TRUE of FALSE as its output. However, when writing functions, we often want to make them work with vectors of input values, rather than a single input. Testing multiple values at once is possible with the ifelse(test, yes, no) function that uses vectorized test, yes, and no arguments (which are recycled to the same length):

Note that the yes, and no values used with ifelse should typically be of the same type, and NA values remain NA:

11.3.5 More complex tests

The condition test of a conditional statement can contain multiple tests. If so, each individual test must evaluate to either TRUE or FALSE and the different tests are linked with && or ||, which work like the logical connectors & and |, but are evaluated sequentially (from left to right):

Example

Here’s a way to fix our problem from above (i.e., evaluating “grandmother” as “male”) by implementing a more comprehensive test:

A vectorized version of this if-then-else statement can be written with ifelse(), but will still mis-classify anything not considered when designing the test (e.g., stepmothers, broomsticks, etc.):

More cases

As we can replace any {...} in a conditional statement if (test) {...} else {...} by another conditional statement, we can distinguish more than 2 cases:

Here, 2 cases are contingent on their corresponding condition being TRUE, otherwise the final {...} is reached and "else" is being printed. Thus, an “else case” often serves as a generic case that occurs when none of the earlier tests are true.

Note that the following variant of this conditional is different:

Here, the final {...} is contingent on another test_3 being TRUE. Thus, the conditions that the final "else" is being printed are not only that test_1 and test_2 are both FALSE but also that test_3 is TRUE. If all 3 tests fail, none of the cases is reached and nothing is printed.

Note

  • When a test evaluates to TRUE, the corresponding {...} is evaluated and any later instances of test and {...} are skipped. Thus, only a single case of {...} is evaluated, even if multiple tests would evaluate to TRUE.

11.3.6 Switch

A useful alternative to overly complicated if statements is switch, which selects one case out of a list of alternative cases on the basis of some keyword or number. For example, the following function do_op() uses a character argument op to distinguish betweeen several operations:

In do_op(), the operation op was explicitly specified (as a character variable). If switch is used with a numeric expression i, it selects the i-th case (with i being coerced into an integer). For example:

The final stop() statement in the above uses of switch() ensures that we would notice function calls with arguments for which we did not provide an alternative. The effects of the stop() function are quite drastic: It abandons the execution of the current expression and yields an error.

11.3.7 Avoiding conditionals

Conditionals are an important element of any programming language. However, in cleaning and transforming data, R provides alternatives that would require conditionals in other programming languages. Especially when data is stored in rectanglular tables (i.e., columns of vectors), we can often avoid conditionals.

As an example, consider the following table dt that provides information on seven people and is a variant of our very first table from Chapter 1 (see Section 1.5.2):

Table 11.1: Basic information on seven people.
name sex age height
Adam male 21 165
Bertha female 23 170
Cecily female 22 168
Dora female 19 172
Eve female 21 NA
Nero male 18 185
Zeno male 24 182

Suppose someone objected to the sex variable and demanded to replace it by a numeric variable gender that re-codes “male” as 1 and “female” as 2. We can create and initialize this variable (with missing values) as follows:

To determine the values of the new variable, someone coming from a different programming language may be tempted to use conditional statements:

However, evaluating these conditionals results in a warning “the condition has length > 1 and only the first element will be used” and the resulting column dt$gender shows a value of 1 for very person. This is because the conditional if (dt$sex == "male") ... only checked the first case (here: the record of Adam) and then recycled the output (1, as dt$sex == "male" is TRUE for Adam) to the length of the desired output dt$gender. By contrast, the test of the second conditional evaluates to FALSE (as dt$sex == "female" is FALSE for Adam) so that its consequence dt$gender <- 2 is never used. Hence, using a conditional in this situation is not just bad style, but yields a Warning and an erroneous result.

What can we do instead? The R solution to this problem is logical indexing or subsetting (see Section 1.4.6):

Here, the specific values of one variable (dt$sex) are used to assign a value to another variable (dt$gender). As a test like dt$sex == "male" evaluates to TRUE or FALSE for every element of dt$sex, logical indexing works like a conditional statement on the entire vector.

We have encountered other commands that also act like and replace conditional statements. Whenever we use (logical or numeric) indexing or the dplyr verbs filter() or select() to subset, we effectively limit our dataset based on some condition(s). For instance, the following expressions all yield the same result and could be described as instructing R to reduce dt by the conditional statement “if a variable is not named sex, keep it”:

Similarly, functions can replace a series of conditionals. For example, the cut() function of base R divides a numeric range into discrete intervals (i.e., assigns numeric values to the levels of a factor). Conceptually, this corresponds to a series of conditional statements, as the following example illustrates:

Example: Using cut() to discretize a continuous variable

In Chapter 1, we generated a vector age by sampling 1000 random values from a Poisson distribution (see Section 1.6.4):

The following plot shows the distribution of age values:57

Suppose we wanted to categorize the age values into three categories “under 18”, “young adult (aged 18 to 30)”, and “over 30”. We could do this by a series of conditionals, but also by using the cut() function with an appropriate setting of its breaks argument:

Note that cut() created a factor variable (assigned to age_cat), whose levels we can re-name. The number of values in each category correspond to cutting the original distribution into three sections (or adding the heights of all bars withing a section into the height of a new bar):

Thus, cut() allows categorizing data into discrete bins. For more complicated ways of transforming continuous data into categorical data, see the rbin package.

Overall, R provides not only conditional statements, but also various ways of avoiding them. As a consequence, R uses fewer conditionals than most other programming languages. This raises the question: When should we use conditionals? As a general heuristic, we primarily use conditionals in R when the contents of objects are variable and currently unknown (e.g., when writing new functions). By contrast, when working with objects and structures that are fully defined (e.g., rows and columns of existing data), indexing is to be preferred.

Practice

Let’s practice what we have learned about using and avoiding conditionals in R.

  1. A conditional nursery rhyme

Consider the following check_flow() function:

The function appears to implement some nursery rhyme, but is really messy, unfortunately.58 Hence, need to clean up this code before we can even begin with trying to understand the function.

  • Format the function so that it becomes easier to read and parse.

Solution

A possible solution would indent commands, place any } on a new line, and generally introduce lots of white space, as follows:

  • Describe and try to understand this function. What does it do and how does it do it?

  • Answer and predict the results of the following questions:

    • Which cases does the 1st conditional statement distinguish?
    • When is the 1st switch statement reached? When is the 2nd switch statement reached?
    • What is the difference between the print and the return statements?
    • Under which conditions does the function return "raus bist du"?
    • What happens when you call check_flow() or check_flow(NA)?
  • Test your predictions by evaluating the following calls of the check_flow() function.

Solution

The following expressions are suited to check our check_flow() function:

  1. Using switch() without stop()
  • What happens with switch() statements, if the final stop() argument is omitted and the first argument does not match one of the cases?

Solution

We can easily modify get_i() from above to test this:

Using i = 3 and i = 4 yields the expected results. However, using i = 5 yields no warning (and no error, if we had used stop()). This illustrates that the final entry of switch() does not work like the else statement in a conditional.

  • Try replacing the final stop() in switch() with message() or warning(). What changes?

Solution

The following variant of get_i() illustrates the differences:

This example illustrates the differences between issuing a message(), a warning(), and a stop() (i.e., an error). Again, the case of i = 10 illustrates that the final entry of switch() does not work like the else statement in a conditional.

  1. Reconding by ifelse()

In Section 11.3.7, we used logical indexing (rather than conditionals) to recode variables of a data table dt (e.g., for creating a new variable gender). However, perhaps we could have used the vectorized ifelse() function to recode the variable sex as gender?

  • Would the following expression yield the desired result? Why or why not?

Solution

The expression fails to work as intended (and all values of dt$sex are set to 2). Overall, the expression results in the desired vector, but this vector is not assigned to dt$sex. Instead, all values of dt$gender are set to 2 (which clearly is problematic).

  • How could we modify the previous ifelse() expression to work as intended?

Solution

The following version would work (in this particular case), as a vector of values is assigned to dt$gender:

  • Why would logical indexing still be better?

Solution

Using the last ifelse() statement would assign 2 (i.e., “female”) to any record for which dt$sex == "male" is FALSE. This happens to work in this limited example, but is error-prone in any real environment (which is likely to include additional gender labels). Thus, using logical indexing is safer, as only cases for which dt$sex == "female" (rather than any non-“male” label) are re-coded as a dt$gender value of 2.

  1. More unconditional recoding

Assume we wanted to update two facts in the tibble dt (from above):

a.the height of Eve is measured to be 158cm (i.e., should no longer be NA)

  1. Adam turned 22 (i.e., his age needs to be adjusted)
  • Explain what the following conditionals would do (to a copy dt2) and why they would fail to make the desired corrections:

Solution

The tests of both conditionals yield a vector of logical values. When evaluating them, a warning message informs us that only their first element is used. As the first element happens to be FALSE for the test dt2$name == "Eve", no height value is being changed. However, as Adam is the first name on the list, the first element of dt2$name == "Adam" is TRUE and all age values are changed to 22 (due to recycling the dt2$age vector). Overall, the first conditional does not change anything and the second conditional changes too much.

  • How could we make the desired corrections?

Solution

A good solution would recode Eve’s height and Adam’s age by logical indexing:

The following alternatives using ifelse() would also work, but are less elegant than logical indexing:


  1. Even using functions implies a linear evaluation on some level. For instance, any function needs to be defined or loaded before it can be used.

  2. In this case, the values of age were created as integer values (from a Poisson distribution), but creating them as continuous values would make no difference for our present purposes.

  3. Actually, this example illustrates pretty well how the functions of students tend to look when they first start writing functions. Imagine searching for a typo in code formatted like this…