Chapter 5 Programming in R

It’s very often the case that the functions provided by base R, tidyverse, or whatever other package you’re using are not sufficient for the particular question you need to answer. In that case, it becomes necessary to design your own functions and procedures, and this involves programming.

R is a full-fledged programming language and offers all of the usual programming constructs. Learning a few of these gives you the ability to customize your projects to fulfill your specific data analysis needs.

We’ll need tidyverse in this chapter.

library(tidyverse)

5.2 User-Defined Functions

A function is a process which accepts inputs, performs a series of operations on them, and produces an output. We’ve used dozens of functions in the course so far, including those included in base R such as mean, sd, sum, etc, as well as those included in extra packages like tidyverse, such as ggplot, mutate, str_replace, etc. In this section, we’ll see how to write our own functions. The syntax for writing functions in R is:


<FUNCTION NAME> <- function(<LIST OF INPUT NAMES>){
  <SEQUENCE OF OPERATIONS TO BE PERFORMED>
  return(<OUTPUT NAME>)
}
 

To illustrate how to define functions in R, we introduce the notion of skewness. Skewness is a statistic that measures how symmetrical a continuous variable’s distribution is, as visualized by its histogram. A perfectly symmetrical distribution has a skewness of 0. A right-skewed distribution (one which is longer to the right of the median) has a positive skewness value, and a left-skewed distribution (one which is longer to the left of the median) has a negative skewness value. In a symmetrical distribution, the mean and median are equal. In a right-skewed distribution, the mean is larger than the median, and in a left-skewed distribution, the mean is less than the median.

There are various ways to define skewness numerically, but a particularly simple one, known as Pearson’s median skewness, is given by \[3\times \frac{\textrm{mean}-\textrm{median}}{\textrm{standard deviation}}.\] Let’s write a function called skewness that calculates this. Notice that the name chosen for the input is v, which reflects the fact that the input is to be a vector of numbers. Notice also that a commented statement appears in the body of the function. This is a very good practice; it helps the reader understand what you’re doing, and it can remind you what you were thinking when you revisit your code later.

skewness <- function(v){
  # This is Pearson's median skewness
  answer <- 3 * (mean(v) - median(v)) / sd(v)
  return(answer)
}

A code chunk that defines a function must obviously first be executed before it can be used, but when executing a function definition code chunk, you should always place the cursor after the final } before hitting ctrl+enter.

Once our function is defined, we can call it with any vector of numbers as input. Let’s find the skewness of the price variable in diamonds:

skewness(diamonds$price)
## [1] 1.151891

The positive skewness tells us that the histogram is skewed to the right, as we can also see directly:

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price))

The variable answer introduced within the function body above is only used to name the output that is to be reported back out of the function. By default, functions in R will always return the last computed value as the output unless instructed otherwise. So, the following definition of skewness would also work:

skewness <- function(v){
  # This is Pearson's median skewness
  3 * (mean(v) - median(v)) / sd(v)
}
skewness(diamonds$price)
## [1] 1.151891

It’s often a good practice to name your output variable, though, and use an explicit return statement to make it clear what your intended output is.

You can also use a function to display a printed sentence. Notice that we are not using a return statement below since cat statements are NULL objects in R and therefore cannot be returned as usable output by a function.

skewness <- function(v){
  value <- 3 * (mean(v) - median(v)) / sd(v)
  cat("The value of Pearson's median skewness is ", value, ".\nA positive value indicates a right-skewed distribution.", sep = "")
}

skewness(diamonds$price)
## The value of Pearson's median skewness is 1.151891.
## A positive value indicates a right-skewed distribution.

Functions can also have more than one input and/or output. For example, let’s write a function which accepts the three coefficients of a quadratic equation \(ax^2+bx+c=0\) as inputs and returns the (possibly more than one) solution as its output. Technically, functions in R will not return more than one output, but we can work around that by returning a single vector of outputs with our two solutions as entries. (As noted above, without the return statement, the function below would only return the value of sol2.)

quad_solve <- function(a, b, c){
  # sol1 and sol2 are the values produced by the Quadratic Formula
  sol1 <- (-b + sqrt(b^2 - 4*a*c))/(2*a)
  sol2 <- (-b - sqrt(b^2 - 4*a*c))/(2*a)
  return(c(sol1, sol2))
}
quad_solve(1, 1, -6)
## [1]  2 -3

As you know, some quadratic equations have only one or even no real solutions. For example, the equation \(x^2-8x+16=0\) has only one solution, namely \(x=4\). Let’s see how our function handles this:

quad_solve(1, -8, 16)
## [1] 4 4

This makes sense, since we can interpret the double appearance of 4 as the result of the way \(x^2-8x+16\) factors as \((x-4)(x-4)\). What about equations with no real solutions, such as \(2x^2-3x+4=0\)?

quad_solve(2, -3, 4)
## Warning in sqrt(b^2 - 4 * a * c): NaNs produced

## Warning in sqrt(b^2 - 4 * a * c): NaNs produced
## [1] NaN NaN

NaN means “not a number,” and it’s showing up here because \(b^2-4ac<0\), and thus our function is trying to take the square root of a negative number. (Apparently, base R does not know about complex numbers.) In the next section, we’ll see a way to avoid this kind of confusing output.

5.2.1 Exercises

These exercises require the nycflights13 library.

  1. Write a function that accepts two numerical vectors of the same length as input and returns the number of times the same number occurs in the same entry of each. For example, if the two vectors are c(2, 5, -2, 8, 7) and c(3, 5, -2, 2, 7), the function should return the number 3.

  2. Recall that if \(a\) and \(b\) are the legs of a right triangle and \(c\) is the hypotenuse, then the Pythagorean Theorem says \(a^2+b^2=c^2\). Write a function that accepts the values of \(a\) and \(b\) as input and returns the value of \(c\) as output. Your output should be a sentence of the form, “The length of the hypotenuse is 4.1342.”

  3. Write a function that accepts a vector as input and counts the number of NA entries. Then use your function to state how many NAs there are in the dep_delay column of flights.

5.3 Conditional Statements

A conditional statement is one which executes only when a given condition is satisfied. In R, these take the form of if...else statements:


if (<CONDITION>){
  <ACTION TO BE PERFORMED WHEN CONDITION IS TRUE>
} else {
  <ACTION TO BE PERFORMED WHEN CONDITION IS FALSE>
}
 

For example, let’s write a conditional statement that states whether the cars in mpg have, on average, better highway or city fuel efficiencies. (Of course, we already know the answer):

hwy_avg <- mean(mpg$hwy)
cty_avg <- mean(mpg$cty)

if (cty_avg > hwy_avg){
  cat("City gas mileage is better.\nThe mean value of `cty` is ",cty_avg,".", sep = "")
} else {
  cat("Highway gas mileage is better.\nThe mean value of `hwy` is ",hwy_avg,".", sep = "")
}
## Highway gas mileage is better.
## The mean value of `hwy` is 23.44017.

Conditional statements play a very important role within functions by providing a way to check the input for problems before proceeding with the function operations. This is called a value check. A value check provides the programmer with the opportunity to specify a meaningful error message if the input doesn’t satisfy the check condition. Here’s how it works for our quad_solve function from the previous section. Recall that for quadratic equations with no real solutions, we were left with a confusing warning message and the results NaN.

quad_solve <- function(a, b, c){
  # First check whether there are actually real solutions.  If not, stop and return an 
  # error message.
  if (b^2 - 4*a*c < 0){
    stop("no real solutions", call. = FALSE)
  }
  
  # sol1 and sol2 are the values produced by the Quadratic Formula
  sol1 <- (-b + sqrt(b^2 - 4*a*c))/(2*a)
  sol2 <- (-b - sqrt(b^2 - 4*a*c))/(2*a)
  return(c(sol1, sol2))
}
quad_solve(2, -3, 4)
## Error: no real solutions

In Section 2.5, we introduced the functions ifelse and case_when to categorize continuous variables. Each one operates with a logic similar to an if...else statement: A condition is checked, the outcome of which dictates what happens next. However, the purpose of these two categorization functions is to generate a value. For example, consider the following two pieces of code:

x <- 5
y <- 6

if (x < y){
  z <- x - y
} else {
  z <- y - x
}

z
## [1] -1

and

x <- 5
y <- 6

z <- ifelse(x < y,
            x - y,
            y - x)

z
## [1] -1

The result is the same; both pieces of code assign -1 to z. However, they do it differently. The if...else statement makes the assignment to z within the body. The ifelse takes on the value of -1 itself, and then assigns this to z. This highlights the difference between these two constructs: The body of an if...else performs whatever sequence of operations you give it, while the only job of ifelse (and case_when) is to return a value.

ifelse and case_when are thus more limited in scope than if...else, but one of their major benefits is that they are vectorized, meaning they can accept an entire vector of inputs and perform the operation individually on each entry, producing an output vector of results. This is why these two functions work within mutate, as in Section 2.5; the code we saw there applied each of these functions not just to a number, but to the entire vector arr_delay.

For example, suppose we re-do our ifelse calculation above but replace x and y with vectors. Before looking at the answer, try to predict what the value of z will be:

x <- c(5, 1, 8)
y <- c(6, 4, 6)

z <- ifelse(x < y,
            x - y,
            y - x)

z
## [1] -1 -3 -2

Because ifelse is vectorized, it individually calculates the z value for each pair of corresponding x and y values and stores the answer in the corresponding spot in the z vector.

5.3.1 Exercises

  1. Write a function which calculates the area of a circle when the radius is input. Include a value check that returns an error message if the input radius is negative.
  2. Write a function that accepts a number as input and returns its absolute value as output. (Do not use the built-in abs function.)
  3. Write a function which accepts an integer as input and determines whether it is even or odd. (Hint: Use the %% operation.) Include a value check that makes sure the input is an integer.
  4. Write a function which accepts a number as input and determines whether it is positive, negative, or zero.
  5. Write a function that accepts a car’s miles-per-gallon highway fuel efficiency as input and assigns it a value of “poor,” “fair,” “good,” or “excellent” depending on what range it falls within. Decide for yourself what these ranges should be.
  6. Use your function from the previous exercise to add a column to mpg that categorizes each observation’s hwy value.
  7. Write a function which accepts two numerical vectors of the same length as input and states, entry-by-entry, which vector has the smaller entry or states that they’re equal. For example, if the two vectors are v1 <- c(3, 4, 1, -2) and v2 <- c(2, 6, 1, 0), the output (with line breaks included) should be:
    The entry of v2 is smaller.
    The entry of v1 is smaller.
    The entries are equal.
    The entry of v1 is smaller.

5.4 Loops

It’s very common for an algorithm to be called on to perform the same action repeatedly, possibly with a different input each time. For example, think about the process of finding the sum of the numbers 1 through 5. You start with 1 and add the next number, 2, to it. Your sum at that point is 3. Then you start with 3 and add the next number, 3, to it. You sum is now 6. Then you start with 6 and add the next number, 4, to it, obtaining a sum of 10. Then you start with 10 and add the next number, 5, to it, arriving at a sum of 15. Since 5 was the last number, you’re done and can report an answer of 15.

This should seem repetitive to the point of being tedious; this monotonous, iterative task is exactly the kind of thing we should relegate to a computer, and the way to do so is with a for loop. The basic structure of a for loop is:


for (<INDEX> in <INDEX VECTOR>){
  <BODY>
}
 

INDEX can have any name of your choosing. INDEX VECTOR is the vector that the index values live in as entries. BODY is a command or sequence of commands to be executed. The body will often refer to INDEX. After the body has done its job, the value of INDEX will be incremented to the next value of INDEX VECTOR, and the body will be executed again with the new INDEX value. This process will repeat until the last value of INDEX in INDEX VECTOR is reached.

The following shows how this works by enacting the summing algorithm above. A few notes:

  • We have to initialize the value of Sum to 0 before we start adding numbers to it. Any for loop whose job is to perform some kind of accumulation requires that the accumulated total be initialized to a starting value before entering the loop.
  • i is the index name, and the index vector 1:5 is the sequence of numbers from 1 to 5.
  • During each pass through the loop, the value of Sum is updated by adding the current index to it.
Sum <- 0
for (i in 1:5){
  Sum <- Sum + i 
}

Sum
## [1] 15

for loops can iterate through any vector, not just sequences of integers. Recall that we can think of a tibble as a vector whose entries are the columns. This means we can even use tibbles as our index vectors. Let’s write a loop that will print the mean of each column of diamonds whose data type is “double.” Notice that we have to use a conditional statement inside our loop to check that the column is of the right type before attempting to compute a mean:

for (col in diamonds){
  if (type_sum(col) == "dbl"){
    print(mean(col))
  }
}
## [1] 0.7979397
## [1] 61.7494
## [1] 57.45718
## [1] 5.731157
## [1] 5.734526
## [1] 3.538734

There’s an important technical consideration that applies to loops. The loops above either produced a single value or printed a list of values. It’s often the case instead that the output of a loop is an entire vector that accumulates its entries with each successive iteration. We’ll first walk through the most obvious way to do this and then see a much faster (but less obvious) approach.

First, the obvious, slow method. Suppose we’ve flipped a coin four times and have a vector containing the results: c("H", "H", "T", "H"). We flip again and get “T.” We can update our vector as follows:

vec <- c("H", "H", "T", "H")
vec <- c(vec, "T")

vec
## [1] "H" "H" "T" "H" "T"

This is like updating Sum in the loop above. Suppose now that we want to generate a list of 25 coin flips and store the results in a vector. We can simulate a coin flip in R by randomly generating a 0 or 1, each having a 50% chance of being generated. The 0:1 argument below indicates that our generated integers are chosen from the sequence starting at 0 and ending at 1, and the 1 argument indicates that we’re generating one number from that sequence.

sample(0:1, 1)
## [1] 1

Now to get a sequence of 25 coin flips, let’s construct a for loop with 25 iterations, each one producing a random 0 or 1. Let’s say 0 is heads and 1 is tails. To store the results in a vector, we start by initializing vec2 to an empty vector and then update it during each iteration as above.

vec2 <- NULL

for (i in 1:25){
  flip <- sample(0:1, 1)
  if (flip == 0){
    vec2 <- c(vec2, "H")
  } else {
    vec2 <- c(vec2, "T")
  }
}

vec2
##  [1] "H" "H" "H" "H" "T" "T" "T" "T" "T" "T" "H" "H" "H" "H" "T" "H" "H" "T" "T" "T" "H" "H"
## [23] "H" "T" "H"

This seems to have taken no time at all, but we’ve actually made R do a lot of extra work. Allocating memory space for a vector takes time, and with every pass through the loop, we’re creating a new vector vec2 which R has to create from scratch 25 times. It would be much better to create a 25-entry vec2 at the beginning of the algorithm (with NA or 0 or whatever you want for each entry) and then update the entries during each iteration. Updating an entry takes much less time than creating an entire vector. This idea is known as pre-allocation, and it’s a very good idea any time your loop’s job is to produce an accumulated vector as output. Here’s how we can do this:

# Initialize our vector to be a 25-entry vector with NA in each entry.
vec3 <- rep(NA, 25) 

for (i in 1:25){
  flip <- sample(0:1, 1)
  if (flip == 0){
    vec3[i] <- "H" # We only update the entry, not the entire vector
  } else {
    vec3[i] <- "T" # We only update the entry, not the entire vector
  }
}

vec3
##  [1] "T" "T" "H" "H" "H" "H" "T" "T" "T" "H" "T" "T" "T" "T" "T" "H" "T" "H" "T" "T" "T" "H"
## [23] "H" "H" "H"

The primary action performed is the assignment of an “H” or “T” to entry i in vec3 during each iteration, as opposed to the previous algorithm, which creates an entirely new copy of vec2 during each iteration.

For a loop with only 25 iterations, it’s hard to notice any speed difference, but the difference becomes quite noticeable as the number of iterations increases. Let’s run each loop again with 100,000 iterations and measure the run times. This is done by marking the system time (using Sys.time()) at the beginning and end of each loop and then subtracting them to find the time elapsed.

# This is the algorithm that does not pre-allocate the output.

start_time <- Sys.time()

vec2 <- NULL

for (i in 1:100000){
  flip <- sample(0:1, 1)
  if (flip == 0){
    vec2 <- c(vec2, "H")
  } else {
    vec2 <- c(vec2, "T")
  }
}

end_time <- Sys.time()

end_time - start_time
## Time difference of 44.00795 secs
# This is the algorithm that pre-allocates the output.

start_time <- Sys.time()

vec3 <- rep(NA, 100000) 

for (i in 1:100000){
  flip <- sample(0:1, 1)
  if (flip == 0){
    vec3[i] <- "H"
  } else {
    vec3[i] <- "T"
  }
}

end_time <- Sys.time()

end_time - start_time
## Time difference of 1.201111 secs

As you can see, the increase in speed resulting from pre-allocation is dramatic.


Some iterative processes don’t terminate after a predetermined number of iterations but instead when some type of stopping condition is met. For example, suppose we want to simulate an experiment that counts the number of times it takes to get a “doubles” (a dice roll that has the same number on each die) on a pair of six-sided dice. We could run a for loop in which we simulate a dice roll using random number generators, but what would we use for the index vector? We don’t know how many iterations we’ll need to get a doubles. Instead, we can use a while loop, the syntax for which is:


while (<CONDITION>){
  <ACTION TO BE PERFORMED WHILE CONDITION IS TRUE>
}
 

Let’s simulate the dice-rolling scenario above and count how many rolls it take to get our first doubles.

# First we create a logical variable called `doubles` that states whether a doubles 
# has yet been rolled.  We initialize it to `FALSE`:

doubles <- FALSE

# We also want to keep track of how many rolls we've done.  We'll create a `counter` 
# variable and initialize it to 0.  We'll then increment it after each roll.

counter <- 0

# We're going to simulate a dice roll within the body of the loop, but only if we haven't
# yet gotten a doubles.  We should therefore only enter the loop while the `doubles` value
# is `FALSE`.  (If `doubles` is `FALSE`, then `!doubles` is `TRUE`.)

while(!doubles){
  
  # We roll the dice by generating a random integer from 1 to 6 for each die:
  
  die1 <- sample(1:6, 1)
  die2 <- sample(1:6, 1)
  
  # Since we just rolled the dice, we have to increment `counter`:
  
  counter <- counter + 1
  
  # Now we check whether we rolled a doubles, and if so, we change the value of `doubles`
  # to `TRUE`.  This means `!doubles` would be `FALSE`, and the loop would end. If we did 
  # not roll a doubles, `doubles` will remain `FALSE`, and the loop will be executed 
  # again.
  
  if (die1 == die2){
    doubles <- TRUE
  }
}

# Once `doubles` becomes `TRUE` and we exit the loop, we check the final value of 
# `counter` to see how many rolls we needed to get a doubles:

counter
## [1] 3

We could even run this experiment 1000 times and store the results in a pre-allocated output vector:

num_rolls <- 1000
results <- rep(NA, num_rolls)

for (i in 1:num_rolls){
  
  doubles <- FALSE
  counter <- 0

  while(!doubles){
    
    die1 <- sample(1:6, 1)
    die2 <- sample(1:6, 1)
    
    counter <- counter + 1
    
    if (die1 == die2){
      doubles <- TRUE
    }
  }
  
  results[i] <- counter
  
}

Now it would be fun to examine some of the statistics:

mean(results)
## [1] 5.947
min(results)
## [1] 1
max(results)
## [1] 42

On average, it took about 6 rolls to get a doubles. This makes sense – there are 36 possible rolls of the dice, 6 of which are doubles, so the probability of rolling a doubles is 1/6. We would thus expect to get a doubles about once every six rolls.

It looks like we also got a doubles on our first roll at least one time, and, amazingly, it once took 42 rolls to get a doubles!


The collection of programming constructs we saw in this chapter is very far from comprehensive, and they were mostly presented without a concern for computational efficiency or even best coding practices. However, you can go a long way with print-to-screen commands, conditional statements, user-defined functions, and loops.

5.4.1 Exercises

  1. Write a loop that prints the first 15 perfect squares to the screen. (A perfect square is a number with an integer square root, such as 36, 81, etc). Do this using the print function and then again using the cat function. What do you notice about the difference between the outputs.

  2. Write a function that accepts a positive integer n as input and returns the sum of the first n perfect squares. (For example, if n is 4, the function would return 30, which is the value of \(1^2+2^2+3^2+4^2\).)

  3. How likely is rolling a 9 on a pair of six-sided dice? Answer this question by simulating 1000 dice rolls, keeping track of the number of times you roll a 9. Then find the percentage of times you rolled a 9.

  4. Write a loop that prints the number of NAs in each column of msleep.

  5. Re-do the previous exercise, but instead of printing the number of NAs in each column, create an ouput vector that stores each number of NAs in its entries. Remember to pre-allocate space for this vector.

  6. A random walk is a statistical process that, in its simplest form, can be described as follows: “Start on a number line at 0. Flip a coin. If you get heads, step one unit to the right. If you get tails, step one unit to the left. Then repeat.” Use a while loop to simulate this random walk, and stop it the first time the walker reaches 5 on the number line. How many steps did it take?

  7. Write a function that accepts an n value as input and then performs the random walk from the previous exercise n times. Accumulate a pre-allocated vector of length n that stores the number of steps required to reach 5 in each of your n random walks. The output of your function should be a printed sentence that states the mean, maximum, and minimum values of this vector.

  8. In the random walk from the previous two exercises, there was a 50-50 chance of stepping left or right. Repeat the previous two exercises, but try to think of a way to make the chances of stepping right 75% and the chances of stepping left 25%. What do you predict will happen when you count the number of steps required to reach 5?

  9. When accumulating an output vector using a while loop, you often won’t know how long it will be, which makes pre-allocation difficult. How might this be addressed?

5.5 Project

Project Description: Baseball is very much a game of numbers, and there seem to be an endless number of statistics, called sabermetrics, which attempt to quantify a player’s performance.

In this project, you will use the Batting data set from Lahman to make an argument for the player whom you think is the best home run hitter of all time. You will do so by computing several home run sabermetrics, some of which will require customized code.

Instructions:

  1. Using the Batting data set, create data sets that contain the following players, along with all of their batting data.

    1. The top 20 leaders in home runs hit in a single season.
    2. The top 20 leaders in total home runs hit over a career.
  2. The players who showed up on either of the lists above should be your candidates for the best home run hitter of all time. Merge the two data sets from the previous problem.

  3. For the merged data set from the previous problem, perform a grouped summary in which you compute each player’s career totals in home runs, at bats, and strikeouts.

  4. Add the following sabermetric columns to the grouped data set:

    1. The number of at bats per home run throughout the player’s career. This measures the frequency with which the player hits home runs. (A low number is better than a high number.)
    2. The number of strikeouts per home run thoughout the player’s career. This measures the ability of the player to hit for power while maintaining the discipline to avoid swinging too hard or swinging at bad pitches. (Again, a low number is better than a high number here.)
  5. One more sabermetric worth adding to our data set is one that measures a player’s home run consistency over time. This will be the number of consecutive seasons a player hit, for example, 20 or more home runs. This calculation will require some programming.

    1. Write a function called max_consec that accepts a vector vec of numbers and a positive integer m as input. The function should return the longest streak of consecutive entries in vec which are greater than or equal to m. For example, if vec is c(12, 17, 25, 32, 20, 19, 21, 23, 35, 20) and m is 20, the output should be 4 since the longest run of consecutive entries greater than or equal to 20 is 4.
    2. Use your function from part (a) to add the following columns to your data set for each of your candidates: the longest consecutive run of seasons with 20 or more, 30 or more, 40 or more, 50 or more, and 60 or more home runs.
  6. Based only on your calculations above, who, in your opinion is the greatest home run hitter of all time? Who’s second? Who’s third?

Guidelines:

Write up your analysis in an R Markdown report, making sure to follow the usual guidelines as stated for the Chapters 1 and 2 projects. Additionally:

  • Carefully describe any preliminary cleaning you do to prepare the data for your analysis.
  • Use only your analysis to make your opinion; avoid the temptation to bring in outside information you might have such as PED use, etc. This is certainly not a comprehensive list of considerations to be taken into account when ranking the all-time best home run hitters. No baseball knowledge is required or expected to perform this analysis.
  • For your max_consec function, be sure to add comments to your code to make it more understandable.
  • You can look up a player’s name from the playerID value as follows:
playerInfo("trammal01")
##        playerID nameFirst nameLast
## 18533 trammal01      Alan Trammell

Grading Rubric:

  • Transformation: Did you correctly perform all necessary data transformations? (30 points)
  • Programming: Did you get your max_consec function to work? (30 points)
  • Narrative: Is it clear what you’re trying to do in this project? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)
  • Professionalism: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? Did you follow the guidelines listed above? (15 points)