# Chapter 5 Programming in R

It’s very often the case that the functions provided by base R, **tidyverse**, or whatever other package you’re using are not sufficient for the particular question you need to answer. In that case, it becomes necessary to design your own functions and procedures, and this involves programming.

R is a full-fledged programming language and offers all of the usual programming constructs. Learning a few of these gives you the ability to customize your projects to fulfill your specific data analysis needs.

We’ll need **tidyverse** in this chapter.

`library(tidyverse)`

## 5.1 Print to Screen

It’s often necessary to print a message or value to the console window, for example, to produce readable, user-friendly output or to check how a piece of code is working. There are two functions that do this: `print`

and `cat`

:

`print("Hello World!")`

`## [1] "Hello World!"`

`cat("Hello world!")`

`## Hello world!`

One obvious difference between these two functions is that `print`

displays quotation marks while `cat`

does not. The actual difference between `print`

and `cat`

is the type of object they return. To see this, we can store the output of each function in a named variable and then check that variable’s data type.

`<- print("Hello world!") print_output `

`## [1] "Hello world!"`

`type_sum(print_output)`

`## [1] "chr"`

`<- cat("Hello world!") cat_output `

`## Hello world!`

`type_sum(cat_output)`

`## [1] "NULL"`

The output of the `print`

statement is a character string, while that of the `cat`

statement is “NULL.” This means that the `cat`

output is not an R object and can’t be operated on further. It’s just a display of characters on the screen.

You can also use a variable as a `print`

or `cat`

argument, and the value of that variable will be printed. (Just don’t put quotes around the variable name.)

```
<- 5
x
print(x)
```

`## [1] 5`

`cat(x)`

`## 5`

You can also print a mix of text and variables. The `cat`

function is tailor-made for this (`cat`

is actually short for “concatenate”), while it takes a little more work with `print`

. It’s interesting to note that `print`

displays more decimal accuracy than `cat`

.

```
<- 10
i
print(paste("The square root of",i,"is about",sqrt(i),"."))
```

`## [1] "The square root of 10 is about 3.16227766016838 ."`

`cat("The square root of",i,"is about",sqrt(i),".")`

`## The square root of 10 is about 3.162278 .`

By default, `print`

and `cat`

separate output pieces by spaces, but if you’re like me, you’re bothered by the space before the closing period above. You can change the default separation by using the optional `sep`

argument. For the above, we don’t want a space before the final period, so we’ll use `sep = ""`

. We’ll have to incorporate spaces into our text, though:

```
<- 10
i
print(paste("The square root of ",i," is about ",sqrt(i),".", sep = ""))
```

`## [1] "The square root of 10 is about 3.16227766016838."`

`cat("The square root of ",i," is about ",sqrt(i),".", sep = "")`

`## The square root of 10 is about 3.162278.`

You can also include a line break within `cat`

with the symbol `\n`

:

`cat("This is the first line.\nThis is the second line.")`

```
## This is the first line.
## This is the second line.
```

`print`

and `cat`

have the same basic functionality for our purposes, so usually either one will serve our needs. We’ll explore the differences more in upcoming exercises.

## 5.2 User-Defined Functions

A *function* is a process which accepts inputs, performs a series of operations on them, and produces an output. We’ve used dozens of functions in the course so far, including those included in base R such as `mean`

, `sd`

, `sum`

, etc, as well as those included in extra packages like **tidyverse**, such as `ggplot`

, `mutate`

, `str_replace`

, etc. In this section, we’ll see how to write our own functions. The syntax for writing functions in R is:

```
<FUNCTION NAME> <- function(<LIST OF INPUT NAMES>){
<SEQUENCE OF OPERATIONS TO BE PERFORMED>
return(<OUTPUT NAME>)
}
```

To illustrate how to define functions in R, we introduce the notion of *skewness*. Skewness is a statistic that measures how symmetrical a continuous variable’s distribution is, as visualized by its histogram. A perfectly symmetrical distribution has a skewness of 0. A right-skewed distribution (one which is longer to the right of the median) has a positive skewness value, and a left-skewed distribution (one which is longer to the left of the median) has a negative skewness value. In a symmetrical distribution, the mean and median are equal. In a right-skewed distribution, the mean is larger than the median, and in a left-skewed distribution, the mean is less than the median.

There are various ways to define skewness numerically, but a particularly simple one, known as *Pearson’s median skewness*, is given by
\[3\times \frac{\textrm{mean}-\textrm{median}}{\textrm{standard deviation}}.\]
Let’s write a function called `skewness`

that calculates this. Notice that the name chosen for the input is `v`

, which reflects the fact that the input is to be a vector of numbers. Notice also that a commented statement appears in the body of the function. This is a very good practice; it helps the reader understand what you’re doing, and it can remind you what you were thinking when you revisit your code later.

```
<- function(v){
skewness # This is Pearson's median skewness
<- 3 * (mean(v) - median(v)) / sd(v)
answer return(answer)
}
```

A code chunk that defines a function must obviously first be executed before it can be used, but when executing a function definition code chunk, you should always place the cursor after the final `}`

before hitting ctrl+enter.

Once our function is defined, we can call it with any vector of numbers as input. Let’s find the skewness of the `price`

variable in `diamonds`

:

`skewness(diamonds$price)`

`## [1] 1.151891`

The positive skewness tells us that the histogram is skewed to the right, as we can also see directly:

```
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price))
```

The variable `answer`

introduced within the function body above is only used to name the output that is to be reported back out of the function. By default, functions in R will always return the last computed value as the output unless instructed otherwise. So, the following definition of `skewness`

would also work:

```
<- function(v){
skewness # This is Pearson's median skewness
3 * (mean(v) - median(v)) / sd(v)
}
```

`skewness(diamonds$price)`

`## [1] 1.151891`

It’s often a good practice to name your output variable, though, and use an explicit `return`

statement to make it clear what your intended output is.

You can also use a function to display a printed sentence. Notice that we are *not* using a `return`

statement below since `cat`

statements are `NULL`

objects in R and therefore cannot be returned as usable output by a function.

```
<- function(v){
skewness <- 3 * (mean(v) - median(v)) / sd(v)
value cat("The value of Pearson's median skewness is ", value, ".\nA positive value indicates a right-skewed distribution.", sep = "")
}
skewness(diamonds$price)
```

```
## The value of Pearson's median skewness is 1.151891.
## A positive value indicates a right-skewed distribution.
```

Functions can also have more than one input and/or output. For example, let’s write a function which accepts the three coefficients of a quadratic equation \(ax^2+bx+c=0\) as inputs and returns the (possibly more than one) solution as its output. Technically, functions in R will not return more than one output, but we can work around that by returning a single *vector* of outputs with our two solutions as entries. (As noted above, without the `return`

statement, the function below would only return the value of `sol2`

.)

```
<- function(a, b, c){
quad_solve # sol1 and sol2 are the values produced by the Quadratic Formula
<- (-b + sqrt(b^2 - 4*a*c))/(2*a)
sol1 <- (-b - sqrt(b^2 - 4*a*c))/(2*a)
sol2 return(c(sol1, sol2))
}
```

`quad_solve(1, 1, -6)`

`## [1] 2 -3`

As you know, some quadratic equations have only one or even no real solutions. For example, the equation \(x^2-8x+16=0\) has only one solution, namely \(x=4\). Let’s see how our function handles this:

`quad_solve(1, -8, 16)`

`## [1] 4 4`

This makes sense, since we can interpret the double appearance of 4 as the result of the way \(x^2-8x+16\) factors as \((x-4)(x-4)\). What about equations with no real solutions, such as \(2x^2-3x+4=0\)?

`quad_solve(2, -3, 4)`

```
## Warning in sqrt(b^2 - 4 * a * c): NaNs produced
## Warning in sqrt(b^2 - 4 * a * c): NaNs produced
```

`## [1] NaN NaN`

`NaN`

means “not a number,” and it’s showing up here because \(b^2-4ac<0\), and thus our function is trying to take the square root of a negative number. (Apparently, base R does not know about complex numbers.) In the next section, we’ll see a way to avoid this kind of confusing output.

### 5.2.1 Exercises

These exercises require the **nycflights13** library.

Write a function that accepts two numerical vectors of the same length as input and returns the number of times the same number occurs in the same entry of each. For example, if the two vectors are

`c(2, 5, -2, 8, 7)`

and`c(3, 5, -2, 2, 7)`

, the function should return the number 3.Recall that if \(a\) and \(b\) are the legs of a right triangle and \(c\) is the hypotenuse, then the Pythagorean Theorem says \(a^2+b^2=c^2\). Write a function that accepts the values of \(a\) and \(b\) as input and returns the value of \(c\) as output. Your output should be a sentence of the form, “The length of the hypotenuse is 4.1342.”

Write a function that accepts a vector as input and counts the number of

`NA`

entries. Then use your function to state how many`NA`

s there are in the`dep_delay`

column of`flights`

.

## 5.3 Conditional Statements

A conditional statement is one which executes only when a given condition is satisfied. In R, these take the form of `if...else`

statements:

```
if (<CONDITION>){
<ACTION TO BE PERFORMED WHEN CONDITION IS TRUE>
} else {
<ACTION TO BE PERFORMED WHEN CONDITION IS FALSE>
}
```

For example, let’s write a conditional statement that states whether the cars in `mpg`

have, on average, better highway or city fuel efficiencies. (Of course, we already know the answer):

```
<- mean(mpg$hwy)
hwy_avg <- mean(mpg$cty)
cty_avg
if (cty_avg > hwy_avg){
cat("City gas mileage is better.\nThe mean value of `cty` is ",cty_avg,".", sep = "")
else {
} cat("Highway gas mileage is better.\nThe mean value of `hwy` is ",hwy_avg,".", sep = "")
}
```

```
## Highway gas mileage is better.
## The mean value of `hwy` is 23.44017.
```

Conditional statements play a very important role within functions by providing a way to check the input for problems before proceeding with the function operations. This is called a *value check*. A value check provides the programmer with the opportunity to specify a meaningful error message if the input doesn’t satisfy the check condition. Here’s how it works for our `quad_solve`

function from the previous section. Recall that for quadratic equations with no real solutions, we were left with a confusing warning message and the results `NaN`

.

```
<- function(a, b, c){
quad_solve # First check whether there are actually real solutions. If not, stop and return an
# error message.
if (b^2 - 4*a*c < 0){
stop("no real solutions", call. = FALSE)
}
# sol1 and sol2 are the values produced by the Quadratic Formula
<- (-b + sqrt(b^2 - 4*a*c))/(2*a)
sol1 <- (-b - sqrt(b^2 - 4*a*c))/(2*a)
sol2 return(c(sol1, sol2))
}
```

`quad_solve(2, -3, 4)`

`## Error: no real solutions`

In Section 2.5, we introduced the functions `ifelse`

and `case_when`

to categorize continuous variables. Each one operates with a logic similar to an `if...else`

statement: A condition is checked, the outcome of which dictates what happens next. However, the purpose of these two categorization functions is to generate a value. For example, consider the following two pieces of code:

```
<- 5
x <- 6
y
if (x < y){
<- x - y
z else {
} <- y - x
z
}
z
```

`## [1] -1`

and

```
<- 5
x <- 6
y
<- ifelse(x < y,
z - y,
x - x)
y
z
```

`## [1] -1`

The result is the same; both pieces of code assign -1 to `z`

. However, they do it differently. The `if...else`

statement makes the assignment to `z`

within the body. The `ifelse`

takes on the value of -1 itself, and then assigns this to `z`

. This highlights the difference between these two constructs: The body of an `if...else`

performs whatever sequence of operations you give it, while the only job of `ifelse`

(and `case_when`

) is to return a value.

`ifelse`

and `case_when`

are thus more limited in scope than `if...else`

, but one of their major benefits is that they are *vectorized*, meaning they can accept an entire vector of inputs and perform the operation individually on each entry, producing an output vector of results. This is why these two functions work within `mutate`

, as in Section 2.5; the code we saw there applied each of these functions not just to a number, but to the entire vector `arr_delay`

.

For example, suppose we re-do our `ifelse`

calculation above but replace `x`

and `y`

with vectors. Before looking at the answer, try to predict what the value of `z`

will be:

```
<- c(5, 1, 8)
x <- c(6, 4, 6)
y
<- ifelse(x < y,
z - y,
x - x)
y
z
```

`## [1] -1 -3 -2`

Because `ifelse`

is vectorized, it individually calculates the `z`

value for each pair of corresponding `x`

and `y`

values and stores the answer in the corresponding spot in the `z`

vector.

### 5.3.1 Exercises

- Write a function which calculates the area of a circle when the radius is input. Include a value check that returns an error message if the input radius is negative.
- Write a function that accepts a number as input and returns its absolute value as output. (Do not use the built-in
`abs`

function.) - Write a function which accepts an integer as input and determines whether it is even or odd. (Hint: Use the
`%%`

operation.) Include a value check that makes sure the input is an integer. - Write a function which accepts a number as input and determines whether it is positive, negative, or zero.
- Write a function that accepts a car’s miles-per-gallon highway fuel efficiency as input and assigns it a value of “poor,” “fair,” “good,” or “excellent” depending on what range it falls within. Decide for yourself what these ranges should be.
- Use your function from the previous exercise to add a column to
`mpg`

that categorizes each observation’s`hwy`

value. - Write a function which accepts two numerical vectors of the same length as input and states, entry-by-entry, which vector has the smaller entry or states that they’re equal. For example, if the two vectors are
`v1 <- c(3, 4, 1, -2)`

and`v2 <- c(2, 6, 1, 0)`

, the output (with line breaks included) should be:

The entry of v2 is smaller.

The entry of v1 is smaller.

The entries are equal.

The entry of v1 is smaller.

## 5.4 Loops

It’s very common for an algorithm to be called on to perform the same action repeatedly, possibly with a different input each time. For example, think about the process of finding the sum of the numbers 1 through 5. You start with 1 and add the next number, 2, to it. Your sum at that point is 3. Then you start with 3 and add the next number, 3, to it. You sum is now 6. Then you start with 6 and add the next number, 4, to it, obtaining a sum of 10. Then you start with 10 and add the next number, 5, to it, arriving at a sum of 15. Since 5 was the last number, you’re done and can report an answer of 15.

This should seem repetitive to the point of being tedious; this monotonous, iterative task is exactly the kind of thing we should relegate to a computer, and the way to do so is with a `for`

loop. The basic structure of a `for`

loop is:

```
for (<INDEX> in <INDEX VECTOR>){
<BODY>
}
```

`INDEX`

can have any name of your choosing. `INDEX VECTOR`

is the vector that the index values live in as entries. `BODY`

is a command or sequence of commands to be executed. The body will often refer to `INDEX`

. After the body has done its job, the value of `INDEX`

will be incremented to the next value of `INDEX VECTOR`

, and the body will be executed again with the new `INDEX`

value. This process will repeat until the last value of `INDEX`

in `INDEX VECTOR`

is reached.

The following shows how this works by enacting the summing algorithm above. A few notes:

- We have to initialize the value of
`Sum`

to 0 before we start adding numbers to it. Any`for`

loop whose job is to perform some kind of accumulation requires that the accumulated total be initialized to a starting value before entering the loop. `i`

is the index name, and the index vector`1:5`

is the sequence of numbers from 1 to 5.- During each pass through the loop, the value of
`Sum`

is updated by adding the current index to it.

```
<- 0
Sum for (i in 1:5){
<- Sum + i
Sum
}
Sum
```

`## [1] 15`

`for`

loops can iterate through any vector, not just sequences of integers. Recall that we can think of a tibble as a vector whose entries are the columns. This means we can even use tibbles as our index vectors. Let’s write a loop that will print the mean of each column of `diamonds`

whose data type is “double.” Notice that we have to use a conditional statement inside our loop to check that the column is of the right type before attempting to compute a mean:

```
for (col in diamonds){
if (type_sum(col) == "dbl"){
print(mean(col))
} }
```

```
## [1] 0.7979397
## [1] 61.7494
## [1] 57.45718
## [1] 5.731157
## [1] 5.734526
## [1] 3.538734
```

There’s an important technical consideration that applies to loops. The loops above either produced a single value or printed a list of values. It’s often the case instead that the output of a loop is an entire vector that accumulates its entries with each successive iteration. We’ll first walk through the most obvious way to do this and then see a much faster (but less obvious) approach.

First, the obvious, slow method. Suppose we’ve flipped a coin four times and have a vector containing the results: `c("H", "H", "T", "H")`

. We flip again and get “T.” We can update our vector as follows:

```
<- c("H", "H", "T", "H")
vec <- c(vec, "T")
vec
vec
```

`## [1] "H" "H" "T" "H" "T"`

This is like updating `Sum`

in the loop above. Suppose now that we want to generate a list of 25 coin flips and store the results in a vector. We can simulate a coin flip in R by randomly generating a 0 or 1, each having a 50% chance of being generated. The `0:1`

argument below indicates that our generated integers are chosen from the sequence starting at 0 and ending at 1, and the `1`

argument indicates that we’re generating one number from that sequence.

`sample(0:1, 1)`

`## [1] 1`

Now to get a sequence of 25 coin flips, let’s construct a `for`

loop with 25 iterations, each one producing a random 0 or 1. Let’s say 0 is heads and 1 is tails. To store the results in a vector, we start by initializing `vec2`

to an empty vector and then update it during each iteration as above.

```
<- NULL
vec2
for (i in 1:25){
<- sample(0:1, 1)
flip if (flip == 0){
<- c(vec2, "H")
vec2 else {
} <- c(vec2, "T")
vec2
}
}
vec2
```

```
## [1] "H" "H" "H" "H" "T" "T" "T" "T" "T" "T" "H" "H" "H" "H" "T" "H" "H" "T" "T" "T" "H" "H"
## [23] "H" "T" "H"
```

This seems to have taken no time at all, but we’ve actually made R do a lot of extra work. Allocating memory space for a vector takes time, and with every pass through the loop, we’re creating a new vector `vec2`

which R has to create from scratch 25 times. It would be much better to create a 25-entry `vec2`

at the beginning of the algorithm (with `NA`

or `0`

or whatever you want for each entry) and then update the entries during each iteration. Updating an entry takes much less time than creating an entire vector. This idea is known as *pre-allocation*, and it’s a very good idea any time your loop’s job is to produce an accumulated vector as output. Here’s how we can do this:

```
# Initialize our vector to be a 25-entry vector with NA in each entry.
<- rep(NA, 25)
vec3
for (i in 1:25){
<- sample(0:1, 1)
flip if (flip == 0){
<- "H" # We only update the entry, not the entire vector
vec3[i] else {
} <- "T" # We only update the entry, not the entire vector
vec3[i]
}
}
vec3
```

```
## [1] "T" "T" "H" "H" "H" "H" "T" "T" "T" "H" "T" "T" "T" "T" "T" "H" "T" "H" "T" "T" "T" "H"
## [23] "H" "H" "H"
```

The primary action performed is the assignment of an “H” or “T” to entry `i`

in `vec3`

during each iteration, as opposed to the previous algorithm, which creates an entirely new copy of `vec2`

during each iteration.

For a loop with only 25 iterations, it’s hard to notice any speed difference, but the difference becomes quite noticeable as the number of iterations increases. Let’s run each loop again with 100,000 iterations and measure the run times. This is done by marking the system time (using `Sys.time()`

) at the beginning and end of each loop and then subtracting them to find the time elapsed.

```
# This is the algorithm that does not pre-allocate the output.
<- Sys.time()
start_time
<- NULL
vec2
for (i in 1:100000){
<- sample(0:1, 1)
flip if (flip == 0){
<- c(vec2, "H")
vec2 else {
} <- c(vec2, "T")
vec2
}
}
<- Sys.time()
end_time
- start_time end_time
```

`## Time difference of 44.00795 secs`

```
# This is the algorithm that pre-allocates the output.
<- Sys.time()
start_time
<- rep(NA, 100000)
vec3
for (i in 1:100000){
<- sample(0:1, 1)
flip if (flip == 0){
<- "H"
vec3[i] else {
} <- "T"
vec3[i]
}
}
<- Sys.time()
end_time
- start_time end_time
```

`## Time difference of 1.201111 secs`

As you can see, the increase in speed resulting from pre-allocation is dramatic.

Some iterative processes don’t terminate after a predetermined number of iterations but instead when some type of stopping condition is met. For example, suppose we want to simulate an experiment that counts the number of times it takes to get a “doubles” (a dice roll that has the same number on each die) on a pair of six-sided dice. We could run a `for`

loop in which we simulate a dice roll using random number generators, but what would we use for the index vector? We don’t know how many iterations we’ll need to get a doubles. Instead, we can use a `while`

loop, the syntax for which is:

```
while (<CONDITION>){
<ACTION TO BE PERFORMED WHILE CONDITION IS TRUE>
}
```

Let’s simulate the dice-rolling scenario above and count how many rolls it take to get our first doubles.

```
# First we create a logical variable called `doubles` that states whether a doubles
# has yet been rolled. We initialize it to `FALSE`:
<- FALSE
doubles
# We also want to keep track of how many rolls we've done. We'll create a `counter`
# variable and initialize it to 0. We'll then increment it after each roll.
<- 0
counter
# We're going to simulate a dice roll within the body of the loop, but only if we haven't
# yet gotten a doubles. We should therefore only enter the loop while the `doubles` value
# is `FALSE`. (If `doubles` is `FALSE`, then `!doubles` is `TRUE`.)
while(!doubles){
# We roll the dice by generating a random integer from 1 to 6 for each die:
<- sample(1:6, 1)
die1 <- sample(1:6, 1)
die2
# Since we just rolled the dice, we have to increment `counter`:
<- counter + 1
counter
# Now we check whether we rolled a doubles, and if so, we change the value of `doubles`
# to `TRUE`. This means `!doubles` would be `FALSE`, and the loop would end. If we did
# not roll a doubles, `doubles` will remain `FALSE`, and the loop will be executed
# again.
if (die1 == die2){
<- TRUE
doubles
}
}
# Once `doubles` becomes `TRUE` and we exit the loop, we check the final value of
# `counter` to see how many rolls we needed to get a doubles:
counter
```

`## [1] 3`

We could even run this experiment 1000 times and store the results in a pre-allocated output vector:

```
<- 1000
num_rolls <- rep(NA, num_rolls)
results
for (i in 1:num_rolls){
<- FALSE
doubles <- 0
counter
while(!doubles){
<- sample(1:6, 1)
die1 <- sample(1:6, 1)
die2
<- counter + 1
counter
if (die1 == die2){
<- TRUE
doubles
}
}
<- counter
results[i]
}
```

Now it would be fun to examine some of the statistics:

`mean(results)`

`## [1] 5.947`

`min(results)`

`## [1] 1`

`max(results)`

`## [1] 42`

On average, it took about 6 rolls to get a doubles. This makes sense – there are 36 possible rolls of the dice, 6 of which are doubles, so the probability of rolling a doubles is 1/6. We would thus expect to get a doubles about once every six rolls.

It looks like we also got a doubles on our first roll at least one time, and, amazingly, it once took 42 rolls to get a doubles!

The collection of programming constructs we saw in this chapter is very far from comprehensive, and they were mostly presented without a concern for computational efficiency or even best coding practices. However, you can go a long way with print-to-screen commands, conditional statements, user-defined functions, and loops.

### 5.4.1 Exercises

Write a loop that prints the first 15 perfect squares to the screen. (A perfect square is a number with an integer square root, such as 36, 81, etc). Do this using the

`print`

function and then again using the`cat`

function. What do you notice about the difference between the outputs.Write a function that accepts a positive integer

`n`

as input and returns the sum of the first`n`

perfect squares. (For example, if`n`

is 4, the function would return 30, which is the value of \(1^2+2^2+3^2+4^2\).)How likely is rolling a 9 on a pair of six-sided dice? Answer this question by simulating 1000 dice rolls, keeping track of the number of times you roll a 9. Then find the percentage of times you rolled a 9.

Write a loop that prints the number of

`NA`

s in each column of`msleep`

.Re-do the previous exercise, but instead of printing the number of

`NA`

s in each column, create an ouput vector that stores each number of`NA`

s in its entries. Remember to pre-allocate space for this vector.A

*random walk*is a statistical process that, in its simplest form, can be described as follows: “Start on a number line at 0. Flip a coin. If you get heads, step one unit to the right. If you get tails, step one unit to the left. Then repeat.” Use a`while`

loop to simulate this random walk, and stop it the first time the walker reaches 5 on the number line. How many steps did it take?Write a function that accepts an

`n`

value as input and then performs the random walk from the previous exercise`n`

times. Accumulate a pre-allocated vector of length`n`

that stores the number of steps required to reach 5 in each of your`n`

random walks. The output of your function should be a printed sentence that states the mean, maximum, and minimum values of this vector.In the random walk from the previous two exercises, there was a 50-50 chance of stepping left or right. Repeat the previous two exercises, but try to think of a way to make the chances of stepping right 75% and the chances of stepping left 25%. What do you predict will happen when you count the number of steps required to reach 5?

When accumulating an output vector using a

`while`

loop, you often won’t know how long it will be, which makes pre-allocation difficult. How might this be addressed?

## 5.5 Project

**Project Description:** Baseball is very much a game of numbers, and there seem to be an endless number of statistics, called *sabermetrics*, which attempt to quantify a player’s performance.

In this project, you will use the `Batting`

data set from **Lahman** to make an argument for the player whom you think is the best home run hitter of all time. You will do so by computing several home run sabermetrics, some of which will require customized code.

**Instructions: **

Using the

`Batting`

data set, create data sets that contain the following players, along with all of their batting data.- The top 20 leaders in home runs hit in a single season.
- The top 20 leaders in total home runs hit over a career.

The players who showed up on either of the lists above should be your candidates for the best home run hitter of all time. Merge the two data sets from the previous problem.

For the merged data set from the previous problem, perform a grouped summary in which you compute each player’s career totals in home runs, at bats, and strikeouts.

Add the following sabermetric columns to the grouped data set:

- The number of at bats per home run throughout the player’s career. This measures the frequency with which the player hits home runs. (A low number is better than a high number.)
- The number of strikeouts per home run thoughout the player’s career. This measures the ability of the player to hit for power while maintaining the discipline to avoid swinging too hard or swinging at bad pitches. (Again, a low number is better than a high number here.)

One more sabermetric worth adding to our data set is one that measures a player’s home run consistency over time. This will be the number of consecutive seasons a player hit, for example, 20 or more home runs. This calculation will require some programming.

- Write a function called
`max_consec`

that accepts a vector`vec`

of numbers and a positive integer`m`

as input. The function should return the longest streak of consecutive entries in`vec`

which are greater than or equal to`m`

. For example, if`vec`

is`c(12, 17, 25, 32, 20, 19, 21, 23, 35, 20)`

and`m`

is 20, the output should be 4 since the longest run of consecutive entries greater than or equal to 20 is 4. - Use your function from part (a) to add the following columns to your data set for each of your candidates: the longest consecutive run of seasons with 20 or more, 30 or more, 40 or more, 50 or more, and 60 or more home runs.

- Write a function called
Based only on your calculations above, who, in your opinion is the greatest home run hitter of all time? Who’s second? Who’s third?

**Guidelines:**

Write up your analysis in an R Markdown report, making sure to follow the usual guidelines as stated for the Chapters 1 and 2 projects. Additionally:

- Carefully describe any preliminary cleaning you do to prepare the data for your analysis.
- Use
*only*your analysis to make your opinion; avoid the temptation to bring in outside information you might have such as PED use, etc. This is certainly not a comprehensive list of considerations to be taken into account when ranking the all-time best home run hitters. No baseball knowledge is required or expected to perform this analysis. - For your
`max_consec`

function, be sure to add comments to your code to make it more understandable. - You can look up a player’s name from the
`playerID`

value as follows:

`playerInfo("trammal01")`

```
## playerID nameFirst nameLast
## 18533 trammal01 Alan Trammell
```

**Grading Rubric:**

**Transformation**: Did you correctly perform all necessary data transformations? (30 points)**Programming**: Did you get your`max_consec`

function to work? (30 points)**Narrative**: Is it clear what you’re trying to do in this project? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)**Professionalism**: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? Did you follow the guidelines listed above? (15 points)