Chapter 5 Programming in R
It’s very often the case that the functions provided by base R, tidyverse, or whatever other package you’re using are not sufficient for the particular question you need to answer. In that case, it becomes necessary to design your own functions and procedures, and this involves programming.
R is a full-fledged programming language and offers all of the usual programming constructs. Learning a few of these gives you the ability to customize your projects to fulfill your specific data analysis needs.
We’ll need tidyverse in this chapter.
library(tidyverse)
5.1 Print to Screen
It’s often necessary to print a message or value to the console window, for example, to produce readable, user-friendly output or to check how a piece of code is working. There are two functions that do this: print
and cat
:
print("Hello World!")
## [1] "Hello World!"
cat("Hello world!")
## Hello world!
One obvious difference between these two functions is that print
displays quotation marks while cat
does not. The actual difference between print
and cat
is the type of object they return. To see this, we can store the output of each function in a named variable and then check that variable’s data type.
<- print("Hello world!") print_output
## [1] "Hello world!"
type_sum(print_output)
## [1] "chr"
<- cat("Hello world!") cat_output
## Hello world!
type_sum(cat_output)
## [1] "NULL"
The output of the print
statement is a character string, while that of the cat
statement is “NULL.” This means that the cat
output is not an R object and can’t be operated on further. It’s just a display of characters on the screen.
You can also use a variable as a print
or cat
argument, and the value of that variable will be printed. (Just don’t put quotes around the variable name.)
<- 5
x
print(x)
## [1] 5
cat(x)
## 5
You can also print a mix of text and variables. The cat
function is tailor-made for this (cat
is actually short for “concatenate”), while it takes a little more work with print
. It’s interesting to note that print
displays more decimal accuracy than cat
.
<- 10
i
print(paste("The square root of",i,"is about",sqrt(i),"."))
## [1] "The square root of 10 is about 3.16227766016838 ."
cat("The square root of",i,"is about",sqrt(i),".")
## The square root of 10 is about 3.162278 .
By default, print
and cat
separate output pieces by spaces, but if you’re like me, you’re bothered by the space before the closing period above. You can change the default separation by using the optional sep
argument. For the above, we don’t want a space before the final period, so we’ll use sep = ""
. We’ll have to incorporate spaces into our text, though:
<- 10
i
print(paste("The square root of ",i," is about ",sqrt(i),".", sep = ""))
## [1] "The square root of 10 is about 3.16227766016838."
cat("The square root of ",i," is about ",sqrt(i),".", sep = "")
## The square root of 10 is about 3.162278.
You can also include a line break within cat
with the symbol \n
:
cat("This is the first line.\nThis is the second line.")
## This is the first line.
## This is the second line.
print
and cat
have the same basic functionality for our purposes, so usually either one will serve our needs. We’ll explore the differences more in upcoming exercises.
5.2 User-Defined Functions
A function is a process which accepts inputs, performs a series of operations on them, and produces an output. We’ve used dozens of functions in the course so far, including those included in base R such as mean
, sd
, sum
, etc, as well as those included in extra packages like tidyverse, such as ggplot
, mutate
, str_replace
, etc. In this section, we’ll see how to write our own functions. The syntax for writing functions in R is:
<FUNCTION NAME> <- function(<LIST OF INPUT NAMES>){
<SEQUENCE OF OPERATIONS TO BE PERFORMED>
return(<OUTPUT NAME>)
}
To illustrate how to define functions in R, we introduce the notion of skewness. Skewness is a statistic that measures how symmetrical a continuous variable’s distribution is, as visualized by its histogram. A perfectly symmetrical distribution has a skewness of 0. A right-skewed distribution (one which is longer to the right of the median) has a positive skewness value, and a left-skewed distribution (one which is longer to the left of the median) has a negative skewness value. In a symmetrical distribution, the mean and median are equal. In a right-skewed distribution, the mean is larger than the median, and in a left-skewed distribution, the mean is less than the median.
There are various ways to define skewness numerically, but a particularly simple one, known as Pearson’s median skewness, is given by
\[3\times \frac{\textrm{mean}-\textrm{median}}{\textrm{standard deviation}}.\]
Let’s write a function called skewness
that calculates this. Notice that the name chosen for the input is v
, which reflects the fact that the input is to be a vector of numbers. Notice also that a commented statement appears in the body of the function. This is a very good practice; it helps the reader understand what you’re doing, and it can remind you what you were thinking when you revisit your code later.
<- function(v){
skewness # This is Pearson's median skewness
<- 3 * (mean(v) - median(v)) / sd(v)
answer return(answer)
}
A code chunk that defines a function must obviously first be executed before it can be used, but when executing a function definition code chunk, you should always place the cursor after the final }
before hitting ctrl+enter.
Once our function is defined, we can call it with any vector of numbers as input. Let’s find the skewness of the price
variable in diamonds
:
skewness(diamonds$price)
## [1] 1.151891
The positive skewness tells us that the histogram is skewed to the right, as we can also see directly:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price))
The variable answer
introduced within the function body above is only used to name the output that is to be reported back out of the function. By default, functions in R will always return the last computed value as the output unless instructed otherwise. So, the following definition of skewness
would also work:
<- function(v){
skewness # This is Pearson's median skewness
3 * (mean(v) - median(v)) / sd(v)
}
skewness(diamonds$price)
## [1] 1.151891
It’s often a good practice to name your output variable, though, and use an explicit return
statement to make it clear what your intended output is.
You can also use a function to display a printed sentence. Notice that we are not using a return
statement below since cat
statements are NULL
objects in R and therefore cannot be returned as usable output by a function.
<- function(v){
skewness <- 3 * (mean(v) - median(v)) / sd(v)
value cat("The value of Pearson's median skewness is ", value, ".\nA positive value indicates a right-skewed distribution.", sep = "")
}
skewness(diamonds$price)
## The value of Pearson's median skewness is 1.151891.
## A positive value indicates a right-skewed distribution.
Functions can also have more than one input and/or output. For example, let’s write a function which accepts the three coefficients of a quadratic equation \(ax^2+bx+c=0\) as inputs and returns the (possibly more than one) solution as its output. Technically, functions in R will not return more than one output, but we can work around that by returning a single vector of outputs with our two solutions as entries. (As noted above, without the return
statement, the function below would only return the value of sol2
.)
<- function(a, b, c){
quad_solve # sol1 and sol2 are the values produced by the Quadratic Formula
<- (-b + sqrt(b^2 - 4*a*c))/(2*a)
sol1 <- (-b - sqrt(b^2 - 4*a*c))/(2*a)
sol2 return(c(sol1, sol2))
}
quad_solve(1, 1, -6)
## [1] 2 -3
As you know, some quadratic equations have only one or even no real solutions. For example, the equation \(x^2-8x+16=0\) has only one solution, namely \(x=4\). Let’s see how our function handles this:
quad_solve(1, -8, 16)
## [1] 4 4
This makes sense, since we can interpret the double appearance of 4 as the result of the way \(x^2-8x+16\) factors as \((x-4)(x-4)\). What about equations with no real solutions, such as \(2x^2-3x+4=0\)?
quad_solve(2, -3, 4)
## Warning in sqrt(b^2 - 4 * a * c): NaNs produced
## Warning in sqrt(b^2 - 4 * a * c): NaNs produced
## [1] NaN NaN
NaN
means “not a number,” and it’s showing up here because \(b^2-4ac<0\), and thus our function is trying to take the square root of a negative number. (Apparently, base R does not know about complex numbers.) In the next section, we’ll see a way to avoid this kind of confusing output.
Exercises
To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.
These exercises require the nycflights13 library.
Write a function that accepts two numerical vectors of the same length as input and returns the number of times the same number occurs in the same entry of each. For example, if the two vectors are
c(2, 5, -2, 8, 7)
andc(3, 5, -2, 2, 7)
, the function should return the number 3.Recall that if \(a\) and \(b\) are the legs of a right triangle and \(c\) is the hypotenuse, then the Pythagorean Theorem says \(a^2+b^2=c^2\). Write a function that accepts the values of \(a\) and \(b\) as input and returns the value of \(c\) as output. Your output should be a sentence of the form, “The length of the hypotenuse is 4.1342.”
Write a function that accepts a vector as input and counts the number of
NA
entries. Then use your function to state how manyNA
s there are in thedep_delay
column offlights
.Write a function which accepts a number as input and determines whether it is positive, negative, or zero.
Write a function that accepts a car’s miles-per-gallon highway fuel efficiency as input and assigns it a value of “poor,” “fair,” “good,” or “excellent” depending on what range it falls within. Decide for yourself what these ranges should be.
Use your function from the previous exercise to add a column to
mpg
that categorizes each observation’shwy
value.Write a function which accepts two numerical vectors of the same length as input and states, entry-by-entry, which vector has the smaller entry or states that they’re equal. For example, if the two vectors are
v1 <- c(3, 4, 1, -2)
andv2 <- c(2, 6, 1, 0)
, the output (with line breaks included) should be:
The entry of v2 is smaller.
The entry of v1 is smaller.
The entries are equal.
The entry of v1 is smaller.
5.3 Conditional Statements
A conditional statement is one which executes only when a given condition is satisfied. In R, these take the form of if...else
statements:
if (<CONDITION>){
<ACTION TO BE PERFORMED WHEN CONDITION IS TRUE>
} else {
<ACTION TO BE PERFORMED WHEN CONDITION IS FALSE>
}
For example, let’s write a conditional statement that states whether the cars in mpg
have, on average, better highway or city fuel efficiencies. (Of course, we already know the answer):
<- mean(mpg$hwy)
hwy_avg <- mean(mpg$cty)
cty_avg
if (cty_avg > hwy_avg){
cat("City gas mileage is better.\nThe mean value of `cty` is ",cty_avg,".", sep = "")
else {
} cat("Highway gas mileage is better.\nThe mean value of `hwy` is ",hwy_avg,".", sep = "")
}
## Highway gas mileage is better.
## The mean value of `hwy` is 23.44017.
Conditional statements play a very important role within functions by providing a way to check the input for problems before proceeding with the function operations. This is called a value check. A value check provides the programmer with the opportunity to specify a meaningful error message if the input doesn’t satisfy the check condition. Here’s how it works for our quad_solve
function from the previous section. Recall that for quadratic equations with no real solutions, we were left with a confusing warning message and the results NaN
.
<- function(a, b, c){
quad_solve # First check whether there are actually real solutions. If not, stop and return an
# error message.
if (b^2 - 4*a*c < 0){
stop("no real solutions", call. = FALSE)
}
# sol1 and sol2 are the values produced by the Quadratic Formula
<- (-b + sqrt(b^2 - 4*a*c))/(2*a)
sol1 <- (-b - sqrt(b^2 - 4*a*c))/(2*a)
sol2 return(c(sol1, sol2))
}
quad_solve(2, -3, 4)
## Error: no real solutions
In Section 2.5, we introduced the functions ifelse
and case_when
to categorize continuous variables. Each one operates with a logic similar to an if...else
statement: A condition is checked, the outcome of which dictates what happens next. However, the purpose of these two categorization functions is to generate a value. For example, consider the following two pieces of code:
<- 5
x <- 6
y
if (x < y){
<- x - y
z else {
} <- y - x
z
}
z
## [1] -1
and
<- 5
x <- 6
y
<- ifelse(x < y,
z - y,
x - x)
y
z
## [1] -1
The result is the same; both pieces of code assign -1 to z
. However, they do it differently. The if...else
statement makes the assignment to z
within the body. The ifelse
takes on the value of -1 itself, and then assigns this to z
. This highlights the difference between these two constructs: The body of an if...else
performs whatever sequence of operations you give it, while the only job of ifelse
(and case_when
) is to return a value.
ifelse
and case_when
are thus more limited in scope than if...else
, but one of their major benefits is that they are vectorized, meaning they can accept an entire vector of inputs and perform the operation individually on each entry, producing an output vector of results. This is why these two functions work within mutate
, as in Section 2.5; the code we saw there applied each of these functions not just to a number, but to the entire vector arr_delay
.
For example, suppose we re-do our ifelse
calculation above but replace x
and y
with vectors. Before looking at the answer, try to predict what the value of z
will be:
<- c(5, 1, 8)
x <- c(6, 4, 6)
y
<- ifelse(x < y,
z - y,
x - x)
y
z
## [1] -1 -3 -2
Because ifelse
is vectorized, it individually calculates the z
value for each pair of corresponding x
and y
values and stores the answer in the corresponding spot in the z
vector.
Exercises
To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.
- Write a function which calculates the area of a circle when the radius is input. Include a value check that returns an error message if the input radius is negative.
- Write a function that accepts a number as input and returns its absolute value as output. (Do not use the built-in
abs
function.) - Write a function which accepts an integer as input and determines whether it is even or odd. (Hint: Use the
%%
operation.) Include a value check that makes sure the input is an integer.
5.4 Loops
It’s very common for an algorithm to be called on to perform the same action repeatedly, possibly with a different input each time. For example, think about the process of finding the sum of the numbers 1 through 5. You start with 1 and add the next number, 2, to it. Your sum at that point is 3. Then you start with 3 and add the next number, 3, to it. You sum is now 6. Then you start with 6 and add the next number, 4, to it, obtaining a sum of 10. Then you start with 10 and add the next number, 5, to it, arriving at a sum of 15. Since 5 was the last number, you’re done and can report an answer of 15.
This should seem repetitive to the point of being tedious; this monotonous, iterative task is exactly the kind of thing we should relegate to a computer, and the way to do so is with a for
loop. The basic structure of a for
loop is:
for (<INDEX> in <INDEX VECTOR>){
<BODY>
}
INDEX
can have any name of your choosing. INDEX VECTOR
is the vector that the index values live in as entries. BODY
is a command or sequence of commands to be executed. The body will often refer to INDEX
. After the body has done its job, the value of INDEX
will be incremented to the next value of INDEX VECTOR
, and the body will be executed again with the new INDEX
value. This process will repeat until the last value of INDEX
in INDEX VECTOR
is reached.
The following shows how this works by enacting the summing algorithm above. A few notes:
- We have to initialize the value of
Sum
to 0 before we start adding numbers to it. Anyfor
loop whose job is to perform some kind of accumulation requires that the accumulated total be initialized to a starting value before entering the loop. i
is the index name, and the index vector1:5
is the sequence of numbers from 1 to 5.- During each pass through the loop, the value of
Sum
is updated by adding the current index to it.
<- 0
Sum for (i in 1:5){
<- Sum + i
Sum
}
Sum
## [1] 15
for
loops can iterate through any vector, not just sequences of integers. Recall that we can think of a tibble as a vector whose entries are the columns. This means we can even use tibbles as our index vectors. Let’s write a loop that will print the mean of each column of diamonds
whose data type is “double.” Notice that we have to use a conditional statement inside our loop to check that the column is of the right type before attempting to compute a mean:
for (col in diamonds){
if (type_sum(col) == "dbl"){
print(mean(col))
} }
## [1] 0.7979397
## [1] 61.7494
## [1] 57.45718
## [1] 5.731157
## [1] 5.734526
## [1] 3.538734
There’s an important technical consideration that applies to loops. The loops above either produced a single value or printed a list of values. It’s often the case instead that the output of a loop is an entire vector that accumulates its entries with each successive iteration. We’ll first walk through the most obvious way to do this and then see a much faster (but less obvious) approach.
First, the obvious, slow method. Suppose we’ve flipped a coin four times and have a vector containing the results: c("H", "H", "T", "H")
. We flip again and get “T.” We can update our vector as follows:
<- c("H", "H", "T", "H")
vec <- c(vec, "T")
vec
vec
## [1] "H" "H" "T" "H" "T"
This is like updating Sum
in the loop above. Suppose now that we want to generate a list of 25 coin flips and store the results in a vector. We can simulate a coin flip in R by randomly generating a 0 or 1, each having a 50% chance of being generated. The 0:1
argument below indicates that our generated integers are chosen from the sequence starting at 0 and ending at 1, and the 1
argument indicates that we’re generating one number from that sequence.
sample(0:1, 1)
## [1] 0
Now to get a sequence of 25 coin flips, let’s construct a for
loop with 25 iterations, each one producing a random 0 or 1. Let’s say 0 is heads and 1 is tails. To store the results in a vector, we start by initializing vec2
to an empty vector and then update it during each iteration as above.
<- NULL
vec2
for (i in 1:25){
<- sample(0:1, 1)
flip if (flip == 0){
<- c(vec2, "H")
vec2 else {
} <- c(vec2, "T")
vec2
}
}
vec2
## [1] "H" "T" "T" "H" "T" "H" "T" "H" "T" "H" "H" "H" "T" "H" "T" "H" "T" "T" "H" "H" "H" "H" "H" "T" "T"
This seems to have taken no time at all, but we’ve actually made R do a lot of extra work. Allocating memory space for a vector takes time, and with every pass through the loop, we’re creating a new vector vec2
which R has to create from scratch 25 times. It would be much better to create a 25-entry vec2
at the beginning of the algorithm (with NA
or 0
or whatever you want for each entry) and then update the entries during each iteration. Updating an entry takes much less time than creating an entire vector. This idea is known as pre-allocation, and it’s a very good idea any time your loop’s job is to produce an accumulated vector as output. Here’s how we can do this:
# Initialize our vector to be a 25-entry vector with NA in each entry.
<- rep(NA, 25)
vec3
for (i in 1:25){
<- sample(0:1, 1)
flip if (flip == 0){
<- "H" # We only update the entry, not the entire vector
vec3[i] else {
} <- "T" # We only update the entry, not the entire vector
vec3[i]
}
}
vec3
## [1] "H" "T" "T" "H" "H" "H" "H" "H" "H" "H" "T" "T" "H" "T" "H" "H" "T" "H" "H" "T" "T" "H" "T" "T" "T"
The primary action performed is the assignment of an “H” or “T” to entry i
in vec3
during each iteration, as opposed to the previous algorithm, which creates an entirely new copy of vec2
during each iteration.
For a loop with only 25 iterations, it’s hard to notice any speed difference, but the difference becomes quite noticeable as the number of iterations increases. Let’s run each loop again with 100,000 iterations and measure the run times. This is done by marking the system time (using Sys.time()
) at the beginning and end of each loop and then subtracting them to find the time elapsed.
# This is the algorithm that does not pre-allocate the output.
<- Sys.time()
start_time
<- NULL
vec2
for (i in 1:100000){
<- sample(0:1, 1)
flip if (flip == 0){
<- c(vec2, "H")
vec2 else {
} <- c(vec2, "T")
vec2
}
}
<- Sys.time()
end_time
- start_time end_time
## Time difference of 1.035769 mins
# This is the algorithm that pre-allocates the output.
<- Sys.time()
start_time
<- rep(NA, 100000)
vec3
for (i in 1:100000){
<- sample(0:1, 1)
flip if (flip == 0){
<- "H"
vec3[i] else {
} <- "T"
vec3[i]
}
}
<- Sys.time()
end_time
- start_time end_time
## Time difference of 1.515949 secs
As you can see, the increase in speed resulting from pre-allocation is dramatic.
Some iterative processes don’t terminate after a predetermined number of iterations but instead when some type of stopping condition is met. For example, suppose we want to simulate an experiment that counts the number of times it takes to get a “doubles” (a dice roll that has the same number on each die) on a pair of six-sided dice. We could run a for
loop in which we simulate a dice roll using random number generators, but what would we use for the index vector? We don’t know how many iterations we’ll need to get a doubles. Instead, we can use a while
loop, the syntax for which is:
while (<CONDITION>){
<ACTION TO BE PERFORMED WHILE CONDITION IS TRUE>
}
Let’s simulate the dice-rolling scenario above and count how many rolls it take to get our first doubles.
# First we create a logical variable called `doubles` that states whether a doubles
# has yet been rolled. We initialize it to `FALSE`:
<- FALSE
doubles
# We also want to keep track of how many rolls we've done. We'll create a `counter`
# variable and initialize it to 0. We'll then increment it after each roll.
<- 0
counter
# We're going to simulate a dice roll within the body of the loop, but only if we haven't
# yet gotten a doubles. We should therefore only enter the loop while the `doubles` value
# is `FALSE`. (If `doubles` is `FALSE`, then `!doubles` is `TRUE`.)
while(!doubles){
# We roll the dice by generating a random integer from 1 to 6 for each die:
<- sample(1:6, 1)
die1 <- sample(1:6, 1)
die2
# Since we just rolled the dice, we have to increment `counter`:
<- counter + 1
counter
# Now we check whether we rolled a doubles, and if so, we change the value of `doubles`
# to `TRUE`. This means `!doubles` would be `FALSE`, and the loop would end. If we did
# not roll a doubles, `doubles` will remain `FALSE`, and the loop will be executed
# again.
if (die1 == die2){
<- TRUE
doubles
}
}
# Once `doubles` becomes `TRUE` and we exit the loop, we check the final value of
# `counter` to see how many rolls we needed to get a doubles:
counter
## [1] 14
We could even run this experiment 1000 times and store the results in a pre-allocated output vector:
<- 1000
num_rolls <- rep(NA, num_rolls)
results
for (i in 1:num_rolls){
<- FALSE
doubles <- 0
counter
while(!doubles){
<- sample(1:6, 1)
die1 <- sample(1:6, 1)
die2
<- counter + 1
counter
if (die1 == die2){
<- TRUE
doubles
}
}
<- counter
results[i]
}
Now it would be fun to examine some of the statistics:
mean(results)
## [1] 5.852
min(results)
## [1] 1
max(results)
## [1] 43
On average, it took about 6 rolls to get a doubles. This makes sense – there are 36 possible rolls of the dice, 6 of which are doubles, so the probability of rolling a doubles is 1/6. We would thus expect to get a doubles about once every six rolls.
It looks like we also got a doubles on our first roll at least one time, and, amazingly, it once took 43 rolls to get a doubles!
The collection of programming constructs we saw in this chapter is very far from comprehensive, and they were mostly presented without a concern for computational efficiency or even best coding practices. However, you can go a long way with print-to-screen commands, conditional statements, user-defined functions, and loops.
Exercises
To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.
Write a loop that prints the first 15 perfect squares to the screen. (A perfect square is a number with an integer square root, such as 36, 81, etc). Do this using the
print
function and then again using thecat
function. What do you notice about the difference between the outputs?Write a function that accepts a positive integer
n
as input and returns the sum of the firstn
perfect squares. (For example, ifn
is 4, the function would return 30, which is the value of \(1^2+2^2+3^2+4^2\).)How likely is rolling a 9 on a pair of six-sided dice? Answer this question by simulating 1000 dice rolls, keeping track of the number of times you roll a 9. Then find the percentage of times you rolled a 9.
Write a loop that prints the number of
NA
s in each column ofmsleep
.Re-do the previous exercise, but instead of printing the number of
NA
s in each column, create an output vector that stores each number ofNA
s in its entries. Remember to pre-allocate space for this vector.A random walk is a statistical process that, in its simplest form, can be described as follows: “Start on a number line at 0. Flip a coin. If you get heads, step one unit to the right. If you get tails, step one unit to the left. Then repeat.” Use a
while
loop to simulate this random walk, and stop it the first time the walker reaches 5 on the number line. How many steps did it take?Write a function that accepts an
n
value as input and then performs the random walk from the previous exercisen
times. Accumulate a pre-allocated vector of lengthn
that stores the number of steps required to reach 5 in each of yourn
random walks. The output of your function should be a printed sentence that states the mean, maximum, and minimum values of this vector.In the random walk from the previous two exercises, there was a 50-50 chance of stepping left or right. Repeat the previous two exercises, but try to think of a way to make the chances of stepping right 75% and the chances of stepping left 25%. What do you predict will happen when you count the number of steps required to reach 5?
When accumulating an output vector using a
while
loop, you often won’t know how long it will be, which makes pre-allocation difficult. How might this be addressed?
5.5 Project
Project Description: Baseball is very much a game of numbers, and there seem to be an endless number of statistics, called sabermetrics, which attempt to quantify a player’s performance.
In this project, you will use the Batting
data set from Lahman to make an argument for the player whom you think is the best home run hitter of all time. You will do so by computing several home run sabermetrics, some of which will require customized code.
Instructions:
Using the
Batting
data set, create data sets that contain the following players, along with all of their batting data.- The top 20 leaders in home runs hit in a single season.
- The top 20 leaders in total home runs hit over a career.
The players who showed up on either of the lists above should be your candidates for the best home run hitter of all time. Merge the two data sets from the previous problem.
For the merged data set from the previous problem, perform a grouped summary in which you compute each player’s career totals in home runs, at bats, and strikeouts.
Add the following sabermetric columns to the grouped data set:
- The number of at bats per home run throughout the player’s career. This measures the frequency with which the player hits home runs. (A low number is better than a high number.)
- The number of strikeouts per home run thoughout the player’s career. This measures the ability of the player to hit for power while maintaining the discipline to avoid swinging too hard or swinging at bad pitches. (Again, a low number is better than a high number here.)
One more sabermetric worth adding to our data set is one that measures a player’s home run consistency over time. This will be the number of consecutive seasons a player hit, for example, 20 or more home runs. This calculation will require some programming.
- Write a function called
max_consec
that accepts a vectorvec
of numbers and a positive integerm
as input. The function should return the longest streak of consecutive entries invec
which are greater than or equal tom
. For example, ifvec
isc(12, 17, 25, 32, 20, 19, 21, 23, 35, 20)
andm
is 20, the output should be 4 since the longest run of consecutive entries greater than or equal to 20 is 4. - Use your function from part (a) to add the following columns to your data set for each of your candidates: the longest consecutive run of seasons with 20 or more, 30 or more, 40 or more, 50 or more, and 60 or more home runs.
- Write a function called
Based only on your calculations above, who, in your opinion is the greatest home run hitter of all time? Who’s second? Who’s third?
Guidelines:
Write up your analysis in an R Markdown report, making sure to follow the usual guidelines as stated for the Chapters 1, 2, and 4 projects. Additionally:
- Carefully describe any preliminary cleaning you do to prepare the data for your analysis.
- Use only your analysis to make your opinion; avoid the temptation to bring in outside information you might have such as PED use, etc. This is certainly not a comprehensive list of considerations to be taken into account when ranking the all-time best home run hitters. No baseball knowledge is required or expected to perform this analysis.
- For your
max_consec
function, be sure to add comments to your code to make it more understandable. - You can look up a player’s name from the
playerID
value as follows:
playerInfo("trammal01")
## playerID nameFirst nameLast
## 18533 trammal01 Alan Trammell
Grading Rubric:
- Transformation: Did you correctly perform all necessary data transformations? (30 points)
- Programming: Did you get your
max_consec
function to work? (30 points) - Narrative: Is it clear what you’re trying to do in this project? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)
- Professionalism: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? Did you follow the guidelines listed above? (15 points)