1 Introduction

This book is an introduction to coding for baseball analytics. It would be most helpful if you have a basic understanding of R. Here a some resources:

The first three chapters are about acquiring data.

Lahman: the Lahman package contains several data frames about players, teams, pitching, hitting, fielding, and more. The version of the Lahman package used throughout this guide has data from 1871 to 2021.

FanGraphs: this chapter is about data collected from the FanGraphs website or through an R package. During the midst of this book’s creation FanGraphs limited data export on the website to only members, but it is still able to accessed through R.

Statcast: every MLB park contains Statcast tracking technology. Play-by-play data is available. Since Statcast is a fairly new creation, the data only goes back to 2015.

Following those is a chapter on Restructuring Data. It covers pivoting data and all types of joins.

Next is a section on Visualizations. It covers strike zone plots, pitch release plots, spray charts, contour plots, and heat maps. Additionally there is information on creating functions and using case_when().

The last chapter is about Modeling. There are examples of Simple Linear Regression and Multiple Linear Regression.

Preparing RStudio:

Before running any of this code yourself, you will need to install and load the tidyverse package. For a description of what a package is check out this link. To install a package use the function install.packages() with the name of the package inside quotes. To load it use the library() function (the package name does not need to be in quotes for this).

Follow this same process to load the knitr and kableExtra libraries. The FanGraphs section also requires the stringr package. Any other necessary packages will be specified in the chapter.

Other Notes:

At the end of the book is an appendix containing the code for all of the graphs throughout the guide.