7 Modeling
7.1 Simple Linear Regression
Predicting Winning Percentage
For this section we will be using the Teams
table.
<- Teams %>%
team_pred filter(yearID >= 2012, yearID < 2020)
Say we want to predict winning percentage using a team’s run differential. To do this we need to create two new variables: run differential (RD) and winning percentage (WPCT). Run differential is calculated with runs scored (R) and runs allowed (RA). Calculating winning percentage requires the amount of wins and losses.
<- team_pred %>%
team_pred mutate(RD = R - RA,
WPCT = W / (W + L))
Now it’s time to build the model. The lm()
function is used to fit linear models. It requires a formula, following the format: response ~ predictor(s). A response variable is the variable we are trying to predict. Predictor variables are used to make predictions. In this example, we will predict a team’s winning percentage (response) using its run differential (predictor).
<- lm(WPCT ~ RD, data = team_pred)
model1 model1
##
## Call:
## lm(formula = WPCT ~ RD, data = team_pred)
##
## Coefficients:
## (Intercept) RD
## 0.4999830 0.0006089
The general equation of a simple linear regression line is: \(\hat{y} = b_0 + b_1 x\). The symbols have the following meaning:
- \(\hat{y}\) = predicted value of the response variable
- \(b_0\) = estimated y-intercept for the line
- \(b_1\) = estimated slope for the line
- x = predictor value to plug in
This output provides us with an equation we can use to predict a team’s winning percentage based on their run differential:
\[\text{Predicted Win Pct} = 0.500 + 0.00061 * \text{Run Diff}\]
The estimated y-intercept for this model is 0.500. This means that we would predict a winning percentage of 0.500 for a team with a run differential of 0. In other words, teams who give up and score the same number of runs are predicted to be average. This makes a lot of sense for this example, but we will see later that y-intercept interpretations aren’t always applicable like this.
The slope for this model is 0.0006267. There are 162 games in the MLB regular season. If we multiple the slope by 162 we get 0.1015, which equates to 1/10th of a win. For each additional difference the predicted wins goes up by 0.0006. This works out to about 0.1 wins.
The estimate slope for our line is 0.00061. This tells us that our predicted winning percentage increases by 0.00061 for each additional run a team scores (or each fewer run they allow). Over a 162 game season, a change in winning percentage of 0.00061 is equivalent to around \(162 * 0.00061 = 0.1\) wins. In other words, an increase of around 10 runs scored (or prevented) is associated with around 1 more win in a season.
The summary()
function can give us some additional information about our linear regression model.
summary(model1)
##
## Call:
## lm(formula = WPCT ~ RD, data = team_pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.060481 -0.016448 0.000136 0.014840 0.081565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.000e-01 1.630e-03 306.74 <2e-16 ***
## RD 6.089e-04 1.414e-05 43.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02525 on 238 degrees of freedom
## Multiple R-squared: 0.8862, Adjusted R-squared: 0.8857
## F-statistic: 1854 on 1 and 238 DF, p-value: < 2.2e-16
Another important thing to look at is the p-value. A p-value tells us how likely certain results would be to occur if “nothing was going on”. We often compare a p-value to a pre-chosen significance level, alpha, to decide if our results would be unusual in a world where our variables weren’t related to one another. The most frequently used significance level is alpha = 0.5. Our model’s p-value is less than 2.2e-16, or 0.00000000000000022, which is much smaller than alpha. This suggests that run differential is helpful in predicting winning percentage because results like this are incredibly unlikely to occur if the two variables are not related.
R-squared values are used to describe how well the model fits the data. In this model, the R-squared value is 0.8862. This is saying that around 88.6% of variability in team winning percentage is explained by run differential.
This should not be the exclusive method to assess model fit, but it helps give us a good idea.
To visualize this data we can make a scatter plot and fit a line using geom_smooth()
. If no method is specified, R will automatically fit the model it thinks is best. Since we fit a linear model above, the method should be “lm”.
%>%
team_pred ggplot(aes(x = RD, y = WPCT)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
theme_bw() +
theme(text = element_text(family = "serif"))
7.2 Multiple Linear Regression
Linear Weights
We are going to predict runs using singles, doubles, triples, homeruns, walks, hit-by-pitches, strikeouts, non-strikeouts, stolen bases,caught stealing, and sacrifice flies.
First, we need to create a variable for singles (‘X1B’) and a variable for non-strikeouts (‘nonSO’).
<- team_pred %>%
team_pred mutate(X1B = H - X2B - X3B - HR,
nonSO = AB - H - SO)
<- lm(R ~ X1B + X2B + X3B + HR + BB + HBP + SO + nonSO + SB + CS + SF,
model2 data = team_pred)
summary(model2)
##
## Call:
## lm(formula = R ~ X1B + X2B + X3B + HR + BB + HBP + SO + nonSO +
## SB + CS + SF, data = team_pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.81 -14.37 0.89 13.04 69.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -81.21901 148.24516 -0.548 0.58432
## X1B 0.39787 0.03233 12.307 < 2e-16 ***
## X2B 0.85061 0.06261 13.586 < 2e-16 ***
## X3B 1.03933 0.17326 5.999 7.74e-09 ***
## HR 1.46736 0.04932 29.753 < 2e-16 ***
## BB 0.27187 0.02782 9.772 < 2e-16 ***
## HBP 0.42777 0.10061 4.252 3.09e-05 ***
## SO -0.07631 0.03220 -2.370 0.01864 *
## nonSO -0.06349 0.03324 -1.910 0.05740 .
## SB 0.19421 0.06119 3.174 0.00171 **
## CS -0.48760 0.21522 -2.266 0.02441 *
## SF 0.54241 0.20892 2.596 0.01004 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 228 degrees of freedom
## Multiple R-squared: 0.9264, Adjusted R-squared: 0.9228
## F-statistic: 260.7 on 11 and 228 DF, p-value: < 2.2e-16
Most of the p-values from this model have stars next to them. ** is significant at alpha = 0.001. *** is significant at alpha = 0.0001. The intercept’s p-value is the only one without stars. A p-value of 0.205166 suggests that we may not need the the intercept. This makes sense since a team that does none of these things (the predictor variables) would score 0 runs.
Now let’s try fitting a model with no intercept. We can accomplish this by putting “- 1” at the of the model equation.
<- lm(R ~ X1B + X2B + X3B + HR + BB + HBP + SO + nonSO + SB + CS + SF - 1,
model2 data = team_pred)
summary(model2)
##
## Call:
## lm(formula = R ~ X1B + X2B + X3B + HR + BB + HBP + SO + nonSO +
## SB + CS + SF - 1, data = team_pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.01 -13.86 0.80 13.12 68.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X1B 0.39067 0.02949 13.249 < 2e-16 ***
## X2B 0.85129 0.06250 13.620 < 2e-16 ***
## X3B 1.03715 0.17295 5.997 7.78e-09 ***
## HR 1.46017 0.04747 30.763 < 2e-16 ***
## BB 0.26895 0.02726 9.865 < 2e-16 ***
## HBP 0.42674 0.10043 4.249 3.13e-05 ***
## SO -0.09294 0.01073 -8.662 8.47e-16 ***
## nonSO -0.08079 0.01037 -7.792 2.28e-13 ***
## SB 0.19451 0.06109 3.184 0.00166 **
## CS -0.51912 0.20707 -2.507 0.01287 *
## SF 0.53241 0.20780 2.562 0.01105 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.38 on 229 degrees of freedom
## Multiple R-squared: 0.9992, Adjusted R-squared: 0.9991
## F-statistic: 2.453e+04 on 11 and 229 DF, p-value: < 2.2e-16
This model’s p-value is 2.2e-16, which is incredibly small. The R-squared value is 0.9992, meaning that 99.9% of variability in runs is explained by these 10 variables.
Here is a table of the of each MLR variable and their slope:
kable(model2$coefficients, col.names = "Coefficient") %>%
kable_styling("striped", full_width = FALSE)
Coefficient | |
---|---|
X1B | 0.3906674 |
X2B | 0.8512873 |
X3B | 1.0371486 |
HR | 1.4601679 |
BB | 0.2689488 |
HBP | 0.4267400 |
SO | -0.0929440 |
nonSO | -0.0807918 |
SB | 0.1945057 |
CS | -0.5191153 |
SF | 0.5324062 |
The coefficient for stolen bases is 0.1945. This means that a team will score 0.1945 runs for every base they steal, given that all other variables remain constant. A coefficient of -0.5191 for caught stealing is saying that the number of runs will decrease by -0.5191 for every time the team is caught stealing.
To see how good this model is we will plot the predicted runs against the actual runs. The line represents a perfect model.
The points stay close to the line, which indicates that the model fits the data very well. This supports the same conclusion as the R-squared value.