# Chapter 10 Inference using simulation methods

[This “chapter” is essentially the same as 4, but with some of the notes I wrote in the document as we went.]

This section is kind of a grab-bag of computational techniques in support of statistics and inference. We’ll introduce some simulation methods: randomization and the bootstrap. We’ll use the `infer`

package.

## 10.1 Reading in external data

So far, we have been working with data that came from within R packages. But, most data we are interested in doesn’t come built into R! So, we need to read the data into our environment. The R package `readr`

will help with many common file formats. (Another file-reading package is called `haven`

, and helps you read in specialty data formats from other statistical packages like SAS and SPSS. There’s also `readxl`

for Excel sheets. If you use a particular kind of data, you can probably find a package to read it into R!)

We are going to work with two datasets, one called `Physicians.csv`

and the other called `policekillings.csv`

(I don’t have provenance information on these datasets, sorry!)

Download those two .csv files.

You can locate the files on my GitHub. Select “raw” to see the raw version of the file.

Then, right-click to download the file.

Once you’ve downloaded those files, place them into the same folder as your RMarkdown file, or inside a sub-folder called something like `data`

.

Let’s read in `Physicians.csv`

first.

As with other tidyverse functions, `read_csv`

gives us a message, telling us how it parsed the data. We can change those defaults if we want to.

Right off the bat, we notice something strange– the number of doctors is being read in as a character variable, meaning it is not numeric. We’ll come back to that.

#### 10.1.0.1 Renaming variables

The variable names in this dataset are hard to use,

`## [1] "State" "Physicians per 100,000 Population"`

In order to run something like `skim()`

, we’d have to use backtics to write out the variable name.

Name | Piped data |

Number of rows | 53 |

Number of columns | 2 |

_______________________ | |

Column type frequency: | |

character | 1 |

________________________ | |

Group variables | None |

**Variable type: character**

skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|

Physicians per 100,000 Population | 0 | 1 | 3 | 4 | 0 | 49 | 0 |

We can manually rename that variable,

```
## # A tibble: 53 x 2
## State Physicians
## <chr> <chr>
## 1 Alaska 514
## 2 Alabama 331
## 3 Arkansas 321
## 4 Arizona 370
## 5 California 370
## 6 Colorado 371
## 7 Connecticut 464
## 8 D.C. 1612
## 9 Delaware 563
## 10 Florida 357
## # … with 43 more rows
```

That’s better!

Or, we could use a package to automatically fix all the names in a dataset.

```
# install.packages("janitor")
library(janitor)
doctors %>%
clean_names() %>%
skim(physicians_per_100_000_population)
```

Name | Piped data |

Number of rows | 53 |

Number of columns | 2 |

_______________________ | |

Column type frequency: | |

character | 1 |

________________________ | |

Group variables | None |

**Variable type: character**

skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|

physicians_per_100_000_population | 0 | 1 | 3 | 4 | 0 | 49 | 0 |

With either of these approaches, you would need to overwrite the original data with the renamed data in order to use it. For the sake of having a short name, let’s use the `rename()`

function.

#### 10.1.0.2 Searching for strange issues in variables

But, we’ve still got a variable type issue. When we run summary statistics, R complains because there is something weird in the data. Let’s try `count()`

ing up the values in the data.

```
## # A tibble: 49 x 2
## Physicians n
## <chr> <int>
## 1 N/A 1
## 2 644 1
## 3 575 1
## 4 563 1
## 5 514 1
## 6 510 1
## 7 506 1
## 8 504 1
## 9 485 1
## 10 478 2
## # … with 39 more rows
```

Now we see it! There’s a strange NA value that R doesn’t know how to deal with. We can actually add this into our data-reading-in code in `read_csv`

.

Of course, we then have to redo our renaming.

```
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 429.
```

- Now, you try reading in the
`policekillings.csv`

dataset.

### 10.1.1 Joining data

In the last section, we talked about one-table verbs from `dplyr`

. But, another task we often need to do is combine data from one table with data from another table.

The command that allows you to combine data frames is a “join”, which corresponds to the database operation called a `JOIN`

. We will focus on the `left_join()`

here, but there are many other types of joins.

When two database tables (or data frames) are joined, the rows of one table get matched to the rows of the other. Computers are exceptionally better suited for this task than humans, but the computer needs to be told what the criteria is for matching up the rows. The variable(s) upon which the matching is based are called a **key**. Both tables must contain the key columns, but there are variety of ways in which the matching can be done.

Joins can be complicated, so you may want to look at a visual explanation:

Let’s say we want to join our `doctors`

data and our `policekillings`

data. We need a variable to “join on.” In this case, we want to match states with states. Unfortunately, `doctors`

has state names written out and `policekillings`

just has the state abbreviations. We need to change one to match the other.

Luckily, R comes with a dataset on state names

```
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
```

```
## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA"
## [16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ"
## [31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
## [46] "VA" "WA" "WV" "WI" "WY"
```

We can use this to make a new variable in the `policekillings`

dataset.

```
policekillings <- policekillings %>%
mutate(state_long = state.name[match(state, state.abb)])
skim(policekillings)
```

Name | policekillings |

Number of rows | 47 |

Number of columns | 3 |

_______________________ | |

Column type frequency: | |

character | 2 |

numeric | 1 |

________________________ | |

Group variables | None |

**Variable type: character**

skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|

state | 0 | 1.00 | 2 | 2 | 0 | 47 | 0 |

state_long | 1 | 0.98 | 4 | 14 | 0 | 46 | 0 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

deaths | 0 | 1 | 9.94 | 12.58 | 1 | 4 | 7 | 10.5 | 74 | ▇▁▁▁▁ |

Now, we can do some joining. After all, if we want to understand the relationship between physicians and police killings, we have to have both variables in the same data frame.

Before we start, let’s consider the dimensions of our two data frames.

`## [1] 47 3`

`## [1] 53 2`

They don’t have the same number of rows, so we should think about how large we want our data to be at the end of our join. Do we want to keep all 53 rows in `doctors`

and match elements from `policekillings`

, filling with NA values where there aren’t matches? Or, are we more interested in `policekillings`

and we want to just match the 47 states with elements from `doctors`

?

In databases, the default JOIN type is INNER JOIN. In this case only the rows that are present in *both* data frames get returned. In R, I find I most often use a `left_join`

, which retains *all* of the records from the first data frame, regardless of whether there is a match in the second data frame.

Let’s try a `left_join`

.

```
## Error: `by` must be supplied when `x` and `y` have no common variables.
## ℹ use by = character()` to perform a cross-join.
```

This throws an error, because it is expecting there to be a variable with the same name in each data set. To get it to work, we need to specify which variables to join on.

Name | joinedData |

Number of rows | 47 |

Number of columns | 4 |

_______________________ | |

Column type frequency: | |

character | 2 |

numeric | 2 |

________________________ | |

Group variables | None |

**Variable type: character**

skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|

state | 0 | 1.00 | 2 | 2 | 0 | 47 | 0 |

state_long | 1 | 0.98 | 4 | 14 | 0 | 46 | 0 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

deaths | 0 | 1.00 | 9.94 | 12.58 | 1 | 4 | 7 | 10.5 | 74 | ▇▁▁▁▁ |

Physicians | 1 | 0.98 | 404.00 | 79.10 | 269 | 342 | 394 | 460.0 | 644 | ▆▇▆▂▁ |

Notice that this dataset only has 47 observations, because there were only 47 in `policekillings`

(the “left” in our `left_join`

). We could have reversed the order of the datasets to get something slightly different.

```
## Error: Problem with `summarise()` input `skimmed`.
## x Names must be unique.
## x These names are duplicated:
## * "n_missing" at locations 2 and 3.
## * "complete_rate" at locations 4 and 5.
## * "character.min" at locations 6 and 7.
## * "character.max" at locations 8 and 9.
## * "character.empty" at locations 10 and 11.
## * ...
## ℹ Use argument `names_repair` to specify repair strategy.
## ℹ Input `skimmed` is `purrr::map2(...)`.
## ℹ The error occurred in group 1: skim_type = "character".
```

Now, the data has 51 observations! (We could have gotten the same result by running `right_join`

with the data specified the same way as our first join.)

## 10.2 Fitting a model

While that was a good example for joining data, we shouldn’t use it for modeling because a major condition for linear regression is violated with the data set. (Which one?)

So, let’s read in another dataset to make some new models.

## 10.3 Testing and training data

One computational strategy for **assess**ing a model is using training and testing data. We can use testing and training data in order to determine if our model is good because it works well on the particular data we fit it on, or if it can be generalized to a larger set of data.

The basic idea is to break your dataset into two parts: one for training the model (running `lm()`

) and one for testing the model (running `predict()`

).

```
set.seed(42)
which_train <- sample(1:219, size = 131, replace = FALSE)
training <- FirstYearGPA %>%
slice(which_train)
testing <- FirstYearGPA %>%
slice(-which_train)
```

Notice that I had to “set the seed” to make this document reproducible. That essentially means I’m placing the beginning of the pseudo-random number generator in R at the same spot every time.

We can now run a model,

```
##
## Call:
## lm(formula = GPA ~ HSGPA + HU + White, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83142 -0.27430 0.01072 0.26880 0.78655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.260965 0.298795 4.220 4.61e-05 ***
## HSGPA 0.384141 0.085603 4.487 1.60e-05 ***
## HU 0.015373 0.004253 3.615 0.000432 ***
## White 0.427565 0.085591 4.995 1.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3707 on 127 degrees of freedom
## Multiple R-squared: 0.3723, Adjusted R-squared: 0.3575
## F-statistic: 25.11 on 3 and 127 DF, p-value: 8.028e-13
```

Notice that this model is slightly different than if we’d run the model on the full dataset,

```
## (Intercept) HSGPA HU White
## 0.93345915 0.50740401 0.01532817 0.26564358
```

We can use our trained model to make predictions on our testing data.

We might also want to have the residuals for these predictions, so we can compute those as well.

Look at your data object in your environment. What has happened to it?

Once we have this information, we can compute all sorts of useful stuff with it. For example, we could find the “cross-validation correlation,” the correlation between the predictions and the actual values.

```
## # A tibble: 1 x 1
## cor
## <dbl>
## 1 0.444
```

If we square this number, it’s akin to an \(R^2\) value,

```
## # A tibble: 1 x 2
## cor R2
## <dbl> <dbl>
## 1 0.444 0.197
```

Sometimes, we want to quantify the difference between the \(R^2\) for the training data and the squared cross validation correlation.

```
testing %>%
summarize(cor = cor(GPA, yhats)) %>%
mutate(R2 = cor^2, shrinkage = summary(m1)$r.squared - R2)
```

```
## # A tibble: 1 x 3
## cor R2 shrinkage
## <dbl> <dbl> <dbl>
## 1 0.444 0.197 0.175
```

You want smaller shrinkage values.

We can also look at the residual plots for our testing data,

```
library(ggplot2)
ggplot(testing, aes(x = yhats, y = residuals)) +
geom_point() +
geom_smooth(se = FALSE)
```

Those don’t look too bad to me. We could look closer to see if the mean and standard deviation of the prediction errors look good.

```
## # A tibble: 1 x 2
## `mean(residuals)` `sd(residuals)`
## <dbl> <dbl>
## 1 -0.0757 0.423
```

We’d like the mean to be close to 0, and the standard deviation to be close to the standard deviation of the error term from the fit to the training sample.

`## [1] 0.3663646`

Those look pretty good to me!

## 10.4 Inference based on distributional approximations

```
##
## Call:
## lm(formula = GPA ~ HSGPA + HU + White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09479 -0.27638 0.02287 0.25411 0.84538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.933459 0.245673 3.800 0.000189 ***
## HSGPA 0.507404 0.070197 7.228 8.42e-12 ***
## HU 0.015328 0.003667 4.180 4.24e-05 ***
## White 0.265644 0.064519 4.117 5.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3856 on 215 degrees of freedom
## Multiple R-squared: 0.3231, Adjusted R-squared: 0.3136
## F-statistic: 34.21 on 3 and 215 DF, p-value: < 2.2e-16
```

If you’ve taken a statistics class before, you have likely seen inference based on distributional approximations. This is inference using known probability distributions like the Normal distribution, t-distribution, F-distribution, etc.

In the case of a slope coefficient, one way to make inference would be to create a confidence interval,

“what are some other reasonable values we could have observed?”

\[ \hat{\beta}_1 \pm t^* \cdot SE_{\hat{\beta}_1} \]

“We are 95% confident that for a 1-unit increase in x, we would see an [increase/decrease] in the response of between [bottom] and [top] units.”

The other inferential task often done with distributional approximations is hypothesis testing. For regression, the hypotheses are

“is this number different from 0?”

\[ H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0 \]

We can test the hypothesis by locating

\[ t = \frac{\hat{\beta_1}}{SE_{\hat{\beta_1}}} \]

in a t-distribution with \(df=n-k-1\). When you look at a regression summary, you see p-values based on \(t\) values,

```
##
## Call:
## lm(formula = GPA ~ HSGPA + HU + White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09479 -0.27638 0.02287 0.25411 0.84538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.933459 0.245673 3.800 0.000189 ***
## HSGPA 0.507404 0.070197 7.228 8.42e-12 ***
## HU 0.015328 0.003667 4.180 4.24e-05 ***
## White 0.265644 0.064519 4.117 5.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3856 on 215 degrees of freedom
## Multiple R-squared: 0.3231, Adjusted R-squared: 0.3136
## F-statistic: 34.21 on 3 and 215 DF, p-value: < 2.2e-16
```

You can generate confidence intervals based on distributional approximations using `confint()`

,

```
## 2.5 % 97.5 %
## (Intercept) 0.44922395 1.41769436
## HSGPA 0.36904250 0.64576552
## HU 0.00810044 0.02255589
## White 0.13847373 0.39281343
```

However, these distributional approximations only hold if the conditions are upheld. A more general solution is to use simulation-based inference.

## 10.5 Simulation-based inference

Recall that the inference for linear regression relies on the assumptions that the following three conditions are met:

- Linearity
- Normality of Residuals
- Equality of Variance

We know how to assess whether these conditions are met, and we have learned a few techniques for correcting them when they are not (e.g. transformations). Today, we will learn a few techniques based on simulation for making *non-parametric* inferences. Such inferences do not rely on stringent assumptions about the distribution on of the residuals.

## 10.6 Randomization (Permutation) Tests

Goes with hypothesis tests. “Is this number different than 0?”

We’ll consider how we could make inference about a relationship using simulation methods. To do this, we’ll use a randomization test.

Before we begin, let’s examine the relationship between two variables from the `FirstYearGPA`

dataset graphically.

```
ggplot(data = FirstYearGPA, aes(x = SATV, y = GPA)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
```

There appears to be some linear association between these variables, but it is not particularly strong. We can quantify this relationship using the slope coefficient,

```
##
## Call:
## lm(formula = GPA ~ SATV, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14057 -0.32756 0.03134 0.34259 1.16723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0684133 0.2204478 9.383 < 2e-16 ***
## SATV 0.0016986 0.0003609 4.706 4.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4444 on 217 degrees of freedom
## Multiple R-squared: 0.09261, Adjusted R-squared: 0.08842
## F-statistic: 22.15 on 1 and 217 DF, p-value: 4.499e-06
```

In this case the value of the slope is 0.00169, which is not large, but appears to be statistically significant. However, the validity of the hypothesis test shown in the default regression table relies on the conditions for simple linear regression being met. Let’s assume that in this case the assumptions are **not** met. Could we still feel confident that the slope is non-zero?

If `GPA`

and `SATV`

were really related, then there is a real relationship binding the \(i^{th}\) value of `GPA`

to the \(i^{th}\) value of `SATV`

. In this case it would not make sense to link the \(i^{th}\) value of `GPA`

to some other value of `SATV`

. But if the relationship between these two variables was in fact zero, then it wouldn’t matter how we matched up the entries in the variables!

The basic idea of the permutation test is to shuffle the mapping between the two variables many times (i.e. sample *without replacement*), and examine the distribution of the resulting slope coefficient. If the actual value of the slope is a rare member of that distribution, then we have evidence that the null hypothesis of \(\beta_1=0\) might not be true.

- Execute the following code several times. Get a feel for how the regression line can change, depending on the permutation of the data.
- Do you still believe that the slope is non-zero?

```
ggplot(data = FirstYearGPA, aes(x = mosaic::shuffle(SATV),
y = GPA)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
```

The procedure for the randomization test is simple. We simply shuffle the response variable and compute the resulting slope with the explanatory variable. But we do this many times, until we have a sense of the distribution of that slope coefficient. We then examine where the observed slope falls in that distribution.

```
library(infer)
slopetest <- FirstYearGPA %>%
specify(response = GPA, explanatory = SATV) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute") %>%
calculate(stat = "slope")
library(ggplot2)
ggplot(data = slopetest, aes(x = stat)) +
geom_density() +
geom_vline(xintercept = 0.0016986, color = "red")
```

This looks pretty weird, visually!

Of course, we can also explicitly find where in the distribution the observed slope lies.

```
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
```

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
```

`direction`

tells R whether to compute the amount of data to the right or the left of the test statistic, or in both directions. In this case, our alternative hypothesis is \(\beta_1 \neq 0\), so we want both sides of the distribution.

Finally, we can find a non-parametric 95% confidence interval for the correlation coefficient. The interpretation here is if our actual slope value fell in this interval, we would **not** consider it statistically significant.

```
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -0.000737 0.000746
```

- Compare this confidence interval to the one returned by
`confint()`

on original model. Why are they different? - Perform the above procedure to test the slope value to predict
`GPA`

using`SATM`

.

```
slopetest <- FirstYearGPA %>%
specify(response = GPA, explanatory = SATM) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute") %>%
calculate(stat = "slope")
lm(GPA~SATM, data=FirstYearGPA)
```

```
##
## Call:
## lm(formula = GPA ~ SATM, data = FirstYearGPA)
##
## Coefficients:
## (Intercept) SATM
## 2.333499 0.001202
```

```
ggplot(data = slopetest, aes(x = stat)) +
geom_density() +
geom_vline(xintercept = 0.001202, color = "red")
```

- Perform the above procedure to test the slope value to predict
`GPA`

using`HSGPA`

.

```
##
## Call:
## lm(formula = salary ~ discipline, data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50748 -24611 -4429 19138 113516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108548 2227 48.751 < 2e-16 ***
## disciplineB 9480 3019 3.141 0.00181 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29960 on 395 degrees of freedom
## Multiple R-squared: 0.02436, Adjusted R-squared: 0.02189
## F-statistic: 9.863 on 1 and 395 DF, p-value: 0.001813
```

```
slopetest <- Salaries %>%
specify(response = salary, explanatory = sex) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute") %>%
calculate(stat = "slope")
```

```
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the order
## "Female" - "Male", or divided in the order "Female" / "Male" for ratio-based
## statistics. To specify this order yourself, supply `order = c("Female", "Male")`
## to the calculate() function.
```

```
##
## Call:
## lm(formula = salary ~ sex, data = Salaries)
##
## Coefficients:
## (Intercept) sexMale
## 101002 14088
```

```
ggplot(data = slopetest, aes(x = stat)) +
geom_density() +
geom_vline(xintercept = 14008, color = "red")
```

#### 10.6.0.1 The Bootstrap

A similar simulation technique is the bootstrap, however the bootstrap is most useful for answering the question “what are some other values we could have observed?” (confidence intervals).

In bootstrapping, we repeatedly sample cases from our data set, *with replacement*, and approximate the sampling distribution of a statistic. The key idea here is that even though the solution to the regression model is deterministic, the data itself is assumed to be randomly sampled, and so all of the estimates that we make in a regression model are **random**. If the data changes, then the estimates change as well. The bootstrap gives us a non-parametric understanding of the distribution of those estimates. Once again, the advantage to this method is that we can construct meaningful confidence intervals for, say, the slope of the regression line, without having to assume that the residuals are normally distributed.

We’ll use this technique to create a confidence interval for the same slope coefficient we were studying in the randomization problem. Remember, here’s what that relationship looks like:

```
ggplot(data = FirstYearGPA, aes(x = SATV, y = GPA)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
```

To create a bootstrap sample, we select rows from the data frame uniformly at random, but with replacement.

```
# this needs the mosaic package again!
ggplot(data = mosaic::resample(FirstYearGPA),
aes(x = SATV, y = GPA)) +
geom_point(alpha = 0.5) +
geom_smooth(method = lm, se = FALSE)
```

- Note the differences between this plot and the previous one. What do you notice?
- Run the preceding code several times. What is changing? What is staying the same?

##### 10.6.0.1.1 Bootstrap distributions and confidence intervals

One advantage of the bootstrap is that it allows us to construct a sampling distribution for the slope coefficient that is not dependent upon the conditions for linear regression being met.

The original confidence intervals for our SLR model depend upon the conditions being true.

```
##
## Call:
## lm(formula = salary ~ discipline, data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50748 -24611 -4429 19138 113516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108548 2227 48.751 < 2e-16 ***
## disciplineB 9480 3019 3.141 0.00181 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29960 on 395 degrees of freedom
## Multiple R-squared: 0.02436, Adjusted R-squared: 0.02189
## F-statistic: 9.863 on 1 and 395 DF, p-value: 0.001813
```

```
## 2.5 % 97.5 %
## (Intercept) 104171.0 112925.87
## disciplineB 3545.7 15414.83
```

Now let’s create a bootstrap distribution for the regression coefficients.

```
# I'm only doing 100 samples, but you should do more!
slopeboot <- FirstYearGPA %>%
specify(response = GPA, explanatory = SATV) %>%
generate(reps = 5000, type = "bootstrap") %>%
calculate(stat = "slope")
ggplot(data = slopeboot, aes(x = stat)) +
geom_histogram() +
geom_vline(xintercept = 0.0016986, color = "red")
```

The bootstrap distribution will always be centered around the sample statistic from our real data, but shows us some other likely values for the coefficient (essentially, sampling error). One way to quantify this variability is to create a confidence interval.

There are several methods constructing confidence intervals from the bootstrap. My preferred method does not require that the bootstrap distribution be normal, but works best when it is roughly symmetric. In this case we simply use the percentiles of the bootstrap distribution to build confidence intervals. This method makes the most sense in the most cases.

```
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.00106 0.00230
```

“I am 95% confident that for a 1-point increase in SATV, the college GPA will increase between 0.000944 and 0.00242.” This interval does not cross 0, so we are pretty confident that there **is** a relationship between SATV and GPA.

Create a bootstrap sample of at least 2000 and construct a confidence intervals using the bootstrap. How does this interval differ from the typical confidence interval?

Construct bootstrapped confidence intervals for the \(GPA \sim SATV\) and \(GPA \sim SATM\) models.

```
##
## Call:
## lm(formula = UTurn ~ PageNum, data = Cars2015)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5867 -1.4420 -0.0501 1.5068 6.7082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.226693 0.941281 43.799 <2e-16 ***
## PageNum -0.014043 0.005944 -2.362 0.02 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.288 on 108 degrees of freedom
## Multiple R-squared: 0.04913, Adjusted R-squared: 0.04033
## F-statistic: 5.581 on 1 and 108 DF, p-value: 0.01995
```

```
slopetest <- Cars2015 %>%
specify(response = UTurn, explanatory = PageNum) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute") %>%
calculate(stat = "slope")
ggplot(data = slopetest, aes(x = stat)) +
geom_density() +
geom_vline(xintercept = (-0.014043), color = "red")
```

```
slopeboot <- Cars2015 %>%
specify(response = UTurn, explanatory = PageNum) %>%
generate(reps = 5000, type = "bootstrap") %>%
calculate(stat = "slope")
ggplot(data = slopeboot, aes(x = stat)) +
geom_histogram()
```

```
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -0.0298 0.00156
```

“I’m 99% confident that as we page through Consumer reports, the u-turn diameter for a car increases between 0.00145 feet and -0.0308 feet”

## 10.7 Further Reading

## 10.8 Homework

Use the data from the `nycflights13`

package to answer some questions about airline flight delays.

- First, you will want to familiarize yourself with the datasets available in the package. There are multiple datasets:

You may need to make some data visualizations or summary statistics to better understand the data.

- Determine which airline had the longest delays. Since we do not have airline codes memorized, we want to see the full name of the airlines, not the codes. In order to answer this question, you will need to join two datasets together. One dataset you will need is
`flights`

.

Use the following code skeleton to get started.

```
library(dplyr)
library(tidyr)
flights %>%
drop_na(arr_delay) %>%
left_join(airlines, by = c("carrier" = "carrier")) %>% # join
group_by(name) %>%
summarize(meandelay = mean(arr_delay)) %>% # compute arrival delays
arrange(desc(meandelay)) # arrange arrival delays
```

```
## # A tibble: 16 x 2
## name meandelay
## <chr> <dbl>
## 1 Frontier Airlines Inc. 21.9
## 2 AirTran Airways Corporation 20.1
## 3 ExpressJet Airlines Inc. 15.8
## 4 Mesa Airlines Inc. 15.6
## 5 SkyWest Airlines Inc. 11.9
## 6 Envoy Air 10.8
## 7 Southwest Airlines Co. 9.65
## 8 JetBlue Airways 9.46
## 9 Endeavor Air Inc. 7.38
## 10 United Air Lines Inc. 3.56
## 11 US Airways Inc. 2.13
## 12 Virgin America 1.76
## 13 Delta Air Lines Inc. 1.64
## 14 American Airlines Inc. 0.364
## 15 Hawaiian Airlines Inc. -6.92
## 16 Alaska Airlines Inc. -9.93
```

Considering only flights arriving in Baltimore (BWI), see which single variable is the best predictor for delay.

Use bootstrapping to generate a confidence interval for the slope coefficient of your model.

### References

Efron, Brad. 1979. “Bootstrap Methods: Another Look at the Jackknife.” *Ann. Statist.* 7 (1). http://projecteuclid.org/euclid.aos/1176344552.

Hesterberg, Tim. 2015. “What Teachers Should Know About the Bootstrap.” *The American Statistician* 69 (4). http://arxiv.org/abs/1411.5279.

Ismay, Chester, and Albert Y Kim. 2021. *Statistical Inference via Data Science*. Chapman; Hall/CRC. https://moderndive.com/index.html.