Chapter 10 Inference using simulation methods

[This “chapter” is essentially the same as 4, but with some of the notes I wrote in the document as we went.]

This section is kind of a grab-bag of computational techniques in support of statistics and inference. We’ll introduce some simulation methods: randomization and the bootstrap. We’ll use the infer package.

10.1 Reading in external data

So far, we have been working with data that came from within R packages. But, most data we are interested in doesn’t come built into R! So, we need to read the data into our environment. The R package readr will help with many common file formats. (Another file-reading package is called haven, and helps you read in specialty data formats from other statistical packages like SAS and SPSS. There’s also readxl for Excel sheets. If you use a particular kind of data, you can probably find a package to read it into R!)

library(readr)

We are going to work with two datasets, one called Physicians.csv and the other called policekillings.csv (I don’t have provenance information on these datasets, sorry!)

Download those two .csv files.

You can locate the files on my GitHub. Select “raw” to see the raw version of the file.

Then, right-click to download the file.

Once you’ve downloaded those files, place them into the same folder as your RMarkdown file, or inside a sub-folder called something like data.

Let’s read in Physicians.csv first.

doctors <- read_csv("data/Physicians.csv")

As with other tidyverse functions, read_csv gives us a message, telling us how it parsed the data. We can change those defaults if we want to.

Right off the bat, we notice something strange– the number of doctors is being read in as a character variable, meaning it is not numeric. We’ll come back to that.

10.1.0.1 Renaming variables

The variable names in this dataset are hard to use,

names(doctors)

## [1] "State"                             "Physicians per 100,000 Population"

In order to run something like skim(), we’d have to use backtics to write out the variable name.

library(skimr)
doctors %>%
  skim(`Physicians per 100,000 Population`)

Table 10.1: Data summary
Name	Piped data
Number of rows	53
Number of columns	2
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Physicians per 100,000 Population	0	1	3	4	0	49	0

We can manually rename that variable,

library(dplyr)
doctors %>%
  rename(Physicians = `Physicians per 100,000 Population`)

## # A tibble: 53 x 2
##    State       Physicians
##    <chr>       <chr>     
##  1 Alaska      514       
##  2 Alabama     331       
##  3 Arkansas    321       
##  4 Arizona     370       
##  5 California  370       
##  6 Colorado    371       
##  7 Connecticut 464       
##  8 D.C.        1612      
##  9 Delaware    563       
## 10 Florida     357       
## # … with 43 more rows

That’s better!

Or, we could use a package to automatically fix all the names in a dataset.

# install.packages("janitor")
library(janitor)

doctors %>%
  clean_names() %>%
  skim(physicians_per_100_000_population)

Table 10.2: Data summary
Name	Piped data
Number of rows	53
Number of columns	2
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
physicians_per_100_000_population	0	1	3	4	0	49	0

With either of these approaches, you would need to overwrite the original data with the renamed data in order to use it. For the sake of having a short name, let’s use the rename() function.

doctors <- doctors %>%
  rename(Physicians = `Physicians per 100,000 Population`)

10.1.0.2 Searching for strange issues in variables

But, we’ve still got a variable type issue. When we run summary statistics, R complains because there is something weird in the data. Let’s try count()ing up the values in the data.

doctors %>%
  count(Physicians) %>%
  arrange(desc(Physicians))

## # A tibble: 49 x 2
##    Physicians     n
##    <chr>      <int>
##  1 N/A            1
##  2 644            1
##  3 575            1
##  4 563            1
##  5 514            1
##  6 510            1
##  7 506            1
##  8 504            1
##  9 485            1
## 10 478            2
## # … with 39 more rows

Now we see it! There’s a strange NA value that R doesn’t know how to deal with. We can actually add this into our data-reading-in code in read_csv.

doctors <- read_csv("data/Physicians.csv", 
                    na = c("", NA, "N/A"))

Of course, we then have to redo our renaming.

doctors <- doctors %>%
  rename(Physicians = `Physicians per 100,000 Population`)

doctors %>%
  summarize(mean = mean(Physicians, na.rm = TRUE))

## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1  429.

Now, you try reading in the policekillings.csv dataset.

10.1.1 Joining data

In the last section, we talked about one-table verbs from dplyr. But, another task we often need to do is combine data from one table with data from another table.

The command that allows you to combine data frames is a “join”, which corresponds to the database operation called a JOIN. We will focus on the left_join() here, but there are many other types of joins.

When two database tables (or data frames) are joined, the rows of one table get matched to the rows of the other. Computers are exceptionally better suited for this task than humans, but the computer needs to be told what the criteria is for matching up the rows. The variable(s) upon which the matching is based are called a key. Both tables must contain the key columns, but there are variety of ways in which the matching can be done.

Joins can be complicated, so you may want to look at a visual explanation:

Let’s say we want to join our doctors data and our policekillings data. We need a variable to “join on.” In this case, we want to match states with states. Unfortunately, doctors has state names written out and policekillings just has the state abbreviations. We need to change one to match the other.

Luckily, R comes with a dataset on state names

state.name

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

state.abb

##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA"
## [16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ"
## [31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
## [46] "VA" "WA" "WV" "WI" "WY"

We can use this to make a new variable in the policekillings dataset.

policekillings <- policekillings %>%
  mutate(state_long = state.name[match(state, state.abb)])
skim(policekillings)

Table 10.3: Data summary
Name	policekillings
Number of rows	47
Number of columns	3
_______________________
Column type frequency:
character	2
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
state	0	1.00	2	2	0	47	0
state_long	1	0.98	4	14	0	46	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
deaths	0	1	9.94	12.58	1	4	7	10.5	74	▇▁▁▁▁

Now, we can do some joining. After all, if we want to understand the relationship between physicians and police killings, we have to have both variables in the same data frame.

Before we start, let’s consider the dimensions of our two data frames.

dim(policekillings)

## [1] 47  3

dim(doctors)

## [1] 53  2

They don’t have the same number of rows, so we should think about how large we want our data to be at the end of our join. Do we want to keep all 53 rows in doctors and match elements from policekillings, filling with NA values where there aren’t matches? Or, are we more interested in policekillings and we want to just match the 47 states with elements from doctors?

In databases, the default JOIN type is INNER JOIN. In this case only the rows that are present in both data frames get returned. In R, I find I most often use a left_join, which retains all of the records from the first data frame, regardless of whether there is a match in the second data frame.

Let’s try a left_join.

library(dplyr)
joinedData <- left_join(policekillings, doctors)

## Error: `by` must be supplied when `x` and `y` have no common variables.
## ℹ use by = character()` to perform a cross-join.

This throws an error, because it is expecting there to be a variable with the same name in each data set. To get it to work, we need to specify which variables to join on.

joinedData <- left_join(policekillings, doctors,
  by = c("state_long" = "State")
)
skim(joinedData)

Table 10.4: Data summary
Name	joinedData
Number of rows	47
Number of columns	4
_______________________
Column type frequency:
character	2
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
state	0	1.00	2	2	0	47	0
state_long	1	0.98	4	14	0	46	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
deaths	0	1.00	9.94	12.58	1	4	7	10.5	74	▇▁▁▁▁
Physicians	1	0.98	404.00	79.10	269	342	394	460.0	644	▆▇▆▂▁

Notice that this dataset only has 47 observations, because there were only 47 in policekillings (the “left” in our left_join). We could have reversed the order of the datasets to get something slightly different.

joinedData <- left_join(doctors, policekillings,
  by = c("State" = "state_long")
)
skim(joinedData)

## Error: Problem with `summarise()` input `skimmed`.
## x Names must be unique.
## x These names are duplicated:
##   * "n_missing" at locations 2 and 3.
##   * "complete_rate" at locations 4 and 5.
##   * "character.min" at locations 6 and 7.
##   * "character.max" at locations 8 and 9.
##   * "character.empty" at locations 10 and 11.
##   * ...
## ℹ Use argument `names_repair` to specify repair strategy.
## ℹ Input `skimmed` is `purrr::map2(...)`.
## ℹ The error occurred in group 1: skim_type = "character".

Now, the data has 51 observations! (We could have gotten the same result by running right_join with the data specified the same way as our first join.)

10.2 Fitting a model

While that was a good example for joining data, we shouldn’t use it for modeling because a major condition for linear regression is violated with the data set. (Which one?)

So, let’s read in another dataset to make some new models.

library(readr)
FirstYearGPA <- read_csv("data/FirstYearGPA.csv")

10.3 Testing and training data

One computational strategy for assessing a model is using training and testing data. We can use testing and training data in order to determine if our model is good because it works well on the particular data we fit it on, or if it can be generalized to a larger set of data.

The basic idea is to break your dataset into two parts: one for training the model (running lm()) and one for testing the model (running predict()).

set.seed(42)
which_train <- sample(1:219, size = 131, replace = FALSE)

training <- FirstYearGPA %>%
  slice(which_train)
testing <- FirstYearGPA %>%
  slice(-which_train)

Notice that I had to “set the seed” to make this document reproducible. That essentially means I’m placing the beginning of the pseudo-random number generator in R at the same spot every time.

We can now run a model,

m1 <- lm(GPA ~ HSGPA + HU + White, data = training)
summary(m1)

## 
## Call:
## lm(formula = GPA ~ HSGPA + HU + White, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83142 -0.27430  0.01072  0.26880  0.78655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.260965   0.298795   4.220 4.61e-05 ***
## HSGPA       0.384141   0.085603   4.487 1.60e-05 ***
## HU          0.015373   0.004253   3.615 0.000432 ***
## White       0.427565   0.085591   4.995 1.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3707 on 127 degrees of freedom
## Multiple R-squared:  0.3723, Adjusted R-squared:  0.3575 
## F-statistic: 25.11 on 3 and 127 DF,  p-value: 8.028e-13

Notice that this model is slightly different than if we’d run the model on the full dataset,

coef(lm(GPA ~ HSGPA + HU + White, data = FirstYearGPA))

## (Intercept)       HSGPA          HU       White 
##  0.93345915  0.50740401  0.01532817  0.26564358

We can use our trained model to make predictions on our testing data.

testing <- testing %>%
  mutate(yhats = predict(m1, newdata = testing))

We might also want to have the residuals for these predictions, so we can compute those as well.

testing <- testing %>%
  mutate(residuals = GPA - yhats)

Look at your data object in your environment. What has happened to it?

Once we have this information, we can compute all sorts of useful stuff with it. For example, we could find the “cross-validation correlation,” the correlation between the predictions and the actual values.

testing %>%
  summarize(cor = cor(GPA, yhats))

## # A tibble: 1 x 1
##     cor
##   <dbl>
## 1 0.444

If we square this number, it’s akin to an $R^2$ value,

testing %>%
  summarize(cor = cor(GPA, yhats)) %>%
  mutate(R2 = cor^2)

## # A tibble: 1 x 2
##     cor    R2
##   <dbl> <dbl>
## 1 0.444 0.197

Sometimes, we want to quantify the difference between the $R^2$ for the training data and the squared cross validation correlation.

testing %>%
  summarize(cor = cor(GPA, yhats)) %>%
  mutate(R2 = cor^2, shrinkage = summary(m1)$r.squared - R2)

## # A tibble: 1 x 3
##     cor    R2 shrinkage
##   <dbl> <dbl>     <dbl>
## 1 0.444 0.197     0.175

You want smaller shrinkage values.

We can also look at the residual plots for our testing data,

library(ggplot2)
ggplot(testing, aes(x = yhats, y = residuals)) +
  geom_point() +
  geom_smooth(se = FALSE)

Those don’t look too bad to me. We could look closer to see if the mean and standard deviation of the prediction errors look good.

testing %>%
  summarize(mean(residuals), sd(residuals))

## # A tibble: 1 x 2
##   `mean(residuals)` `sd(residuals)`
##               <dbl>           <dbl>
## 1           -0.0757           0.423

We’d like the mean to be close to 0, and the standard deviation to be close to the standard deviation of the error term from the fit to the training sample.

sd(residuals(m1))

## [1] 0.3663646

Those look pretty good to me!

10.4 Inference based on distributional approximations

m1 <- lm(GPA ~ HSGPA + HU + White, data = FirstYearGPA)
summary(m1)

## 
## Call:
## lm(formula = GPA ~ HSGPA + HU + White, data = FirstYearGPA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09479 -0.27638  0.02287  0.25411  0.84538 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.933459   0.245673   3.800 0.000189 ***
## HSGPA       0.507404   0.070197   7.228 8.42e-12 ***
## HU          0.015328   0.003667   4.180 4.24e-05 ***
## White       0.265644   0.064519   4.117 5.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3856 on 215 degrees of freedom
## Multiple R-squared:  0.3231, Adjusted R-squared:  0.3136 
## F-statistic: 34.21 on 3 and 215 DF,  p-value: < 2.2e-16

If you’ve taken a statistics class before, you have likely seen inference based on distributional approximations. This is inference using known probability distributions like the Normal distribution, t-distribution, F-distribution, etc.

In the case of a slope coefficient, one way to make inference would be to create a confidence interval,

“what are some other reasonable values we could have observed?”

$\hat{\beta}_1 \pm t^* \cdot SE_{\hat{\beta}_1}$

“We are 95% confident that for a 1-unit increase in x, we would see an [increase/decrease] in the response of between [bottom] and [top] units.”

The other inferential task often done with distributional approximations is hypothesis testing. For regression, the hypotheses are

“is this number different from 0?”

$H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0$

We can test the hypothesis by locating

$t = \frac{\hat{\beta_1}}{SE_{\hat{\beta_1}}}$

in a t-distribution with $df=n-k-1$ . When you look at a regression summary, you see p-values based on $t$ values,

summary(m1)

## 
## Call:
## lm(formula = GPA ~ HSGPA + HU + White, data = FirstYearGPA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09479 -0.27638  0.02287  0.25411  0.84538 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.933459   0.245673   3.800 0.000189 ***
## HSGPA       0.507404   0.070197   7.228 8.42e-12 ***
## HU          0.015328   0.003667   4.180 4.24e-05 ***
## White       0.265644   0.064519   4.117 5.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3856 on 215 degrees of freedom
## Multiple R-squared:  0.3231, Adjusted R-squared:  0.3136 
## F-statistic: 34.21 on 3 and 215 DF,  p-value: < 2.2e-16

You can generate confidence intervals based on distributional approximations using confint(),

confint(m1)

##                  2.5 %     97.5 %
## (Intercept) 0.44922395 1.41769436
## HSGPA       0.36904250 0.64576552
## HU          0.00810044 0.02255589
## White       0.13847373 0.39281343

However, these distributional approximations only hold if the conditions are upheld. A more general solution is to use simulation-based inference.

10.5 Simulation-based inference

Recall that the inference for linear regression relies on the assumptions that the following three conditions are met:

Linearity
Normality of Residuals
Equality of Variance

We know how to assess whether these conditions are met, and we have learned a few techniques for correcting them when they are not (e.g. transformations). Today, we will learn a few techniques based on simulation for making non-parametric inferences. Such inferences do not rely on stringent assumptions about the distribution on of the residuals.

10.6 Randomization (Permutation) Tests

Goes with hypothesis tests. “Is this number different than 0?”

We’ll consider how we could make inference about a relationship using simulation methods. To do this, we’ll use a randomization test.

Before we begin, let’s examine the relationship between two variables from the FirstYearGPA dataset graphically.

ggplot(data = FirstYearGPA, aes(x = SATV, y = GPA)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

There appears to be some linear association between these variables, but it is not particularly strong. We can quantify this relationship using the slope coefficient,

m1 <- lm(GPA~SATV, data = FirstYearGPA)
summary(m1)

## 
## Call:
## lm(formula = GPA ~ SATV, data = FirstYearGPA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14057 -0.32756  0.03134  0.34259  1.16723 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.0684133  0.2204478   9.383  < 2e-16 ***
## SATV        0.0016986  0.0003609   4.706  4.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4444 on 217 degrees of freedom
## Multiple R-squared:  0.09261,    Adjusted R-squared:  0.08842 
## F-statistic: 22.15 on 1 and 217 DF,  p-value: 4.499e-06

In this case the value of the slope is 0.00169, which is not large, but appears to be statistically significant. However, the validity of the hypothesis test shown in the default regression table relies on the conditions for simple linear regression being met. Let’s assume that in this case the assumptions are not met. Could we still feel confident that the slope is non-zero?

If GPA and SATV were really related, then there is a real relationship binding the $i^{th}$ value of GPA to the $i^{th}$ value of SATV. In this case it would not make sense to link the $i^{th}$ value of GPA to some other value of SATV. But if the relationship between these two variables was in fact zero, then it wouldn’t matter how we matched up the entries in the variables!

The basic idea of the permutation test is to shuffle the mapping between the two variables many times (i.e. sample without replacement), and examine the distribution of the resulting slope coefficient. If the actual value of the slope is a rare member of that distribution, then we have evidence that the null hypothesis of $\beta_1=0$ might not be true.

Execute the following code several times. Get a feel for how the regression line can change, depending on the permutation of the data.
Do you still believe that the slope is non-zero?

ggplot(data = FirstYearGPA, aes(x = mosaic::shuffle(SATV), 
                                y = GPA)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

The procedure for the randomization test is simple. We simply shuffle the response variable and compute the resulting slope with the explanatory variable. But we do this many times, until we have a sense of the distribution of that slope coefficient. We then examine where the observed slope falls in that distribution.

library(infer)
slopetest <- FirstYearGPA %>%
  specify(response = GPA, explanatory = SATV) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "slope")
library(ggplot2)
ggplot(data = slopetest, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = 0.0016986, color = "red")

This looks pretty weird, visually!

Of course, we can also explicitly find where in the distribution the observed slope lies.

slopetest %>%
  get_p_value(obs_stat = 0.0016986, direction = "both")

## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

direction tells R whether to compute the amount of data to the right or the left of the test statistic, or in both directions. In this case, our alternative hypothesis is $\beta_1 \neq 0$ , so we want both sides of the distribution.

Finally, we can find a non-parametric 95% confidence interval for the correlation coefficient. The interpretation here is if our actual slope value fell in this interval, we would not consider it statistically significant.

get_ci(slopetest)

## # A tibble: 1 x 2
##    lower_ci upper_ci
##       <dbl>    <dbl>
## 1 -0.000737 0.000746

Compare this confidence interval to the one returned by confint() on original model. Why are they different?
Perform the above procedure to test the slope value to predict GPA using SATM.

slopetest <- FirstYearGPA %>%
  specify(response = GPA, explanatory = SATM) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "slope")

lm(GPA~SATM, data=FirstYearGPA)

## 
## Call:
## lm(formula = GPA ~ SATM, data = FirstYearGPA)
## 
## Coefficients:
## (Intercept)         SATM  
##    2.333499     0.001202

ggplot(data = slopetest, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = 0.001202, color = "red")

Perform the above procedure to test the slope value to predict GPA using HSGPA.

library(car)
data(Salaries)
m1 <- lm(salary ~ discipline, data = Salaries)
summary(m1)

## 
## Call:
## lm(formula = salary ~ discipline, data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50748 -24611  -4429  19138 113516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   108548       2227  48.751  < 2e-16 ***
## disciplineB     9480       3019   3.141  0.00181 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29960 on 395 degrees of freedom
## Multiple R-squared:  0.02436,    Adjusted R-squared:  0.02189 
## F-statistic: 9.863 on 1 and 395 DF,  p-value: 0.001813

slopetest <- Salaries %>%
  specify(response = salary, explanatory = sex) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "slope")

## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the order
## "Female" - "Male", or divided in the order "Female" / "Male" for ratio-based
## statistics. To specify this order yourself, supply `order = c("Female", "Male")`
## to the calculate() function.

lm(salary~sex, data=Salaries)

## 
## Call:
## lm(formula = salary ~ sex, data = Salaries)
## 
## Coefficients:
## (Intercept)      sexMale  
##      101002        14088

ggplot(data = slopetest, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = 14008, color = "red")

10.6.0.1 The Bootstrap

A similar simulation technique is the bootstrap, however the bootstrap is most useful for answering the question “what are some other values we could have observed?” (confidence intervals).

In bootstrapping, we repeatedly sample cases from our data set, with replacement, and approximate the sampling distribution of a statistic. The key idea here is that even though the solution to the regression model is deterministic, the data itself is assumed to be randomly sampled, and so all of the estimates that we make in a regression model are random. If the data changes, then the estimates change as well. The bootstrap gives us a non-parametric understanding of the distribution of those estimates. Once again, the advantage to this method is that we can construct meaningful confidence intervals for, say, the slope of the regression line, without having to assume that the residuals are normally distributed.

We’ll use this technique to create a confidence interval for the same slope coefficient we were studying in the randomization problem. Remember, here’s what that relationship looks like:

ggplot(data = FirstYearGPA, aes(x = SATV, y = GPA)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

To create a bootstrap sample, we select rows from the data frame uniformly at random, but with replacement.

# this needs the mosaic package again!
ggplot(data = mosaic::resample(FirstYearGPA), 
       aes(x = SATV, y = GPA)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, se = FALSE)

Note the differences between this plot and the previous one. What do you notice?
Run the preceding code several times. What is changing? What is staying the same?

10.6.0.1.1 Bootstrap distributions and confidence intervals

One advantage of the bootstrap is that it allows us to construct a sampling distribution for the slope coefficient that is not dependent upon the conditions for linear regression being met.

The original confidence intervals for our SLR model depend upon the conditions being true.

summary(m1)

## 
## Call:
## lm(formula = salary ~ discipline, data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50748 -24611  -4429  19138 113516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   108548       2227  48.751  < 2e-16 ***
## disciplineB     9480       3019   3.141  0.00181 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29960 on 395 degrees of freedom
## Multiple R-squared:  0.02436,    Adjusted R-squared:  0.02189 
## F-statistic: 9.863 on 1 and 395 DF,  p-value: 0.001813

confint(m1)

##                2.5 %    97.5 %
## (Intercept) 104171.0 112925.87
## disciplineB   3545.7  15414.83

Now let’s create a bootstrap distribution for the regression coefficients.

# I'm only doing 100 samples, but you should do more!
slopeboot <- FirstYearGPA %>%
  specify(response = GPA, explanatory = SATV) %>%
  generate(reps = 5000, type = "bootstrap") %>%
  calculate(stat = "slope")

ggplot(data = slopeboot, aes(x = stat)) +
  geom_histogram() +
  geom_vline(xintercept =  0.0016986, color = "red")

The bootstrap distribution will always be centered around the sample statistic from our real data, but shows us some other likely values for the coefficient (essentially, sampling error). One way to quantify this variability is to create a confidence interval.

There are several methods constructing confidence intervals from the bootstrap. My preferred method does not require that the bootstrap distribution be normal, but works best when it is roughly symmetric. In this case we simply use the percentiles of the bootstrap distribution to build confidence intervals. This method makes the most sense in the most cases.

get_ci(slopeboot, level = 0.9)

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1  0.00106  0.00230

visualize(slopeboot)

“I am 95% confident that for a 1-point increase in SATV, the college GPA will increase between 0.000944 and 0.00242.” This interval does not cross 0, so we are pretty confident that there is a relationship between SATV and GPA.

Create a bootstrap sample of at least 2000 and construct a confidence intervals using the bootstrap. How does this interval differ from the typical confidence interval?
Construct bootstrapped confidence intervals for the $GPA \sim SATV$ and $GPA \sim SATM$ models.

data("Cars2015", package = "Lock5Data")
m1 <- lm(UTurn~PageNum, data=Cars2015)
summary(m1)

## 
## Call:
## lm(formula = UTurn ~ PageNum, data = Cars2015)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5867 -1.4420 -0.0501  1.5068  6.7082 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.226693   0.941281  43.799   <2e-16 ***
## PageNum     -0.014043   0.005944  -2.362     0.02 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.288 on 108 degrees of freedom
## Multiple R-squared:  0.04913,    Adjusted R-squared:  0.04033 
## F-statistic: 5.581 on 1 and 108 DF,  p-value: 0.01995

slopetest <- Cars2015 %>%
  specify(response = UTurn, explanatory = PageNum) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "slope")


ggplot(data = slopetest, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = (-0.014043), color = "red")

slopeboot <- Cars2015 %>%
  specify(response = UTurn, explanatory = PageNum) %>%
  generate(reps = 5000, type = "bootstrap") %>%
  calculate(stat = "slope")

ggplot(data = slopeboot, aes(x = stat)) +
  geom_histogram()

get_ci(slopeboot, level = 0.99)

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1  -0.0298  0.00156

“I’m 99% confident that as we page through Consumer reports, the u-turn diameter for a car increases between 0.00145 feet and -0.0308 feet”

10.7 Further Reading

The Boostrapping and Randomization chapters from Ismay and Kim (2021).
Hesterberg (2015)
Efron (1979)

10.8 Homework

Use the data from the nycflights13 package to answer some questions about airline flight delays.

library(nycflights13)

First, you will want to familiarize yourself with the datasets available in the package. There are multiple datasets:

?airlines
?airports
?flights
?planes
?weather

You may need to make some data visualizations or summary statistics to better understand the data.

Determine which airline had the longest delays. Since we do not have airline codes memorized, we want to see the full name of the airlines, not the codes. In order to answer this question, you will need to join two datasets together. One dataset you will need is flights.

Use the following code skeleton to get started.

library(dplyr)
library(tidyr)
flights %>%
  drop_na(arr_delay) %>%
  left_join(airlines, by = c("carrier" = "carrier")) %>% # join
  group_by(name) %>% 
  summarize(meandelay = mean(arr_delay)) %>% # compute arrival delays
  arrange(desc(meandelay)) # arrange arrival delays

## # A tibble: 16 x 2
##    name                        meandelay
##    <chr>                           <dbl>
##  1 Frontier Airlines Inc.         21.9  
##  2 AirTran Airways Corporation    20.1  
##  3 ExpressJet Airlines Inc.       15.8  
##  4 Mesa Airlines Inc.             15.6  
##  5 SkyWest Airlines Inc.          11.9  
##  6 Envoy Air                      10.8  
##  7 Southwest Airlines Co.          9.65 
##  8 JetBlue Airways                 9.46 
##  9 Endeavor Air Inc.               7.38 
## 10 United Air Lines Inc.           3.56 
## 11 US Airways Inc.                 2.13 
## 12 Virgin America                  1.76 
## 13 Delta Air Lines Inc.            1.64 
## 14 American Airlines Inc.          0.364
## 15 Hawaiian Airlines Inc.         -6.92 
## 16 Alaska Airlines Inc.           -9.93

Considering only flights arriving in Baltimore (BWI), see which single variable is the best predictor for delay.
Use bootstrapping to generate a confidence interval for the slope coefficient of your model.

References

Efron, Brad. 1979. “Bootstrap Methods: Another Look at the Jackknife.” Ann. Statist. 7 (1). http://projecteuclid.org/euclid.aos/1176344552.

Hesterberg, Tim. 2015. “What Teachers Should Know About the Bootstrap.” The American Statistician 69 (4). http://arxiv.org/abs/1411.5279.

Ismay, Chester, and Albert Y Kim. 2021. Statistical Inference via Data Science. Chapman; Hall/CRC. https://moderndive.com/index.html.