Part 4 Hypothesis Testing

A hypothesis test is a statistical method for checking how strongly a dataset supports a given hypothesis. Our focus here will be on two-sample hypothesis tests, which are used extensively to investigate whether two groups differ in a statistically meaningful way. Picking up on the examples given in Part 3:

  • An online retailer tests whether customer engagement under website design A is different than under website design B,
  • A pharmaceutical company tests whether the efficacy of a new blood pressure medication differs from an existing competitor’s product, and
  • A polling organization tests whether the strength of support for one mayoral candidate differs from another.

A challenge in working with limited data in practice is determining whether an observed difference between two groups is “real” or due to random chance. For example, suppose that the pharmaceutical company finds that its medication is more effective at lowering blood pressure than a competitor’s product, based on a clinical trial with 200 participants. Does this mean the medication is truly more effective in general, or is the difference due to random chance? The purpose of a hypothesis tests to help resolve these competing explanations.

The general steps to a hypothesis test are as follows:

  1. Specify a hypothesis about two groups.
  2. Collect a random sample of data from both groups.
  3. Test the hypothesis through a few short calculations.
  4. Draw a conclusion about whether two groups differ.

In what follows, we first discuss the mechanics of hypothesis testing in R, and how to interpret the output at an intuitive level. We’ll then discuss some of the background theory toward the end of this section.

4.1 Data: LA Metro bike share trips

LA Metro runs a bike share program to improve the accessibility of the public transit system and promote cycling. Users within the bike share program can choose from among four pass types: annual ($150/year), monthly ($17/month), daily ($5/day) and “walk-up” (free). For the annual, monthly, and daily passes, users can rent bikes for free for the first 30 minutes and pay $1.75/hour for each 30 minutes thereafter. For walk-ups, users pay $1.75/hour for each 30 minutes without an initial free period.

Suppose LA Metro is considering an an overhaul of their pass program, including adjusting the length of the initial free period each day. As a first step toward understanding the revenue impacts of adjusting the pass program, they wish to understand if the users of the various pass types differ in their cycling trips. In this example, we’ll demonstrate how hypothesis testing can be used to investigate if the usage of bike share programs differs across the various pass types.

We will conduct our analysis using a random sample of 500 bike share trips. Let’s start by loading the data (and some helpful packages):

library(tidyverse)
library(MetBrewer)
library(hrbrthemes)
library(pander)
bike = read.csv("bikeshare_sample.csv", stringsAsFactors = TRUE)

We will focus on three variables: trip duration (in minutes), trip type (one way or round trip) and pass type (annual, monthly, daily, walk up). To simplify the dataset, we can use subset() to hold onto only the columns we want and drop the rest:

bike = subset(bike, select = c("duration","trip_type","pass_type"))

The command summary() gives some basic descriptive statistics:

summary(bike)
##     duration           trip_type          pass_type  
##  Min.   :  1.00   One Way   :405   Annual Pass : 46  
##  1st Qu.:  8.00   Round Trip: 95   Monthly Pass:266  
##  Median : 16.00                    One Day Pass: 51  
##  Mean   : 25.88                    Walk-up     :137  
##  3rd Qu.: 28.00                                      
##  Max.   :288.00

4.2 Two-sample tests for means

Formally, a hypothesis test involves specifying a null hypothesis \((H_0)\) and an alternate hypothesis \((H_a)\). In the two-sample tests for means, the null hypothesis is always that the two groups have the same population mean.

Suppose we wish to compare the mean trip duration of monthly passholders and walk-ups. Then we can write the null and alternate hypotheses as follows.

\[\begin{align} H_0 &: \mu_m = \mu_w \\ H_a &: \mu_m \neq \mu_w. \end{align}\]

As a starting point, we visualize the trip times grouped by pass type:

# See code using the scroll bar immediately to the right ----->  #




annual_mean = mean(subset(bike, pass_type == "Annual Pass")$duration)
monthly_mean = mean(subset(bike, pass_type == "Monthly Pass")$duration)
oneday_mean = mean(subset(bike, pass_type == "One Day Pass")$duration)
walkup_mean = mean(subset(bike, pass_type == "Walk-up")$duration)

    cols = met.brewer("Monet",4)
    ggplot(bike, aes(x=pass_type, y=duration, fill=pass_type))+
    geom_boxplot(outlier.shape = NA)+
    theme_ipsum(base_family = "Helvetica", axis_title_size = 14)+
    geom_jitter(color="black", size=0.2, alpha=0.5)+
    scale_fill_manual(values=cols)+
    ylim(c(0,100))+
    theme(legend.position="none",plot.title = element_text(size=14))+
    ylab("Average trip duration (minutes)")+
    annotate("point", x = 1, y = annual_mean, shape=18, size=2, color="black" )+
    annotate("point", x = 2, y = monthly_mean, shape=18, size=2, color="black")+
    annotate("point", x = 3, y = oneday_mean, shape=18, size=2, color="black")+
    annotate("point", x = 4, y = walkup_mean, shape=18, size=2, color="black")+
    xlab("")

The black diamonds in the box plot above shows the mean trip duration, and the horizontal black line shows the median trip duration. Note that the mean trip duration is longer for the one day and walk-ups than annual and monthly pass. Is this different indicative of a true difference between passholders, or is it specific to our random sample of trip data? We can determine which explanation is more plausible using a hypothesis test for means.

Let’s first split the data by pass type:

annual = bike[bike$pass_type=="Annual Pass",]
monthly = bike[bike$pass_type=="Monthly Pass",]
daily = bike[bike$pass_type=="One Day Pass",]
walkup = bike[bike$pass_type=="Walk-up",]

Suppose we first want to investigate whether the mean trip duration differs between annual and monthly passholders. We can use the t.test() function to compare the trip durations for both groups:

result <- t.test(annual$duration,monthly$duration)
pander(result)
Welch Two Sample t-test: annual$duration and monthly$duration (continued below)
Test statistic df P value Alternative hypothesis mean of x
-4.524 265.3 9.142e-06 * * * two.sided 10.22
mean of y
18.24

From the output above, we can see that the average trip time for annual passholders was 10.2 minutes and the average trip time for monthly passholders was 18.2 minutes. The p-value is extremely small (0.0000126), which suggests it is extremely unlikely this difference in due to chance alone. Therefore, we conclude that monthly passholders tend to have longer trip durations than annual passholders, and that this difference is statistically significant at the \(\alpha = 0.05\) level.

A crucial condition for the analysis above to be valid is that the data to be a random sample of the population, or as close to random as possible. This often requires thinking carefully about where the data came from and what the relevant population is. If the sample is not random, then it may be biased, which means the conclusions from the hypothesis test may be invalid.

4.3 Two-sample tests for proportions

Performing a hypothesis test is for proportions is conceptually similar. Let’s suppose LA Metro wants to investigate whether the proportion of one way vs round trips differs across the various pass types.

We begin with a little data processing to store the relevant statistics:

# Frequency of one way trips by pass type
f_a = sum(annual$trip_type=="One Way") 
f_m = sum(monthly$trip_type=="One Way") 
f_d = sum(daily$trip_type=="One Way") 
f_w = sum(walkup$trip_type=="One Way") 

# Total number of trips by pass type
n_a = nrow(annual) 
n_m = nrow(monthly) 
n_d = nrow(daily)
n_w = nrow(walkup) 

Next let’s visualize the four proportions based on the counts above:

# See code using the scroll bar immediately to the right -----> #



props <- data.frame(
  name=factor(c("Annual","Monthly","Daily","Walkup"), levels = c("Annual","Monthly","Daily","Walkup")),  
  value=c(f_a/n_a,f_m/n_m,f_d/n_d,f_w/n_w))

p_a = f_a/n_a
p_m = f_m/n_m
p_d = f_d/n_d
p_w = f_w/n_w

ggplot(props, aes(x=name, y=value, fill = name)) + 
  geom_bar(stat = "identity")+
  scale_fill_manual(values = cols)+
  theme_ipsum(base_family = "Helvetica", axis_title_size = 14)+
  theme(legend.position="none",plot.title = element_text(size=11)) +
  ylab("Proportion of one-way trips")+
  geom_errorbar(aes(x=1, ymin=p_a - 2*sqrt(p_a*(1-p_a)/n_a), ymax=p_a + 2*sqrt(p_a*(1-p_a)/n_a)), width=0.4, colour="black", alpha=0.9, size=0.25)+
    geom_errorbar(aes(x=2, ymin=p_m - 2*sqrt(p_m*(1-p_m)/n_m), ymax=p_m + 2*sqrt(p_m*(1-p_m)/n_m)), width=0.4, colour="black", alpha=0.9, size=0.25)+
    geom_errorbar(aes(x=3, ymin=p_d - 2*sqrt(p_d*(1-p_d)/n_d), ymax=p_d + 2*sqrt(p_d*(1-p_d)/n_d)), width=0.4, colour="black", alpha=0.9, size=0.25)+
    geom_errorbar(aes(x=4, ymin=p_w - 2*sqrt(p_w*(1-p_w)/n_w), ymax=p_w + 2*sqrt(p_w*(1-p_w)/n_w)), width=0.4, colour="black", alpha=0.9, size=0.25)+
  xlab("")

Let’s start by comparing annual passholders with walk-ups. From the plot above, we can see that within our random sample of 500 trips annual passholders are more likely to make one-way trips. Is this result statistically significant at an \(\alpha = 0.05\) significance level?

To investigate, we perform a two-sample proportion test. In R, this is done by calling the prop.test() function:

freqs = c(f_a,f_w) # store frequencies of one-way trips by annual passholders and walkups
totals = c(n_a,n_w) # store total number of trips by annual and walkups
result <- prop.test(freqs,totals) # run two sample hypothesis test for proportions
pander(result)
2-sample test for equality of proportions with continuity correction: freqs out of totals
Test statistic df P value Alternative hypothesis prop 1 prop 2
8.59 1 0.00338 * * two.sided 0.913 0.6788

The summary table above shows that the probability an annual passholder takes a one-way trip is 0.91, whereas the same probability for walkups is 0.67 – a difference of 0.24. Because we observe a tiny p-value of \(0.003 < 0.05\), we can conclude that this difference is statistically significant at the \(\alpha = 0.05\) level. In other words, this \(p\)-value means that if annual and walkup passholders were equally likely to make one-way trips in general, then the likelihood we would observe a difference as large as 0.24 just by chance is 3 in 1000. This is very unlikely, so we conclude the underlying population proportions are in fact not equal.

Suppose we now want to compare annual passholders to monthly passholders. We can see from the bar chart above that annual passholders are slightly more likely to make one-way trips than monthly passholders. Does this hold in general?

freqs = c(f_a,f_m) # store frequencies of one-way trips by annual and monthly passholders
totals = c(n_a,n_m) # store total number of trips by annual and monthly passholders
result <- prop.test(freqs,totals) # run two sample hypothesis test for proportions
pander(result)
2-sample test for equality of proportions with continuity correction: freqs out of totals
Test statistic df P value Alternative hypothesis prop 1 prop 2
0.2898 1 0.5903 two.sided 0.913 0.8722

The difference in likelihood of making a one-way trip between annual and monthly passholders is \(0.91 - 0.87 = 0.04\). The large p-value of \(0.59\) suggests that this observed difference is not statistically significant. In words, if the true population proportions were the same, there is a 59% chance we would observe a difference of at least 0.04. So, we can’t reject the null hypothesis of equal population proportions.

4.4 One-sample tests for means and proportions

Instead of testing whether two groups are different, we can also perform a one-sample test which compares one group against a reference value. For example, suppose we know that the mean duration of bike share trips by monthly holders was 22 minutes in the previous three years, and we wish to test whether it has changed this year. Because the reference value of 22 minutes is not based on a random sample of data, we treat it as a fixed number. The null and alternate hypotheses then become:

\[\begin{align} H_0 : \mu = 22, \\ H_a : \mu \neq 22. \\ \end{align}\]

In R, we run the hypothesis test in the same way, this time replacing the second sample with mu = 22:

result <- t.test(monthly$duration, mu = 22)
pander(result)
One Sample t-test: monthly$duration
Test statistic df P value Alternative hypothesis mean of x
-2.519 265 0.01235 * two.sided 18.24

4.5 Background theory

Here we’ll give a high-level idea of the background theory for hypothesis testing. We will use one-sample hypothesis tests for means to convey the main ideas, which hold for one-sample proportion tests and both types of two-sample tests as well.

In a one-sample hypothesis test, the test statistic is

\[ t =\frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} \] What is the interpretation of \(t\) here? Recall from Part 3 that the sample mean \(\bar{x}\) is a random variable with mean \(\mu\) and standard deviation \(\frac{s}{\sqrt{n}}\). Therefore, if \(\mu\) is the true mean, then the observed \(\bar{x}\) is \(t\) standard errors away from the mean of the sampling distribution.

To illustrate, suppose in a random sample of \(n = 100\) trips by monthly passholders, we obtain a sample mean of \(\bar{x} = 27\) minutes. Suppose now we wish to test whether the true mean is \(22\) minutes. In this case, the hypotheses are

\[\begin{align} H_0 : \mu = 22 \\ H_a : \mu \neq 22. \end{align}\]

If the null hypothesis were true, then the distribution for \(\bar{x}\) would look something like:

# See code using the scroll bar immediately to the right -----> #



cols = met.brewer("Monet")
set.seed(1)
bike = read.csv("bikeshare.csv", stringsAsFactors = TRUE)
bike = bike[bike$pass_type=="Monthly Pass",]
mu = 22


n = 100
x_bar <- list()


for(i in 1:10000){
bike_sample <- bike[sample(1:nrow(bike), n),] 
x_bar[[i]] = mean(bike_sample$duration)
}


delt = mu - mean(unlist(x_bar))

CLT <-data.frame(cbind(x_bar))
CLT$x_bar=as.numeric(CLT$x_bar)+delt

ggplot(CLT, aes(x=x_bar))+
  geom_histogram(binwidth = 0.2, color = "white", fill=cols[9])+
  theme_ipsum(base_family = "Helvetica", axis_title_size = 14)+
  xlab("Mean duration of one way trips (minutes)")+
  ylab("Frequency")+
 geom_segment(aes(x = mu, y = 0, xend = mu, yend = 450), size = 0.5, linetype = "dashed", color = cols[6])+
   geom_segment(aes(x = 27, y = 0, xend = 27, yend = 80), size = 0.5, linetype = "dashed", color = cols[6])+
  annotate("text", x = mu+1, y = 425, label = expression(paste(mu,' = 22')))+
    annotate("text", x = 27+1, y = 70, label = expression(paste(x,' = 27')))+
  scale_x_continuous(breaks=c(15,16,17,18,19,20,21,22,23,24,25,26, 27, 28, 29, 30))

Note how far away \(\bar{x} = 27\) is from the hypothesized mean of \(\mu = 22\) minutes. That is, if the true mean really were 22 minutes, then it would be pretty rare for us to observe \(\bar{x} = 27\) (or larger) from a sample size of 100. In fact, out of 10,000 random draws of the data, the mean duration was more than 5 minutes further from the mean (in either direction) in only 353 out of 10,000 random draws, or in 3.53% of cases. This small probability is the \(p\)-value – the likelihood of seeing a deviation from the mean this large if the null were true.