## 35.6 Hypothesis testing

### 35.6.1 Introduction

The regression line is computed from the *sample*,
assuming a linear relationship actually exists in the *population*.
The (unknown) regression line in the *population* is

\[
\hat{y} = {\beta_0} + {\beta_1} x.
\]
From the *sample*,
the *estimate* of the *population* regression line
(Appendix C)
is

\[
\hat{y} = {b_0} + {b_1} x.
\]
That is,
the intercept in the population is \(\beta_0\)
(estimated by \(b_0\)),
and the slope in the population is \(\beta_1\)
(estimated by \(b_1\)).
The *sample* can be used to ask questions about the
*population* regression coefficients.
As usual,
the sample values can vary from sample to sample
(and so have a *sampling distribution*).

Usually questions are asked about the *slope*,
because the slope explains the
*relationship* between the two variables
(Sect. 35.5).

### 35.6.2 Hypotheses: Assumption

The null hypothesis is
the usual
‘no relationship’ hypothesis.
In this context,
‘no relationship’ means that the slope is zero
(Sect. 35.5.2).
Hence,
the null hypotheses (about the *population*) is:

- \(H_0\): \(\beta_1 = 0\).

This hypothesis proposes that \(b_1\) is not zero because of sampling variation.
As part of the decision-making process,
the null hypothesis is initially **assumed** to be true.

For the red deer data (Sect. 33.2), determining if a relationship exists between the age of the deer, and the weight of their molars, would test these hypotheses:

- \(H_0\): \(\beta_1 = 0\);
- \(H_1\): \(\beta_1 \ne 0\)

The parameter is \(\beta\), the population slope for the regression equation predicting molar weight from age.

The alternative hypothesis is two-tailed, based on the RQ.

### 35.6.3 Sampling distribution: Expectation

**Assuming** the null hypothesis is true (that \(\beta_1=0\)),
we can describe what values the sample slope \(b_1\) are **expected** to take,
through *sampling variation*.
The variation in the sample slope from sample to sample
can be described
(Fig. 35.6)
using:

- an approximate normal distribution,
- with a mean of \(\beta_1 = 0\) (from \(H_0\)), and
- a standard deviation,
called the
*standard error of the slope*, of \(\text{s.e.}(b_1)\).

The standard error is found using software (jamovi: Fig. 35.7; SPSS: Fig. 35.8).

### 35.6.4 The test statistic: Observation

The **observed**
sample slope was \(b_1 = -0.181\).
The *test statistic* would be found using
the usual approach:

\[\begin{align*} t &= \frac{ b_1 - \beta_1}{\text{s.e.}(b_1)} \\ &= \frac{-0.181 - 0}{0.0289} = -6.27, \end{align*}\] where the values of \(b_1\) and \(\text{s.e.}(b_1)\) are taken from the software output. The \(t\)-score is also reported by the software.

### 35.6.5 \(P\)-value: Consistency with assumption

To determine if the statistic is **consistent** with the null hypothesis,
the \(P\)-value can be approximated using the 68–95–99.7 rule,
or taken from software output
(jamovi: Fig. 35.7;
SPSS: Fig. 35.8).
Using software,
the \(P\)-value is \(P<0.001\).

We write:

The sample presents very strong evidence (\(t = -6.27\); one-tailed \(P<0.001\)) that the slope in the population between age of the deer and molar weight is not zero (slope: \(-0.181\)).

**Example 35.4 (Emergency department patients) **A study examined the relationship between the number of emergency department (ED) patients
and the number of days following the distribution of monthly welfare monies (Brunette et al. 1991) from 1986 to 1988 in Minneapolis, USA.

The data (extracted from Fig. 2 of Brunette et al. (1991)) is plotted in Fig. 35.9, and the jamovi analysis shown in Fig. 35.10.

The regression line is estimated as

\[ \hat{y} = 150.19 - 0.348x, \] where \(y\) represents the mean number of ED patients, and \(x\) the number of days since welfare distribution.

This regression equation suggests that each extra day after welfare distribution is associated with a *decrease* in the number of ED patients of about 0.35.
(It may be easier to understand this way:
‘each 10 extra days after welfare distribution is associated with a *decrease* in the number of ED patients of about \(10\times 0.35 = 3.5\).’)

The scatterplot and the regression equation suggests a negative relationship between the number of ED patients and the days after distribution. However, we know that every sample is likely to be different, so the relationship may not actually be present in the population. So we test these hypotheses:

- \(H_0\): \(\beta_1 = 0\) where \(\beta_1\) is the
*population*slope - \(H_1\): \(\beta_1 \ne 0\) (i.e., two-tailed, based on the authors’ aim)

The output shows that the test statistic is \(t = -7.45\), which is very large; unsurprisingly, the two-tailed \(P\)-value is very small: \(P<0.001\).

We write:

There is very strong evidence (\(t = -7.45\); two-tailed \(P < 0.001\)) of a relationship between the mean number of ED patients and the number of days after welfare distribution (slope: \(-0.348\)).