4.4 Two-Sample Inference

4.4.1 For Means

Suppose we have two sets of observations:

$y_1, \dots, y_{n_y}$
$x_1, \dots, x_{n_x}$

These are random samples from two independent populations with means $\mu_y$ and $\mu_x$ and variances $\sigma_y^2$ and $\sigma_x^2$ . Our goal is to compare $\mu_y$ and $\mu_x$ or test whether $\sigma_y^2 = \sigma_x^2$ .

4.4.1.1 Large Sample Tests

If $n_y$ and $n_x$ are large ( $\geq 30$ ), the Central Limit Theorem allows us to make the following assumptions:

Expectation: $E(\bar{y} - \bar{x}) = \mu_y - \mu_x$
Variance: $\text{Var}(\bar{y} - \bar{x}) = \frac{\sigma_y^2}{n_y} + \frac{\sigma_x^2}{n_x}$

The test statistic is:

$Z = \frac{\bar{y} - \bar{x} - (\mu_y - \mu_x)}{\sqrt{\frac{\sigma_y^2}{n_y} + \frac{\sigma_x^2}{n_x}}} \sim N(0,1)$

For large samples, replace variances with their unbiased estimators $s_y^2$ and $s_x^2$ , yielding the same large sample distribution.

Confidence Interval

An approximate $100(1-\alpha)\%$ confidence interval for $\mu_y - \mu_x$ is:

$\bar{y} - \bar{x} \pm z_{\alpha/2} \sqrt{\frac{s_y^2}{n_y} + \frac{s_x^2}{n_x}}$

Hypothesis Test

Testing:

$H_0: \mu_y - \mu_x = \delta_0 \quad \text{vs.} \quad H_a: \mu_y - \mu_x \neq \delta_0$

The test statistic:

$z = \frac{\bar{y} - \bar{x} - \delta_0}{\sqrt{\frac{s_y^2}{n_y} + \frac{s_x^2}{n_x}}}$

Reject $H_0$ at the $\alpha$ -level if:

$|z| > z_{\alpha/2}$

If $\delta_0 = 0$ , this tests whether the two means are equal.

# Large sample test
y <- c(10, 12, 14, 16, 18)
x <- c(9, 11, 13, 15, 17)

# Mean and variance
mean_y <- mean(y)
mean_x <- mean(x)
var_y <- var(y)
var_x <- var(x)
n_y <- length(y)
n_x <- length(x)

# Test statistic
z <- (mean_y - mean_x) / sqrt(var_y / n_y + var_x / n_x)
p_value <- 2 * (1 - pnorm(abs(z)))

list(z = z, p_value = p_value)
#> $z
#> [1] 0.5
#> 
#> $p_value
#> [1] 0.6170751

4.4.1.2 Small Sample Tests

If the samples are small, assume the data come from independent normal distributions:

$y_i \sim N(\mu_y, \sigma_y^2)$
$x_i \sim N(\mu_x, \sigma_x^2)$

We can do inference based on the Student’s T Distribution, where we have 2 cases:

Statistical assumptions, diagnostic tests, and graphical methods
Assumption	Tests	Plots
Independence and Identically Distributed (i.i.d.) Observations	Test for serial correlation
Independence Between Samples	Correlation Coefficient	Scatterplot
Normality	See Normality Assessment	See Normality Assessment
Equality of Variances	F-Test Levene’s Test Modified Levene Test (Brown-Forsythe Test) Bartlett’s Test	Boxplots with overlayed means Residuals spread plots

4.4.1.2.1 Equal Variances

Assumptions

Independence and Identically Distributed (i.i.d.) Observations

Assume that observations in each sample are i.i.d., which implies:

$var(\bar{y}) = \frac{\sigma^2_y}{n_y}, \quad var(\bar{x}) = \frac{\sigma^2_x}{n_x}$

Independence Between Samples

The samples are assumed to be independent, meaning no observation from one sample influences observations from the other. This independence allows us to write:

$\begin{aligned} var(\bar{y} - \bar{x}) &= var(\bar{y}) + var(\bar{x}) - 2cov(\bar{y}, \bar{x}) \\ &= var(\bar{y}) + var(\bar{x}) \\ &= \frac{\sigma^2_y}{n_y} + \frac{\sigma^2_x}{n_x} \end{aligned}$

This calculation assumes $cov(\bar{y}, \bar{x}) = 0$ due to the independence between the samples.

Normality Assumption

We assume that the underlying populations are normally distributed. This assumption justifies the use of the Student’s T Distribution, which is critical for hypothesis testing and constructing confidence intervals.

Equality of Variances

If the population variances are equal, i.e., $\sigma^2_y = \sigma^2_x = \sigma^2$ , then $s^2_y$ and $s^2_x$ are both unbiased estimators of $\sigma^2$ . This allows us to pool the variances.

The pooled variance estimator is calculated as:

$s^2 = \frac{(n_y - 1)s^2_y + (n_x - 1)s^2_x}{(n_y - 1) + (n_x - 1)}$

The pooled variance estimate has degrees of freedom equal to:

$df = (n_y + n_x - 2)$

Test Statistic

The test statistic is: $T = \frac{\bar{y} - \bar{x} - (\mu_y - \mu_x)}{s \sqrt{\frac{1}{n_y} + \frac{1}{n_x}}} \sim t_{n_y + n_x - 2}$

Confidence Interval

A $100(1 - \alpha)\%$ confidence interval for $\mu_y - \mu_x$ is: $\bar{y} - \bar{x} \pm t_{n_y + n_x - 2, \alpha/2} \cdot s \sqrt{\frac{1}{n_y} + \frac{1}{n_x}}$

Hypothesis Test

Testing: $H_0: \mu_y - \mu_x = \delta_0 \quad \text{vs.} \quad H_a: \mu_y - \mu_x \neq \delta_0$

Reject $H_0$ if: $|T| > t_{n_y + n_x - 2, \alpha/2}$

# Small sample test with equal variance
t_test_equal <- t.test(y, x, var.equal = TRUE)
t_test_equal
#> 
#>  Two Sample t-test
#> 
#> data:  y and x
#> t = 0.5, df = 8, p-value = 0.6305
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -3.612008  5.612008
#> sample estimates:
#> mean of x mean of y 
#>        14        13

4.4.1.2.2 Unequal Variances

Assumptions

Independence and Identically Distributed (i.i.d.) Observations

Assume that observations in each sample are i.i.d., which implies:

$var(\bar{y}) = \frac{\sigma^2_y}{n_y}, \quad var(\bar{x}) = \frac{\sigma^2_x}{n_x}$

Independence Between Samples

The samples are assumed to be independent, meaning no observation from one sample influences observations from the other. This independence allows us to write:

$\begin{aligned} var(\bar{y} - \bar{x}) &= var(\bar{y}) + var(\bar{x}) - 2cov(\bar{y}, \bar{x}) \\ &= var(\bar{y}) + var(\bar{x}) \\ &= \frac{\sigma^2_y}{n_y} + \frac{\sigma^2_x}{n_x} \end{aligned}$

This calculation assumes $cov(\bar{y}, \bar{x}) = 0$ due to the independence between the samples.

Normality Assumption

Unequal Variances

$\sigma_y^2 \neq \sigma_x^2$

Test Statistic

The test statistic is:

$T = \frac{\bar{y} - \bar{x} - (\mu_y - \mu_x)}{\sqrt{\frac{s_y^2}{n_y} + \frac{s_x^2}{n_x}}}$

Degrees of Freedom (Welch-Satterthwaite Approximation) (Satterthwaite 1946)

The degrees of freedom are approximated by:

$v = \frac{\left(\frac{s_y^2}{n_y} + \frac{s_x^2}{n_x}\right)^2}{\frac{\left(\frac{s_y^2}{n_y}\right)^2}{n_y - 1} + \frac{\left(\frac{s_x^2}{n_x}\right)^2}{n_x - 1}}$

Since $v$ is fractional, truncate to the nearest integer.

Confidence Interval

A $100(1 - \alpha)\%$ confidence interval for $\mu_y - \mu_x$ is:

$\bar{y} - \bar{x} \pm t_{v, \alpha/2} \sqrt{\frac{s_y^2}{n_y} + \frac{s_x^2}{n_x}}$

Hypothesis Test

Testing:

$H_0: \mu_y - \mu_x = \delta_0 \quad \text{vs.} \quad H_a: \mu_y - \mu_x \neq \delta_0$

Reject $H_0$ if:

$|T| > t_{v, \alpha/2}$

where

$t = \frac{\bar{y} - \bar{x}-\delta_0}{\sqrt{s^2_y/n_y + s^2_x /n_x}}$

# Small sample test with unequal variance
t_test_unequal <- t.test(y, x, var.equal = FALSE)
t_test_unequal
#> 
#>  Welch Two Sample t-test
#> 
#> data:  y and x
#> t = 0.5, df = 8, p-value = 0.6305
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -3.612008  5.612008
#> sample estimates:
#> mean of x mean of y 
#>        14        13

4.4.2 For Variances

To compare the variances of two independent samples, we can use the F-test. The test statistic is defined as:

$F_{ndf,ddf} = \frac{s_1^2}{s_2^2}$

where $s_1^2 > s_2^2$ , $ndf = n_1 - 1$ , and $ddf = n_2 - 1$ are the numerator and denominator degrees of freedom, respectively.

4.4.2.1 F-Test

The hypotheses for the F-test are:

$\begin{aligned} H_0&: \sigma_y^2 = \sigma_x^2 \quad \text{(equal variances)} \\ H_a&: \sigma_y^2 \neq \sigma_x^2 \quad \text{(unequal variances)} \end{aligned}$

The test statistic is:

$F = \frac{s_y^2}{s_x^2}$

where $s_y^2$ and $s_x^2$ are the sample variances of the two groups.

Decision Rule

Reject $H_0$ if:

$F > F_{n_y-1, n_x-1, \alpha/2}$ (upper critical value), or
$F < F_{n_y-1, n_x-1, 1-\alpha/2}$ (lower critical value).

Here:

$F_{n_y-1, n_x-1, \alpha/2}$ and $F_{n_y-1, n_x-1, 1-\alpha/2}$ are the critical points of the F-distribution, with $n_y - 1$ and $n_x - 1$ degrees of freedom.

Assumptions

The F-test requires that the data in both groups follow a normal distribution.
The F-test is sensitive to deviations from normality (e.g., heavy-tailed distributions). If the normality assumption is violated, it may lead to an inflated Type I error rate (false positives).

Limitations and Alternatives

Sensitivity to Non-Normality:
- When data have long-tailed distributions (positive kurtosis), the F-test may produce misleading results.
- To assess normality, see Normality Assessment.
Nonparametric Alternatives:
- If the normality assumption is not met, use robust tests such as the Modified Levene Test (Brown-Forsythe Test), which compares group variances based on medians instead of means.

# Load iris dataset
data(iris)

# Subset data for two species
irisVe <- iris$Petal.Width[iris$Species == "versicolor"]
irisVi <- iris$Petal.Width[iris$Species == "virginica"]

# Perform F-test
f_test <- var.test(irisVe, irisVi)

# Display results
f_test
#> 
#>  F test to compare two variances
#> 
#> data:  irisVe and irisVi
#> F = 0.51842, num df = 49, denom df = 49, p-value = 0.02335
#> alternative hypothesis: true ratio of variances is not equal to 1
#> 95 percent confidence interval:
#>  0.2941935 0.9135614
#> sample estimates:
#> ratio of variances 
#>          0.5184243

4.4.2.2 Levene’s Test

Levene’s Test is a robust method for testing the equality of variances across multiple groups. Unlike the F-test, it is less sensitive to departures from normality and is particularly useful for handling non-normal distributions and datasets with outliers. The test works by analyzing the deviations of individual observations from their group mean or median.

Test Procedure

Compute the absolute deviations of each observation from its group mean or median:
- For group $y$ : $d_{y,i} = |y_i - \text{Central Value}_y|$
- For group $x$ : $d_{x,j} = |x_j - \text{Central Value}_x|$
- The “central value” can be either the mean (classic Levene’s test) or the median (Modified Levene Test (Brown-Forsythe Test) variation, more robust for non-normal data).
Perform a one-way ANOVA on the absolute deviations to test for differences in group variances.

Hypotheses

Null Hypothesis ( $H_0$ ): All groups have equal variances.
Alternative Hypothesis ( $H_a$ ): At least one group has a variance different from the others.

Test Statistic

The Levene test statistic is calculated as an ANOVA on the absolute deviations. Let:

$k$ : Number of groups,
$n_i$ : Number of observations in group $i$ ,
$n$ : Total number of observations.

The test statistic is:

$W = \frac{(n - k) \sum_{i=1}^k n_i (\bar{d}_i - \bar{d})^2}{(k - 1) \sum_{i=1}^k \sum_{j=1}^{n_i} (d_{i,j} - \bar{d}_i)^2}$

where:

$d_{i,j}$ : Absolute deviations within group $i$ ,
$\bar{d}_i$ : Mean of the absolute deviations for group $i$ ,
$\bar{d}$ : Overall mean of the absolute deviations.

Under the null hypothesis, $W \sim F_{k-1, n - k}$ .

Decision Rule

Compute the test statistic $W$ .
Reject $H_0$ at significance level $\alpha$ if: $W > F_{k-1, n-k, \alpha}$

# Load required package
library(car)

# Perform Levene's Test (absolute deviations from the mean)
levene_test_mean <- leveneTest(Petal.Width ~ Species, data = iris)

# Perform Levene's Test (absolute deviations from the median)
levene_test_median <-
    leveneTest(Petal.Width ~ Species, data = iris, center = median)

# Display results
levene_test_mean
#> Levene's Test for Homogeneity of Variance (center = median)
#>        Df F value    Pr(>F)    
#> group   2  19.892 2.261e-08 ***
#>       147                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
levene_test_median
#> Levene's Test for Homogeneity of Variance (center = median)
#>        Df F value    Pr(>F)    
#> group   2  19.892 2.261e-08 ***
#>       147                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output includes:

Df: Degrees of freedom for the numerator and denominator.
F-value: The computed value of the test statistic $W$ .
p-value: The probability of observing such a value under the null hypothesis.
If the p-value is less than $\alpha$ , reject $H_0$ and conclude that the group variances are significantly different.
Otherwise, fail to reject $H_0$ and conclude there is no evidence of a difference in variances.

Advantages of Levene’s Test

Robustness:
- Handles non-normal data and outliers better than the F-test.
Flexibility:
- By choosing the center value (mean or median), it can adapt to different data characteristics:
  - Use the mean for symmetric distributions.
  - Use the median for non-normal or skewed data.
Versatility:
- Applicable to comparing variances across more than two groups, unlike the Modified Levene Test (Brown-Forsythe Test), which is limited to two groups.

4.4.2.3 Modified Levene Test (Brown-Forsythe Test)

The Modified Levene Test is a robust alternative to the F-test for comparing variances between two groups. Instead of using squared deviations (as in the F-test), this test considers the absolute deviations from the median, making it less sensitive to non-normal data and long-tailed distributions. It is, however, still appropriate for normally distributed data.

For each sample, compute the absolute deviations from the median:

$d_{y,i} = |y_i - y_{.5}| \quad \text{and} \quad d_{x,i} = |x_i - x_{.5}|$

Let:

$\bar{d}_y$ and $\bar{d}_x$ be the means of the absolute deviations for groups $y$ and $x$ , respectively.

The test statistic is:

$t_L^* = \frac{\bar{d}_y - \bar{d}_x}{s \sqrt{\frac{1}{n_y} + \frac{1}{n_x}}}$

where the pooled variance $s^2$ is:

$s^2 = \frac{\sum_{i=1}^{n_y} (d_{y,i} - \bar{d}_y)^2 + \sum_{j=1}^{n_x} (d_{x,j} - \bar{d}_x)^2}{n_y + n_x - 2}$

Assumptions

Constant Variance of Error Terms:
The test assumes equal error variances in each group under the null hypothesis.
Moderate Sample Size:
The approximation $t_L^* \sim t_{n_y + n_x - 2}$ holds well for moderate or large sample sizes.

Decision Rule

Compute $t_L^*$ using the formula above.
Reject the null hypothesis of equal variances if: $|t_L^*| > t_{n_y + n_x - 2; \alpha/2}$

This is equivalent to applying a two-sample t-test to the absolute deviations.

# Absolute deviations from the median
dVe <- abs(irisVe - median(irisVe))
dVi <- abs(irisVi - median(irisVi))

# Perform t-test on absolute deviations
levene_test <- t.test(dVe, dVi, var.equal = TRUE)

# Display results
levene_test
#> 
#>  Two Sample t-test
#> 
#> data:  dVe and dVi
#> t = -2.5584, df = 98, p-value = 0.01205
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.12784786 -0.01615214
#> sample estimates:
#> mean of x mean of y 
#>     0.154     0.226

For small sample sizes, use the unequal variance t-test directly on the original data as a robust alternative:

# Small sample t-test with unequal variances
small_sample_test <- t.test(irisVe, irisVi, var.equal = FALSE)

# Display results
small_sample_test
#> 
#>  Welch Two Sample t-test
#> 
#> data:  irisVe and irisVi
#> t = -14.625, df = 89.043, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.7951002 -0.6048998
#> sample estimates:
#> mean of x mean of y 
#>     1.326     2.026

4.4.2.4 Bartlett’s Test

The Bartlett’s Test is a statistical procedure for testing the equality of variances across multiple groups. It assumes that the data in each group are normally distributed and is sensitive to deviations from normality. When the assumption of normality holds, Bartlett’s Test is more powerful than Levene’s Test.

Hypotheses for Bartlett’s Test

Null Hypothesis ( $H_0$ ): All groups have equal variances.
Alternative Hypothesis ( $H_a$ ): At least one group has a variance different from the others.

The test statistic for Bartlett’s Test is:

$B = \frac{(n - k) \log(S_p^2) - \sum_{i=1}^k (n_i - 1) \log(S_i^2)}{1 + \frac{1}{3(k - 1)} \left( \sum_{i=1}^k \frac{1}{n_i - 1} - \frac{1}{n - k} \right)}$

Where:

$k$ : Number of groups,
$n_i$ : Number of observations in group $i$ ,
$n = \sum_{i=1}^k n_i$ : Total number of observations,
$S_i^2$ : Sample variance of group $i$ ,
$S_p^2$ : Pooled variance, given by: $S_p^2 = \frac{\sum_{i=1}^k (n_i - 1) S_i^2}{n - k}$

Under the null hypothesis, the test statistic $B \sim \chi^2_{k - 1}$ .

Assumptions

Normality: The data in each group must follow a normal distribution.
Independence: Observations within and between groups must be independent.
Equal Sample Sizes (Optional): Bartlett’s Test is more robust if sample sizes are approximately equal.

Decision Rule

Compute the test statistic $B$ .
Compare $B$ to the critical value of the Chi-Square distribution at $\alpha$ and $k - 1$ degrees of freedom.
Reject $H_0$ if: $B > \chi^2_{k-1, \alpha}$

Alternatively, use the p-value:

Reject $H_0$ if the p-value $\leq \alpha$ .

# Perform Bartlett's Test
bartlett_test <- bartlett.test(Petal.Width ~ Species, data = iris)

# Display results
bartlett_test
#> 
#>  Bartlett test of homogeneity of variances
#> 
#> data:  Petal.Width by Species
#> Bartlett's K-squared = 39.213, df = 2, p-value = 3.055e-09

The output includes:

Bartlett’s K-squared: The value of the test statistic $B$ .
df: Degrees of freedom ( $k - 1$ ), where $k$ is the number of groups.
p-value: The probability of observing such a value of $B$ under $H_0$ .
If the p-value is less than $\alpha$ , reject $H_0$ and conclude that the variances are significantly different across groups.
If the p-value is greater than $\alpha$ , fail to reject $H_0$ and conclude that there is no significant evidence of variance differences.

Limitations of Bartlett’s Test

Sensitivity to Non-Normality:
Bartlett’s Test is highly sensitive to departures from normality. Even slight deviations can lead to misleading results.
Not Robust to Outliers:
Outliers can disproportionately affect the test result.
Alternatives:
If the normality assumption is violated, use robust alternatives like:
- Levene’s Test (absolute deviations)
- Modified Levene Test (Brown-Forsythe Test) (median-based absolute deviations)

Advantages of Bartlett’s Test

High Power: Bartlett’s Test is more powerful than robust alternatives when the normality assumption holds.
Simple Implementation: The test is easy to perform and interpret.

4.4.3 Power

To evaluate the power of a test, we consider the situation where the variances are equal across groups:

$\sigma_y^2 = \sigma_x^2 = \sigma^2$

Under the assumption of equal variances, we take equal sample sizes from both groups, i.e., $n_y = n_x = n$ .

Hypotheses for One-Sided Testing

We are testing:

$H_0: \mu_y - \mu_x \leq 0 \quad \text{vs.} \quad H_a: \mu_y - \mu_x > 0$

Test Statistic

The $\alpha$ -level z-test rejects $H_0$ if the test statistic:

$z = \frac{\bar{y} - \bar{x}}{\sigma \sqrt{\frac{2}{n}}} > z_\alpha$

where:

$\bar{y}$ and $\bar{x}$ are the sample means,
$\sigma$ is the common standard deviation,
$z_\alpha$ is the critical value from the standard normal distribution.

Power Function

The power of the test, denoted as $\pi(\mu_y - \mu_x)$ , is the probability of correctly rejecting $H_0$ when $\mu_y - \mu_x$ is some specified value. Under the alternative hypothesis, the power function is:

$\pi(\mu_y - \mu_x) = \Phi\left(-z_\alpha + \frac{\mu_y - \mu_x}{\sigma} \sqrt{\frac{n}{2}}\right)$

where:

$\Phi$ is the cumulative distribution function (CDF) of the standard normal distribution,
$\frac{\mu_y - \mu_x}{\sigma} \sqrt{\frac{n}{2}}$ represents the standardized effect size.

Determining the Required Sample Size

To achieve a desired power of $1 - \beta$ when the true difference is $\delta$ (the smallest difference of interest), we solve for the required sample size $n$ . The power equation is:

$\Phi\left(-z_\alpha + \frac{\delta}{\sigma} \sqrt{\frac{n}{2}}\right) = 1 - \beta$

Rearranging for $n$ , the required sample size is:

$n = \frac{2 \sigma^2}{\delta^2} \left(z_\alpha + z_\beta\right)^2$

where:

$\sigma$ : The common standard deviation,
$z_{\alpha}$ : The critical value for the Type I error rate $\alpha$ (one-sided test),
$z_{\beta}$ : The critical value for the Type II error rate $\beta$ (related to power $1 - \beta$ ),
$\delta$ : The minimum detectable difference between the means.

# Parameters
alpha <- 0.05   # Significance level
beta <- 0.2     # Type II error rate (1 - Power = 0.2)
sigma <- 1      # Common standard deviation
delta <- 0.5    # Minimum detectable difference

# Critical values
z_alpha <- qnorm(1 - alpha)
z_beta <- qnorm(1 - beta)

# Sample size calculation
n <- (2 * sigma ^ 2 * (z_alpha + z_beta) ^ 2) / delta ^ 2

# Output the required sample size (per group)
ceiling(n)
#> [1] 50

Sample Size for Two-Sided Tests

For a two-sided test, replace $z_{\alpha}$ with $z_{\alpha/2}$ to account for the two-tailed critical region:

$n = 2 \left( \frac{\sigma (z_{\alpha/2} + z_{\beta})}{\delta} \right)^2$

This ensures that the test has the required power $1 - \beta$ to detect a difference of size $\delta$ between the means at significance level $\alpha$ .

Adjustment for the Exact t-Test

When conducting an exact two-sample t-test for small sample sizes, the sample size calculation involves the non-central t-distribution. An approximate correction can be applied using the critical values from the t-distribution instead of the z-distribution.

The adjusted sample size is:

$n^* = 2 \left( \frac{\sigma (t_{2n-2; \alpha/2} + t_{2n-2; \beta})}{\delta} \right)^2$

Where:

$t_{2n-2; \alpha/2}$ : The critical value for the t-distribution with $2n - 2$ degrees of freedom for significance level $\alpha/2$ ,
$t_{2n-2; \beta}$ : The critical value for the t-distribution with $2n - 2$ degrees of freedom for power $1 - \beta$ .

This correction adjusts for the increased variability of the t-distribution, especially important for small sample sizes.

# Parameters
alpha <- 0.05    # Significance level
power <- 0.8     # Desired power
sigma <- 1       # Common standard deviation
delta <- 0.5     # Minimum detectable difference

# Calculate sample size for two-sided test
sample_size <-
    power.t.test(
        delta = delta,
        sd = sigma,
        sig.level = alpha,
        power = power,
        type = "two.sample",
        alternative = "two.sided"
    )

# Display results
sample_size
#> 
#>      Two-sample t test power calculation 
#> 
#>               n = 63.76576
#>           delta = 0.5
#>              sd = 1
#>       sig.level = 0.05
#>           power = 0.8
#>     alternative = two.sided
#> 
#> NOTE: n is number in *each* group

Key Insights

Z-Test vs. T-Test:
For large samples, the normal approximation (z-test) works well. For small samples, the t-test correction using the t-distribution is essential.
Effect of Power and Significance Level:
- Increasing power ( $1 - \beta$ ) or decreasing $\alpha$ requires larger sample sizes.
- A smaller minimum detectable difference ( $\delta$ ) also requires a larger sample size.
Two-Sided Tests:
Two-sided tests require larger sample sizes compared to one-sided tests due to the split critical region.

Formula Summary

Test Type	Formula for Sample Size
One-Sided Test	$n = 2 \left( \frac{\sigma (z_{\alpha} + z_{\beta})}{\delta} \right)^2$
Two-Sided Test	$n = 2 \left( \frac{\sigma (z_{\alpha/2} + z_{\beta})}{\delta} \right)^2$
Approximate t-Test	$n^* = 2 \left( \frac{\sigma (t_{2n-2; \alpha/2} + t_{2n-2; \beta})}{\delta} \right)^2$

4.4.4 Matched Pair Designs

In matched pair designs, two treatments are compared by measuring responses for the same subjects under both treatments. This ensures that the effects of subject-to-subject variability are minimized, as each subject serves as their own control.

We have two treatments, and the data are structured as follows:

Paired sample structure with treatment values and differences
Subject	Treatment A	Treatment B	Difference
$1$	$y_1$	$x_1$	$d_1 = y_1 - x_1$
$2$	$y_2$	$x_2$	$d_2 = y_2 - x_2$
…	…	…	…
$n$	$y_n$	$x_n$	$d_n = y_n - x_n$

Here:

$y_i$ represents the observation under Treatment A,
$x_i$ represents the observation under Treatment B,
$d_i = y_i - x_i$ is the difference for subject $i$ .

Assumptions

Observations $y_i$ and $x_i$ are measured for the same subjects, inducing correlation.
The differences $d_i$ are independent and identically distributed (iid), and follow a normal distribution: $d_i \sim N(\mu_D, \sigma_D^2)$

Mean and Variance of the Difference

The mean difference $\mu_D$ and the variance $\sigma_D^2$ are given by:

$\mu_D = E(y_i - x_i) = \mu_y - \mu_x$

$\sigma_D^2 = \text{Var}(y_i - x_i) = \text{Var}(y_i) + \text{Var}(x_i) - 2 \cdot \text{Cov}(y_i, x_i)$

If the covariance between $y_i$ and $x_i$ is positive (a typical case), the variance of the differences $\sigma_D^2$ is reduced compared to the independent sample case.
This is the key benefit of Matched Pair Designs: reduced variability increases the precision of estimates.

Sample Statistics

For the differences $d_i = y_i - x_i$ :

The sample mean of the differences: $\bar{d} = \frac{1}{n} \sum_{i=1}^n d_i = \bar{y} - \bar{x}$
The sample variance of the differences: $s_d^2 = \frac{1}{n-1} \sum_{i=1}^n (d_i - \bar{d})^2$

Once the data are converted into differences $d_i$ , the problem reduces to one-sample inference. We can use tests and confidence intervals (CIs) for the mean of a single sample.

Hypothesis Test

We test the following hypotheses:

$H_0: \mu_D = 0 \quad \text{vs.} \quad H_a: \mu_D \neq 0$

The test statistic is:

$t = \frac{\bar{d}}{s_d / \sqrt{n}} \sim t_{n-1}$

where $n$ is the number of subjects.

Reject $H_0$ at significance level $\alpha$ if: $|t| > t_{n-1, \alpha/2}$

Confidence Interval

A $100(1 - \alpha)\%$ confidence interval for $\mu_D$ is:

$\bar{d} \pm t_{n-1, \alpha/2} \cdot \frac{s_d}{\sqrt{n}}$

# Sample data
treatment_a <- c(85, 90, 78, 92, 88)
treatment_b <- c(80, 86, 75, 89, 85)

# Compute differences
differences <- treatment_a - treatment_b

# Perform one-sample t-test on the differences
t_test <- t.test(differences, mu = 0, alternative = "two.sided")

# Display results
t_test
#> 
#>  One Sample t-test
#> 
#> data:  differences
#> t = 9, df = 4, p-value = 0.0008438
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#>  2.489422 4.710578
#> sample estimates:
#> mean of x 
#>       3.6

The output includes:

t-statistic: The calculated test statistic for the matched pairs.
p-value: The probability of observing such a difference under the null hypothesis.
Confidence Interval: The range of plausible values for the mean difference $\mu_D$ .
If the p-value is less than $\alpha$ , reject $H_0$ and conclude that there is a significant difference between the two treatments.
If the confidence interval does not include 0, this supports the conclusion of a significant difference.

Key Insights

Reduced Variability: Positive correlation between paired observations reduces the variance of the differences, increasing test power.
Use of Differences: The paired design converts the data into a single-sample problem for inference.
Robustness: The paired t-test assumes normality of the differences $d_i$ . For larger $n$ , the Central Limit Theorem ensures robustness to non-normality.

Matched pair designs are a powerful way to control for subject-specific variability, leading to more precise comparisons between treatments.

4.4.5 Nonparametric Tests for Two Samples

For Matched Pair Designs or independent samples where normality cannot be assumed, we use nonparametric tests. These tests do not assume any specific distribution of the data and are robust alternatives to parametric methods.

Stochastic Order and Location Shift

Suppose $Y$ and $X$ are random variables with cumulative distribution functions (CDFs) $F_Y$ and $F_X$ . Then $Y$ is stochastically larger than $X$ if, for all real numbers $u$ :

$P(Y > u) \geq P(X > u) \quad \text{(equivalently, } F_Y(u) \leq F_X(u)).$

If the two distributions differ only in their location parameters, say $\theta_y$ and $\theta_x$ , then we can frame the relationship as:

$Y > X \quad \text{if} \quad \theta_y > \theta_x.$

We test the following hypotheses:

Two-Sided Hypothesis: $H_0: F_Y = F_X \quad \text{vs.} \quad H_a: F_Y \neq F_X$
Upper One-Sided Hypothesis: $H_0: F_Y = F_X \quad \text{vs.} \quad H_a: F_Y < F_X$
Lower One-Sided Hypothesis: $H_0: F_Y = F_X \quad \text{vs.} \quad H_a: F_Y > F_X$

We generally avoid the completely non-directional alternative $H_a: F_Y \neq F_X$ because it allows arbitrary differences between the distributions, without requiring one distribution to be stochastically larger than the other.

Nonparametric Tests

When the focus is on whether the two distributions differ only in location parameters, two equivalent nonparametric tests are commonly used:

Both tests are mathematically equivalent and test whether one sample is systematically larger than the other.

4.4.5.1 Wilcoxon Rank-Sum Test

The Wilcoxon Rank Test is a nonparametric test used to compare two independent samples to assess whether their distributions differ in location. It is based on the ranks of the combined observations rather than their actual values.

Procedure

Combine and Rank Observations:
Combine all $n = n_y + n_x$ observations (from both groups) into a single dataset and rank them in ascending order. If ties exist, assign the average rank to tied values.
Calculate Rank Sums:
Compute the sum of ranks for each group:
- $w_y$ : Sum of the ranks for group $y$ (sample 1),
- $w_x$ : Sum of the ranks for group $x$ (sample 2).
  By definition: $w_y + w_x = \frac{n(n+1)}{2}$
Test Statistic:
The test focuses on the rank sum $w_y$ . Reject $H_0$ if $w_y$ is large (indicating $y$ systematically has larger values) or equivalently, if $w_x$ is small.
Null Distribution:
Under $H_0$ (no difference between groups), all possible arrangements of ranks among $y$ and $x$ are equally likely. The total number of possible rank arrangements is:

$\frac{(n_y + n_x)!}{n_y! \, n_x!}$
Computational Considerations:
- For small samples, the exact null distribution of the rank sums can be calculated.
- For large samples, an approximate normal distribution can be used.

Hypotheses

Null Hypothesis ( $H_0$ ): The two samples come from identical distributions.
Alternative Hypothesis ( $H_a$ ): The two samples come from different distributions, or one distribution is systematically larger.
Two-Sided Test: $H_a: F_Y \neq F_X$
One-Sided Test: $H_a: F_Y > F_X \quad \text{or} \quad H_a: F_Y < F_X$

# Subset data for two species
irisVe <- iris$Petal.Width[iris$Species == "versicolor"]
irisVi <- iris$Petal.Width[iris$Species == "virginica"]

# Perform Wilcoxon Rank Test (approximate version, large sample)
wilcox_result <- wilcox.test(
    irisVe,
    irisVi,
    alternative = "two.sided", # Two-sided test
    conf.level = 0.95,         # Confidence level
    exact = FALSE,             # Approximate test for large samples
    correct = TRUE             # Apply continuity correction
)

# Display results
wilcox_result
#> 
#>  Wilcoxon rank sum test with continuity correction
#> 
#> data:  irisVe and irisVi
#> W = 49, p-value < 2.2e-16
#> alternative hypothesis: true location shift is not equal to 0

The output of wilcox.test includes:

W: The test statistic, which is the smaller of the two rank sums.
p-value: The probability of observing such a difference in rank sums under $H_0$ .
Alternative Hypothesis: Specifies whether the test was one-sided or two-sided.
Confidence Interval (if applicable): Provides a range for the difference in medians.

Decision Rule

Reject $H_0$ at significance level $\alpha$ if the p-value $\leq \alpha$ .
For large samples, compare the test statistic to a critical value from the normal approximation.

Key Features

Robustness:
The test does not require assumptions of normality and is robust to outliers.
Distribution-Free:
It evaluates whether two samples differ in location without assuming a specific distribution.
Rank-Based:
It uses the ranks of the observations, which makes it scale-invariant (resistant to data transformation).

Computational Considerations

For small sample sizes, the exact distribution of the rank sums is used.
For large sample sizes, the normal approximation with continuity correction is applied for computational efficiency.

4.4.5.2 Mann-Whitney U Test

The Mann-Whitney U Test is a nonparametric test used to compare two independent samples. It evaluates whether one sample tends to produce larger observations than the other, based on pairwise comparisons. The test does not assume normality and is robust to outliers.

Procedure

Pairwise Comparisons:
Compare each observation $y_i$ from sample $Y$ with each observation $x_j$ from sample $X$ .
- Let $u_y$ be the number of pairs where $y_i > x_j$ .
- Let $u_x$ be the number of pairs where $y_i < x_j$ .
By definition: $u_y + u_x = n_y n_x$ where $n_y$ is the sample size for group $Y$ , and $n_x$ is the sample size for group $X$ .
Test Statistic:
Reject $H_0$ if $u_y$ is large (or equivalently, if $u_x$ is small).

The Mann-Whitney U Test and Wilcoxon Rank-Sum Test are related through the rank sums:

$u_y = w_y - \frac{n_y (n_y + 1)}{2}, \quad u_x = w_x - \frac{n_x (n_x + 1)}{2}$

Here, $w_y$ and $w_x$ are the rank sums for groups $Y$ and $X$ , respectively.

Hypotheses

Null Hypothesis ( $H_0$ ): The two samples come from identical distributions.
Alternative Hypothesis ( $H_a$ ):
- Upper One-Sided: $F_Y < F_X$ (Sample $Y$ is stochastically larger).
- Lower One-Sided: $F_Y > F_X$ (Sample $X$ is stochastically larger).
- Two-Sided: $F_Y \neq F_X$ (Distributions differ in location).

Test Statistic for Large Samples

For large sample sizes $n_y$ and $n_x$ , the null distribution of $U$ can be approximated by a normal distribution with:

Mean: $E(U) = \frac{n_y n_x}{2}$
Variance: $\text{Var}(U) = \frac{n_y n_x (n_y + n_x + 1)}{12}$

The standardized test statistic $z$ is:

$z = \frac{u_y - \frac{n_y n_x}{2} - \frac{1}{2}}{\sqrt{\frac{n_y n_x (n_y + n_x + 1)}{12}}}$

The test rejects $H_0$ at level $\alpha$ if:

$z \ge z_{\alpha} \quad \text{(one-sided)} \quad \text{or} \quad |z| \ge z_{\alpha/2} \quad \text{(two-sided)}.$

For the two-sided test, we use:

$u_{\text{max}} = \max(u_y, u_x)$ , and
$u_{\text{min}} = \min(u_y, u_x)$ .

The p-value is given by:

$p\text{-value} = 2P(U \ge u_{\text{max}}) = 2P(U \le u_{\text{min}}).$

When $y_i = x_j$ (ties), assign a value of $1/2$ to both $u_y$ and $u_x$ for that pair. While the exact sampling distribution differs slightly when ties exist, the large sample normal approximation remains reasonable.

# Subset data for two species
irisVe <- iris$Petal.Width[iris$Species == "versicolor"]
irisVi <- iris$Petal.Width[iris$Species == "virginica"]

# Perform Mann-Whitney U Test
mann_whitney <- wilcox.test(
    irisVe, irisVi, 
    alternative = "two.sided", 
    conf.level = 0.95,
    exact = FALSE,   # Approximate test for large samples
    correct = TRUE   # Apply continuity correction
)

# Display results
mann_whitney
#> 
#>  Wilcoxon rank sum test with continuity correction
#> 
#> data:  irisVe and irisVi
#> W = 49, p-value < 2.2e-16
#> alternative hypothesis: true location shift is not equal to 0

Decision Rule

Reject $H_0$ if the p-value is less than $\alpha$ .
For large samples, check whether $z \ge z_{\alpha}$ (one-sided) or $|z| \ge z_{\alpha/2}$ (two-sided).

Key Insights

Robustness: The Mann-Whitney U Test does not assume normality and is robust to outliers.
Relationship to Wilcoxon Test: The test is equivalent to the Wilcoxon Rank-Sum Test but formulated differently (based on pairwise comparisons).
Large Sample Approximation: For large $n_y$ and $n_x$ , the test statistic $U$ follows an approximate normal distribution, simplifying computation.
Handling Ties: Ties are accounted for by assigning fractional contributions to $u_y$ and $u_x$ .

References

Satterthwaite, Franklin E. 1946. “An Approximate Distribution of Estimates of Variance Components.” Biometrics Bulletin 2 (6): 110–14.