4.4 Two-Sample Inference

4.4.1 For Means

Suppose we have two sets of observations:

  • y1,,yny
  • x1,,xnx

These are random samples from two independent populations with means μy and μx and variances σ2y and σ2x. Our goal is to compare μy and μx or test whether σ2y=σ2x.


4.4.1.1 Large Sample Tests

If ny and nx are large (30), the Central Limit Theorem allows us to make the following assumptions:

  • Expectation: E(ˉyˉx)=μyμx
  • Variance: Var(ˉyˉx)=σ2yny+σ2xnx

The test statistic is:

Z=ˉyˉx(μyμx)σ2yny+σ2xnxN(0,1)

For large samples, replace variances with their unbiased estimators s2y and s2x, yielding the same large sample distribution.

Confidence Interval

An approximate 100(1α)% confidence interval for μyμx is:

ˉyˉx±zα/2s2yny+s2xnx

Hypothesis Test

Testing:

H0:μyμx=δ0vs.Ha:μyμxδ0

The test statistic:

z=ˉyˉxδ0s2yny+s2xnx

Reject H0 at the α-level if:

|z|>zα/2

If δ0=0, this tests whether the two means are equal.

# Large sample test
y <- c(10, 12, 14, 16, 18)
x <- c(9, 11, 13, 15, 17)

# Mean and variance
mean_y <- mean(y)
mean_x <- mean(x)
var_y <- var(y)
var_x <- var(x)
n_y <- length(y)
n_x <- length(x)

# Test statistic
z <- (mean_y - mean_x) / sqrt(var_y / n_y + var_x / n_x)
p_value <- 2 * (1 - pnorm(abs(z)))

list(z = z, p_value = p_value)
#> $z
#> [1] 0.5
#> 
#> $p_value
#> [1] 0.6170751

4.4.1.2 Small Sample Tests

If the samples are small, assume the data come from independent normal distributions:

  • yiN(μy,σ2y)

  • xiN(μx,σ2x)

We can do inference based on the Student’s T Distribution, where we have 2 cases:

Assumption Tests Plots
Independence and Identically Distributed (i.i.d.) Observations Test for serial correlation
Independence Between Samples Correlation Coefficient Scatterplot
Normality See Normality Assessment See Normality Assessment
Equality of Variances
  1. F-Test
  2. Levene’s Test
  3. Modified Levene Test (Brown-Forsythe Test)
  4. Bartlett’s Test
  1. Boxplots with overlayed means
  2. Residuals spread plots
4.4.1.2.1 Equal Variances

Assumptions

  1. Independence and Identically Distributed (i.i.d.) Observations

Assume that observations in each sample are i.i.d., which implies:

var(ˉy)=σ2yny,var(ˉx)=σ2xnx

  1. Independence Between Samples

The samples are assumed to be independent, meaning no observation from one sample influences observations from the other. This independence allows us to write:

var(ˉyˉx)=var(ˉy)+var(ˉx)2cov(ˉy,ˉx)=var(ˉy)+var(ˉx)=σ2yny+σ2xnx

This calculation assumes cov(ˉy,ˉx)=0 due to the independence between the samples.

  1. Normality Assumption

We assume that the underlying populations are normally distributed. This assumption justifies the use of the Student’s T Distribution, which is critical for hypothesis testing and constructing confidence intervals.

  1. Equality of Variances

If the population variances are equal, i.e., σ2y=σ2x=σ2, then s2y and s2x are both unbiased estimators of σ2. This allows us to pool the variances.

The pooled variance estimator is calculated as:

s2=(ny1)s2y+(nx1)s2x(ny1)+(nx1)

The pooled variance estimate has degrees of freedom equal to:

df=(ny+nx2)

Test Statistic

The test statistic is: T=ˉyˉx(μyμx)s1ny+1nxtny+nx2

Confidence Interval

A 100(1α)% confidence interval for μyμx is: ˉyˉx±tny+nx2,α/2s1ny+1nx

Hypothesis Test

Testing: H0:μyμx=δ0vs.Ha:μyμxδ0

Reject H0 if: |T|>tny+nx2,α/2

# Small sample test with equal variance
t_test_equal <- t.test(y, x, var.equal = TRUE)
t_test_equal
#> 
#>  Two Sample t-test
#> 
#> data:  y and x
#> t = 0.5, df = 8, p-value = 0.6305
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -3.612008  5.612008
#> sample estimates:
#> mean of x mean of y 
#>        14        13

4.4.1.2.2 Unequal Variances

Assumptions

  1. Independence and Identically Distributed (i.i.d.) Observations

Assume that observations in each sample are i.i.d., which implies:

var(ˉy)=σ2yny,var(ˉx)=σ2xnx

  1. Independence Between Samples

The samples are assumed to be independent, meaning no observation from one sample influences observations from the other. This independence allows us to write:

var(ˉyˉx)=var(ˉy)+var(ˉx)2cov(ˉy,ˉx)=var(ˉy)+var(ˉx)=σ2yny+σ2xnx

This calculation assumes cov(ˉy,ˉx)=0 due to the independence between the samples.

  1. Normality Assumption

We assume that the underlying populations are normally distributed. This assumption justifies the use of the Student’s T Distribution, which is critical for hypothesis testing and constructing confidence intervals.

  1. Unequal Variances

σ2yσ2x

Test Statistic

The test statistic is:

T=ˉyˉx(μyμx)s2yny+s2xnx

Degrees of Freedom (Welch-Satterthwaite Approximation) (Satterthwaite 1946)

The degrees of freedom are approximated by:

v=(s2yny+s2xnx)2(s2yny)2ny1+(s2xnx)2nx1

Since v is fractional, truncate to the nearest integer.

Confidence Interval

A 100(1α)% confidence interval for μyμx is:

ˉyˉx±tv,α/2s2yny+s2xnx

Hypothesis Test

Testing:

H0:μyμx=δ0vs.Ha:μyμxδ0

Reject H0 if:

|T|>tv,α/2

where

t=ˉyˉxδ0s2y/ny+s2x/nx

# Small sample test with unequal variance
t_test_unequal <- t.test(y, x, var.equal = FALSE)
t_test_unequal
#> 
#>  Welch Two Sample t-test
#> 
#> data:  y and x
#> t = 0.5, df = 8, p-value = 0.6305
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -3.612008  5.612008
#> sample estimates:
#> mean of x mean of y 
#>        14        13

4.4.2 For Variances

To compare the variances of two independent samples, we can use the F-test. The test statistic is defined as:

Fndf,ddf=s21s22

where s21>s22, ndf=n11, and ddf=n21 are the numerator and denominator degrees of freedom, respectively.


4.4.2.1 F-Test

The hypotheses for the F-test are:

H0:σ2y=σ2x(equal variances)Ha:σ2yσ2x(unequal variances)

The test statistic is:

F=s2ys2x

where s2y and s2x are the sample variances of the two groups.

Decision Rule

Reject H0 if:

  • F>Fny1,nx1,α/2 (upper critical value), or

  • F<Fny1,nx1,1α/2 (lower critical value).

Here:

  • Fny1,nx1,α/2 and Fny1,nx1,1α/2 are the critical points of the F-distribution, with ny1 and nx1 degrees of freedom.

Assumptions

  • The F-test requires that the data in both groups follow a normal distribution.
  • The F-test is sensitive to deviations from normality (e.g., heavy-tailed distributions). If the normality assumption is violated, it may lead to an inflated Type I error rate (false positives).

Limitations and Alternatives

  1. Sensitivity to Non-Normality:
    • When data have long-tailed distributions (positive kurtosis), the F-test may produce misleading results.
    • To assess normality, see Normality Assessment.
  2. Nonparametric Alternatives:
# Load iris dataset
data(iris)

# Subset data for two species
irisVe <- iris$Petal.Width[iris$Species == "versicolor"]
irisVi <- iris$Petal.Width[iris$Species == "virginica"]

# Perform F-test
f_test <- var.test(irisVe, irisVi)

# Display results
f_test
#> 
#>  F test to compare two variances
#> 
#> data:  irisVe and irisVi
#> F = 0.51842, num df = 49, denom df = 49, p-value = 0.02335
#> alternative hypothesis: true ratio of variances is not equal to 1
#> 95 percent confidence interval:
#>  0.2941935 0.9135614
#> sample estimates:
#> ratio of variances 
#>          0.5184243

4.4.2.2 Levene’s Test

Levene’s Test is a robust method for testing the equality of variances across multiple groups. Unlike the F-test, it is less sensitive to departures from normality and is particularly useful for handling non-normal distributions and datasets with outliers. The test works by analyzing the deviations of individual observations from their group mean or median.

Test Procedure

  1. Compute the absolute deviations of each observation from its group mean or median:
    • For group y: dy,i=|yiCentral Valuey|
    • For group x: dx,j=|xjCentral Valuex|
    • The “central value” can be either the mean (classic Levene’s test) or the median (Modified Levene Test (Brown-Forsythe Test) variation, more robust for non-normal data).
  2. Perform a one-way ANOVA on the absolute deviations to test for differences in group variances.

Hypotheses

  • Null Hypothesis (H0): All groups have equal variances.
  • Alternative Hypothesis (Ha): At least one group has a variance different from the others.

Test Statistic

The Levene test statistic is calculated as an ANOVA on the absolute deviations. Let:

  • k: Number of groups,

  • ni: Number of observations in group i,

  • n: Total number of observations.

The test statistic is:

W=(nk)ki=1ni(ˉdiˉd)2(k1)ki=1nij=1(di,jˉdi)2

where:

  • di,j: Absolute deviations within group i,

  • ˉdi: Mean of the absolute deviations for group i,

  • ˉd: Overall mean of the absolute deviations.

Under the null hypothesis, WFk1,nk.

Decision Rule

  • Compute the test statistic W.
  • Reject H0 at significance level α if: W>Fk1,nk,α
# Load required package
library(car)

# Perform Levene's Test (absolute deviations from the mean)
levene_test_mean <- leveneTest(Petal.Width ~ Species, data = iris)

# Perform Levene's Test (absolute deviations from the median)
levene_test_median <-
    leveneTest(Petal.Width ~ Species, data = iris, center = median)

# Display results
levene_test_mean
#> Levene's Test for Homogeneity of Variance (center = median)
#>        Df F value    Pr(>F)    
#> group   2  19.892 2.261e-08 ***
#>       147                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
levene_test_median
#> Levene's Test for Homogeneity of Variance (center = median)
#>        Df F value    Pr(>F)    
#> group   2  19.892 2.261e-08 ***
#>       147                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output includes:

  • Df: Degrees of freedom for the numerator and denominator.

  • F-value: The computed value of the test statistic W.

  • p-value: The probability of observing such a value under the null hypothesis.

  • If the p-value is less than α, reject H0 and conclude that the group variances are significantly different.

  • Otherwise, fail to reject H0 and conclude there is no evidence of a difference in variances.

Advantages of Levene’s Test

  1. Robustness:

    • Handles non-normal data and outliers better than the F-test.
  2. Flexibility:

    • By choosing the center value (mean or median), it can adapt to different data characteristics:

      • Use the mean for symmetric distributions.

      • Use the median for non-normal or skewed data.

  3. Versatility:

4.4.2.3 Modified Levene Test (Brown-Forsythe Test)

The Modified Levene Test is a robust alternative to the F-test for comparing variances between two groups. Instead of using squared deviations (as in the F-test), this test considers the absolute deviations from the median, making it less sensitive to non-normal data and long-tailed distributions. It is, however, still appropriate for normally distributed data.

For each sample, compute the absolute deviations from the median:

dy,i=|yiy.5|anddx,i=|xix.5|

Let:

  • ˉdy and ˉdx be the means of the absolute deviations for groups y and x, respectively.

The test statistic is:

tL=ˉdyˉdxs1ny+1nx

where the pooled variance s2 is:

s2=nyi=1(dy,iˉdy)2+nxj=1(dx,jˉdx)2ny+nx2

Assumptions

  1. Constant Variance of Error Terms:
    The test assumes equal error variances in each group under the null hypothesis.

  2. Moderate Sample Size:
    The approximation tLtny+nx2 holds well for moderate or large sample sizes.

Decision Rule

  • Compute tL using the formula above.
  • Reject the null hypothesis of equal variances if: |tL|>tny+nx2;α/2

This is equivalent to applying a two-sample t-test to the absolute deviations.

# Absolute deviations from the median
dVe <- abs(irisVe - median(irisVe))
dVi <- abs(irisVi - median(irisVi))

# Perform t-test on absolute deviations
levene_test <- t.test(dVe, dVi, var.equal = TRUE)

# Display results
levene_test
#> 
#>  Two Sample t-test
#> 
#> data:  dVe and dVi
#> t = -2.5584, df = 98, p-value = 0.01205
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.12784786 -0.01615214
#> sample estimates:
#> mean of x mean of y 
#>     0.154     0.226

For small sample sizes, use the unequal variance t-test directly on the original data as a robust alternative:

# Small sample t-test with unequal variances
small_sample_test <- t.test(irisVe, irisVi, var.equal = FALSE)

# Display results
small_sample_test
#> 
#>  Welch Two Sample t-test
#> 
#> data:  irisVe and irisVi
#> t = -14.625, df = 89.043, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.7951002 -0.6048998
#> sample estimates:
#> mean of x mean of y 
#>     1.326     2.026

4.4.2.4 Bartlett’s Test

The Bartlett’s Test is a statistical procedure for testing the equality of variances across multiple groups. It assumes that the data in each group are normally distributed and is sensitive to deviations from normality. When the assumption of normality holds, Bartlett’s Test is more powerful than Levene’s Test.

Hypotheses for Bartlett’s Test

  • Null Hypothesis (H0): All groups have equal variances.
  • Alternative Hypothesis (Ha): At least one group has a variance different from the others.

The test statistic for Bartlett’s Test is:

B=(nk)log(S2p)ki=1(ni1)log(S2i)1+13(k1)(ki=11ni11nk)

Where:

  • k: Number of groups,

  • ni: Number of observations in group i,

  • n=ki=1ni: Total number of observations,

  • S2i: Sample variance of group i,

  • S2p: Pooled variance, given by: S2p=ki=1(ni1)S2ink

Under the null hypothesis, the test statistic Bχ2k1.

Assumptions

  1. Normality: The data in each group must follow a normal distribution.
  2. Independence: Observations within and between groups must be independent.
  3. Equal Sample Sizes (Optional): Bartlett’s Test is more robust if sample sizes are approximately equal.

Decision Rule

  • Compute the test statistic B.
  • Compare B to the critical value of the Chi-Square distribution at α and k1 degrees of freedom.
  • Reject H0 if: B>χ2k1,α

Alternatively, use the p-value:

  • Reject H0 if the p-value α.
# Perform Bartlett's Test
bartlett_test <- bartlett.test(Petal.Width ~ Species, data = iris)

# Display results
bartlett_test
#> 
#>  Bartlett test of homogeneity of variances
#> 
#> data:  Petal.Width by Species
#> Bartlett's K-squared = 39.213, df = 2, p-value = 3.055e-09

The output includes:

  • Bartlett’s K-squared: The value of the test statistic B.

  • df: Degrees of freedom (k1), where k is the number of groups.

  • p-value: The probability of observing such a value of B under H0.

  • If the p-value is less than α, reject H0 and conclude that the variances are significantly different across groups.

  • If the p-value is greater than α, fail to reject H0 and conclude that there is no significant evidence of variance differences.

Limitations of Bartlett’s Test

  1. Sensitivity to Non-Normality:
    Bartlett’s Test is highly sensitive to departures from normality. Even slight deviations can lead to misleading results.

  2. Not Robust to Outliers:
    Outliers can disproportionately affect the test result.

  3. Alternatives:
    If the normality assumption is violated, use robust alternatives like:

Advantages of Bartlett’s Test

  • High Power: Bartlett’s Test is more powerful than robust alternatives when the normality assumption holds.

  • Simple Implementation: The test is easy to perform and interpret.

4.4.3 Power

To evaluate the power of a test, we consider the situation where the variances are equal across groups:

σ2y=σ2x=σ2

Under the assumption of equal variances, we take equal sample sizes from both groups, i.e., ny=nx=n.

Hypotheses for One-Sided Testing

We are testing:

H0:μyμx0vs.Ha:μyμx>0

Test Statistic

The α-level z-test rejects H0 if the test statistic:

z=ˉyˉxσ2n>zα

where:

  • ˉy and ˉx are the sample means,

  • σ is the common standard deviation,

  • zα is the critical value from the standard normal distribution.

Power Function

The power of the test, denoted as π(μyμx), is the probability of correctly rejecting H0 when μyμx is some specified value. Under the alternative hypothesis, the power function is:

π(μyμx)=Φ(zα+μyμxσn2)

where:

  • Φ is the cumulative distribution function (CDF) of the standard normal distribution,

  • μyμxσn2 represents the standardized effect size.

Determining the Required Sample Size

To achieve a desired power of 1β when the true difference is δ (the smallest difference of interest), we solve for the required sample size n. The power equation is:

Φ(zα+δσn2)=1β

Rearranging for n, the required sample size is:

n=2σ2δ2(zα+zβ)2

where:

  • σ: The common standard deviation,

  • zα: The critical value for the Type I error rate α (one-sided test),

  • zβ: The critical value for the Type II error rate β (related to power 1β),

  • δ: The minimum detectable difference between the means.

# Parameters
alpha <- 0.05   # Significance level
beta <- 0.2     # Type II error rate (1 - Power = 0.2)
sigma <- 1      # Common standard deviation
delta <- 0.5    # Minimum detectable difference

# Critical values
z_alpha <- qnorm(1 - alpha)
z_beta <- qnorm(1 - beta)

# Sample size calculation
n <- (2 * sigma ^ 2 * (z_alpha + z_beta) ^ 2) / delta ^ 2

# Output the required sample size (per group)
ceiling(n)
#> [1] 50

Sample Size for Two-Sided Tests

For a two-sided test, replace zα with zα/2 to account for the two-tailed critical region:

n=2(σ(zα/2+zβ)δ)2

This ensures that the test has the required power 1β to detect a difference of size δ between the means at significance level α.

Adjustment for the Exact t-Test

When conducting an exact two-sample t-test for small sample sizes, the sample size calculation involves the non-central t-distribution. An approximate correction can be applied using the critical values from the t-distribution instead of the z-distribution.

The adjusted sample size is:

n=2(σ(t2n2;α/2+t2n2;β)δ)2

Where:

  • t2n2;α/2: The critical value for the t-distribution with 2n2 degrees of freedom for significance level α/2,

  • t2n2;β: The critical value for the t-distribution with 2n2 degrees of freedom for power 1β.

This correction adjusts for the increased variability of the t-distribution, especially important for small sample sizes.

# Parameters
alpha <- 0.05    # Significance level
power <- 0.8     # Desired power
sigma <- 1       # Common standard deviation
delta <- 0.5     # Minimum detectable difference

# Calculate sample size for two-sided test
sample_size <-
    power.t.test(
        delta = delta,
        sd = sigma,
        sig.level = alpha,
        power = power,
        type = "two.sample",
        alternative = "two.sided"
    )

# Display results
sample_size
#> 
#>      Two-sample t test power calculation 
#> 
#>               n = 63.76576
#>           delta = 0.5
#>              sd = 1
#>       sig.level = 0.05
#>           power = 0.8
#>     alternative = two.sided
#> 
#> NOTE: n is number in *each* group

Key Insights

  1. Z-Test vs. T-Test:
    For large samples, the normal approximation (z-test) works well. For small samples, the t-test correction using the t-distribution is essential.

  2. Effect of Power and Significance Level:

    • Increasing power (1β) or decreasing α requires larger sample sizes.

    • A smaller minimum detectable difference (δ) also requires a larger sample size.

  3. Two-Sided Tests:
    Two-sided tests require larger sample sizes compared to one-sided tests due to the split critical region.

Formula Summary

Test Type Formula for Sample Size
One-Sided Test n=2(σ(zα+zβ)δ)2
Two-Sided Test n=2(σ(zα/2+zβ)δ)2
Approximate t-Test n=2(σ(t2n2;α/2+t2n2;β)δ)2

4.4.4 Matched Pair Designs

In matched pair designs, two treatments are compared by measuring responses for the same subjects under both treatments. This ensures that the effects of subject-to-subject variability are minimized, as each subject serves as their own control.


We have two treatments, and the data are structured as follows:

Subject Treatment A Treatment B Difference
1 y1 x1 d1=y1x1
2 y2 x2 d2=y2x2
n yn xn dn=ynxn

Here:

  • yi represents the observation under Treatment A,

  • xi represents the observation under Treatment B,

  • di=yixi is the difference for subject i.

Assumptions

  1. Observations yi and xi are measured for the same subjects, inducing correlation.
  2. The differences di are independent and identically distributed (iid), and follow a normal distribution: diN(μD,σ2D)

Mean and Variance of the Difference

The mean difference μD and the variance σ2D are given by:

μD=E(yixi)=μyμx

σ2D=Var(yixi)=Var(yi)+Var(xi)2Cov(yi,xi)

  • If the covariance between yi and xi is positive (a typical case), the variance of the differences σ2D is reduced compared to the independent sample case.
  • This is the key benefit of Matched Pair Designs: reduced variability increases the precision of estimates.

Sample Statistics

For the differences di=yixi:

  • The sample mean of the differences: ˉd=1nni=1di=ˉyˉx

  • The sample variance of the differences: s2d=1n1ni=1(diˉd)2

Once the data are converted into differences di, the problem reduces to one-sample inference. We can use tests and confidence intervals (CIs) for the mean of a single sample.

Hypothesis Test

We test the following hypotheses:

H0:μD=0vs.Ha:μD0

The test statistic is:

t=ˉdsd/ntn1

where n is the number of subjects.

  • Reject H0 at significance level α if: |t|>tn1,α/2

Confidence Interval

A 100(1α)% confidence interval for μD is:

ˉd±tn1,α/2sdn

# Sample data
treatment_a <- c(85, 90, 78, 92, 88)
treatment_b <- c(80, 86, 75, 89, 85)

# Compute differences
differences <- treatment_a - treatment_b

# Perform one-sample t-test on the differences
t_test <- t.test(differences, mu = 0, alternative = "two.sided")

# Display results
t_test
#> 
#>  One Sample t-test
#> 
#> data:  differences
#> t = 9, df = 4, p-value = 0.0008438
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#>  2.489422 4.710578
#> sample estimates:
#> mean of x 
#>       3.6

The output includes:

  • t-statistic: The calculated test statistic for the matched pairs.

  • p-value: The probability of observing such a difference under the null hypothesis.

  • Confidence Interval: The range of plausible values for the mean difference μD.

  • If the p-value is less than α, reject H0 and conclude that there is a significant difference between the two treatments.

  • If the confidence interval does not include 0, this supports the conclusion of a significant difference.

Key Insights

  1. Reduced Variability: Positive correlation between paired observations reduces the variance of the differences, increasing test power.

  2. Use of Differences: The paired design converts the data into a single-sample problem for inference.

  3. Robustness: The paired t-test assumes normality of the differences di. For larger n, the Central Limit Theorem ensures robustness to non-normality.

Matched pair designs are a powerful way to control for subject-specific variability, leading to more precise comparisons between treatments.

4.4.5 Nonparametric Tests for Two Samples

For Matched Pair Designs or independent samples where normality cannot be assumed, we use nonparametric tests. These tests do not assume any specific distribution of the data and are robust alternatives to parametric methods.

Stochastic Order and Location Shift

Suppose Y and X are random variables with cumulative distribution functions (CDFs) FY and FX. Then Y is stochastically larger than X if, for all real numbers u:

P(Y>u)P(X>u)(equivalently, FY(u)FX(u)).

If the two distributions differ only in their location parameters, say θy and θx, then we can frame the relationship as:

Y>Xifθy>θx.

We test the following hypotheses:

  • Two-Sided Hypothesis: H0:FY=FXvs.Ha:FYFX
  • Upper One-Sided Hypothesis: H0:FY=FXvs.Ha:FY<FX
  • Lower One-Sided Hypothesis: H0:FY=FXvs.Ha:FY>FX

We generally avoid the completely non-directional alternative Ha:FYFX because it allows arbitrary differences between the distributions, without requiring one distribution to be stochastically larger than the other.

Nonparametric Tests

When the focus is on whether the two distributions differ only in location parameters, two equivalent nonparametric tests are commonly used:

  1. Wilcoxon Signed Rank Test
  2. Mann-Whitney U Test

Both tests are mathematically equivalent and test whether one sample is systematically larger than the other.


4.4.5.1 Wilcoxon Rank-Sum Test

The Wilcoxon Rank Test is a nonparametric test used to compare two independent samples to assess whether their distributions differ in location. It is based on the ranks of the combined observations rather than their actual values.

Procedure

  1. Combine and Rank Observations:
    Combine all n=ny+nx observations (from both groups) into a single dataset and rank them in ascending order. If ties exist, assign the average rank to tied values.

  2. Calculate Rank Sums:
    Compute the sum of ranks for each group:

    • wy: Sum of the ranks for group y (sample 1),
    • wx: Sum of the ranks for group x (sample 2).
      By definition: wy+wx=n(n+1)2
  3. Test Statistic:
    The test focuses on the rank sum wy. Reject H0 if wy is large (indicating y systematically has larger values) or equivalently, if wx is small.

  4. Null Distribution:
    Under H0 (no difference between groups), all possible arrangements of ranks among y and x are equally likely. The total number of possible rank arrangements is:

    (ny+nx)!ny!nx!

  5. Computational Considerations:

    • For small samples, the exact null distribution of the rank sums can be calculated.
    • For large samples, an approximate normal distribution can be used.

Hypotheses

  • Null Hypothesis (H0): The two samples come from identical distributions.

  • Alternative Hypothesis (Ha): The two samples come from different distributions, or one distribution is systematically larger.

  • Two-Sided Test: Ha:FYFX

  • One-Sided Test: Ha:FY>FXorHa:FY<FX

# Subset data for two species
irisVe <- iris$Petal.Width[iris$Species == "versicolor"]
irisVi <- iris$Petal.Width[iris$Species == "virginica"]

# Perform Wilcoxon Rank Test (approximate version, large sample)
wilcox_result <- wilcox.test(
    irisVe,
    irisVi,
    alternative = "two.sided", # Two-sided test
    conf.level = 0.95,         # Confidence level
    exact = FALSE,             # Approximate test for large samples
    correct = TRUE             # Apply continuity correction
)

# Display results
wilcox_result
#> 
#>  Wilcoxon rank sum test with continuity correction
#> 
#> data:  irisVe and irisVi
#> W = 49, p-value < 2.2e-16
#> alternative hypothesis: true location shift is not equal to 0

The output of wilcox.test includes:

  • W: The test statistic, which is the smaller of the two rank sums.

  • p-value: The probability of observing such a difference in rank sums under H0.

  • Alternative Hypothesis: Specifies whether the test was one-sided or two-sided.

  • Confidence Interval (if applicable): Provides a range for the difference in medians.

Decision Rule

  1. Reject H0 at significance level α if the p-value α.

  2. For large samples, compare the test statistic to a critical value from the normal approximation.

Key Features

  1. Robustness:
    The test does not require assumptions of normality and is robust to outliers.

  2. Distribution-Free:
    It evaluates whether two samples differ in location without assuming a specific distribution.

  3. Rank-Based:
    It uses the ranks of the observations, which makes it scale-invariant (resistant to data transformation).

Computational Considerations

  • For small sample sizes, the exact distribution of the rank sums is used.

  • For large sample sizes, the normal approximation with continuity correction is applied for computational efficiency.

4.4.5.2 Mann-Whitney U Test

The Mann-Whitney U Test is a nonparametric test used to compare two independent samples. It evaluates whether one sample tends to produce larger observations than the other, based on pairwise comparisons. The test does not assume normality and is robust to outliers.

Procedure

  1. Pairwise Comparisons:
    Compare each observation yi from sample Y with each observation xj from sample X.

    • Let uy be the number of pairs where yi>xj.
    • Let ux be the number of pairs where yi<xj.

    By definition: uy+ux=nynx where ny is the sample size for group Y, and nx is the sample size for group X.

  2. Test Statistic:
    Reject H0 if uy is large (or equivalently, if ux is small).

    The Mann-Whitney U Test and Wilcoxon Rank-Sum Test are related through the rank sums:

    uy=wyny(ny+1)2,ux=wxnx(nx+1)2

    Here, wy and wx are the rank sums for groups Y and X, respectively.

Hypotheses

  • Null Hypothesis (H0): The two samples come from identical distributions.
  • Alternative Hypothesis (Ha):
    • Upper One-Sided: FY<FX (Sample Y is stochastically larger).
    • Lower One-Sided: FY>FX (Sample X is stochastically larger).
    • Two-Sided: FYFX (Distributions differ in location).

Test Statistic for Large Samples

For large sample sizes ny and nx, the null distribution of U can be approximated by a normal distribution with:

  • Mean: E(U)=nynx2

  • Variance: Var(U)=nynx(ny+nx+1)12

The standardized test statistic z is:

z=uynynx212nynx(ny+nx+1)12

The test rejects H0 at level α if:

zzα(one-sided)or|z|zα/2(two-sided).

For the two-sided test, we use:

  • umax=max, and

  • u_{\text{min}} = \min(u_y, u_x).

The p-value is given by:

p\text{-value} = 2P(U \ge u_{\text{max}}) = 2P(U \le u_{\text{min}}).

When y_i = x_j (ties), assign a value of 1/2 to both u_y and u_x for that pair. While the exact sampling distribution differs slightly when ties exist, the large sample normal approximation remains reasonable.

# Subset data for two species
irisVe <- iris$Petal.Width[iris$Species == "versicolor"]
irisVi <- iris$Petal.Width[iris$Species == "virginica"]

# Perform Mann-Whitney U Test
mann_whitney <- wilcox.test(
    irisVe, irisVi, 
    alternative = "two.sided", 
    conf.level = 0.95,
    exact = FALSE,   # Approximate test for large samples
    correct = TRUE   # Apply continuity correction
)

# Display results
mann_whitney
#> 
#>  Wilcoxon rank sum test with continuity correction
#> 
#> data:  irisVe and irisVi
#> W = 49, p-value < 2.2e-16
#> alternative hypothesis: true location shift is not equal to 0

Decision Rule

  1. Reject H_0 if the p-value is less than \alpha.

  2. For large samples, check whether $z \ge z_{\alpha}$ (one-sided) or $|z| \ge z_{\alpha/2}$ (two-sided).

Key Insights

  1. Robustness: The Mann-Whitney U Test does not assume normality and is robust to outliers.

  2. Relationship to Wilcoxon Test: The test is equivalent to the Wilcoxon Rank-Sum Test but formulated differently (based on pairwise comparisons).

  3. Large Sample Approximation: For large n_y and n_x, the test statistic U follows an approximate normal distribution, simplifying computation.

  4. Handling Ties: Ties are accounted for by assigning fractional contributions to u_y and u_x.

References

Satterthwaite, Franklin E. 1946. “An Approximate Distribution of Estimates of Variance Components.” Biometrics Bulletin 2 (6): 110–14.