4.4 Two-Sample Inference
4.4.1 For Means
Suppose we have two sets of observations:
- y1,…,yny
- x1,…,xnx
These are random samples from two independent populations with means μy and μx and variances σ2y and σ2x. Our goal is to compare μy and μx or test whether σ2y=σ2x.
4.4.1.1 Large Sample Tests
If ny and nx are large (≥30), the Central Limit Theorem allows us to make the following assumptions:
- Expectation: E(ˉy−ˉx)=μy−μx
- Variance: Var(ˉy−ˉx)=σ2yny+σ2xnx
The test statistic is:
Z=ˉy−ˉx−(μy−μx)√σ2yny+σ2xnx∼N(0,1)
For large samples, replace variances with their unbiased estimators s2y and s2x, yielding the same large sample distribution.
Confidence Interval
An approximate 100(1−α)% confidence interval for μy−μx is:
ˉy−ˉx±zα/2√s2yny+s2xnx
Hypothesis Test
Testing:
H0:μy−μx=δ0vs.Ha:μy−μx≠δ0
The test statistic:
z=ˉy−ˉx−δ0√s2yny+s2xnx
Reject H0 at the α-level if:
|z|>zα/2
If δ0=0, this tests whether the two means are equal.
# Large sample test
y <- c(10, 12, 14, 16, 18)
x <- c(9, 11, 13, 15, 17)
# Mean and variance
mean_y <- mean(y)
mean_x <- mean(x)
var_y <- var(y)
var_x <- var(x)
n_y <- length(y)
n_x <- length(x)
# Test statistic
z <- (mean_y - mean_x) / sqrt(var_y / n_y + var_x / n_x)
p_value <- 2 * (1 - pnorm(abs(z)))
list(z = z, p_value = p_value)
#> $z
#> [1] 0.5
#>
#> $p_value
#> [1] 0.6170751
4.4.1.2 Small Sample Tests
If the samples are small, assume the data come from independent normal distributions:
yi∼N(μy,σ2y)
xi∼N(μx,σ2x)
We can do inference based on the Student’s T Distribution, where we have 2 cases:
Assumption | Tests | Plots |
---|---|---|
Independence and Identically Distributed (i.i.d.) Observations | Test for serial correlation | |
Independence Between Samples | Correlation Coefficient | Scatterplot |
Normality | See Normality Assessment | See Normality Assessment |
Equality of Variances |
|
4.4.1.2.1 Equal Variances
Assumptions
- Independence and Identically Distributed (i.i.d.) Observations
Assume that observations in each sample are i.i.d., which implies:
var(ˉy)=σ2yny,var(ˉx)=σ2xnx
- Independence Between Samples
The samples are assumed to be independent, meaning no observation from one sample influences observations from the other. This independence allows us to write:
var(ˉy−ˉx)=var(ˉy)+var(ˉx)−2cov(ˉy,ˉx)=var(ˉy)+var(ˉx)=σ2yny+σ2xnx
This calculation assumes cov(ˉy,ˉx)=0 due to the independence between the samples.
- Normality Assumption
We assume that the underlying populations are normally distributed. This assumption justifies the use of the Student’s T Distribution, which is critical for hypothesis testing and constructing confidence intervals.
- Equality of Variances
If the population variances are equal, i.e., σ2y=σ2x=σ2, then s2y and s2x are both unbiased estimators of σ2. This allows us to pool the variances.
The pooled variance estimator is calculated as:
s2=(ny−1)s2y+(nx−1)s2x(ny−1)+(nx−1)
The pooled variance estimate has degrees of freedom equal to:
df=(ny+nx−2)
Test Statistic
The test statistic is: T=ˉy−ˉx−(μy−μx)s√1ny+1nx∼tny+nx−2
Confidence Interval
A 100(1−α)% confidence interval for μy−μx is: ˉy−ˉx±tny+nx−2,α/2⋅s√1ny+1nx
Hypothesis Test
Testing: H0:μy−μx=δ0vs.Ha:μy−μx≠δ0
Reject H0 if: |T|>tny+nx−2,α/2
# Small sample test with equal variance
t_test_equal <- t.test(y, x, var.equal = TRUE)
t_test_equal
#>
#> Two Sample t-test
#>
#> data: y and x
#> t = 0.5, df = 8, p-value = 0.6305
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -3.612008 5.612008
#> sample estimates:
#> mean of x mean of y
#> 14 13
4.4.1.2.2 Unequal Variances
Assumptions
- Independence and Identically Distributed (i.i.d.) Observations
Assume that observations in each sample are i.i.d., which implies:
var(ˉy)=σ2yny,var(ˉx)=σ2xnx
- Independence Between Samples
The samples are assumed to be independent, meaning no observation from one sample influences observations from the other. This independence allows us to write:
var(ˉy−ˉx)=var(ˉy)+var(ˉx)−2cov(ˉy,ˉx)=var(ˉy)+var(ˉx)=σ2yny+σ2xnx
This calculation assumes cov(ˉy,ˉx)=0 due to the independence between the samples.
- Normality Assumption
We assume that the underlying populations are normally distributed. This assumption justifies the use of the Student’s T Distribution, which is critical for hypothesis testing and constructing confidence intervals.
- Unequal Variances
σ2y≠σ2x
Test Statistic
The test statistic is:
T=ˉy−ˉx−(μy−μx)√s2yny+s2xnx
Degrees of Freedom (Welch-Satterthwaite Approximation) (Satterthwaite 1946)
The degrees of freedom are approximated by:
v=(s2yny+s2xnx)2(s2yny)2ny−1+(s2xnx)2nx−1
Since v is fractional, truncate to the nearest integer.
Confidence Interval
A 100(1−α)% confidence interval for μy−μx is:
ˉy−ˉx±tv,α/2√s2yny+s2xnx
Hypothesis Test
Testing:
H0:μy−μx=δ0vs.Ha:μy−μx≠δ0
Reject H0 if:
|T|>tv,α/2
where
t=ˉy−ˉx−δ0√s2y/ny+s2x/nx
# Small sample test with unequal variance
t_test_unequal <- t.test(y, x, var.equal = FALSE)
t_test_unequal
#>
#> Welch Two Sample t-test
#>
#> data: y and x
#> t = 0.5, df = 8, p-value = 0.6305
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -3.612008 5.612008
#> sample estimates:
#> mean of x mean of y
#> 14 13
4.4.2 For Variances
To compare the variances of two independent samples, we can use the F-test. The test statistic is defined as:
Fndf,ddf=s21s22
where s21>s22, ndf=n1−1, and ddf=n2−1 are the numerator and denominator degrees of freedom, respectively.
4.4.2.1 F-Test
The hypotheses for the F-test are:
H0:σ2y=σ2x(equal variances)Ha:σ2y≠σ2x(unequal variances)
The test statistic is:
F=s2ys2x
where s2y and s2x are the sample variances of the two groups.
Decision Rule
Reject H0 if:
F>Fny−1,nx−1,α/2 (upper critical value), or
F<Fny−1,nx−1,1−α/2 (lower critical value).
Here:
- Fny−1,nx−1,α/2 and Fny−1,nx−1,1−α/2 are the critical points of the F-distribution, with ny−1 and nx−1 degrees of freedom.
Assumptions
- The F-test requires that the data in both groups follow a normal distribution.
- The F-test is sensitive to deviations from normality (e.g., heavy-tailed distributions). If the normality assumption is violated, it may lead to an inflated Type I error rate (false positives).
Limitations and Alternatives
- Sensitivity to Non-Normality:
- When data have long-tailed distributions (positive kurtosis), the F-test may produce misleading results.
- To assess normality, see Normality Assessment.
- Nonparametric Alternatives:
- If the normality assumption is not met, use robust tests such as the Modified Levene Test (Brown-Forsythe Test), which compares group variances based on medians instead of means.
# Load iris dataset
data(iris)
# Subset data for two species
irisVe <- iris$Petal.Width[iris$Species == "versicolor"]
irisVi <- iris$Petal.Width[iris$Species == "virginica"]
# Perform F-test
f_test <- var.test(irisVe, irisVi)
# Display results
f_test
#>
#> F test to compare two variances
#>
#> data: irisVe and irisVi
#> F = 0.51842, num df = 49, denom df = 49, p-value = 0.02335
#> alternative hypothesis: true ratio of variances is not equal to 1
#> 95 percent confidence interval:
#> 0.2941935 0.9135614
#> sample estimates:
#> ratio of variances
#> 0.5184243
4.4.2.2 Levene’s Test
Levene’s Test is a robust method for testing the equality of variances across multiple groups. Unlike the F-test, it is less sensitive to departures from normality and is particularly useful for handling non-normal distributions and datasets with outliers. The test works by analyzing the deviations of individual observations from their group mean or median.
Test Procedure
- Compute the absolute deviations of each observation from its group mean or median:
- For group y: dy,i=|yi−Central Valuey|
- For group x: dx,j=|xj−Central Valuex|
- The “central value” can be either the mean (classic Levene’s test) or the median (Modified Levene Test (Brown-Forsythe Test) variation, more robust for non-normal data).
- Perform a one-way ANOVA on the absolute deviations to test for differences in group variances.
Hypotheses
- Null Hypothesis (H0): All groups have equal variances.
- Alternative Hypothesis (Ha): At least one group has a variance different from the others.
Test Statistic
The Levene test statistic is calculated as an ANOVA on the absolute deviations. Let:
k: Number of groups,
ni: Number of observations in group i,
n: Total number of observations.
The test statistic is:
W=(n−k)∑ki=1ni(ˉdi−ˉd)2(k−1)∑ki=1∑nij=1(di,j−ˉdi)2
where:
di,j: Absolute deviations within group i,
ˉdi: Mean of the absolute deviations for group i,
ˉd: Overall mean of the absolute deviations.
Under the null hypothesis, W∼Fk−1,n−k.
Decision Rule
- Compute the test statistic W.
- Reject H0 at significance level α if: W>Fk−1,n−k,α
# Load required package
library(car)
# Perform Levene's Test (absolute deviations from the mean)
levene_test_mean <- leveneTest(Petal.Width ~ Species, data = iris)
# Perform Levene's Test (absolute deviations from the median)
levene_test_median <-
leveneTest(Petal.Width ~ Species, data = iris, center = median)
# Display results
levene_test_mean
#> Levene's Test for Homogeneity of Variance (center = median)
#> Df F value Pr(>F)
#> group 2 19.892 2.261e-08 ***
#> 147
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
levene_test_median
#> Levene's Test for Homogeneity of Variance (center = median)
#> Df F value Pr(>F)
#> group 2 19.892 2.261e-08 ***
#> 147
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output includes:
Df: Degrees of freedom for the numerator and denominator.
F-value: The computed value of the test statistic W.
p-value: The probability of observing such a value under the null hypothesis.
If the p-value is less than α, reject H0 and conclude that the group variances are significantly different.
Otherwise, fail to reject H0 and conclude there is no evidence of a difference in variances.
Advantages of Levene’s Test
Robustness:
- Handles non-normal data and outliers better than the F-test.
Flexibility:
By choosing the center value (mean or median), it can adapt to different data characteristics:
Use the mean for symmetric distributions.
Use the median for non-normal or skewed data.
Versatility:
- Applicable to comparing variances across more than two groups, unlike the Modified Levene Test (Brown-Forsythe Test), which is limited to two groups.
4.4.2.3 Modified Levene Test (Brown-Forsythe Test)
The Modified Levene Test is a robust alternative to the F-test for comparing variances between two groups. Instead of using squared deviations (as in the F-test), this test considers the absolute deviations from the median, making it less sensitive to non-normal data and long-tailed distributions. It is, however, still appropriate for normally distributed data.
For each sample, compute the absolute deviations from the median:
dy,i=|yi−y.5|anddx,i=|xi−x.5|
Let:
- ˉdy and ˉdx be the means of the absolute deviations for groups y and x, respectively.
The test statistic is:
t∗L=ˉdy−ˉdxs√1ny+1nx
where the pooled variance s2 is:
s2=∑nyi=1(dy,i−ˉdy)2+∑nxj=1(dx,j−ˉdx)2ny+nx−2
Assumptions
Constant Variance of Error Terms:
The test assumes equal error variances in each group under the null hypothesis.Moderate Sample Size:
The approximation t∗L∼tny+nx−2 holds well for moderate or large sample sizes.
Decision Rule
- Compute t∗L using the formula above.
- Reject the null hypothesis of equal variances if: |t∗L|>tny+nx−2;α/2
This is equivalent to applying a two-sample t-test to the absolute deviations.
# Absolute deviations from the median
dVe <- abs(irisVe - median(irisVe))
dVi <- abs(irisVi - median(irisVi))
# Perform t-test on absolute deviations
levene_test <- t.test(dVe, dVi, var.equal = TRUE)
# Display results
levene_test
#>
#> Two Sample t-test
#>
#> data: dVe and dVi
#> t = -2.5584, df = 98, p-value = 0.01205
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.12784786 -0.01615214
#> sample estimates:
#> mean of x mean of y
#> 0.154 0.226
For small sample sizes, use the unequal variance t-test directly on the original data as a robust alternative:
# Small sample t-test with unequal variances
small_sample_test <- t.test(irisVe, irisVi, var.equal = FALSE)
# Display results
small_sample_test
#>
#> Welch Two Sample t-test
#>
#> data: irisVe and irisVi
#> t = -14.625, df = 89.043, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.7951002 -0.6048998
#> sample estimates:
#> mean of x mean of y
#> 1.326 2.026
4.4.2.4 Bartlett’s Test
The Bartlett’s Test is a statistical procedure for testing the equality of variances across multiple groups. It assumes that the data in each group are normally distributed and is sensitive to deviations from normality. When the assumption of normality holds, Bartlett’s Test is more powerful than Levene’s Test.
Hypotheses for Bartlett’s Test
- Null Hypothesis (H0): All groups have equal variances.
- Alternative Hypothesis (Ha): At least one group has a variance different from the others.
The test statistic for Bartlett’s Test is:
B=(n−k)log(S2p)−∑ki=1(ni−1)log(S2i)1+13(k−1)(∑ki=11ni−1−1n−k)
Where:
k: Number of groups,
ni: Number of observations in group i,
n=∑ki=1ni: Total number of observations,
S2i: Sample variance of group i,
S2p: Pooled variance, given by: S2p=∑ki=1(ni−1)S2in−k
Under the null hypothesis, the test statistic B∼χ2k−1.
Assumptions
- Normality: The data in each group must follow a normal distribution.
- Independence: Observations within and between groups must be independent.
- Equal Sample Sizes (Optional): Bartlett’s Test is more robust if sample sizes are approximately equal.
Decision Rule
- Compute the test statistic B.
- Compare B to the critical value of the Chi-Square distribution at α and k−1 degrees of freedom.
- Reject H0 if: B>χ2k−1,α
Alternatively, use the p-value:
- Reject H0 if the p-value ≤α.
# Perform Bartlett's Test
bartlett_test <- bartlett.test(Petal.Width ~ Species, data = iris)
# Display results
bartlett_test
#>
#> Bartlett test of homogeneity of variances
#>
#> data: Petal.Width by Species
#> Bartlett's K-squared = 39.213, df = 2, p-value = 3.055e-09
The output includes:
Bartlett’s K-squared: The value of the test statistic B.
df: Degrees of freedom (k−1), where k is the number of groups.
p-value: The probability of observing such a value of B under H0.
If the p-value is less than α, reject H0 and conclude that the variances are significantly different across groups.
If the p-value is greater than α, fail to reject H0 and conclude that there is no significant evidence of variance differences.
Limitations of Bartlett’s Test
Sensitivity to Non-Normality:
Bartlett’s Test is highly sensitive to departures from normality. Even slight deviations can lead to misleading results.Not Robust to Outliers:
Outliers can disproportionately affect the test result.Alternatives:
If the normality assumption is violated, use robust alternatives like:Levene’s Test (absolute deviations)
Modified Levene Test (Brown-Forsythe Test) (median-based absolute deviations)
Advantages of Bartlett’s Test
High Power: Bartlett’s Test is more powerful than robust alternatives when the normality assumption holds.
Simple Implementation: The test is easy to perform and interpret.
4.4.3 Power
To evaluate the power of a test, we consider the situation where the variances are equal across groups:
σ2y=σ2x=σ2
Under the assumption of equal variances, we take equal sample sizes from both groups, i.e., ny=nx=n.
Hypotheses for One-Sided Testing
We are testing:
H0:μy−μx≤0vs.Ha:μy−μx>0
Test Statistic
The α-level z-test rejects H0 if the test statistic:
z=ˉy−ˉxσ√2n>zα
where:
ˉy and ˉx are the sample means,
σ is the common standard deviation,
zα is the critical value from the standard normal distribution.
Power Function
The power of the test, denoted as π(μy−μx), is the probability of correctly rejecting H0 when μy−μx is some specified value. Under the alternative hypothesis, the power function is:
π(μy−μx)=Φ(−zα+μy−μxσ√n2)
where:
Φ is the cumulative distribution function (CDF) of the standard normal distribution,
μy−μxσ√n2 represents the standardized effect size.
Determining the Required Sample Size
To achieve a desired power of 1−β when the true difference is δ (the smallest difference of interest), we solve for the required sample size n. The power equation is:
Φ(−zα+δσ√n2)=1−β
Rearranging for n, the required sample size is:
n=2σ2δ2(zα+zβ)2
where:
σ: The common standard deviation,
zα: The critical value for the Type I error rate α (one-sided test),
zβ: The critical value for the Type II error rate β (related to power 1−β),
δ: The minimum detectable difference between the means.
# Parameters
alpha <- 0.05 # Significance level
beta <- 0.2 # Type II error rate (1 - Power = 0.2)
sigma <- 1 # Common standard deviation
delta <- 0.5 # Minimum detectable difference
# Critical values
z_alpha <- qnorm(1 - alpha)
z_beta <- qnorm(1 - beta)
# Sample size calculation
n <- (2 * sigma ^ 2 * (z_alpha + z_beta) ^ 2) / delta ^ 2
# Output the required sample size (per group)
ceiling(n)
#> [1] 50
Sample Size for Two-Sided Tests
For a two-sided test, replace zα with zα/2 to account for the two-tailed critical region:
n=2(σ(zα/2+zβ)δ)2
This ensures that the test has the required power 1−β to detect a difference of size δ between the means at significance level α.
Adjustment for the Exact t-Test
When conducting an exact two-sample t-test for small sample sizes, the sample size calculation involves the non-central t-distribution. An approximate correction can be applied using the critical values from the t-distribution instead of the z-distribution.
The adjusted sample size is:
n∗=2(σ(t2n−2;α/2+t2n−2;β)δ)2
Where:
t2n−2;α/2: The critical value for the t-distribution with 2n−2 degrees of freedom for significance level α/2,
t2n−2;β: The critical value for the t-distribution with 2n−2 degrees of freedom for power 1−β.
This correction adjusts for the increased variability of the t-distribution, especially important for small sample sizes.
# Parameters
alpha <- 0.05 # Significance level
power <- 0.8 # Desired power
sigma <- 1 # Common standard deviation
delta <- 0.5 # Minimum detectable difference
# Calculate sample size for two-sided test
sample_size <-
power.t.test(
delta = delta,
sd = sigma,
sig.level = alpha,
power = power,
type = "two.sample",
alternative = "two.sided"
)
# Display results
sample_size
#>
#> Two-sample t test power calculation
#>
#> n = 63.76576
#> delta = 0.5
#> sd = 1
#> sig.level = 0.05
#> power = 0.8
#> alternative = two.sided
#>
#> NOTE: n is number in *each* group
Key Insights
Z-Test vs. T-Test:
For large samples, the normal approximation (z-test) works well. For small samples, the t-test correction using the t-distribution is essential.Effect of Power and Significance Level:
Increasing power (1−β) or decreasing α requires larger sample sizes.
A smaller minimum detectable difference (δ) also requires a larger sample size.
Two-Sided Tests:
Two-sided tests require larger sample sizes compared to one-sided tests due to the split critical region.
Formula Summary
Test Type | Formula for Sample Size |
---|---|
One-Sided Test | n=2(σ(zα+zβ)δ)2 |
Two-Sided Test | n=2(σ(zα/2+zβ)δ)2 |
Approximate t-Test | n∗=2(σ(t2n−2;α/2+t2n−2;β)δ)2 |
4.4.4 Matched Pair Designs
In matched pair designs, two treatments are compared by measuring responses for the same subjects under both treatments. This ensures that the effects of subject-to-subject variability are minimized, as each subject serves as their own control.
We have two treatments, and the data are structured as follows:
Subject | Treatment A | Treatment B | Difference |
---|---|---|---|
1 | y1 | x1 | d1=y1−x1 |
2 | y2 | x2 | d2=y2−x2 |
… | … | … | … |
n | yn | xn | dn=yn−xn |
Here:
yi represents the observation under Treatment A,
xi represents the observation under Treatment B,
di=yi−xi is the difference for subject i.
Assumptions
- Observations yi and xi are measured for the same subjects, inducing correlation.
- The differences di are independent and identically distributed (iid), and follow a normal distribution: di∼N(μD,σ2D)
Mean and Variance of the Difference
The mean difference μD and the variance σ2D are given by:
μD=E(yi−xi)=μy−μx
σ2D=Var(yi−xi)=Var(yi)+Var(xi)−2⋅Cov(yi,xi)
- If the covariance between yi and xi is positive (a typical case), the variance of the differences σ2D is reduced compared to the independent sample case.
- This is the key benefit of Matched Pair Designs: reduced variability increases the precision of estimates.
Sample Statistics
For the differences di=yi−xi:
The sample mean of the differences: ˉd=1nn∑i=1di=ˉy−ˉx
The sample variance of the differences: s2d=1n−1n∑i=1(di−ˉd)2
Once the data are converted into differences di, the problem reduces to one-sample inference. We can use tests and confidence intervals (CIs) for the mean of a single sample.
Hypothesis Test
We test the following hypotheses:
H0:μD=0vs.Ha:μD≠0
The test statistic is:
t=ˉdsd/√n∼tn−1
where n is the number of subjects.
- Reject H0 at significance level α if: |t|>tn−1,α/2
Confidence Interval
A 100(1−α)% confidence interval for μD is:
ˉd±tn−1,α/2⋅sd√n
# Sample data
treatment_a <- c(85, 90, 78, 92, 88)
treatment_b <- c(80, 86, 75, 89, 85)
# Compute differences
differences <- treatment_a - treatment_b
# Perform one-sample t-test on the differences
t_test <- t.test(differences, mu = 0, alternative = "two.sided")
# Display results
t_test
#>
#> One Sample t-test
#>
#> data: differences
#> t = 9, df = 4, p-value = 0.0008438
#> alternative hypothesis: true mean is not equal to 0
#> 95 percent confidence interval:
#> 2.489422 4.710578
#> sample estimates:
#> mean of x
#> 3.6
The output includes:
t-statistic: The calculated test statistic for the matched pairs.
p-value: The probability of observing such a difference under the null hypothesis.
Confidence Interval: The range of plausible values for the mean difference μD.
If the p-value is less than α, reject H0 and conclude that there is a significant difference between the two treatments.
If the confidence interval does not include 0, this supports the conclusion of a significant difference.
Key Insights
Reduced Variability: Positive correlation between paired observations reduces the variance of the differences, increasing test power.
Use of Differences: The paired design converts the data into a single-sample problem for inference.
Robustness: The paired t-test assumes normality of the differences di. For larger n, the Central Limit Theorem ensures robustness to non-normality.
Matched pair designs are a powerful way to control for subject-specific variability, leading to more precise comparisons between treatments.
4.4.5 Nonparametric Tests for Two Samples
For Matched Pair Designs or independent samples where normality cannot be assumed, we use nonparametric tests. These tests do not assume any specific distribution of the data and are robust alternatives to parametric methods.
Stochastic Order and Location Shift
Suppose Y and X are random variables with cumulative distribution functions (CDFs) FY and FX. Then Y is stochastically larger than X if, for all real numbers u:
P(Y>u)≥P(X>u)(equivalently, FY(u)≤FX(u)).
If the two distributions differ only in their location parameters, say θy and θx, then we can frame the relationship as:
Y>Xifθy>θx.
We test the following hypotheses:
- Two-Sided Hypothesis: H0:FY=FXvs.Ha:FY≠FX
- Upper One-Sided Hypothesis: H0:FY=FXvs.Ha:FY<FX
- Lower One-Sided Hypothesis: H0:FY=FXvs.Ha:FY>FX
We generally avoid the completely non-directional alternative Ha:FY≠FX because it allows arbitrary differences between the distributions, without requiring one distribution to be stochastically larger than the other.
Nonparametric Tests
When the focus is on whether the two distributions differ only in location parameters, two equivalent nonparametric tests are commonly used:
Both tests are mathematically equivalent and test whether one sample is systematically larger than the other.
4.4.5.1 Wilcoxon Rank-Sum Test
The Wilcoxon Rank Test is a nonparametric test used to compare two independent samples to assess whether their distributions differ in location. It is based on the ranks of the combined observations rather than their actual values.
Procedure
Combine and Rank Observations:
Combine all n=ny+nx observations (from both groups) into a single dataset and rank them in ascending order. If ties exist, assign the average rank to tied values.Calculate Rank Sums:
Compute the sum of ranks for each group:- wy: Sum of the ranks for group y (sample 1),
- wx: Sum of the ranks for group x (sample 2).
By definition: wy+wx=n(n+1)2
Test Statistic:
The test focuses on the rank sum wy. Reject H0 if wy is large (indicating y systematically has larger values) or equivalently, if wx is small.Null Distribution:
Under H0 (no difference between groups), all possible arrangements of ranks among y and x are equally likely. The total number of possible rank arrangements is:(ny+nx)!ny!nx!
Computational Considerations:
- For small samples, the exact null distribution of the rank sums can be calculated.
- For large samples, an approximate normal distribution can be used.
- For small samples, the exact null distribution of the rank sums can be calculated.
Hypotheses
Null Hypothesis (H0): The two samples come from identical distributions.
Alternative Hypothesis (Ha): The two samples come from different distributions, or one distribution is systematically larger.
Two-Sided Test: Ha:FY≠FX
One-Sided Test: Ha:FY>FXorHa:FY<FX
# Subset data for two species
irisVe <- iris$Petal.Width[iris$Species == "versicolor"]
irisVi <- iris$Petal.Width[iris$Species == "virginica"]
# Perform Wilcoxon Rank Test (approximate version, large sample)
wilcox_result <- wilcox.test(
irisVe,
irisVi,
alternative = "two.sided", # Two-sided test
conf.level = 0.95, # Confidence level
exact = FALSE, # Approximate test for large samples
correct = TRUE # Apply continuity correction
)
# Display results
wilcox_result
#>
#> Wilcoxon rank sum test with continuity correction
#>
#> data: irisVe and irisVi
#> W = 49, p-value < 2.2e-16
#> alternative hypothesis: true location shift is not equal to 0
The output of wilcox.test
includes:
W: The test statistic, which is the smaller of the two rank sums.
p-value: The probability of observing such a difference in rank sums under H0.
Alternative Hypothesis: Specifies whether the test was one-sided or two-sided.
Confidence Interval (if applicable): Provides a range for the difference in medians.
Decision Rule
Reject H0 at significance level α if the p-value ≤α.
For large samples, compare the test statistic to a critical value from the normal approximation.
Key Features
Robustness:
The test does not require assumptions of normality and is robust to outliers.Distribution-Free:
It evaluates whether two samples differ in location without assuming a specific distribution.Rank-Based:
It uses the ranks of the observations, which makes it scale-invariant (resistant to data transformation).
Computational Considerations
For small sample sizes, the exact distribution of the rank sums is used.
For large sample sizes, the normal approximation with continuity correction is applied for computational efficiency.
4.4.5.2 Mann-Whitney U Test
The Mann-Whitney U Test is a nonparametric test used to compare two independent samples. It evaluates whether one sample tends to produce larger observations than the other, based on pairwise comparisons. The test does not assume normality and is robust to outliers.
Procedure
Pairwise Comparisons:
Compare each observation yi from sample Y with each observation xj from sample X.- Let uy be the number of pairs where yi>xj.
- Let ux be the number of pairs where yi<xj.
By definition: uy+ux=nynx where ny is the sample size for group Y, and nx is the sample size for group X.
Test Statistic:
Reject H0 if uy is large (or equivalently, if ux is small).The Mann-Whitney U Test and Wilcoxon Rank-Sum Test are related through the rank sums:
uy=wy−ny(ny+1)2,ux=wx−nx(nx+1)2
Here, wy and wx are the rank sums for groups Y and X, respectively.
Hypotheses
- Null Hypothesis (H0): The two samples come from identical distributions.
- Alternative Hypothesis (Ha):
- Upper One-Sided: FY<FX (Sample Y is stochastically larger).
- Lower One-Sided: FY>FX (Sample X is stochastically larger).
- Two-Sided: FY≠FX (Distributions differ in location).
Test Statistic for Large Samples
For large sample sizes ny and nx, the null distribution of U can be approximated by a normal distribution with:
Mean: E(U)=nynx2
Variance: Var(U)=nynx(ny+nx+1)12
The standardized test statistic z is:
z=uy−nynx2−12√nynx(ny+nx+1)12
The test rejects H0 at level α if:
z≥zα(one-sided)or|z|≥zα/2(two-sided).
For the two-sided test, we use:
umax=max, and
u_{\text{min}} = \min(u_y, u_x).
The p-value is given by:
p\text{-value} = 2P(U \ge u_{\text{max}}) = 2P(U \le u_{\text{min}}).
When y_i = x_j (ties), assign a value of 1/2 to both u_y and u_x for that pair. While the exact sampling distribution differs slightly when ties exist, the large sample normal approximation remains reasonable.
# Subset data for two species
irisVe <- iris$Petal.Width[iris$Species == "versicolor"]
irisVi <- iris$Petal.Width[iris$Species == "virginica"]
# Perform Mann-Whitney U Test
mann_whitney <- wilcox.test(
irisVe, irisVi,
alternative = "two.sided",
conf.level = 0.95,
exact = FALSE, # Approximate test for large samples
correct = TRUE # Apply continuity correction
)
# Display results
mann_whitney
#>
#> Wilcoxon rank sum test with continuity correction
#>
#> data: irisVe and irisVi
#> W = 49, p-value < 2.2e-16
#> alternative hypothesis: true location shift is not equal to 0
Decision Rule
Reject H_0 if the p-value is less than \alpha.
For large samples, check whether $z \ge z_{\alpha}$ (one-sided) or $|z| \ge z_{\alpha/2}$ (two-sided).
Key Insights
Robustness: The Mann-Whitney U Test does not assume normality and is robust to outliers.
Relationship to Wilcoxon Test: The test is equivalent to the Wilcoxon Rank-Sum Test but formulated differently (based on pairwise comparisons).
Large Sample Approximation: For large n_y and n_x, the test statistic U follows an approximate normal distribution, simplifying computation.
Handling Ties: Ties are accounted for by assigning fractional contributions to u_y and u_x.