4.2 Two Sample Inference

4.2.1 Means

Suppose we have 2 sets of observations,

  • \(y_1,..., y_{n_y}\)
  • \(x_1,...,x_{n_x}\)

that are random samples from two independent populations with means \(\mu_y\) and \(\mu_x\) and variances \(\sigma^2_y\),\(\sigma^2_x\). Our goal is to compare \(\mu_x\) and \(\mu_y\) or \(\sigma^2_y = \sigma^2_x\) Large Sample Tests

Assume that \(n_y\) and \(n_x\) are large (\(\ge 30\)). Then,

\[ E(\bar{y} - \bar{x}) = \mu_y - \mu_x \\ Var(\bar{y} - \bar{x}) = \sigma^2_y /n_y + \sigma^2_x/n_x \]


\[ Z = \frac{\bar{y}-\bar{x} - (\mu_y - \mu_x)}{\sqrt{\sigma^2_y /n_y + \sigma^2_x/n_x}} \sim N(0,1) \] (according to Central Limit Theorem). For large samples, we can replace variances by their unbiased estimators (\(s^2_y,s^2_x\)), and get the same large sample distribution.

An approximate \(100(1-\alpha) \%\) CI for \(\mu_y - \mu_x\) is given by:

\[ \bar{y} - \bar{x} \pm z_{\alpha/2}\sqrt{s^2_y/n_y + s^2_x/n_x} \]

\[ H_0: \mu_y - \mu_x = \delta_0 \\ H_A: \mu_y - \mu_x \neq \delta_0 \]

at the \(\alpha\)-level with the statistic:

\[ z = \frac{\bar{y}-\bar{x} - \delta_0}{\sqrt{s^2_y /n_y + s^2_x/n_x}} \]

and reject \(H_0\) if \(|z| > z_{\alpha/2}\)

If \(\delta = )\), it means that we are testing whether two means are equal. Small Sample Tests

If the two samples are from normal distribution, iid \(N(\mu_y,\sigma^2_y)\) and iid \(N(\mu_x,\sigma^2_x)\) and the two samples are independent, we can do inference based on the t-distribution

Then we have 2 cases Equal variance


  • iid: so that \(var(\bar{y}) = \sigma^2_y / n_y ; var(\bar{x}) = \sigma^2_x / n_x\)
  • Independence between samples: No observation from one sample can influence any observation from the other sample, to have

\[ \begin{aligned} var(\bar{y} - \bar{x}) &= var(\bar{y}) + var{\bar{x}} - 2cov(\bar{y},\bar{x}) \\ &= var(\bar{y}) + var{\bar{x}} \\ &= \sigma^2_y / n_y + \sigma^2_x / n_x \end{aligned} \]

Let \(\sigma^2 = \sigma^2_y = \sigma^2_x\). Then, \(s^2_y\) and \(s^2_x\) are both unbiased estimators of \(\sigma^2\). We then can pool them.

Then the pooled variance estimate is \[ s^2 = \frac{(n_y - 1)s^2_y + (n_x - 1)s^2_x}{(n_y-1)+(n_x-1)} \] has \(n_y + n_x -2\) df.

Then the test statistic

\[ T = \frac{\bar{y}- \bar{x} -(\mu_y - \mu_x)}{s\sqrt{1/n_y + 1/n_x}} \sim t_{n_y + n_x -2} \]

\(100(1 - \alpha) \%\) CI for \(\mu_y - \mu_x\) is

\[ \bar{y} - \bar{x} \pm (t_{n_y + n_x -2})s\sqrt{1/n_y + 1/n_x} \]

Hypothesis testing:
\[ H_0: \mu_y - \mu_x = \delta_0 \\ H_1: \mu_y - \mu_x \neq \delta_0 \]

we reject \(H_0\) if \(|t| > t_{n_y + n_x -2;\alpha/2}\) Unequal Variance


  1. Two samples are independent
    1. Scatter plots
    2. Correlation coefficient (if normal)
  2. Independence of observation in each sample
    1. Test for serial correlation
  3. For each sample, homogeneity of variance
    1. Scatter plots
    2. Formal tests
  4. Normality
  5. Equality of variances (homogeneity of variance between samples)
    1. F-test
    2. Barlett test
    3. [Modified Levene Test]

To compare 2 normal \(\sigma^2_y \neq \sigma^2_x\), we use the test statistic:

\[ T = \frac{\bar{y}- \bar{x} -(\mu_y - \mu_x)}{\sqrt{s^2_y/n_y + s^2_x/n_x}} \] In this case, T does not follow the t-distribution (its distribution depends on the ratio of the unknown variances \(\sigma^2_y,\sigma^2_x\)). In the case of small sizes, we can can approximate tests by using the Welch-Satterthwaite method (Satterthwaite 1946). We assume T can be approximated by a t-distribution, and adjust the degrees of freedom.

Let \(w_y = s^2_y /n_y\) and \(w_x = s^2_x /n_x\) (the w’s are the square of the respective standard errors)
Then, the degrees of freedom are

\[ v = \frac{(w_y + w_x)^2}{w^2_y / (n_y-1) + w^2_x / (n_x-1)} \]

Since v is usually fractional, we truncate down to the nearest integer.

\(100 (1-\alpha) \%\) CI for \(\mu_y - \mu_x\) is

\[ \bar{y} - \bar{x} \pm t_{v,\alpha/2} \sqrt{s^2_y/n_y + s^2_x /n_x} \]

Reject \(H_0\) if \(|t| > t_{v,\alpha/2}\), where

\[ t = \frac{\bar{y} - \bar{x}-\delta_0}{\sqrt{s^2_y/n_y + s^2_x /n_x}} \]

4.2.2 Variances

\[ F_{ndf,ddf}= \frac{s^2_1}{s^2_2} \]

where \(s^2_1>s^2_2, ndf = n_1-1,ddf = n_2-1\) F-test


\[ H_0: \sigma^2_y = \sigma^2_x \\ H_a: \sigma^2_y \neq \sigma^2_x \]

Consider the test statistic,

\[ F= \frac{s^2_y}{s^2_x} \]

Reject \(H_0\) if

  • \(F>f_{n_y -1,n_x -1,\alpha/2}\) or
  • \(F<f_{n_y -1,n_x -1,1-\alpha/2}\)

Where \(F>f_{n_y -1,n_x -1,\alpha/2}\) and \(F<f_{n_y -1,n_x -1,1-\alpha/2}\) are the upper and lower \(\alpha/2\) critical points of an F-distribution, with a \(n_y-1\) and \(n_x-1\) degrees of freedom.


  • This test depends heavily on the assumption Normality.
  • In particular, it could give to many significant results when observations come from long-tailed distributions (i.e., positive kurtosis).
  • If we cannot find support for normality, then we can use nonparametric tests such as the Modified Levene Test (Brown-Forsythe Test)

#>  F test to compare two variances
#> data:  irisVe and irisVi
#> F = 0.51842, num df = 49, denom df = 49, p-value = 0.02335
#> alternative hypothesis: true ratio of variances is not equal to 1
#> 95 percent confidence interval:
#>  0.2941935 0.9135614
#> sample estimates:
#> ratio of variances 
#>          0.5184243 Modified Levene Test (Brown-Forsythe Test)

  • considers averages of absolute deviations rather than squared deviations. Hence, less sensitive to long-tailed distributions.
  • This test is still good for normal data

For each sample, we consider the absolute deviation of each observation form the median:

\[ d_{y,i} = |y_i - y_{.5}| \\ d_{x,i} = |x_i - x_{.5}| \] Then,

\[ t_L^* = \frac{\bar{d}_y-\bar{d}_x}{s \sqrt{1/n_y + 1/n_x}} \]

The pooled variance \(s^2\) is given by:

\[ s^2 = \frac{\sum_i^{n_y}(d_{y,i}-\bar{d}_y)^2 + \sum_j^{n_x}(d_{x,i}-\bar{d}_x)^2}{n_y + n_x -2} \]

  • If the error terms have constant variance and \(n_y\) and \(n_x\) are not extremely small, then \(t_L^* \sim t_{n_x + n_y -2}\)
  • We reject the null hypothesis when \(|t_L^*| > t_{n_y + n_x -2;\alpha/2}\)
  • This is just the two-sample t-test applied to the absolute deviations.
#>  Two Sample t-test
#> data:  dVe and dVi
#> t = -2.5584, df = 98, p-value = 0.01205
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.12784786 -0.01615214
#> sample estimates:
#> mean of x mean of y 
#>     0.154     0.226

# small samples t-test  
#>  Welch Two Sample t-test
#> data:  irisVe and irisVi
#> t = -14.625, df = 89.043, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.7951002 -0.6048998
#> sample estimates:
#> mean of x mean of y 
#>     1.326     2.026

4.2.3 Power

Consider \(\sigma^2_y = \sigma^2_x = \sigma^2\)
Under the assumption of equal variances, we take size samples from both groups (\(n_y = n_x = n\))

For 1-sided testing,

\[ H_0: \mu_y - \mu_x \le 0 \\ H_a: \mu_y - \mu_x > 0 \]

\(\alpha\)-level z-test rejects \(H_0\) if

\[ z = \frac{\bar{y} - \bar{x}}{\sigma \sqrt{2/n}} > z_{\alpha} \]

\[ \pi(\mu_y - \mu_x) = \Phi(-z_{\alpha} + \frac{\mu_y -\mu_x}{\sigma}\sqrt{n/2}) \]

We need sample size n that give at least \(1-\beta\) power when \(\mu_y - \mu_x = \delta\), where \(\delta\) is the smallest difference that we want to see.

Power is given by:

\[ \Phi(-z_{\alpha} + \frac{\delta}{\sigma}\sqrt{n/2}) = 1 - \beta \]

4.2.4 Sample Size

Then, the sample size is

\[ n = 2(\frac{\sigma (z_{\alpha} + z_{\beta}}{\delta})^2 \]

For 2-sided test, replace \(z_{\alpha}\) with \(z_{\alpha/2}\).
As with the one-sample case, to perform an exact 2-sample t-test sample size calculation, we must use a non-central t-distribution.

A correction that gives the approximate t-test sample size can be obtained by using the z-test n value in the formula:
\[ n^* = 2(\frac{\sigma (t_{2n-2;\alpha} + t_{2n-2;\beta})}{\delta})^2 \]

where we use \(\alpha/2\) for the two-sided test

4.2.5 Matched Pair Designs

We have two treatments

Subject Treatment A Treatment B Difference
1 \(y_1\) \(x_1\) \(d_1 = y_1 - x_1\)
2 \(y_2\) \(x_2\) \(d_2 = y_2 - x_2\)
. . . .
n \(y_n\) \(x_n\) \(d_n = y_n - x_n\)

we assume \(y_i \sim^{iid} N(\mu_y, \sigma^2_y)\) and \(x_i \sim^{iid} N(\mu_x,\sigma^2_x)\), but since \(y_i\) and \(x_i\) are measured on the same subject, they are correlated.


\[ \mu_D = E(y_i - x_i) = \mu_y -\mu_x \\ \sigma^2_D = var(y_i - x_i) = Var(y_i) + Var(x_i) -2cov(y_i,x_i) \]

If the matching induces positive correlation, then the variance of the difference of the measurements is reduced as compared to the independent case. This is the point of Matched Pair Designs. Although covariance can be negative, giving a larger variance of the difference than the independent sample case, usually the covariance is positive. This means both \(y_i\) and \(x_i\) are large for many of the same subjects, and for others, both measurement are small. (we still assume that various subjects respond independently of each other, which is necessary for the iid assumption within groups).

Let \(d_i = y_i - x_i\), then

  • \(\bar{d} = \bar{y}-\bar{x}\) is the sample mean of the \(d_i\)
  • \(s_d^2=\frac{1}{n-1}\sum_{i=1}^n (d_i - \bar{d})^2\) is the sample variance of the difference

Once the data are converted to differences, we are back to One Sample Inference and can use its tests and CIs.

4.2.6 Nonparametric Tests for Two Samples

For Matched Pair Designs, we can use the One-sample Non-parametric Methods.

Assume that Y and X are random variables with CDF \(F_y\) and \(F_x\). then, Y is stochastically larger than X for all real number u, \(P(Y > u) \ge P(X > u)\).

Equivalently, \(P(Y \le u) \le P(X \le u)\), which is \(F_Y(u) \le F_X(u)\), same thing as \(F_Y < F_X\)

If two distributions are identical, except that one is shifted relative to the other, then each of distribution can be indexed by a location parameter, say \(\theta_y\) and \(\theta_x\). In this case, \(Y>X\) if \(\theta_y > \theta_x\)

Consider the hypotheses,

\[ H_0: F_Y = F_X \\ H_a: F_Y < F_X \] where the alternative is an upper one-sided alternative.

  • We can also consider the lower one-sided alternative

\[ H_a: F_Y > F_X \text{ or} \\ H_a: F_Y < F_X \text{ or } F_Y > F_X \]

  • In this case, we don’t use \(H_a: F_Y \neq F_X\) as that allows arbitrary differences between the distributions, without requiring one be stochastically larger than the other.

If the distributions only differ in terms of their location parameters, we can focus hypothesis tests on the parameters (e.g., \(H_0: \theta_y = \theta_x\) vs. \(\theta_y > \theta_x\))

We have 2 equivalent nonparametric tests that consider the hypothesis mentioned above

  1. Wilcoxon rank test
  2. Mann-Whitney U test Wilcoxon rank test

  1. Combine all \(n= n_y + n_x\) observations and rank them in ascending order.
  2. Sum the ranks of the \(y\)’s and \(x\)’s separately. Let \(w_y\) and \(w_x\) be these sums. (\(w_y + w_x = 1 + 2 + ... + n = n(n+1)/2\))
  3. Reject \(H_0\) if \(w_y\) is large (equivalently, \(w_x\) is small)

Under \(H_0\), any arrangement of the \(y\)’s and \(x\)’s is equally likely to occur, and there are \((n_y + n_x)!/(n_y! n_x!)\) possible arrangements.

  • Technically, for each arrangement we can compute the values of \(w_y\) and \(w_x\), and thus generate the distribution of the statistic under the null hypothesis.
  • This could lead to computationally intensive.
    alternative = "two.sided",
    conf.level = 0.95,
    exact = F,
    correct = T
#>  Wilcoxon rank sum test with continuity correction
#> data:  irisVe and irisVi
#> W = 49, p-value < 2.2e-16
#> alternative hypothesis: true location shift is not equal to 0 Mann-Whitney U test

The Mann-Whitney test is computed as follows:

  1. Compare each \(y_i\) with each \(x_i\).
    Let \(u_y\) be the number of pairs in which \(y_i > x_i\) Let \(u_x\) be the number of pairs in which \(y_i < x_i\). (assume there are no ties). There are \(n_y n_x\) such comparisons and \(u_y + u_x = n_y n_x\).
  2. Reject \(H_0\) if \(u_y\) is large (or \(u_x\) is small)

Mann-Whitney U test and Wilcoxon rank test are related:
\[ u_y = w_y - n_y(n_y+1) /2 \\ u_x = w_x - n_x(n_x +1)/2 \]

An \(\alpha\)-level test rejects \(H_0\) if \(u_y \ge u_{n_y,n_x,\alpha}\), where \(u_{n_y,n_x,\alpha}\) is the upper \(\alpha\) critical point of the null distribution of the random variable, U.

The p-value is defined to be \(P(Y \ge u_y) = P(U \le u_x)\). One advantage of Mann-Whitney U test is that we can use either \(u_y\) or \(u_x\) to carry out the test.

For large \(n_y\) and \(n_x\), the null distribution of U can be well approximated by a normal distribution with mean \(E(U) = n_y n_x /2\) and variance \(var(U) = n_y n_x (n+1)/12\). A large sample z-test can be based on the statistic:

\[ z = \frac{u_y - n_y n_x /2 -1/2}{\sqrt{n_y n_x (n+1)/12}} \]

The test rejects \(H_0\) at level \(\alpha\) if \(z \ge z_{\alpha}\) or if \(u_y \ge u_{n_y,n_x,\alpha}\) where

\[ u_{n_y, n_x, \alpha} \approx n_y n_x /2 + 1/2 + z_{\alpha}\sqrt{n_y n_x (n+1)/12} \]

For the 2-sided test, we use the test statistic \(u_{max} = max(u_y,u_x)\) and \(u_{min} = min(u_y, u_x)\) and p-value is given by

\[ p-value = 2P(U \ge u_{max}) = 2P(U \le u_{min}) \] Since we assume there are no ties (when \(y_i = x_j\)), we count 1/2 towards both \(u_y\) and \(u_x\). Even though the sampling distribution is not the same, but large sample approximation is still reasonable,


Satterthwaite, Franklin E. 1946. “An Approximate Distribution of Estimates of Variance Components.” Biometrics Bulletin 2 (6): 110–14.