3 A/B Testing

Product data science interviews tend to allocate 30 minutes to an hour for A/B test problems. You will need to be proficient in the following:

A/B test assumptions (normality, randomization, no interference, etc.)
Common problems that arise with A/B test and relevant mitigation
Sample size and power analysis
Everything we discussed in Chapter 2.

If you want to dig deeper into these topics consider studying the related material on https://www.madinterview.com/learn/

If you are preparing for PDS interviews you should study the excellent book on experiments by Kohavi, Tang, and Xu (2020).

3.1 Sample Ratio Mismatch

Medium Seen in a real interviewSample Ratio MismatchTry it

Question: What are some reasons that can cause Sample Ratio Mismatch (SRM)? How do you test for SRM?

Answer: First let’s recall what is SRM:

Sample ratio mismatch (SRM) occurs when the ratio of observations between control and treatment is not close to the designed (expected) ratio. SRM suggests that something in the experiment is not working as it should (randomization did not work properly), and hence it threatens the experiment’s internal validity (p.45, Kohavi, Tang, and Xu (2020)).

Some of the reasons that SRM can happen include:

Page redirects for treatment (e.g., the treatment is implemented through web page redirects that take significantly longer)
Bad hash randomization (more generally buggy code of randomization)
If the conditions for treatment triggering are influenced by the experiment (more generally, bad trigger conditions can lead to imbalance)
Data pipeline logging, e.g. removing users who are inactivated or deemed bots
Time of day treatment occurs for test and control can bias metric measurement

We can test for SRM through a chi-squared test, where the null is that the SR = 1. Consider for instance that we split 1000 users evenly but the actual groups are 550 - 450. We can estimate the $\chi^2$ statistic as follows:

$\chi^2 = \sum_i \frac{(O_i - N p_i)^2}{Np_i} = \frac{(550-500)^2 + (450-500)^2}{500} = 10$ The value of the $\chi^2$ statistic is too large for it to have come from the Null, and hence we can reject the Null.

from scipy.stats import chisquare
observed = [550,450]
expected = [500,500]
chi = chisquare(observed, f_exp=expected)
print(f"chi squared statistic: {chi[0]} \np-val: {chi[1]:.3f}")

## chi squared statistic: 10.0 
## p-val: 0.002

You can find more info regarding the

$\chi^2$ test here: https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test

3.2 Simpson’s paradox

Medium Seen in a real interviewSimpson’s paradoxTry it

Question: What is Simpson’s paradox and can you give an example where it can arise?

Answer:. The Simpson’s paradox is when a relationship that appears in different groups of data disappears or even reverses when the groups are combined. It typically occurs when there is a confounding variable that affects the distributions of individual groups.

Mathematically, there is really no paradox. Consider the following table:

Treatment A	Treatment B
a/b	A/B
c/d	C/D
$\frac{a+c}{b+d}$	$\frac{A+C}{B+D}$

It is completely possible to have the following relationships:

$\begin{align} a/b &< A/B \\ c/d &< C/D \\ \frac{a+c}{b+d} &> \frac{A+C}{B+D} \end{align}$ This is what the Simpson’s paradox looks like mathematically, however, conceptually, it becomes trickier to explain, and that’s why it is called a Paradox. Let’s see an example.

Assume the above scenario, where we have two potential treatments, A and B, that can potentially cure a disease. Assume that the disease can be severe and not severe. If we don’t properly randomize the populations who will take the drugs, then maybe we will end up in a situation where a majority of severe patients end up in one group. For instance:

	Treatment A	Treatment B
Not Severe	9/10	85/100
Severe	60/100	5/10
Aggregate	69/110	90/110

So in the above, 9/10 > 85/100 and 60/100 > 5/10 but once combined: 69/110 < 90/110 and hence we have the Simpson’s paradox.

Note that in order for Simpson’s paradox to occur, there needs to be unequal data splits (i.e., a result of buggy randomization that generates selection bias). For instance, in the above example, the Simpson’s paradox disappears if the groups are evenly split, e.g.:

	Treatment A	Treatment B
Not Severe	90/100	85/100
Severe	60/100	50/100
Aggregate	150/200	135/200

The Simpson’s paradox is an extremely popular topic in interviews, and it often arises in subtle ways (e.g., combining experiments with different results). It is extremely important that you understand this topic otherwise it might be an easy-to-miss topic in an interview setting.

Simpson’s paradox does not show up exclusively in A/B testing questions. For instance, it can come up in Machine Learning system design sessions (p.186 Huyen (2022)), as it can occur when comparing different models in production (e.g., model B might be overall better than model A, but model A might perform better on each individual subgroup of the population).

You can find more on Simpson’s paradox here: https://en.wikipedia.org/wiki/Simpson%27s_paradox

3.3 Additional Questions

The following AB testing questions are included in the complete book that defines the bar for product data science:

Question	Topics
Counterfactual definition	Counterfactual
Randomization checks	Randomization
Novelty and primacy effects	Novelty effects, Primacy effects
False discovery control	False discovery rate, Multiple hypotheses testing, Benjamini & Hochberg, Bonferroni
Interference	Interference
AA tests	Variance
Randomization level	Randomization, Variance
Normality assumption	Normality
Reducing variance in AB testing	Variance
Equal-sized treatment and control groups	Power, Variance, Sample size
Multi-armed and contextual bandits in AB testing	Contextual bandits, Multi-armed bandits

References

Huyen, Chip. 2022. Designing Machine Learning Systems. " O’Reilly Media, Inc.".

Kohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to a/b Testing. Cambridge University Press.