2 Statistics

Product data science interviews tend to allocate 30 minutes to an hour for statistics problems. You will need to be proficient in the following:

  • Hypothesis testing
  • Central limit theorem
  • Expectation estimation
  • Variance, Covariance
  • Everything we discussed in Chapter 1.



If you are preparing for PDS interviews you should study the excellent book on experiments by Kohavi, Tang, and Xu (2020).


2.1 Two-sample t-test

Easy Seen in a real interviewHypothesis testingTry it


Question: What is a two-sample t-test?

Answer: A two-sample t-test tests whether there is a statistically significant difference between the means of two populations. Assume two samples, \(X_t, X_c\); for simplicity, let us call them control and treatment samples. We can define the difference of their means as: \(\Delta = \bar{X}_t - \bar{X_c}\). Then, the two-sample t-test can be written (p.186, Kohavi, Tang, and Xu (2020)) as:

\[ T = \frac{\Delta}{\sqrt{Var(\Delta)}} \]

Let’s expand on the previous relationship:

\[ T = \frac{ \bar{X}_t - \bar{X_c}}{\sqrt{Var(\bar{X}_t) + Var(\bar{X}_c) + 2COV(\bar{X}_t, \bar{X}_c)}} \] Since the two samples are independent, the covariance of the means is zero. The variance of the mean is:

\[ Var(X_t) = Var(\frac{X_1+X_2+X_{n_t}}{n_t}) = \frac{1}{n_t^2} \bigg[ Var(X_1) + Var(X_2) + ... \bigg] = \frac{n_t}{n_t^2} s_t^2 = \frac{s_t^2}{n_t} \]

where we assumed that all samples are independent and come from the same distribution, with empirical variance \(s_t^2\). Similarly, we can obtain the same result for the control group. Hence, our t-test can be written as:

\[ T = \frac{ \bar{X}_t - \bar{X_c}}{\sqrt{\frac{s_t^2}{n_t}+ \frac{s_c^2}{n_c}}} \]


We used the Welch’s t-test for this definition, which assumes unequal variances between samples. For a thorough discussion and comparison with Student’s t-test, you can check https://en.wikipedia.org/wiki/Welch%27s_t-test


2.2 Relationship between p-val and confidence interval

Easy Seen in a real interviewConfidence interval, P-value, Hypothesis testingTry it


Question: What is the relationship between a p-value and a confidence interval?

Answer: A 95% confidence interval contains all values of a parameter, which if tested as null hypotheses, would give a P-value \(\geq 0.05\).

More often, we use the relationship between confidence intervals and p-values in an experimental setting, where we compare two quantities and the null hypothesis is that there is no difference. In this specific content, we can say that a 95% confidence interval of the treatment effect that does not overlap with zero implies a p-value of \(p < 0.05\).

2.3 Measuring sticks

Medium Seen in a real interviewVarianceTry it


Question: Assume there are two sticks with lengths \(l1\) and \(l2\). You have an instrument that can measure the length of a stick with an error \(e \in N(0,\sigma)\). Now assume that your budget constrains you to use the instrument only twice. What is the best way to use your budget in order to get the most accurate measurements?

Answer: The tricky part here is to realize that your goal is to minimize the variance of the measurement. The naive approach would be to measure \(l1\) and then \(l2\). In that case:

\[ \begin{align*} \hat{l1} &= l1 + e_1 \\ \hat{l2} &= l2 + e_2 \\ Var(\hat{l1}) &= Var(e_1)= \sigma^2 \\ Var(\hat{l2}) &= Var(e_2)= \sigma^2 \end{align*} \]

A better way to do this would be to measure the sum and the differences of the two:

\[ \begin{align*} \hat{m1} &= l1 + l2 + e_1 \\ \hat{m2} &= l2 - l1 + e_2 \\ \hat{l1} &= \frac{1}{2}(\hat{m1} - \hat{m2}) = l1 + \frac{1}{2}(e_1-e_2) \\ \hat{l2} &= \frac{1}{2}(\hat{m1} + \hat{m2}) = l2 + \frac{1}{2}(e_1+e_2) \\ Var(\hat{l1}) &= \frac{1}{4} [Var(e_1) + Var(e_2)] = \frac{1}{2}\sigma^2 \\ Var(\hat{l2}) &= \frac{1}{4} [Var(e_1) + Var(e_2)] = \frac{1}{2}\sigma^2 \end{align*} \]

In the above, the covariance of the two errors is zero since they are independent measurements.

2.4 Biased coin

Medium Seen in a real interviewExpectation, CLT, Binomial, Normal, Bernoulli, Hypothesis testing, CDFTry it


Question: A coin was flipped 1000 times 550 of which turned out to be heads. Do you think this coin is biased?

Answer: We will solve this problem in two ways, both of which will invoke the Central Limit Theorem.

A. Solution through Binomial approximation: Assume that \(X_i\) represents a coin flip. The random variable \(\bar{X} = \sum_i X_i\) follows a Binomial distribution (sum of Bernoulli trials) and hence \(E[\bar{X}] = np\) and \(Var(\bar{X}) = np(1-p)\). Since the Binomial can be approximated by the Normal distribution, we can estimate the probability of observing 550 heads under the assumption (null hypothesis) that the coin is fair:

\[ \begin{align} Pr(X >= 550) &= 1 - Pr(X < 550) \\ &= 1 - Pr(Z < \frac{550-500}{\sqrt{250}}) \\ &= 1 - Pr(Z < 3.16) \approx 0.0008 \end{align} \] In the above, we standardized \(X\) to generate the Z statistic, even though it wasn’t necessary, and we used the Binomial’s mean and variance given above. Based on the above result, we can reject the Null hypothesis that the coin is fair since under the null, the probability of observing 550 heads or more would be almost zero.


Standardization help us to get a quick understanding of where we land in the Null distribution. Under the standard normal, we know that 2 standard deviations (1.96) are enough to reject a Null hypothesis at significance level \(\alpha=0.05\). Hence, we get a Z-statistic of 3.16, we know exactly that we are further in the right than 2, and hence we can immediately understand that it is unlikely for this coin to be fair.


If all of the above is confusing, you can always get an estimate directly from simulations:

from scipy.stats import norm
import numpy as np
print(f"Standard normal: {1- norm.cdf(3.16):.4f}")
## Standard normal: 0.0008
print(f"Original Binomial approximation: {1- norm.cdf(550,loc=500, scale=np.sqrt(250)):.4f}")
## Original Binomial approximation: 0.0008

A. Solution through Bernoulli trials and CLT: We can apply the Central Limit Theorem directly to the coin flips. Specifically, under the Null hypothesis that the coin is fair, we would expect that the mean probability of heads will follow a normal of \(N(0.5, \frac{0.5^2}{n}\). Then, similarly to what we did before:

\[ \begin{align} \Pr(X >= \frac{550}{1000}) &= 1- \Pr(X < 0.55) \\ &= 1 - \Pr(Z < \frac{0.55-0.5}{0.0158}) \\ &= 1 - \Pr(Z < 3.16) \approx 0.0008 \end{align} \]

2.5 Prussian horses

Medium Seen in a real interviewPoisson, Hypothesis testing, CDFTry it


Question: The following dataset includes the number of deaths per year and corp:

library(tidyverse)
d =read_csv("../../data/prussian.csv")
d %>% head()
## # A tibble: 6 × 3
##   deaths  year corp 
##    <dbl> <dbl> <chr>
## 1      0    75 G    
## 2      2    76 G    
## 3      2    77 G    
## 4      1    78 G    
## 5      0    79 G    
## 6      0    80 G
d %>% summary
##      deaths         year           corp          
##  Min.   :0.0   Min.   :75.00   Length:280        
##  1st Qu.:0.0   1st Qu.:79.75   Class :character  
##  Median :0.0   Median :84.50   Mode  :character  
##  Mean   :0.7   Mean   :84.50                     
##  3rd Qu.:1.0   3rd Qu.:89.25                     
##  Max.   :4.0   Max.   :94.00

A. What kind of distribution does the number of deaths follow and with what parameters?

B. How would you test your response to question A?

C. If you had observed 4 deaths, could you reject the null that 4 deaths could be derived by the distribution of question A?

Answer:

A. Given that we have positive integers, a natural choice would be to fit the number of deaths to a Poisson distribution, with \(\lambda = 0.7\) (see mean of data above). If we indeed do so, we can get the following:

lambda = 0.7
NumberOfObs = nrow(d)
predicted = c()
for(i in 0:4){
  prob = exp(-lambda) * lambda^i/factorial(i)
  cat(i,prob, " expected number of obs:", round(prob * NumberOfObs,1), 
      " vs. actual obs: ", d %>% filter(deaths==i) %>% count %>% pull, "\n")
  predicted = c(predicted,round(prob * NumberOfObs) )
}
## 0 0.4965853  expected number of obs: 139  vs. actual obs:  144 
## 1 0.3476097  expected number of obs: 97.3  vs. actual obs:  91 
## 2 0.1216634  expected number of obs: 34.1  vs. actual obs:  32 
## 3 0.02838813  expected number of obs: 7.9  vs. actual obs:  11 
## 4 0.004967922  expected number of obs: 1.4  vs. actual obs:  2

The predicted number of observations for each number of deaths is pretty close to the actually observed ones, so, just by looking at these numbers, we can feel confident that we made a good choice of distribution.

B. We can do a chi-squared test to see if there is statistical evidence that our data indeed follows a Poisson distribution:

actual = d %>% count(deaths) %>%  pull
chisq.test(actual,p=predicted,rescale.p = T)
## 
##  Chi-squared test for given probabilities
## 
## data:  actual
## X-squared = 2.7801, df = 4, p-value = 0.5953

The null hypothesis of the above test suggests that the two distributions are the same. We cannot reject the null, and hence it is statistically possible that this data was produced through a Poisson distribution.

C. The probability of observing 4 or more deaths assuming we the data follows a Poisson distribution with \(\lambda=0.7\) is equal to 0.00078, which is less than 0.05 (assuming significance level \(\alpha=0.05\)), and hence there is some evidence to reject the null hypothesis that the data can be model by this particular Poisson.

In R:

cat(round(1 - ppois(4,0.7),4))
## 8e-04

In Python:

from scipy.stats import poisson
print(round(1-poisson.cdf(4,mu=0.7),4))
## 0.0008


Can the Poisson be approximated by Normal? The answer is yes, when \(\lambda > 10\) typically we can approximate the Poisson by Normal \(N(\lambda, \lambda)\).


For instance:

plot(dpois(0:20,10))




2.6 Additional Questions

The following statistics questions are included in the complete book that defines the bar for product data science:

Question Topics
Confidence interval definition Confidence interval, Hypothesis testing
P-value definition P-value, Hypothesis testing
An intuitive way to write power Power, Hypothesis testing
Tests for normality Hypothesis testing, Normality
Confidence intervals that overlap Confidence interval, Hypothesis testing
Manual estimation of flips Normal, CDF, Binomial, CLT
CI of flipping heads Confidence Interval, CLT, Bernoulli trials
Buy and sell stocks Gambler ruin, Expectation, Recursion, Random walk
Expected number of consecutive heads Expectation
Number of draws to get greater than 1 Normal, Geometric, CDF, Expectation
Gambler’s ruin win probability Gambler ruin, Random walk, Expectation
Distribution of a CDF CDF, Inverse transform
Covariance of dependent variables Variance, Uniform, Covariance, Expectation
Dynamic coin flips Expectation, Simulation
Monotonic draws Expectation

References

Kohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to a/b Testing. Cambridge University Press.