1 Probability fundamentals
Product data science interviews tend to allocate 15 minutes to an hour for probability theory problems. You will need to be proficient in the following:
- Probability axioms, independence, conditional probability, Bayes rule
- Combinatorics (permutations, combinations)
- Discrete and continuous random variables, probability mass function, cumulative distribution function, probability density function
- Distributions (Uniform, Bernoulli, Binomial, Poisson, Geometric)
For a deeper understanding of these topics I would suggest studying the excellent probability and stats book by Pishro-Nik (2014).
1.1 Trailing by two: should we go for two or three?
Question: Assume you are a basketball coach, and your team is losing by 2 points. You have the ball for the last shot. The probability of winning in overtime is roughly 50%. The probability of making a three-pointer is 36%. The probability of making a two-pointer is 52%.
Should you go for a two-pointer and take the game to overtime, or should you go for a buzzer-beating three-pointer to win the game?
Assuming everything else stays the same, adjust the probability of making a two-pointer to change the decision. What should that probability be?
Answer: We will use conditional probability and independence to solve this problem and make and data-driven decision.
- The probability of winning if you choose a two-pointer will be:
\[ \begin{align} \Pr \text{(win | 2pointer)} &= \Pr \text{(Overtime | 2pointer)} * Pr\text{(win in overtime)} \\ &= 0.52 * 0.5 = 0.26 \end{align} \]
Since 0.26 is less than 0.36, the coach should choose to go for a buzzer-beating three-pointer.
- To solve the second part, we need to solve for the two-pointer probability:
\[ \begin{align} \Pr \text{(win | 2pointer)} & \geq 0.36 \Leftrightarrow \\ \Pr \text{(Overtime | 2pointer)} & \geq 0.36/Pr\text{(win in overtime)} \\ & = 0.36/0.5 \\ & = 0.72 \end{align} \]
If you are interested in sports analytics and decision-making, check out the book by Winston, Nestler, and Pelechrinis (2022).
1.2 Red and blue balls
Question: Assume that there are 100 balls, 70 red and 30 blue.
If you pick 5 with replacement, what’s the probability that all 5 are red?
If you pick 5 without replacement, what’s the probability that all 5 are red?
If you pick 10 without replacement, what’s the probability that 5 are red and 5 are blue?
Answer: This question tests combinations with and without permutations.
- Since we have repetition, we can essentially estimate the probability through a binomial distribution \(Bin(5,7/10)\):
\[ \Pr(5 \text{ red with repetition}) = \binom{5}{5} (7/10)^5 \approx 0.17 \]
- To get all 5 red, initially, we have 70 options, then 69, then 68, etc. To get any 5 balls, the combinations are 100, then 99, then 98. etc. So we can estimate this probability as:
\[ \Pr(5 \text{ red without repetition}) = \frac{70 * 69 * 68 * 67 * 66}{100 * 99 * 98 * 97 * 96} \]
Alternatively, we can use the binomial coefficient as follows:
\[ \Pr(5 \text{ red without repetition}) = \frac{\binom{70}{5}}{\binom{100}{5}} = \frac{\frac{70!}{5!65!}}{\frac{100!}{5! 95!}} = \frac{70 * 69 * 68 * 67 * 66}{100 * 99 * 98 * 97 * 96} \approx 0.16 \]
You can read the second equation as the number of ways I can choose 5 out of 70 reds over the number of ways that I can choose 5 out of 100 balls.
- Similar to b, we can estimate this as follows:
\[ \Pr(5 \text{ red without repitition}) = \frac{\binom{70}{5} \binom{30}{5}}{\binom{100}{10}} \approx 0.004 \]
1.3 Games between two players
Question: Consider a game where two players play a sequential game (e.g., rolling a dice), and at each round, the player who plays has probability \(p\) of winning. What is the probability that the player who plays first wins?
Answer: This is a very simple class of problems that we have seen repeatedly in interview settings. These problems are actually trivial once you solve a few examples, but for many, they might be tricky to solve if you haven’t seen them before.
The main thing to identify in this case is that the game goes to infinity the probability of one of the two players to win goes to 1. However, at each iteration, the playing participant has a probability \(p\) of winning. Keeping that in mind, we can come up with a recursive relationship that estimates the probability of the first player winning:
\[
\Pr\text{(win)} = p + (1-p)^2 * \Pr\text{(win)} \Leftrightarrow \Pr\text{(win)} = \frac{1}{2-p}
\]
1.4 Sum of normally distributed random variables
Question: Assume 4 standard normal random variables \(X_1, ..., X_4\). What is the probability of \(2X_1 + 4X_2 > X_3 + X_4\)?
Answer: To solve these types of problems formulaically you need to know how to add normal random variables (https://online.stat.psu.edu/stat414/lesson/26/26.1):
If \(X_1,X_2,...,X_n\) are mutually independent normal rvs with means \(\mu_1,\mu_2,..., \mu_n\) and variances \(\sigma_1^2, \sigma_2^2 ..., \sigma_n^2\), then the linear combination \(Y = \sum_{i=1}^n c_i X_i\) follows the normal distribution with:
\[\begin{equation} \mu = \sum_i^n c_i \mu_i, \;\;\; \sigma^2 = \sum_i^n c_i^2 \sigma_i^2 \end{equation}\]
In this example, we are actually looking for the probability \(\Pr(Y>0)\) given that \(Y = 2X_1 + 4X_2 - X_3 - X_4\). Following the above formula, \(Y\) will be following a normal distribution with:
\[ \begin{align} \mu &= 2 * 0 + 4 * 0 - 0 - 0 = 0 \\ \sigma^2 &= 4 + 16 + 1 + 1 = 22 \end{align} \]
Hence \(Y \sim N(0,22)\). Since the normal distribution is symmetric, and since \(Y\)’s mean is zero:
\[ Pr(Y > 0) \approx Pr(Y\leq 0) = 0.5 \]
We did not need to estimate the variance to solve this question, but we did it for completeness. The interviewer will primarily be looking into a general understanding that summing normal random variables yields a new normal and that the normal is symmetric around its mean.
1.5 Additional Questions
The following probability questions are included in the complete book that defines the bar for product data science:
Question | Topics |
---|---|
Paths to destination | Counting, Combinations |
Documents edited by AI | Bayes rule, Conditional probability |
Two children I | Prior evidence |
Consecutive tails | Permutations, Repetition |
Number of emails | Poisson distribution |
Two fair die rolls | Independence, CDF, PMF |
Repeated rolls until 4 | Geometric distribution |
Unfair coin probability | Bayes rule, Conditional probability |
Monty Hall {#monty} | Bayes rule, Conditional independence, Prior evidence |
Sample digits 1-10 | Sample from samples, Uniform |
Median probability | Binomial, Uniform, CDF |
Largest number rolled | Counting, Permutations, Repetition |