Chapter 2 Joint Probability Distributions

Lewis Hamilton is desperate to start tomorrows Formula 1 grand prix in the front row. To do so, he must set the first or second fastest lap time in qualifying today. He has just recorded the current best time: 1 minute and 11.116 seconds. Only Max Verstappen and Sergio Perez are left to record a time. Your job is to determine the likelihood that Hamilton starts the race in the front row.

If Verstappen is faster than 1 minute and 11.116 seconds and Perez slower, then Lewis Hamilton is second and happily on the front row. Similarly if Perez is faster than 1 minute and 11.116 seconds but Verstappen slower, again Hamilton is on the front row. However if both Verstappen and Perez are faster than 1 minute and 11.116 seconds, then Hamilton will be disappointed. It is not enough to consider either driver in isolation, it is necessary to consider the joint behavior of both events.

How do we handle this type of problem mathematically?

2.1 Joint probability density funtions

Throughout this chapter the outcome of two random variables are considered simultaneously.

A fair coin is flipped three times. Let \(X\) be the number of heads observed. Let \(Y\) be the number of pairs of consecutive flips in which the same outcome is observed. What are the probabilities of each of the pairs of possible outcomes for \(X\) and \(Y\)?



The possible outcomes of the three coin flips are

\[HHH, \quad HHT,\quad HTH,\quad HTT,\quad THH,\quad THT,\quad TTH,\quad TTT\]

where \(H\) denotes a head and \(T\) a tails. Each of the outcomes occurs with a probability \(\frac{1}{8}\). The values that \(X\) and \(Y\) take for each of the outcomes is outlined in the following table:

From this complete set of outcomes, one can calculate the probabilities \(P(X=x, Y=y)\) where \(x=0,1,2,3\) and \(y=0,1,2\). These probabilities are outlined below:

The values derived in Example 2.1.1 are the values taken by a joint probability mass function for two discrete random variables:

Let \(X\) and \(Y\) be two discrete random variables. The joint probability mass function (joint PMF) of \(X\) and \(Y\) is defined as

\(p_{X,Y}(x,y) = P(X=x,Y=y).\)



Note the probability values in Example 2.1.1 sum to \(1\). This is because the collection of outcomes cover all possibilities. This is a property that is shared by all joint probability mass functions.

In this same vein, for some fixed outcome \(x\) of \(X\) one can calculate \(P(X=x)\) from the joint PMF by summing the probabilities \(P(X=x,Y=y)\) for all possible outcomes \(y\) of \(Y\): this collection of outcomes cover all possible scenarios in which \(X=x\). A argument holds to calculate \(p_Y(y)\). This argument is stated mathematically by the following lemma:

Joint PMFs can be used to calculate the PMFs of its random variables. Specifically
\[\begin{align*} p_{X}(x) &= P(X=x)= \sum\limits_{y} P(X=x,Y=y), \\[3pt] p_Y(y) &= P(Y=y)= \sum\limits_{x} P(X=x,Y=y). \end{align*}\]

In this context, \(p_X(x)\) and \(p_{Y}(y)\) are called marginal PMFs.

Consider the three coin flips in Example 2.1.1. What is the probability that there is exactly one pair of consecutive flips for which the same outcome is observed?



Using the notation of Example 2.1.1, calculate via Lemma 1.1.3 that

\[\begin{align*} p_Y(1) &= P(Y=1) \\[3pt] &= P(X=0, Y=1) + P(X=1, Y=1) + P(X=2, Y=1) + P(X=3, Y=1) \\[3pt] &= 0 + \frac{1}{4} + \frac{1}{4} + 0 \\[3pt] &=\frac{1}{2}. \end{align*}\]

The following app allows the user to calculate the probabilities of both marginal PMFs of the coin flip problem outlined in Example 2.1.1.


Alternative to discrete random variables are continuous random variables: random variables which take values in some given range, rather than from some given set. We are already familiar with probability density functions, the analogous concent for continuous random variables to the probability mass function for discrete random variables. This is no different in the joint random variables case.

For any \((x,y)\) in \(\mathbb{R}^2\), the probability \(P(X=x,Y=y)\) is zero. Therefore the joint probability density function is not defined at individual points in \(\mathbb{R}^{2}\), and is instead defined over regions in \(\mathbb{R}^2\).

Let \(X\) and \(Y\) be continuous random variables. The joint probability density function (joint PDF) of \(X\) and \(Y\) is a function \(f_{X,Y}\) with the property that for all \(C \subset \mathbb{R}^{2}\):

\(P \big( (x,y) \text{ in } C \big) = \iint_{C} f_{X,Y}(x,y) \,dx \,dy.\)



Abbie and Bertie are playing a game in which they aim to score as many as they can. Abbie scores \(X\), and Bertie scores \(Y\). Although we are not fully aware of the rules for the game, we do know the joint PDF of \(X\) and \(Y\) is \[f_{X,Y}(x,y) = \begin{cases} 24x(1-x-y),& \text{if } x,y \geq 0 \text{ and } x+y \leq 1, \\ 0,& \text{otherwise.} \end{cases}\] Calculate the probability that Abbie beats Bertie.



The problem is interpreted mathematically as calculating the probability \(P(X>Y)\). Using Definition 2.1.5, this is akin to calculating the integral \(\iint_{C} f_{X,Y}(x,y) \,dx\,dy\) where \(C\) is the region in which \(X>Y\). Since any integral of the constant function \(0\) evaluates to \(0\), we only need to consider the parts of \(C\) that are contained within the region where \(f_{X,Y}(x,y)\) is non-zero, that is, \(\left\{ (x,y): x,y \geq 0 \text{ and } x+y \leq 1 \right\}\). Therefore the region \(C'\) over which we will integrate is bounded by the lines \(x=y\), \(x+y=1\), \(x=0\) and \(y=0\).


Therefore

\[\begin{align*} P(X>Y) &= \int \int_{C} f_{X,Y}(x,y) \,dx\,dy \\[3pt] &= \int \int_{C'} 24x(1-x-y) \,dx\,dy \\[3pt] &= \int_{0}^{\frac{1}{2}} \int_{y}^{1-y} 24x-24x^2-24xy \,dx\,dy \\[3pt] &= \int_{0}^{\frac{1}{2}} \left[ 12x^2 - 8x^3 -12 x^2 y \right]_{y}^{1-y} \,dy \\[3pt] &= \int_{0}^{\frac{1}{2}} \left( 12 (1-y)^2 -8(1-y)^3 -12(1-y)^2 y \right) - \left( 12y^2 - 8y^3 -12y^3 \right) \,dy \\[3pt] &= \int_{0}^{\frac{1}{2}} 16y^3 -12y +4 \,dy \\[3pt] &= \left[ 4y^4 -6y^2 +4y \right]_{0}^{\frac{1}{2}} \\[3pt] &= \left( \frac{4}{16} - \frac{6}{4} +2 \right) - \left( 0 \right) \\[3pt] &= \frac{3}{4}. \end{align*}\]

Given a joint PDF, we can write R code them returns samples from the joint distribution. The following R code calculates \(10000\) samples from the joint PDF \[f_{X,Y}(x,y) = \begin{cases} 6x^2y, & \text{if } 0 \leq x,y \leq 1, \\ 0, & \text{otherwise.} \end{cases}\] Note that the region on which this joint PDF is non-zero is rectangular.

The code works by generating \(50000\) points in the region on which \(f_{X,Y}(x,y)\) is non-zero. These points are then chosen with probabilities governed by \(f_{X,Y}(x,y)\). The samples are stored in the random variable `random_sample’.

#Create a 50000 points in the rectangle [0,1] x [0,1]
xs <- runif(50000, 0, 1)
ys <- runif(50000, 0, 1)
dat <- matrix(c(xs, ys), ncol = 2)
dat <- data.frame(dat)
names(dat) <- c("x", "y")

#Code to simulate the joint PDF
joint_pdf <- function(x, y) {
  6*x^2*y
}

#Evaluate the PDF at each of our 50000 points
probs <- joint_pdf(xs, ys)

#Choose 10000 samples from our 50000 points with probabilities governed by the PDF
indices <- sample(1:50000, 10000, replace=TRUE, prob=probs)
random_sample <- dat[indices, ]

It is also possible to write R code that takes samples from a joint PDF where the region on which the function is non-zero is not rectangular. Note the joint PDF governing the game played by Annie and Bertie in Example 2.1.6 is non-zero on a triangular region. The following R code calculates \(10000\) samples from this joint PDF. The code is similar to that above with the sole difference being in the section defining the variable `joint_pdf’.

#Create a 50000 points in the rectangle [0,1] x [0,1]
xs <- runif(50000, 0, 1)
ys <- runif(50000, 0, 1)
dat <- matrix(c(xs, ys), ncol = 2)
dat <- data.frame(dat)
names(dat) <- c("x", "y")

#Code to simulate the joint PDF
joint_pdf <- function(x, y) {
  ifelse(x+y<1, 24*x*(1-x-y), 0) 
}

#Evaluate the PDF at each of our 50000 points
probs <- joint_pdf(xs, ys)

#Choose 10000 samples from our 50000 points with probabilities governed by the PDF
indices <- sample(1:50000, 10000, replace=TRUE, prob=probs)
random_sample <- dat[indices, ]

Crucially by taking sufficiently large samples, one can estimate probabilities related to the joint PDF. The following R code verifies the solution to Example 2.1.6 by estimating \(P(X>Y)\) using the variable `random_sample’ defined in the prior block.

mean(random_sample$x > random_sample$y)

While R is useful for plotting and working with random samples, it is not good for working with symbolic mathematics.

Similarly to the discrete case, the PDF of \(X\) can be recovered from the joint PDF by totaling \(f_{X,Y}(x,y)\) across all possible values of \(y\) for \(Y\). This is stated mathematically by the following lemma.

Joint PDFs can be used to calculate the PDFs of its random variables. Specifically
\[\begin{align*} f_{X}(x) &= \int_{-\infty}^{\infty} f_{X,Y}(x,y) \,dy, \\[3pt] f_Y(y) &= \int_{-\infty}^{\infty} f_{X,Y}(x,y) \,dx. \end{align*}\]

In this context, \(f_X(x)\) and \(f_{Y}(y)\) are called marginal PDFs.


Consider the game from Example 2.1.6. Calculate the marginal PDF of Abbie’s score.



The question asks us to calculate \(f_{X}(x)\). Since \(f_{X,Y}(x,y)\) is \(0\) for all \(x<0\) and \(x>1\), it is enough for us to consider \(0 \leq x \leq 1\). By Lemma 2.1.7, calculate

\[\begin{align*} f_{X}(x) &= \int_{-\infty}^{\infty} f_{X,Y}(x,y) \,dy \\[3pt] &= \int_{0}^{1-x} 24x(1-x-y) \,dy \\[3pt] &= \left[ -12x(1-x-y)^2 \right]_{0}^{1-x} \\[3pt] &= 12x(1-x)^2. \end{align*}\]

Therefore

\(f_{X}(x) = \begin{cases} 12x(1-x)^2, & \text{if } 0 \leq x \leq 1, \\[3pt] 0, & \text{otherwise.} \end{cases}\)

2.2 Joint cumulative distribution functions

There are two random variables that we are concerned about in the F1 racing problem: namely the time \(X\) that Max Verstappen sets and the time \(Y\) that Sergio Perez sets. It follows that

\[P \big(\text{Hamilton not on the front row} \big) = P \big( X \leq 1:11.116, Y \leq 1:11.116 \big)\]

Probabilities where two random variables are both less than some fixed values are a well-studied phenomenon in probability.

The joint cumulative distribution function of two random variables \(X\) and \(Y\) is defined by

\[F_{X,Y}(x,y) = P(X \leq x, Y \leq y).\]
Two friends are meeting in Nottingham city center. Both travel via buses on different routes that come once an hour. Let \(X\) and \(Y\) be the time, as a proportion of an hour, until the respective buses come. The joint cumulative distribution function is \[F_{X,Y}(x,y) = xy, \qquad \text{where } 0 \leq x,y \leq 1.\] What is the probability that both friends are on a bus to Nottingham within half an hour?



Since we want both buses to arrive within half an hour, we want the probability that \(X \leq \frac{1}{2}\) and \(Y \leq \frac{1}{2}\). Calculate

\[\begin{align*} P \left( X \leq \frac{1}{2}, Y \leq \frac{1}{2} \right) &= F_{X,Y} \left( \frac{1}{2}, \frac{1}{2} \right) \\[3pt] &= \frac{1}{2} \cdot \frac{1}{2} \\[3pt] &= \frac{1}{4}. \end{align*}\]

In general we would like to define \(F_{X,Y}(x,y)\) on all of \(\mathbb{R}^{2}\), that is, \(X\) and \(Y\) can take any real number value. Consider the cumulative distribution function from Example 2.1.2.

Given a joint CDF \(F_{X,Y}\), the CDF for \(X\) and \(Y\), denoted \(F_X, F_Y\) and in this context called marginal CDFs, can be calculated. Specifically

\[\begin{align*} F_X(x) = F_{X,Y} (x,\infty) \\[3pt] F_Y(y) = F_{X,Y} (\infty, y) \end{align*}\]

Intuitively this lemma makes sense: \(P(X<x, Y< \infty) = P(X<x)\) since \(Y\) being less than \(\infty\) will always be true.

Consider the buses of Example 2.2.2. Calculate the marginal CDF of both buses.



Since the problem is entirely symmetrical when interchanging \(x\) and \(y\) it is enough to only calculate \(F_X(x)\). For \(0\leq x \leq 1\), by Lemma 2.2.3 we have that

\[\begin{align*} F_{X}(x) &= F_{X,Y}(x,\infty) \\[3pt] &= F_{X,Y}(x,1) \\[3pt] &= x. \end{align*}\]
It follows by the aforementioned symmetry that \(F_Y(y) = y\) for \(0 \leq y \leq 1\).

In Definition 2.2.1, we saw that \(F_{X,Y}(x,y) = P \left( X \leq x, Y \leq y \right)\). However by Definition 2.1.5, we also have that \(P \left( X \leq x, Y \leq y \right) = \int_{-\infty}^{y} \int_{-\infty}^{x} f_{X,Y}(u,v) \,du \,dv\). Therefore

\[F_{X,Y}(x,y) = \int_{-\infty}^{y} \int_{-\infty}^{x} f_{X,Y}(u,v) \,du \,dv.\]

Taking the second order derivative of this equation, differentiating with respect to \(x\) and \(y\), we obtain:

\[f_{X,Y}(x,y) = \frac{\partial^2 F_{X,Y}(x,y)}{\partial x \partial y}.\]

These equations give us a method to calculate the joint PDF from the joint CDF, and vice-versa.

Consider the buses of Example 2.2.2, the arrival times of which are governed by the joint CDF \[F_{X,Y}(x,y) = xy, \qquad \text{where } 0 \leq x,y \leq 1.\] Calculate the joint PDF of the bus arrival times.

By the above theory, for \(0\leq x,y \leq 1\) calculate that \[\begin{align*} f_{X,Y}(x,y) &= \frac{\partial^2 F_{X,Y}(x,y)}{\partial x \partial y} \\ &= \frac{\partial^2}{\partial x \partial y} \left( xy \right) \\ &= \frac{\partial}{\partial x} \left( x \right) \\ &= 1. \end{align*}\] So \(f_{X,Y}=1\) when \(0\leq x,y \leq 1\).

2.3 Independence

When considering two events, it can be tempting to calculate the probability of the desired outcome for both events and then multiply these probabilities together. But what if the two events are not independent? Then this argument breaks down.

In the F1 racing example, Max Verstappen and Sergio Perez both drive in cars designed by the Red Bull racing team. If one of the drivers is slow because of a deficiency in the car, then the other driver may also record a slow time because of the same deficiency. Alternatively if one driver posts a fast time because of an outstanding car, then it is perhaps more likely that the other driver will be similarly fast. The two lap times are not independent.

Recall that two events \(A\) and \(B\) are independent if \(P(A \cap B) = P(A) P(B)\). The notion of independence of events can be extended to a notion of independence of random variables.

The random variables \(X\) and \(Y\) with PDFs \(f_X\) and \(f_y\) respectively, are said to be independent if \[f_{X,Y}(x,y) = f_{X}(x) f_{Y}(y),\] for all values of \(x\) and \(y\).

The condition \(F_{X,Y}(x,y) = F_{X}(x) F_{Y}(y)\) in Definition 2.3.1 is equivalent to the condition \[P(X \leq x, Y \leq y) = P(X \leq x)\cdot P(Y\leq y).\]

The following theorem allows us to check independence of two random variables from PDFs instead of CDFs.

Suppose the CDF \(F_{X,Y}(x,y)\) is differentiable. Then jointly continuous random variables \(X\) and \(Y\) are independent if and only if \[f_{X,Y}(x,y) = f_{X}(x) f_{Y}(y),\] for all values of \(x\) and \(y\).

Specifically given two random variables \(X\) and \(Y\) with joint PDF \(f_{X,Y}(x,y)\), one can check independence by calculating both marginal PDFs \(f_{X}(x)\) and \(f_{Y}(y)\), then applying Theorem 2.3.2.


Consider again the game played by Abbie and Bertie in Example 2.1.6, the scores of which are governed by the joint PDF \[f_{X,Y}(x,y) = \begin{cases} 24x(1-x-y),& \text{if } x,y \geq 0 \text{ and } x+y \leq 1, \\ 0,& \text{otherwise.} \end{cases}\] Are the scores of Abbie and Bertie independent?



Recall from Example 2.1.8 that Abbies’ score is distributed by the marginal PDF

\[f_{X}(x) = \begin{cases} 12x(1-x)^2, & \text{if } 0 \leq x \leq 1, \\ 0, & \text{otherwise.} \end{cases}\]

Calculating the marginal PDF of Berties’ score for \(0\leq y \leq 1\):

\[\begin{align*} f_{Y}(y) &= \int_{-\infty}^{\infty} f_{X,Y}(x,y) \,dx \\[3pt] &= \int_{0}^{1-y} 24x(1-x-y) \,dx \\[3pt] &= \int_{0}^{1-y} 24x-24x^2-24xy \,dx \\[3pt] &= \left[ 12x^2 -8x^3 -12x^2 y \right]_{0}^{1-y} \\[3pt] &= 12(1-y)^2 - 8(1-y)^3 -12 (1-y)^2 y \\[3pt] &= 4(1-y)^3. \end{align*}\]

Therefore

\[f_{Y}(y) = \begin{cases} 4(1-y)^3, & \text{if } 0 \leq y \leq 1, \\ 0, & \text{otherwise.} \end{cases}\]

Immediately one can see that \(f_{X,Y}(x,y) \neq f_{X}(x) f_{Y}(y)\): for example \(f_{X,Y} \left(\frac{3}{4}, \frac{3}{4} \right) = 0\) since \(\frac{3}{4} + \frac{3}{4} > 1\), but \(f_{X} \left( \frac{3}{4} \right) f_{Y} \left( \frac{3}{4} \right) = \frac{9}{16} \cdot \frac{1}{16} = \frac{3}{16^2}\).

Therefore Abbie and Berties’ scores are dependent.
Consider again the buses of Example 2.2.2, the arrival times of which are governed by the joint CDF \[F_{X,Y}(x,y) = xy, \qquad \text{where } 0 \leq x,y \leq 1.\] Are the arrival times of the buses independent?



In Example 2.2.4, we calculated the marginal distributions \(F_X (x) = x\) for \(0 \leq x \leq 1\) and \(F_Y (y) = y\) for \(0 \leq y \leq 1\). It follows that

\[F_{X,Y}(x,y) = xy = F_{X}(x) F_{Y}(y).\]
Therefore the bus arrival times are independent.

2.4 Three or more random variables

All the ideas and notions seen in this chapter can be extended to the setting of more than two random variables.

Specifically let \(n\) be a positive integer with \(n \geq 2\). Consider \(n\) random variables denoted \(X_1,X_2, \ldots, X_n\). Then

  • assuming \(X_1,X_2, \ldots, X_n\) are discrete, the joint PMF of \(X_1,X_2 \ldots, X_n\) is a function \(p_{X_1,X_2 \ldots, X_n}\) such that \[p_{X_1,X_2 \ldots, X_n}(x_1,\ldots,x_n) = P(X_1 =x_1, \ldots , X_n=x_n);\]

  • the joint CDF of \(X_1,X_2 \ldots, X_n\) is a function \(F_{X_1,X_2 \ldots, X_n}\) such that
    \[F_{X_1,X_2 \ldots, X_n}(x_1,x_2 \ldots, x_n) = P \left( X_1 \leq x_1, X_2 \leq x_2 \ldots , X_n \leq x_n \right);\]

  • assuming \(X_1,X_2, \ldots, X_n\) are continuous, the joint PDF of \(X_1,X_2 \ldots, X_n\) is the function \(f_{X_1,X_2 \ldots, X_n}(x_1,x_2 \ldots, x_n)\) such that for any region \(C \subseteq \mathbb{R}^{n}\), the probability \[P \big( \left(X_1, X_2, \ldots, X_n \right) \in C \big) = \int \cdots \int_{C} f_{X_1,X_2 \ldots, X_n}(x_1,x_2 \ldots, x_n) \,dx_1 \cdots \,dx_n;\]

  • assuming \(X_1,X_2, \ldots, X_n\) are continuous, the joint PDF \(f_{X_1,X_2 \ldots, X_n}(x_1,x_2 \ldots, x_n)\) and joint CDF \(F_{X_1,X_2 \ldots, X_n}(x_1,x_2 \ldots, x_n)\) are related by the identity \[f_{X_1,X_2 \ldots, X_n}(x_1,x_2 \ldots, x_n) = \frac{\partial^n F_{X_1,X_2 \ldots, X_n}(x_1,x_2 \ldots, x_n)}{\partial x_1 \ldots \partial x_n};\]

  • the marginal CDF of \(X_i\) for \(1 \leq i \leq n\) can be obtained by substitution into the joint CDF: \[F_{X_i}(x_i) = F_{X_1,X_2 \ldots, X_n}(\infty, \infty, \ldots, \infty, x_i, \infty, \ldots, \infty);\]

  • assuming \(X_1,X_2, \ldots, X_n\) are continuous, the marginal PDF of \(X_i\) for \(1\leq i \leq n\) can be obtained by integration of the joint PDF: \[f_{X_i}(x_i) = \int \cdots \int_{\mathbb{R}^{n-1}} f_{X_1,X_2 \ldots, X_n}(x_1,x_2 \ldots, x_n) \,dx_1 \,dx_2 \ldots \,dx_{i-1} \,dx_{i+1} \ldots \,dx_n;\]

  • the random variables \(X_1, X_2, \ldots , X_n\) are said to be independent if \[F_{X_1,X_2 \ldots, X_n}(x_1,x_2 \ldots, x_n) = F_{X_1}(x_1) F_{X_2}(x_2) \cdots F_{X_n}(x_n).\]

Compare all of the statements here with the analogous definitions earlier in the chapter when two random variables were considered. In particular set \(n=2\) in each bullet point to recover an earlier definition or lemma.