Chapter 2 Joint Probability Distributions
Lewis Hamilton is desperate to start tomorrows Formula 1 grand prix in the front row. To do so, he must set the first or second fastest lap time in qualifying today. He has just recorded the current best time: 1 minute and 11.116 seconds. Only Max Verstappen and Sergio Perez are left to record a time. Your job is to determine the likelihood that Hamilton starts the race in the front row.
If Verstappen is faster than 1 minute and 11.116 seconds and Perez slower, then Lewis Hamilton is second and happily on the front row. Similarly if Perez is faster than 1 minute and 11.116 seconds but Verstappen slower, again Hamilton is on the front row. However if both Verstappen and Perez are faster than 1 minute and 11.116 seconds, then Hamilton will be disappointed. It is not enough to consider either driver in isolation, it is necessary to consider the joint behavior of both events.
How do we handle this type of problem mathematically?
2.1 Joint probability density funtions
Throughout this chapter the outcome of two random variables are considered simultaneously.
A fair coin is flipped three times. Let XX be the number of pairs of consecutive flips in which the same outcome is observed, and YY be the number of heads observed. What are the probabilities of each of the pairs of possible outcomes for XX and YY?
The possible outcomes of the three coin flips are
where HH denotes a head and TT a tails. Each of the outcomes occurs with a probability 1818. The values that XX and YY take for each of the outcomes is outlined in the following table:

From this complete set of outcomes, one can calculate the probabilities P(X=x,Y=y)P(X=x,Y=y) where x=0,1,2x=0,1,2 and y=0,1,2,3y=0,1,2,3. These probabilities are outlined below:

The values derived in Example 2.1.1 are the values taken by a joint probability mass function for two discrete random variables:
Let XX and YY be two discrete random variables. The joint probability mass function (joint PMF) of XX and YY is defined as
Note the probability values in Example 2.1.1 sum to 11. This is because the collection of outcomes cover all possibilities. This is a property that is shared by all joint probability mass functions.
In this same vein, for some fixed outcome xx of XX one can calculate P(X=x)P(X=x) from the joint PMF by summing the probabilities P(X=x,Y=y)P(X=x,Y=y) for all possible outcomes yy of YY: this collection of outcomes cover all possible scenarios in which X=xX=x. A argument holds to calculate pY(y)pY(y). This argument is stated mathematically by the following lemma:
In this context, pX(x)pX(x) and pY(y)pY(y) are called marginal PMFs.
Consider the three coin flips in Example 2.1.1. What is the probability that there is exactly one head observed?
Using the notation of Example 2.1.1, calculate via Lemma 1.1.3 that

The following app allows the user to calculate the probabilities of both marginal PMFs of the coin flip problem outlined in Example 2.1.1.
Alternative to discrete random variables are continuous random variables: random variables which take values in some given range, rather than from some given set. We are already familiar with probability density functions, the analogous concent for continuous random variables to the probability mass function for discrete random variables. This is no different in the joint random variables case.
For any (x,y)(x,y) in R2R2, the probability P(X=x,Y=y)P(X=x,Y=y) is zero. Therefore the joint probability density function is not defined at individual points in R2R2, and is instead defined over regions in R2R2.
Let XX and YY be continuous random variables. The joint probability density function (joint PDF) of XX and YY is a function fX,YfX,Y with the property that for all C⊂R2C⊂R2:
Abbie and Bertie are playing a game in which they aim to score as many as they can. Abbie scores XX, and Bertie scores YY. Although we are not fully aware of the rules for the game, we do know the joint PDF of XX and YY is fX,Y(x,y)={24x(1−x−y),if x,y≥0 and x+y≤1,0,otherwise.fX,Y(x,y)={24x(1−x−y),if x,y≥0 and x+y≤1,0,otherwise. Calculate the probability that Abbie beats Bertie.
The problem is interpreted mathematically as calculating the probability P(X>Y)P(X>Y). Using Definition 2.1.5, this is akin to calculating the integral ∬CfX,Y(x,y)dxdy∬CfX,Y(x,y)dxdy where CC is the region in which X>YX>Y. Since any integral of the constant function 00 evaluates to 00, we only need to consider the parts of CC that are contained within the region where fX,Y(x,y)fX,Y(x,y) is non-zero, that is, {(x,y):x,y≥0 and x+y≤1}{(x,y):x,y≥0 and x+y≤1}. Therefore the region C′C′ over which we will integrate is bounded by the lines x=yx=y, x+y=1x+y=1, x=0x=0 and y=0y=0.

Therefore
Given a joint PDF, we can write R code them returns samples from the joint distribution. The following R code calculates 1000010000 samples from the joint PDF fX,Y(x,y)={6x2y,if 0≤x,y≤1,0,otherwise.fX,Y(x,y)={6x2y,if 0≤x,y≤1,0,otherwise. Note that the region on which this joint PDF is non-zero is rectangular.
The code works by generating 5000050000 points in the region on which fX,Y(x,y)fX,Y(x,y) is non-zero. These points are then chosen with probabilities governed by fX,Y(x,y)fX,Y(x,y). The samples are stored in the random variable `random_sample’.
#Create a 50000 points in the rectangle [0,1] x [0,1]
xs <- runif(50000, 0, 1)
ys <- runif(50000, 0, 1)
dat <- matrix(c(xs, ys), ncol = 2)
dat <- data.frame(dat)
names(dat) <- c("x", "y")
#Code to simulate the joint PDF
joint_pdf <- function(x, y) {
6*x^2*y
}
#Evaluate the PDF at each of our 50000 points
probs <- joint_pdf(xs, ys)
#Choose 10000 samples from our 50000 points with probabilities governed by the PDF
indices <- sample(1:50000, 10000, replace=TRUE, prob=probs)
random_sample <- dat[indices, ]
It is also possible to write R code that takes samples from a joint PDF where the region on which the function is non-zero is not rectangular. Note the joint PDF governing the game played by Annie and Bertie in Example 2.1.6 is non-zero on a triangular region. The following R code calculates 1000010000 samples from this joint PDF. The code is similar to that above with the sole difference being in the section defining the variable `joint_pdf’.
#Create a 50000 points in the rectangle [0,1] x [0,1]
xs <- runif(50000, 0, 1)
ys <- runif(50000, 0, 1)
dat <- matrix(c(xs, ys), ncol = 2)
dat <- data.frame(dat)
names(dat) <- c("x", "y")
#Code to simulate the joint PDF
joint_pdf <- function(x, y) {
ifelse(x+y<1, 24*x*(1-x-y), 0)
}
#Evaluate the PDF at each of our 50000 points
probs <- joint_pdf(xs, ys)
#Choose 10000 samples from our 50000 points with probabilities governed by the PDF
indices <- sample(1:50000, 10000, replace=TRUE, prob=probs)
random_sample <- dat[indices, ]
Crucially by taking sufficiently large samples, one can estimate probabilities related to the joint PDF. The following R code verifies the solution to Example 2.1.6 by estimating P(X>Y)P(X>Y) using the variable `random_sample’ defined in the prior block.
While R is useful for plotting and working with random samples, it is not good for working with symbolic mathematics.
Similarly to the discrete case, the PDF of XX can be recovered from the joint PDF by totaling fX,Y(x,y)fX,Y(x,y) across all possible values of yy for YY. This is stated mathematically by the following lemma.
In this context, fX(x)fX(x) and fY(y)fY(y) are called marginal PDFs.
Consider the game from Example 2.1.6. Calculate the marginal PDF of Abbie’s score.
The question asks us to calculate fX(x)fX(x). Since fX,Y(x,y)fX,Y(x,y) is 00 for all x<0x<0 and x>1x>1, it is enough for us to consider 0≤x≤10≤x≤1. By Lemma 2.1.7, calculate
Therefore
2.2 Joint cumulative distribution functions
There are two random variables that we are concerned about in the F1 racing problem: namely the time XX that Max Verstappen sets and the time YY that Sergio Perez sets. It follows that
Probabilities where two random variables are both less than some fixed values are a well-studied phenomenon in probability.
The joint cumulative distribution function of two random variables XX and YY is defined by
Two friends are meeting in Nottingham city center. Both travel via buses on different routes that come once an hour. Let XX and YY be the time, as a proportion of an hour, until the respective buses come. The joint cumulative distribution function is FX,Y(x,y)=xy,where 0≤x,y≤1.FX,Y(x,y)=xy,where 0≤x,y≤1. What is the probability that both friends are on a bus to Nottingham within half an hour?
Since we want both buses to arrive within half an hour, we want the probability that X≤12X≤12 and Y≤12Y≤12. Calculate
In general we would like to define FX,Y(x,y)FX,Y(x,y) on all of R2R2, that is, XX and YY can take any real number value. Consider the cumulative distribution function from Example 2.1.2.
Given a joint CDF FX,YFX,Y, the CDF for XX and YY, denoted FX,FYFX,FY and in this context called marginal CDFs, can be calculated. Specifically
Intuitively this lemma makes sense: P(X<x,Y<∞)=P(X<x)P(X<x,Y<∞)=P(X<x) since YY being less than ∞∞ will always be true.
Consider the buses of Example 2.2.2. Calculate the marginal CDF of both buses.
Since the problem is entirely symmetrical when interchanging xx and yy it is enough to only calculate FX(x)FX(x). For 0≤x≤10≤x≤1, by Lemma 2.2.3 we have that
In Definition 2.2.1, we saw that FX,Y(x,y)=P(X≤x,Y≤y)FX,Y(x,y)=P(X≤x,Y≤y). However by Definition 2.1.5, we also have that P(X≤x,Y≤y)=∫y−∞∫x−∞fX,Y(u,v)dudvP(X≤x,Y≤y)=∫y−∞∫x−∞fX,Y(u,v)dudv. Therefore
FX,Y(x,y)=∫y−∞∫x−∞fX,Y(u,v)dudv.FX,Y(x,y)=∫y−∞∫x−∞fX,Y(u,v)dudv.
Taking the second order derivative of this equation, differentiating with respect to xx and yy, we obtain:
fX,Y(x,y)=∂2FX,Y(x,y)∂x∂y.fX,Y(x,y)=∂2FX,Y(x,y)∂x∂y.
These equations give us a method to calculate the joint PDF from the joint CDF, and vice-versa.
Consider the buses of Example 2.2.2, the arrival times of which are governed by the joint CDF FX,Y(x,y)=xy,where 0≤x,y≤1.FX,Y(x,y)=xy,where 0≤x,y≤1. Calculate the joint PDF of the bus arrival times.
By the above theory, for 0≤x,y≤10≤x,y≤1 calculate that fX,Y(x,y)=∂2FX,Y(x,y)∂x∂y=∂2∂x∂y(xy)=∂∂x(x)=1.fX,Y(x,y)=∂2FX,Y(x,y)∂x∂y=∂2∂x∂y(xy)=∂∂x(x)=1. So fX,Y=1fX,Y=1 when 0≤x,y≤10≤x,y≤1.
2.3 Independence
When considering two events, it can be tempting to calculate the probability of the desired outcome for both events and then multiply these probabilities together. But what if the two events are not independent? Then this argument breaks down.
In the F1 racing example, Max Verstappen and Sergio Perez both drive in cars designed by the Red Bull racing team. If one of the drivers is slow because of a deficiency in the car, then the other driver may also record a slow time because of the same deficiency. Alternatively if one driver posts a fast time because of an outstanding car, then it is perhaps more likely that the other driver will be similarly fast. The two lap times are not independent.
Recall that two events AA and BB are independent if P(A∩B)=P(A)P(B)P(A∩B)=P(A)P(B). The notion of independence of events can be extended to a notion of independence of random variables.
The random variables XX and YY with PDFs fXfX and fyfy respectively, are said to be independent if fX,Y(x,y)=fX(x)fY(y),fX,Y(x,y)=fX(x)fY(y), for all values of xx and yy.
The condition FX,Y(x,y)=FX(x)FY(y)FX,Y(x,y)=FX(x)FY(y) in Definition 2.3.1 is equivalent to the condition P(X≤x,Y≤y)=P(X≤x)⋅P(Y≤y).P(X≤x,Y≤y)=P(X≤x)⋅P(Y≤y).
The following theorem allows us to check independence of two random variables from PDFs instead of CDFs.
Suppose the CDF FX,Y(x,y)FX,Y(x,y) is differentiable. Then jointly continuous random variables XX and YY are independent if and only if fX,Y(x,y)=fX(x)fY(y),fX,Y(x,y)=fX(x)fY(y), for all values of xx and yy.
Specifically given two random variables XX and YY with joint PDF fX,Y(x,y)fX,Y(x,y), one can check independence by calculating both marginal PDFs fX(x)fX(x) and fY(y)fY(y), then applying Theorem 2.3.2.
Consider again the game played by Abbie and Bertie in Example 2.1.6, the scores of which are governed by the joint PDF fX,Y(x,y)={24x(1−x−y),if x,y≥0 and x+y≤1,0,otherwise.fX,Y(x,y)={24x(1−x−y),if x,y≥0 and x+y≤1,0,otherwise. Are the scores of Abbie and Bertie independent?
Recall from Example 2.1.8 that Abbies’ score is distributed by the marginal PDF
Calculating the marginal PDF of Berties’ score for 0≤y≤10≤y≤1:
Therefore
Immediately one can see that fX,Y(x,y)≠fX(x)fY(y)fX,Y(x,y)≠fX(x)fY(y): for example fX,Y(34,34)=0fX,Y(34,34)=0 since 34+34>134+34>1, but fX(34)fY(34)=916⋅116=3162fX(34)fY(34)=916⋅116=3162.
Therefore Abbie and Berties’ scores are dependent.Consider again the buses of Example 2.2.2, the arrival times of which are governed by the joint CDF FX,Y(x,y)=xy,where 0≤x,y≤1.FX,Y(x,y)=xy,where 0≤x,y≤1. Are the arrival times of the buses independent?
In Example 2.2.4, we calculated the marginal distributions FX(x)=xFX(x)=x for 0≤x≤10≤x≤1 and FY(y)=yFY(y)=y for 0≤y≤10≤y≤1. It follows that
2.4 Three or more random variables
All the ideas and notions seen in this chapter can be extended to the setting of more than two random variables.
Specifically let nn be a positive integer with n≥2n≥2. Consider nn random variables denoted X1,X2,…,XnX1,X2,…,Xn. Then
assuming X1,X2,…,XnX1,X2,…,Xn are discrete, the joint PMF of X1,X2…,XnX1,X2…,Xn is a function pX1,X2…,XnpX1,X2…,Xn such that pX1,X2…,Xn(x1,…,xn)=P(X1=x1,…,Xn=xn);pX1,X2…,Xn(x1,…,xn)=P(X1=x1,…,Xn=xn);
the joint CDF of X1,X2…,XnX1,X2…,Xn is a function FX1,X2…,XnFX1,X2…,Xn such that
FX1,X2…,Xn(x1,x2…,xn)=P(X1≤x1,X2≤x2…,Xn≤xn);FX1,X2…,Xn(x1,x2…,xn)=P(X1≤x1,X2≤x2…,Xn≤xn);assuming X1,X2,…,XnX1,X2,…,Xn are continuous, the joint PDF of X1,X2…,XnX1,X2…,Xn is the function fX1,X2…,Xn(x1,x2…,xn)fX1,X2…,Xn(x1,x2…,xn) such that for any region C⊆RnC⊆Rn, the probability P((X1,X2,…,Xn)∈C)=∫⋯∫CfX1,X2…,Xn(x1,x2…,xn)dx1⋯dxn;P((X1,X2,…,Xn)∈C)=∫⋯∫CfX1,X2…,Xn(x1,x2…,xn)dx1⋯dxn;
assuming X1,X2,…,XnX1,X2,…,Xn are continuous, the joint PDF fX1,X2…,Xn(x1,x2…,xn)fX1,X2…,Xn(x1,x2…,xn) and joint CDF FX1,X2…,Xn(x1,x2…,xn)FX1,X2…,Xn(x1,x2…,xn) are related by the identity fX1,X2…,Xn(x1,x2…,xn)=∂nFX1,X2…,Xn(x1,x2…,xn)∂x1…∂xn;fX1,X2…,Xn(x1,x2…,xn)=∂nFX1,X2…,Xn(x1,x2…,xn)∂x1…∂xn;
the marginal CDF of XiXi for 1≤i≤n1≤i≤n can be obtained by substitution into the joint CDF: FXi(xi)=FX1,X2…,Xn(∞,∞,…,∞,xi,∞,…,∞);FXi(xi)=FX1,X2…,Xn(∞,∞,…,∞,xi,∞,…,∞);
assuming X1,X2,…,XnX1,X2,…,Xn are continuous, the marginal PDF of XiXi for 1≤i≤n1≤i≤n can be obtained by integration of the joint PDF: fXi(xi)=∫⋯∫Rn−1fX1,X2…,Xn(x1,x2…,xn)dx1dx2…dxi−1dxi+1…dxn;fXi(xi)=∫⋯∫Rn−1fX1,X2…,Xn(x1,x2…,xn)dx1dx2…dxi−1dxi+1…dxn;
the random variables X1,X2,…,XnX1,X2,…,Xn are said to be independent if FX1,X2…,Xn(x1,x2…,xn)=FX1(x1)FX2(x2)⋯FXn(xn).FX1,X2…,Xn(x1,x2…,xn)=FX1(x1)FX2(x2)⋯FXn(xn).
Compare all of the statements here with the analogous definitions earlier in the chapter when two random variables were considered. In particular set n=2n=2 in each bullet point to recover an earlier definition or lemma.
2.5 Expectation
Norwegian Air run a small five person passenger plane from Oslo to Nordfjordeid. In order to analyse future financial prospects, the company would like to understand fuel consumption of the flight. To calculate fuel consumption, Norwegian air needs to know the total weight of the five passengers. Obviously this quantity will change from flight-to-flight, and so Norwegian Air will have to calculate an expected value for the total weight.
The next flight has a family of 55 booked on: a male parent, a female parent and three teenage children. Norwegian air know that the weight of an adult male is modeled by a continuous random variable Wadult maleWadult male. Similarly Wadult maleWadult male is a continuous random variable that models the weight of an adult female and WteenagerWteenager is a random variable that models the weight of a teenager. Norwegian Air would therefore like to predict the average value of Wadult male+Wadult female+3WteenagerWadult male+Wadult female+3Wteenager, that is E[Wadult male+Wadult female+3Wteenager]E[Wadult male+Wadult female+3Wteenager].
More generally, given a collection of continuous random variables X1,…,XnX1,…,Xn, what is the expected value of some function g(X1,…,Xn)g(X1,…,Xn) of these random variables.
In MATH1055, we saw the solution to the analogous discrete problem with n=2n=2: how to calculate the expectation of a function of two discrete random variables.
Let X1X1 and X2X2 be two discrete random variables, with joint PMF denoted pX1,X2pX1,X2. Then, for a function of the two random variables g(X1,X2)g(X1,X2), we have E[g(X1,X2)]=∑(x1,x2)g(x1,x2)pX1,X2(x1,x2).E[g(X1,X2)]=∑(x1,x2)g(x1,x2)pX1,X2(x1,x2).
In Theorem 2.5.1, the summation is over all pairs of values x1,x2x1,x2 that X1,X2X1,X2 can take respectively. Since the term inside the summation involves pX1,X2(x1,x2)pX1,X2(x1,x2), it is enough to only consider pairs (x1,x2)(x1,x2) for which pX1,X2(x1,x2)≠0pX1,X2(x1,x2)≠0, that is, only the pairs that have a change of occurring.
A cafe wants to investigate the correlation between temperature XX in degrees Celsius during winter and the number of customers YY in the cafe each day. Based on existing data collected by the owner, the joint probability table is
What is the average number of customers per day?

In mathematical language, the question is asking us to calculate E[Y]E[Y]. Setting g(X,Y)=Yg(X,Y)=Y, by Theorem 2.5.1: E[Y]=∑(x,y)g(x,y)pX,Y(x,y)=15⋅pX,Y(0,15)+75⋅pX,Y(0,75)+150⋅pX,Y(0,150)+15⋅pX,Y(10,15)+75⋅pX,Y(10,75)+150⋅pX,Y(10,150)+15⋅pX,Y(20,15)+75⋅pX,Y(20,75)+150⋅pX,Y(20,150)=(15×0.07)+(75×0.11)+(150×0.01)+(15×0.23)+(75×0.43)+(150×0.05)+(15×0.04)+(75×0.05)+(150×0.01)=58.35E[Y]=∑(x,y)g(x,y)pX,Y(x,y)=15⋅pX,Y(0,15)+75⋅pX,Y(0,75)+150⋅pX,Y(0,150)+15⋅pX,Y(10,15)+75⋅pX,Y(10,75)+150⋅pX,Y(10,150)+15⋅pX,Y(20,15)+75⋅pX,Y(20,75)+150⋅pX,Y(20,150)=(15×0.07)+(75×0.11)+(150×0.01)+(15×0.23)+(75×0.43)+(150×0.05)+(15×0.04)+(75×0.05)+(150×0.01)=58.35
Theorem 2.5.1 generalises to considering three or more variables. This is possible due to the introduction of joint PMFs in Section 2.4.
Let X1,…,XnX1,…,Xn be a collection of discrete random variables, with joint PMF denoted pX1,…,XnpX1,…,Xn. Then, for a function of the random variables g(X1,…,Xn)g(X1,…,Xn), we have E[g(X1,…,Xn)]=∑(x1,…,xn)g(x1,…,xn)pX1,…,Xn(x1,…,xn).E[g(X1,…,Xn)]=∑(x1,…,xn)g(x1,…,xn)pX1,…,Xn(x1,…,xn).
The generalisation of Theorem 2.5.3 to the case of continuous random variables involves integrating over the continuous region of possibilities, rather than summing over the discrete collection of possibilities.
Let X1,…,XnX1,…,Xn be a collection of continuous random variables, with joint PDF denoted fX1,…,XnfX1,…,Xn. Then, for a function of the random variables g(X1…,Xn)g(X1…,Xn), we have E[g(X1,…,Xn)]=∫∞−∞⋯∫∞−∞g(x1,…,xn)fX1,…,Xn(x1,…,xn)dx1⋯dxn.E[g(X1,…,Xn)]=∫∞−∞⋯∫∞−∞g(x1,…,xn)fX1,…,Xn(x1,…,xn)dx1⋯dxn.
Consider two random variables with joint PDF given by fX,Y(x,y)={2(x+y),if 0≤x≤y≤1,0,otherwise.fX,Y(x,y)={2(x+y),if 0≤x≤y≤1,0,otherwise. Calculate the expected value of XYXY.
The two-dimensional region on which fX,Y(x,y)fX,Y(x,y) is non-zero is

2.6 Covariance
Consider two random variables XX and YY. We have seen the notion of independence: XX and YY have no impact on each other. Alternatively there could exist a positive relationship, that is as one of X,YX,Y increases so does the other, or an inverse relationship, as one of X,YX,Y increases the other decreases and vice-versa.
The mathematical quantities of covariance studied in this section, and correlation studied in Section 2.7 identify any such relationship and measures its strength.
The covariance of two random variables, XX and YY, is defined by
If Cov(X,Y)Cov(X,Y) is positive, both random variables generally tend to be large and small at the same time. If Cov(X,Y)Cov(X,Y) is negative, then as one random variable is large the other tends to be small. The strength of any such relationship between XX and YY is not accounted for by the covariance.



Scatter plots of samples taken from random variables XX and YY with negative covariance



There is a simpler formula than that of Definition 2.6.1 by which to calculate covariance. Specifically it can be shown that covariance is equal to the expected value of the product minus the product of the expected values.
The covariance of two random variables, XX and YY, can by calculated by Cov(X,Y)=E[XY]−E[X]E[Y].Cov(X,Y)=E[XY]−E[X]E[Y].
Calculate the covariance of the random variables X,YX,Y in Example 2.5.2 that govern the temperature and number of customers daily in the cafe.
Motivated by the formula Cov(X,Y)=E[XY]−E[X]E[Y]Cov(X,Y)=E[XY]−E[X]E[Y] for covariance, we seek to calculate E[X],E[Y]E[X],E[Y] and E[XY]E[XY]. From Example 2.5.2, we know E[Y]=58.35E[Y]=58.35. Now calculating E[X]E[X] and E[XY]E[XY] using Theorem 2.5.4, obtain
E[X]=0×P(X=0)+10×P(X=10)+20×P(X=20)=0×(0.07+0.11+0.01)+10×(0.23+0.43+0.05)+20×(0.04+0.05+0.01)=9.1,E[XY]=∑n∈Zn⋅P(XY=n)=∞∑n=−∞n: integern⋅P(XY=n)=0×(pX,Y(0,15)+pX,Y(0,75)+pX,Y(0,150))+150×pX,Y(10,15)+300×pX,Y(20,15)+750×pX,Y(10,75)+1500×(pX,Y(10,150)+pX,Y(20,75))+3000×pX,Y(20,150)=0×(0.07+0.11+0.01)+150×0.23+300×0.04+750×0.43+1500×(0.05+0.05)+3000×0.01=549.E[X]=0×P(X=0)+10×P(X=10)+20×P(X=20)=0×(0.07+0.11+0.01)+10×(0.23+0.43+0.05)+20×(0.04+0.05+0.01)=9.1,E[XY]=∑n∈Zn⋅P(XY=n)=∞∑n=−∞n: integern⋅P(XY=n)=0×(pX,Y(0,15)+pX,Y(0,75)+pX,Y(0,150))+150×pX,Y(10,15)+300×pX,Y(20,15)+750×pX,Y(10,75)+1500×(pX,Y(10,150)+pX,Y(20,75))+3000×pX,Y(20,150)=0×(0.07+0.11+0.01)+150×0.23+300×0.04+750×0.43+1500×(0.05+0.05)+3000×0.01=549.
Therefore it follows that
Cov(X,Y)=549−9.1×58.35=18.015.Cov(X,Y)=549−9.1×58.35=18.015.Calculate the covariance of the random variables X,YX,Y in Example 3.1.5.
Again we seek to apply the formula Cov(X,Y)=E[XY]−E[X]E[Y]Cov(X,Y)=E[XY]−E[X]E[Y]. From Example 2.5.5, we know E[XY]=13E[XY]=13. Calculating E[X]E[X] and E[Y]E[Y] using Theorem 2.5.4, obtain
Therefore it follows that
Cov(X,Y)=13−512⋅34=148.Cov(X,Y)=13−512⋅34=148.The covariance of two sample populations can be calculated in R. The following code calculates the covariance of a known sample of size 55 taken from two random variables XX and YY:
Covariance has the following important properties:
If XX and YY are independent, then Cov(X,Y)=0Cov(X,Y)=0. However if Cov(X,Y)=0Cov(X,Y)=0, then XX and YY do not necessarily have to be independent;
The covariance of two equal random variables is equal to the variance of that random variable. Cov(X,X)=Var(X);Cov(X,X)=Var(X);
There is a further relationship between variance and covariance: Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y).Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y). That is to say, that covariance describes the variance of the random variable X+YX+Y that is not explained by the variances of the random variables XX and YY;
More generally the above relationship between variance and covariance generalises to: Var(n∑i=1aiXi)=n∑i=1a2iVar(Xi)+2∑1≤i<j≤naiajCov(Xi,Xj);3Var(n∑i=1aiXi)=n∑i=1a2iVar(Xi)+2∑1≤i<j≤naiajCov(Xi,Xj);3
Assume that X1,X2,…,XnX1,X2,…,Xn are independent so that Cov(Xi,Xj)=0Cov(Xi,Xj)=0, and also assume that each aiai is equal to 11. This derives the formula: Var(n∑i=1Xi)=n∑i=1Var(Xi).Var(n∑i=1Xi)=n∑i=1Var(Xi).
Suppose XX and YY are discrete random variables whose probability mass function is given by the following table:
What is the covariance of the two variables? Are they independent?

From the table, one can calculate the probability mass functions for XX and YY:
Using pX,pYpX,pY and pX,YpX,Y, we can calculate the expectation of XX, YY and XYXY respectively:
The covariance can then be calculated using Definition 2.6.1:
This does not inform us whether XX and YY are independent or not. However, note that
There are some rules which allow us to easily calculate the covariance of linear combinations of random variables.
Let a,ba,b be real numbers, and X,X1,X2,Y,Y1,Y2X,X1,X2,Y,Y1,Y2 be random variables. Each of the following rules pertaining to covariance hold in general
Consider the cafe from Example 2.5.2. Calculate the covariance of the number of customers the cafe receives in a three day and the average temperature over these three days in degrees Fahrenheit. You may assume that the values XX and YY on any given day are independent of the values taken by X and Y on the other days.
In mathematical language, the total number of customers over a three day period can be modeled by 3X. The average temperature in degrees Celsius is modeled by 13(Y+Y+Y)=13(3Y)=Y, which converting into degrees Fahrenheit is 95Y+32. Therefore we aim to calculate Cov(3X,95Y+32). Applying Lemma 2.6.6 and using the result of Example 2.6.3, obtain
2.7 Correlation
Suppose two random variables have a large covariance. There are two factors that can contribute to this: the variance of X and Y as individual random variables could be high, that is the magnitude of X−E[X] and Y−E[X] are particularly large, or the relationship between X and Y could be strong. We would like to isolate this latter contribution.
By scaling covariance to account for the variance of X and Y, one obtains a mathematical quantity, known as correlation, that solely tests the relationship between two random variables, and provides a measure of the strength of this relationship.
If Var(X)>0 and Var(Y)>0, then the correlation of X and Y is defined by
The correlation of two sample populations can be calculated in R. The following code calculates the correlation of a known sample of size 5 taken from two random variables X and Y:
Correlation has the following important properties:
−1≤ρ(X,Y)≤1;
If ρ(X,Y)=1, then there is a perfect linear positive correlation between X and Y. If ρ(X,Y)=−1, then there is a perfect linear inverse correlation between X and Y;
If X and Y are independent, then ρ(X,Y)=0. Note, again, that the converse is not true.
These properties can be explored in the following app.
While a correlation ρ(X,Y)=1 indicates a perfect positive linear relationship between X and Y, it is important to note that the correlation does not indicate the gradient of this linear relationship. Specifically an increase in X does not indicate an equal increase in Y. A similar statement holds for correlation ρ(X,Y)=−1.


Scatter plots of samples taken from random variables X and Y with correlation −1


Similarly to covariance, there are identities that help us to calculate the correlation of linear combinations of random variables.
Let a,b be real numbers, and X,Y be random variables. Both of the following rules pertaining to correlation hold in general:
Let Xi∼Exp(λi) where λi>0 for i=0,1,2 be a collection of independent random variables. Set
Y1=X0+X1Y2=X0+X2
Calculate the correlation ρ(Y1,Y2) of Y1 and Y2.
Calculate
Similarly
Now
Therefore