# Chapter 6 Data and sampling

In chapters 2 and 5 we learned some techniques for cleaning data and calculating statistics in Excel. In chapters 3 and 4, we developed the basic theory of probability, random events, and random variables.

Our next step is to bring these two strands of concepts together. Over the next two chapters we will develop a framework for talking about data and the statistics calculated from that data as a random process that can be described using the theory of probability and random variables.

Chapter goals

In this chapter we will:

• Calculate and interpret joint, marginal and conditional distributions of two random variables.
• Calculate and interpret the covariance and correlation coefficient of two discrete random variables.
• Model the random process generating a data set.
• Apply and interpret the assumption of simple random sampling, and compare it to other sampling schemes.

## 6.1 Multiple random variables

Almost all interesting data sets have multiple observations and multiple variables and observations. So before we start talking about data, we need to develop some tools and terminology for thinking about multiple random variables.

To keep things simple, most of the definitions and examples will be stated in terms of two random variables. The extension to more than two random variables is conceptually straightforward but will be skipped.

### 6.1.1 Joint distributions

Let $$x = x(b)$$ and $$y = y(b)$$ be two random variables defined in terms of the same underlying outcome $$b$$.

Their joint probability distribution assigns a probability to every event that can be defined in terms of $$x$$ and $$y$$, for example $$\Pr(x = 6 \cap y = 0)$$ or $$\Pr(x < y)$$.

This joint distribution can be fully described by the joint CDF: $F_{x,y}(a,b) = \Pr(x \leq a \cap y \leq b)$ or by the joint PDF: $f_{x,y}(a,b) = \begin{cases} \Pr(x = a \cap y = b) & \textrm{if x and y are discrete} \\ \frac{\partial F_{x,y}(a,b)}{\partial a \partial b} & \textrm{if x and y are continuous} \\ \end{cases}$

The joint PDF in roulette

In our roulette example, the joint PDF of $$w_{red}$$ and $$w_{14}$$ can be derived from the original outcome.

If $$b=14$$, then both red and 14 win: \begin{align} f_{red,14}(1,35) &= \Pr(w_{red}=1 \cap w_{14} = 35) \\ &= \Pr(b \in \{14\}) = 1/37 \end{align} If $$b \in Red$$ but $$b \neq 14$$, then red wins but 14 loses: \begin{align} f_{red,14}(1,-1) &= \Pr(w_{red} = 1 \cap w_{14} = -1) \\ &= \Pr\left(b \in \left\{ \begin{gathered} 1,3,5,7,9,12,16,18,19,21,\\ 23,25,27,30,32,34,36 \end{gathered}\right\}\right) \\ &= 17/37 \end{align} Otherwise both red and 14 lose: \begin{align} f_{red,14}(-1,-1) &= \Pr(w_{red} = -1 \cap w_{14} = -1) \\ &= \Pr\left(b \in \left\{ \begin{gathered} 0,2,4,6,7,10,11,13,15,17, \\ 20,22,24,26,28,31,33,35 \end{gathered}\right\}\right) \\ &= 19/37 \end{align} All other values have probability zero.

The joint distribution tells you two things about these variables

1. The probability distribution of each individual random variable, sometimes called that variable’s marginal distribution.
• For example, we can derive each variable’s CDF from the joint CDF: $F_x(a) = \Pr(x \leq a) = \Pr(x \leq a \cap y \leq \infty) = F_{x,y}(a,\infty)$ $F_y(b) = \Pr(y \leq n) = \Pr(x \leq \infty \cap y \leq b) = F_{x,y}(\infty,b)$
• We can also derive each variable’s PDF from the joint PDF
2. The relationship between the two variables.
• We will develop several ways of describing this relationship: conditional distribution, covariance, correlation, etc.

Note that while you can always derive the marginal distributions from the joint distribution, you cannot go the other way around unless you know everything about the relationship between the two variables.

Three joint distributions with identical marginal distributions

The scatter plots in Figure 6.1 below depict simulation results for a pair of random variables $$(x,y)$$, with a different joint distribution in each graph. In all three graphs, $$x$$ and $$y$$ have the same marginal distribution (standard normal).

The differences between the graphs are in the relationship between $$x$$ and $$y$$.

• In the first graph, $$x$$ and $$y$$ are unrelated, so the data looks like as a “cloud” of random dots.
• In the second graph, $$x$$ and $$y$$ have something of a positive relationship. High values of $$x$$ tend to go with high values of $$y$$.
• In the third graph, $$x$$ and $$y$$ are closely related. In fact, they are equal. Figure 6.1: x and y have the same marginal distribution in all three graphs, but not the same joint distribution.

### 6.1.2 Conditional distributions

The conditional distibution of a random variable $$y$$ given another random variable $$x$$ assigns values to all probabilities of the form: $\Pr(y \in A| x \in B) = \frac{\Pr(y \in A \cap x \in B)}{\Pr(x \in B)}$ Since a conditional probability is just the ratio of the joint probability to the marginal probability, the conditional distribution can always be derived from the joint distribution.

We can describe a conditional distribution with either the conditional CDF: $F_{y|x}(a,b) = \Pr(y \leq a|x=b)$ or the conditional PDF $f_{y|x}(a,b) = \begin{cases} \Pr(y=a|x=b) & \textrm{if x and y are discrete} \\ \frac{\partial}{\partial a}F_{y|x}(a,b) & \textrm{if x and y are continuous} \\ \end{cases}$

Conditional PDFs in roulette

Let’s find the conditional PDF of the payout for a bet on 14 given the payout for a bet on red. \begin{align} \Pr(w_{14}=-1|w_{red}=-1) &= \frac{\Pr(w_{14}=-1 \cap w_{red}=-1)}{\Pr(w_{red}=-1)} \\ &=\frac{19/37}{19/37} = 1 \\ \Pr(w_{14}=35|w_{red}=-1) &= \frac{\Pr(w_{14}=35 \cap w_{red}=-1)}{\Pr(w_{red}=-1)} \\ &=\frac{0}{19/37} = 0 \\ \Pr(w_{14}=-1|w_{red}=1) &= \frac{\Pr(w_{14}=-1 \cap w_{red}=1)}{\Pr(w_{red}=1)} \\ &= \frac{17/37}{18/37} \approx 0.944 \\ \Pr(w_{14}=35|w_{red}=1) &= \frac{\Pr(w_{14}=35 \cap w_{red}=1)}{\Pr(w_{red}=1)} \\ &= \frac{1/37}{18/37} \approx 0.056 \end{align}

### 6.1.3 Independent random variables

We say that $$x$$ and $$y$$ are independent if every event defined in terms of $$x$$ is independent of every event defined in terms of $$y$$. $\Pr(x \in A \cap y \in B) = \Pr(x \in A)\Pr(y \in B)$ As before, independence of $$x$$ and $$y$$ implies that the conditional distribution is the same as the marginal distribution: $\Pr(x \in A| y \in B) = \Pr(x \in A)$ $\Pr(y \in A| x \in B) = \Pr(y \in A)$ The first graph in Figure 6.1 shows what independent random variables look like in data: a cloud of unrelated points.

Independence also means that the joint and conditional PDF/CDF can be derived from the marginal PDF/CDF: $f_{x,y}(a,b) = f_x(a)f_y(b)$ $f_{y|x}(a,b) = f_y(a)$ $F_{x,y}(a,b) = F_x(a)F_y(b)$ $F_{y|x}(a,b) = F_y(a)$ As with independence of events, this will be very handy in simplifying the analysis. But remember: independence is an assumption that we can only make when it’s reasonable to do so.

Independence in roulette

The winnings from a bet on red $$(w_{red})$$ and the winnings from a bet on 14 $$(w_{14})$$ in the same game are not independent.

However the winnings from a bet on red and a bet on 14 in two different games are independent since the underlying outcomes are independent.

### 6.1.4 Expected values with multiple variables

We earlier showed that if $$x$$ is a random variable, we can take the expected value of any function of $$x$$, and that you can take the expected value “inside” any linear function of $$x$$: $E(a + bx) = a + b E(x)$ but in general $$E(g(x)) \neq g(E(x))$$.

Similarly, when $$x$$ and $$y$$ are random variables with a well-defined joint distribution, we can take the expected value of any function of $$x$$ and $$y$$.

When $$x$$ and $$y$$ are both discrete: $E(g(x,y)) = \sum_{a \in S_x} \sum_{b \in S_y} g(a,b)\Pr( x=a \cap y = b)$ That is, we add up $$g(x,y)$$ across all possible values for the pair $$(x,y)$$ with each value weighted by its probability.

When $$x$$ and $$y$$ are both continuous: $E(g(x,y)) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(a,b)f_{x,y}(a,b)dadb$ Again, this looks similar to the formula for the discrete case, but uses an integral instead of a sum.

You do not need to remember or use either of these formulas

As with a single random variable, you can take the expected value inside a linear function of multiple random variables: $E(ax + by + c) = aE(x) + bE(y) + c$ but in general $$E(g(x,y)) \neq g(E(x),E(y))$$. For example: $E(xy) \neq E(x)E(y)$ $E(x/y) \neq E(x)/E(y)$ etc.

Multiple bets in roulette

Suppose we bet $100 on red and$10 on 14. Our net payout will be: $w_{total} = 100*w_{red} + 10*w_{14}$ which has expected value: \begin{align} E(w_{total}) &= E(100 w_{red} + 10 w_{14}) \\ &= 100 \, \underbrace{E(w_{red})}_{\approx -0.027} + 10 \, \underbrace{E(w_{14})}_{\approx -0.027} \\ &\approx -3 \end{align} That is we expect this betting strategy to lose an average of about \$3 per game.

### 6.1.5 Covariance and correlation

The covariance of two random variables $$x$$ and $$y$$ is defined as: $\sigma_{xy} = cov(x,y) = E[(x-E(x))*(y-E(y))]$ and their correlation is defined as: $\rho_{xy} = corr(x,y) = \frac{cov(x,y)}{\sqrt{var(x)var(y)}} = \frac{\sigma_{xy}}{\sigma_x\sigma_y}$ Both the covariance and correlation are measures of how $$x$$ and $$y$$ tend to move together.

Covariance and correlation in roulettte

The covariance of $$w_{red}$$ and $$w_{14}$$ is: \begin{align} cov(w_{red},w_{14}) &= \begin{aligned}[t] & (1-\underbrace{E(w_{red})}_{\approx -0.027})(35-\underbrace{E(w_{14})}_{\approx -0.027})\underbrace{f_{red,14}(1,35)}_{1/37}\\ &+ (1-\underbrace{E(w_{red})}_{\approx -0.027})(-1-\underbrace{E(w_{14})}_{\approx -0.027})\underbrace{f_{red,14}(1,-1)}_{17/37} \\ &+ (-1-\underbrace{E(w_{red})}_{\approx -0.027})(-1-\underbrace{E(w_{14})}_{\approx -0.027})\underbrace{f_{red,14}(-1,-1)}_{19/37} \\ \end{aligned} \\ &\approx 0.999 \end{align} and its correlation is: \begin{align} corr(w_{red},w_{14}) &= \frac{cov(w_{red},w_{14})}{\sqrt{var(w_{red})*var(w_{14})}} \\ &\approx \frac{0.999}{\sqrt{1.0*34.1}} \\ &\approx 0.17 \end{align}

The sign of the covariance or correlation tells us the direction of the relationship between $$x$$ and $$y$$. - If positive: $$x$$ and $$y$$ tend to move in the same direction. - If negative, $$x$$ and $$y$$ tend to move in opposite directions. - If zero: $$x$$ and $$y$$ tend not to move together at all. the covariance and correlation always have the same sign since standard deviations are always5 positive,

When random variables are independent, their covariance and correlation are both exactly zero. However, it does not go the other way around.

Uncorrelated does not imply independent

Figure 6.2 below shows a scatter plot from a simulation of two random variables that are clearly related (and therefore not independent) but whose covariance and correlation are exactly zero.

Intuitively, covariance and correlation are a measure of the linear relationship between two variables. When variables have a nonlinear relationship as in Figure 6.2, the covariance or correlation may miss it. Figure 6.2: x and y are uncorrelated, but clearly related.

The magnitude of the covariance or correlation tells us about the strength of the relationship between the two variables:

• The correlation is a scale-free measure of the strength of the relationship:
• It always lies between -1 and 1.
• It is unchanged by any rescaling or change in units. That is, for any positive constants $$a$$ and $$b$$: $corr(ax,by) = corr(x,y)$
• When $$corr(x,y) \in \{-1,1\}$$, then that means $$y$$ is an exact linear function of $$x$$. That is, we can write it: $y = a + bx$ and $corr(x,y) = \begin{cases} 1 & \textrm{if b > 0} \\ -1 & \textrm{if b < 0} \\ \end{cases}$
• In contrast, the covariance depends on the scale of the variables, and always lies between $$-\sigma_x\sigma_y$$ and $$\sigma_x\sigma_y$$.

Earlier I said that the variance was a square, and has some implied properties similar to those of a square. In particular:

• The variance can be written $var(x) = E(x^2) - E(x)^2$
• We can take constants out of a variance: $var(ax) = a^2 var(x)$ since $$(ax)^2 = a^2x^2$$.

Similarly, the covariance is the expected value of a product, and has some implied properties similar to those of a product.

• The covariance can be written: $cov(x,y) = E(xy) - E(x)E(y)$
• The order in a covariance does not matter: $cov(x,y)=cov(y,x)$ since $$xy = yx$$.
• Covariances go through sums: $cov(x,y+z) = cov(x,y) + cov(x,z)$ since $$x*(y+z) = xy + xz$$
• Constant multiples can be taken out of covariances $cov(ax,by) = ab \, cov(x,y)$ since $$ax*by = ab*xy$$.
• The variance of a random variable is its covariance with itself: $cov(x,x) = var(x)$ since $$x*x = x^2$$
• The variance of a sum is: $var(x + y) = var(x) + var(y) + 2 \, cov(x,y)$ since $$(x+y)^2 = x^2 + y^2 + 2xy$$.

I do not expect you to remember all of these formulas, but be prepared to see me use them

## 6.2 Data and the data generating process

Having invested in all of the probabilistic preliminaries, we can finally talk about data. Suppose for the rest of this chapter that we have a data set called $$D_n$$.

In this chapter, we will assume that $$D_n = (x_1,x_2,\ldots,x_n)$$ is a data set with one variable and $$n$$ observations. We use $$x_i$$ to refer to the value of our variable for an arbitrary observation $$i$$.

In real-world analysis, data tends to be more complex:

• In most applications, it will be a simple table of numbers with $$n$$ observations (rows) and $$K$$ variables (columns).
• However, it is occasionally something more abstract. For example, the data set at https://www.kaggle.com/c/dogs-vs-cats is a big folder full of dog and cat photos.
• A great deal of research in the statistical field of machine learning has been focused on developing methods for determining if a particular photo in this data set shows a dog or a cat.

Although our examples will all be based on simple data sets, many of our concepts and results can be applied to more complex data.

Data from 3 roulette games

Suppose we have a data set $$D_3$$ providing the result of $$n=3$$ independent games of roulette. Let $$b_i$$ be the outcome in game $$i$$, and let $$x_i$$ be the result of a bet on red: $x_i = I(b_i \in RED) = \begin{cases} 1 & b_i \in RED \\ 0 & b_i \notin RED \\ \end{cases}$ Then $$D_n = (x_1,x_2,x_3)$$. For example, if red loses the first two games and wins the third game we have $$D_n = (0,0,1)$$.

Our data set $$D_n$$ is a set of $$n$$ numbers, but we can also think of it as a set of $$n$$ random variables with unknown joint distribution $$P_D$$. The distinction here is a hard one for students to make, so give it some thought before proceeding.

The joint distribution of $$D_n$$ is called its data generating process or DGP. The exact DGP is assumed to be unknown, but we usually have at least some information about it.

The DGP for the roulette data

The joint distribution of $$D_n$$ can be derived. Let $p = \Pr(b \in Red)$ We showed in a previous chapter that $$p \approx 0.486$$ if the roulette wheel is fair. But rather than assuming it is fair, let’s treat $$p$$ as an unknown parameter.

The PDF of $$x_i$$ is $f_x(a) = \begin{cases}(1-p) & a = 0 \\ p & a = 1 \\ 0 & \textrm{otherwise} \\ \end{cases}$ Since the random variables in $$D_n$$ are independent, their joint PDF is: $\Pr(D_n) = f_x(x_1)f_x(x_2)f_x(x_3) = p^{x_1+x_2+x_3}(1-p)^{3-x_1-x_2-x_3}$ Note that even with a small data set of a simple random variable, the joint PDF is not easy to calculate. Once we get into larger data sets and more complex random variables, it can get very difficult. That’s OK, we don’t usually need to calculate it - we just need to know that it could be calculated.

### 6.2.1 Simple random sampling

In order to model the data generating process, we need to model the entire joint distribution of $$D_n$$. As mentioned earlier, this means we must model both:

• The marginal probability distribution of each $$x_i$$
• The relationship between the $$x_i$$’s

Fortunately, we often can simplify this joint distribution quite a bit by assuming that $$D_n$$ is independent and identically distributed (IID) or a simple random sample from a large population.

A simple random sample has two features:

1. Independent: Each $$x_i$$ is independent of the others.
2. Identically distributed: Each $$x_i$$ has the same (unknown) marginal distribution.

This implies that its joint PDF can be written: $\Pr(D_n = (a_1,a_2,\ldots,a_n)) = f_x(a_1)f_x(a_2)\ldots f_x(a_n)$ where $$f_x(a) = \Pr(x_i = a)$$ is just the marginal PDF of a single observation. Independence allows us to write the joint PDF as the product of the marginal PDFs for each observation, and identical distribution allows us to use the same marginal PDF for each observation.

The reason we call this “independent and identically distributed” is hopefully obvious, but what does it mean to say we have a “random sample” from a “population”? Well, one simple way of generating an IID sample is to:

1. Define the population of interest, for example all Canadian residents.
2. Use some purely random mechanism6 to choose a small subset of cases from this population.
• The subset is called our sample
• “Purely random” here means some mechanism like a computer’s random number generator, which can then be used to dial random telephone numbers or select cases from a list.
3. Collect data from every case in our sample, usually by contacting them and asking them questions (survey).

It will turn out that a moderately-sized random sample provides surprisingly accurate information on the underlying population.

Our roulette data is a random sample

Each observation $$x_i$$ in our roulette data is an independent random draw from the $$Bernouilli(p)$$ distribution where $$p = \Pr(b \in Red)$$.

Therefore, this data set satisfies the criteria for a simple random sample.

### 6.2.2 Time series data

Our employment data set is an example of time series data; it is made of observations of each variable at regularly-spaced points in time. Most macroeconomic data - GDP, population, inflation, interest rates - are time series.

Time series have several features that are inconsistent with the random sampling assumption:

• They usually have clear time trends.
• For example, Canada’s realGDP has been steadily growing for as long as we have data.
• This violates “identically distributed” since 2010 GDP is drawn from a distribution with a higher expected value than the distribution for 1910 GDP.
• They usually have clear recurring cyclical patterns or seasonality.
• For example, unemployment in Canada is usually lower from September through December.
• This also violates “identically distributed” since February unemployment has a higher expected value than November unemployment.
• They usually exhibit what is called autocorrelation.
• For example, shocks to the economy that affect GDP in one month or quarter (think of COVID or a financial crisis) are likely to have a similar (if smaller) effect on GDP in the next month or quarter.
• This violates “independence” since nearby time periods are positively correlated.

We can calculate statistics for time series, and we already did in Chapter 5. However, time series data often requires more advanced techniques than we will learn in this class. ECON 433 addresses time series data.

### 6.2.3 Other sampling models

Not all useful data sets come from a simple random sample or a time series. For example:

• A stratified sample is collected by dividing the population into strata (subgroups) based on some observable characteristics, and then randomly sampling a predetermined number of cases within each strata.
• Most professional surveys are constructed from stratified samples rather than random samples.
• Stratified sampling is often combined with oversampling of some smaller strata that are of particular interest.
• The LFS oversamples residents of Prince Edward Island (PEI) because a national random sample would not catch enough PEI residents to accurately measure PEI’s unemployment rate.
• Government surveys typically oversample disadvantaged groups.
• Stratified samples can usually be handled as if they were from a random sample, with some adjustments.
• A cluster sample is gathered by dividing the population into clusters, randomly selecting some of these clusters, and sampling cases within the cluster.
• Educational data sets are often gathered this way: we pick a random sample of schools, and then collect data from each student within those schools.
• Cluster samples can usually be handled as if they were from a random sample, with some adjustments.
• A census gathers data on every case in the population.
• For example, we might have data on all 50 US states, or all 10 Canadian provinces, or all of the countries of the world.
• Data from administrative sources such as tax records or school records often cover the entire population of interest as well.
• Censuses are often treated as random samples from some hypothetical population of “possible” cases.
• A convenience sample is gathered by whatever method is convenient.
• For example, we might gather a survey from people who walk by, or we might recruit our friends to participate in the survey.
• Convenience samples are the worst-case scenario; in many cases they simply aren’t usable for accurate statistical analysis.

Many data sets combine several of these elements. For example, Canada’s unemployment rate is calculated using data from the Labour Force Survey (LFS). The LFS is built from a stratified sample of the civilian non-institutionalized working-age population of Canada. There is also some clustering: the LFS will typically interview whole households, and will do some geographic clustering to save on travel costs. The LFS is gathered monthly, and the resulting unemployment rate is a time series.

### 6.2.4 Sample selection and representativeness

Random samples and their close relatives have the feature that they are representative of the population from which they are drawn. In a sense that will be made more clear over the next few chapters, any sufficiently large random sample “looks just like” the population.

Unfortunately, a simple random is quite difficult to collect from humans. Even if we are able to randomly select cases, we often run into the following problems:

• Nonresponse occurs when a sampled individual does not provide the information requested by the survey
• Survey-level nonresponse occurs when the sampled individual does not answer any questions.
• This can occur if the sampled individual cannot be found, refuses to answer, or cannot answer (for example, is incapacitated due to illness or disability).
• Recent response rates to telephone surveys have been around 9%, implying over 90% of those contacted do not respond.
• Item-level nonresponse occurs when the sampled individual does not answer a particular question.
• This can occur if the respondent refuses to answer, or the question is not applicable or has no valid answer.
• Item-level nonresponse is particularly common on sensitive questions including income.
• Censoring occurs when a particular quantity of interest cannot be observed for a particular case. Censored outcomes are extremely common in economics, for example:
• In labour market analysis, we cannot observe the market wage for individuals who are not currently employed.
• In supply/demand analysis, we only observe quantity supplied and quantity demanded at the current market price.

There are two basic solutions to these problems:

• Imputation: we assume or impute values for all missing quantities. For example, we might assume that the wage of each non-employed worker is equal to the average wage among employed workers with similar characteristics.
• Redefinition: we redefine the population so that our data can be correctly interpreted as a random sample from that population. For example, instead of having a random sample of Canadians, we can interpret our data as a random sample of Canadians who would answer these questions if asked.

This is not an issue that has a purely technical solution, but requires careful thought instead. If we are imputing values, do we believe that our imputation method is reasonable? If we are redefining the population, is the redefined population one we are interested in? There is no right or wrong answers to these questions, and sometimes our data are simply not good enough to answer our questions.

Nonresponse bias in recent US elections

Going into both the 2016 and 2020 US presidential elections, polls indicated that the Democratic candidate had a substantial lead over the Republican candidate:

• Hillary Clinton led Donald Trump by 4-6% nationally in 2016
• Joe Biden led Trump by 8% nationally in 2020.

The actual vote was much closer:

• Clinton won the popular vote (but lost the election) by 2%
• Biden won the popular vote (and won the election) by about 4.5%.

The generally accepted explanation among pollsters for the clear disparity between polls and voting is systematic nonresponse: for some reason, Trump voters are less likely to respond to polls. Since most people do not respond to standard telephone polls any more (response rates are typically around 9%), it does not take much difference in repsonse rates to produce a large difference in responses. For example, suppose that:

• We call 1,000 voters
• These voters are equally split, with 500 supporting Biden and 500 supporting trump.
• 10% of Biden voters respond (50 voters)
• 8% of Trump voters respond (40 voters)

The overall response rate is 9% (similar to what we usually see in surveys), Biden has the support of $$50/90 = 56\%$$ of the respondents while Trump has the support of $$40/90 = 44\%$$. Actual support is even, but the polls show a 12 percentage point gap in support, entirely because of the small difference in response rates.

Polling organizations employ statisticians who are well aware of this problem, and they made various adjustments after 2016 to address it. For example, most now weight their analysis by education, since more educated people tend to have a higher response rate. Unfortunately, the 2020 results indicate that this adjustment was not enough. Some pollsters have argued that it makes more sense to just assume the nonresponse bias is 2-3% and adjust the numbers by that amount directly.

1. More precisely, either or both of $$\sigma_x$$ and $$\sigma_y$$ could be zero. In that case the covariance will also be zero, and the correlation will be undefined (zero divided by zero).↩︎

2. As a technical matter, the assumption of independence requires that we sample with replacement. This means we allow for the possibility that we sample the same case more than once. In practice this doesn’t matter as long as the sample is small relative to the population.↩︎