# Chapter 6 Data and sampling

In chapters 2 and 5 we learned some techniques for cleaning data and calculating statistics in Excel. In chapters 3 and 4, we developed the basic theory of probability, random events, and random variables.

Our next step is to bring these two strands of concepts together. Over the next two chapters we will develop a framework for talking about data and the statistics calculated from that data as a random process that can be described using the theory of probability and random variables.

*Chapter goals*

In this chapter we will:

- Calculate and interpret joint, marginal and conditional distributions of two random variables.
- Calculate and interpret the covariance and correlation coefficient of two discrete random variables.
- Model the random process generating a data set.
- Apply and interpret the assumption of simple random sampling, and compare it to other sampling schemes.

## 6.1 Multiple random variables

Almost all interesting data sets have multiple observations and multiple variables and observations. So before we start talking about data, we need to develop some tools and terminology for thinking about multiple random variables.

To keep things simple, most of the definitions and examples will be stated in
terms of *two* random variables. The extension to more than two random
variables is conceptually straightforward but will be skipped.

### 6.1.1 Joint distributions

Let \(x = x(b)\) and \(y = y(b)\) be two random variables defined in terms of the same underlying outcome \(b\).

Their ** joint probability distribution** assigns a probability
to every event that can be defined in terms of \(x\) and \(y\),
for example \(\Pr(x = 6 \cap y = 0)\) or \(\Pr(x < y)\).

This joint distribution can be fully described by the ** joint CDF**:
\[F_{x,y}(a,b) = \Pr(x \leq a \cap y \leq b)\]
or by the

**: \[f_{x,y}(a,b) = \begin{cases} \Pr(x = a \cap y = b) & \textrm{if $x$ and $y$ are discrete} \\ \frac{\partial F_{x,y}(a,b)}{\partial a \partial b} & \textrm{if $x$ and $y$ are continuous} \\ \end{cases}\]**

*joint PDF***The joint PDF in roulette**

In our roulette example, the joint PDF of \(w_{red}\) and \(w_{14}\) can be derived from the original outcome.

If \(b=14\), then both red and 14 win: \[\begin{align} f_{red,14}(1,35) &= \Pr(w_{red}=1 \cap w_{14} = 35) \\ &= \Pr(b \in \{14\}) = 1/37 \end{align}\] If \(b \in Red\) but \(b \neq 14\), then red wins but 14 loses: \[\begin{align} f_{red,14}(1,-1) &= \Pr(w_{red} = 1 \cap w_{14} = -1) \\ &= \Pr\left(b \in \left\{ \begin{gathered} 1,3,5,7,9,12,16,18,19,21,\\ 23,25,27,30,32,34,36 \end{gathered}\right\}\right) \\ &= 17/37 \end{align}\] Otherwise both red and 14 lose: \[\begin{align} f_{red,14}(-1,-1) &= \Pr(w_{red} = -1 \cap w_{14} = -1) \\ &= \Pr\left(b \in \left\{ \begin{gathered} 0,2,4,6,7,10,11,13,15,17, \\ 20,22,24,26,28,31,33,35 \end{gathered}\right\}\right) \\ &= 19/37 \end{align}\] All other values have probability zero.

The joint distribution tells you two things about these variables

- The probability distribution of each individual random variable,
sometimes called that variable’s
distribution.*marginal*- For example, we can derive each variable’s CDF from the joint CDF: \[F_x(a) = \Pr(x \leq a) = \Pr(x \leq a \cap y \leq \infty) = F_{x,y}(a,\infty)\] \[F_y(b) = \Pr(y \leq n) = \Pr(x \leq \infty \cap y \leq b) = F_{x,y}(\infty,b)\]
- We can also derive each variable’s PDF from the joint PDF

- The
*relationship*between the two variables.- We will develop several ways of describing this relationship: conditional distribution, covariance, correlation, etc.

Note that while you can always derive the marginal distributions from the joint distribution, you cannot go the other way around unless you know everything about the relationship between the two variables.

**Three joint distributions with identical marginal distributions**

The scatter plots in Figure 6.1 below depict simulation results for a pair of random variables \((x,y)\), with a different joint distribution in each graph. In all three graphs, \(x\) and \(y\) have the same marginal distribution (standard normal).

The differences between the graphs are in the relationship between \(x\) and \(y\).

- In the first graph, \(x\) and \(y\) are unrelated, so the
data looks like as a “cloud” of random dots.

- In the second graph, \(x\) and \(y\) have something of a positive relationship. High values of \(x\) tend to go with high values of \(y\).
- In the third graph, \(x\) and \(y\) are closely related. In fact, they are equal.

### 6.1.2 Conditional distributions

The ** conditional distibution** of a random variable \(y\) given another random
variable \(x\) assigns values to all probabilities of the form:
\[\Pr(y \in A| x \in B) = \frac{\Pr(y \in A \cap x \in B)}{\Pr(x \in B)}\]
Since a conditional probability is just the ratio of the joint probability
to the marginal probability, the conditional distribution can always be
derived from the joint distribution.

We can describe a conditional distribution with either the ** conditional CDF**:
\[F_{y|x}(a,b) = \Pr(y \leq a|x=b)\]
or the

**\[f_{y|x}(a,b) = \begin{cases} \Pr(y=a|x=b) & \textrm{if $x$ and $y$ are discrete} \\ \frac{\partial}{\partial a}F_{y|x}(a,b) & \textrm{if $x$ and $y$ are continuous} \\ \end{cases} \]**

*conditional PDF***Conditional PDFs in roulette**

Let’s find the conditional PDF of the payout for a bet on 14 given the payout for a bet on red. \[\begin{align} \Pr(w_{14}=-1|w_{red}=-1) &= \frac{\Pr(w_{14}=-1 \cap w_{red}=-1)}{\Pr(w_{red}=-1)} \\ &=\frac{19/37}{19/37} = 1 \\ \Pr(w_{14}=35|w_{red}=-1) &= \frac{\Pr(w_{14}=35 \cap w_{red}=-1)}{\Pr(w_{red}=-1)} \\ &=\frac{0}{19/37} = 0 \\ \Pr(w_{14}=-1|w_{red}=1) &= \frac{\Pr(w_{14}=-1 \cap w_{red}=1)}{\Pr(w_{red}=1)} \\ &= \frac{17/37}{18/37} \approx 0.944 \\ \Pr(w_{14}=35|w_{red}=1) &= \frac{\Pr(w_{14}=35 \cap w_{red}=1)}{\Pr(w_{red}=1)} \\ &= \frac{1/37}{18/37} \approx 0.056 \end{align}\]

### 6.1.3 Independent random variables

We say that \(x\) and \(y\) are ** independent** if every event defined in
terms of \(x\) is independent of every event defined in terms of \(y\).
\[\Pr(x \in A \cap y \in B) = \Pr(x \in A)\Pr(y \in B)\]
As before, independence of \(x\) and \(y\) implies that the conditional
distribution is the same as the marginal distribution:
\[\Pr(x \in A| y \in B) = \Pr(x \in A)\]
\[\Pr(y \in A| x \in B) = \Pr(y \in A)\]
The first graph in Figure 6.1 shows
what independent random variables look like in data: a cloud of
unrelated points.

Independence also means that the joint and conditional PDF/CDF can be derived
from the marginal PDF/CDF:
\[f_{x,y}(a,b) = f_x(a)f_y(b)\]
\[f_{y|x}(a,b) = f_y(a)\]
\[F_{x,y}(a,b) = F_x(a)F_y(b)\]
\[F_{y|x}(a,b) = F_y(a)\]
As with independence of events, this will be very handy in simplifying the
analysis. But remember: independence is an *assumption* that
we can only make when it’s reasonable to do so.

**Independence in roulette**

The winnings from a bet on red \((w_{red})\) and the winnings from
a bet on 14 \((w_{14})\) in the same game are *not* independent.

However the winnings from a bet on red and a bet on 14 in two different
games *are* independent since the underlying outcomes are independent.

### 6.1.4 Expected values with multiple variables

We earlier showed that if \(x\) is a random variable, we can take the expected value of any function of \(x\), and that you can take the expected value “inside” any linear function of \(x\): \[E(a + bx) = a + b E(x)\] but in general \(E(g(x)) \neq g(E(x))\).

Similarly, when \(x\) and \(y\) are random variables with a well-defined joint distribution, we can take the expected value of any function of \(x\) and \(y\).

When \(x\) and \(y\) are both discrete: \[E(g(x,y)) = \sum_{a \in S_x} \sum_{b \in S_y} g(a,b)\Pr( x=a \cap y = b)\] That is, we add up \(g(x,y)\) across all possible values for the pair \((x,y)\) with each value weighted by its probability.

When \(x\) and \(y\) are both continuous: \[E(g(x,y)) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(a,b)f_{x,y}(a,b)dadb\] Again, this looks similar to the formula for the discrete case, but uses an integral instead of a sum.

You do not need to remember or use either of these formulas

As with a single random variable, you can take the expected value inside
a *linear* function of multiple random variables:
\[E(ax + by + c) = aE(x) + bE(y) + c\]
but in general \(E(g(x,y)) \neq g(E(x),E(y))\). For example:
\[E(xy) \neq E(x)E(y)\]
\[E(x/y) \neq E(x)/E(y)\]
etc.

**Multiple bets in roulette**

Suppose we bet $100 on red and $10 on 14. Our net payout will be: \[w_{total} = 100*w_{red} + 10*w_{14}\] which has expected value: \[\begin{align} E(w_{total}) &= E(100 w_{red} + 10 w_{14}) \\ &= 100 \, \underbrace{E(w_{red})}_{\approx -0.027} + 10 \, \underbrace{E(w_{14})}_{\approx -0.027} \\ &\approx -3 \end{align}\] That is we expect this betting strategy to lose an average of about $3 per game.

### 6.1.5 Covariance and correlation

The ** covariance** of two random variables \(x\) and \(y\)
is defined as:
\[\sigma_{xy} = cov(x,y) = E[(x-E(x))*(y-E(y))]\]
and their

**is defined as: \[\rho_{xy} = corr(x,y) = \frac{cov(x,y)}{\sqrt{var(x)var(y)}} = \frac{\sigma_{xy}}{\sigma_x\sigma_y}\] Both the covariance and correlation are measures of how \(x\) and \(y\) tend to move together.**

*correlation***Covariance and correlation in roulettte**

The covariance of \(w_{red}\) and \(w_{14}\) is: \[\begin{align} cov(w_{red},w_{14}) &= \begin{aligned}[t] & (1-\underbrace{E(w_{red})}_{\approx -0.027})(35-\underbrace{E(w_{14})}_{\approx -0.027})\underbrace{f_{red,14}(1,35)}_{1/37}\\ &+ (1-\underbrace{E(w_{red})}_{\approx -0.027})(-1-\underbrace{E(w_{14})}_{\approx -0.027})\underbrace{f_{red,14}(1,-1)}_{17/37} \\ &+ (-1-\underbrace{E(w_{red})}_{\approx -0.027})(-1-\underbrace{E(w_{14})}_{\approx -0.027})\underbrace{f_{red,14}(-1,-1)}_{19/37} \\ \end{aligned} \\ &\approx 0.999 \end{align}\] and its correlation is: \[\begin{align} corr(w_{red},w_{14}) &= \frac{cov(w_{red},w_{14})}{\sqrt{var(w_{red})*var(w_{14})}} \\ &\approx \frac{0.999}{\sqrt{1.0*34.1}} \\ &\approx 0.17 \end{align}\]

The *sign* of the covariance or correlation tells us the direction
of the relationship between \(x\) and \(y\).
- If *positive*: \(x\) and \(y\) tend to move in the *same* direction.
- If *negative*, \(x\) and \(y\) tend to move in *opposite* directions.
- If *zero*: \(x\) and \(y\) tend *not* to move together at all.
the covariance and correlation always have the same sign since
standard deviations are always^{5} positive,

When random variables are independent, their covariance and correlation are both exactly zero. However, it does not go the other way around.

**Uncorrelated does not imply independent**

Figure 6.2 below shows a scatter plot from a simulation of two random variables that are clearly related (and therefore not independent) but whose covariance and correlation are exactly zero.

Intuitively, covariance and correlation are a measure of the *linear*
relationship between two variables. When variables have a nonlinear
relationship as in Figure 6.2, the
covariance or correlation may miss it.

The *magnitude* of the covariance or correlation tells us about
the strength of the relationship between the two variables:

- The correlation is a scale-free measure of the strength of the
relationship:
- It always lies between -1 and 1.
- It is unchanged by any rescaling or change in units. That is, for any positive constants \(a\) and \(b\): \[corr(ax,by) = corr(x,y)\]
- When \(corr(x,y) \in \{-1,1\}\), then that means \(y\) is an exact linear function of \(x\). That is, we can write it: \[y = a + bx\] and \[corr(x,y) = \begin{cases} 1 & \textrm{if $b > 0$} \\ -1 & \textrm{if $b < 0$} \\ \end{cases}\]

- In contrast, the covariance depends on the scale of the variables, and always lies between \(-\sigma_x\sigma_y\) and \(\sigma_x\sigma_y\).

Earlier I said that the variance was a square, and has some implied properties similar to those of a square. In particular:

- The variance can be written \[var(x) = E(x^2) - E(x)^2\]
- We can take constants out of a variance: \[var(ax) = a^2 var(x)\] since \((ax)^2 = a^2x^2\).

Similarly, the covariance is the expected value of a product, and has some implied properties similar to those of a product.

- The covariance can be written: \[cov(x,y) = E(xy) - E(x)E(y)\]
- The order in a covariance does not matter: \[cov(x,y)=cov(y,x)\] since \(xy = yx\).
- Covariances go through sums: \[cov(x,y+z) = cov(x,y) + cov(x,z)\] since \(x*(y+z) = xy + xz\)
- Constant multiples can be taken out of covariances \[cov(ax,by) = ab \, cov(x,y)\] since \(ax*by = ab*xy\).
- The variance of a random variable is its covariance with itself: \[cov(x,x) = var(x)\] since \(x*x = x^2\)
- The variance of a sum is: \[var(x + y) = var(x) + var(y) + 2 \, cov(x,y)\] since \((x+y)^2 = x^2 + y^2 + 2xy\).

I do not expect you to remember all of these formulas, but be prepared to see me use them

## 6.2 Data and the data generating process

Having invested in all of the probabilistic preliminaries, we can finally talk about data. Suppose for the rest of this chapter that we have a data set called \(D_n\).

In this chapter, we will assume that \(D_n = (x_1,x_2,\ldots,x_n)\) is a data set with one variable and \(n\) observations. We use \(x_i\) to refer to the value of our variable for an arbitrary observation \(i\).

In real-world analysis, data tends to be more complex:

- In most applications, it will be a simple table of numbers with \(n\) observations (rows) and \(K\) variables (columns).
- However, it is occasionally something more abstract. For example,
the data set at https://www.kaggle.com/c/dogs-vs-cats is a big
folder full of dog and cat photos.
- A great deal of research in the statistical field of machine learning has been focused on developing methods for determining if a particular photo in this data set shows a dog or a cat.

Although our examples will all be based on simple data sets, many of our concepts and results can be applied to more complex data.

**Data from 3 roulette games**

Suppose we have a data set \(D_3\) providing the result of \(n=3\) independent games of roulette. Let \(b_i\) be the outcome in game \(i\), and let \(x_i\) be the result of a bet on red: \[x_i = I(b_i \in RED) = \begin{cases} 1 & b_i \in RED \\ 0 & b_i \notin RED \\ \end{cases}\] Then \(D_n = (x_1,x_2,x_3)\). For example, if red loses the first two games and wins the third game we have \(D_n = (0,0,1)\).

Our data set \(D_n\) is a set of \(n\) *numbers*, but we can also think of
it as a set of \(n\) *random variables* with unknown joint distribution
\(P_D\). The distinction here is a hard one for students to make, so give
it some thought before proceeding.

The joint distribution of \(D_n\) is called its ** data generating process**
or DGP. The exact DGP is assumed to be unknown, but we usually have at
least some information about it.

**The DGP for the roulette data**

The joint distribution of \(D_n\) can be derived. Let \[p = \Pr(b \in Red)\] We showed in a previous chapter that \(p \approx 0.486\) if the roulette wheel is fair. But rather than assuming it is fair, let’s treat \(p\) as an unknown parameter.

The PDF of \(x_i\) is
\[f_x(a) = \begin{cases}(1-p) & a = 0 \\ p & a = 1 \\ 0 & \textrm{otherwise} \\ \end{cases} \]
Since the random variables in \(D_n\) are independent, their
joint PDF is:
\[\Pr(D_n) = f_x(x_1)f_x(x_2)f_x(x_3) = p^{x_1+x_2+x_3}(1-p)^{3-x_1-x_2-x_3}\]
Note that even with a small data set of a simple random variable,
the joint PDF is not easy to calculate. Once we get into larger
data sets and more complex random variables, it can get very
difficult. That’s OK, we don’t usually need to calculate it - we
just need to know that it *could* be calculated.

### 6.2.1 Simple random sampling

In order to model the data generating process, we need to model the entire joint distribution of \(D_n\). As mentioned earlier, this means we must model both:

- The marginal probability distribution of each \(x_i\)
- The relationship between the \(x_i\)’s

Fortunately, we often can simplify this joint distribution quite a bit
by assuming that \(D_n\)
is ** independent and identically distributed** (IID)
or a

**from a large**

*simple random sample***.**

*population*A simple random sample has two features:

**Independent**: Each \(x_i\) is independent of the others.**Identically distributed**: Each \(x_i\) has the same (unknown) marginal distribution.

This implies that its joint PDF can be written: \[\Pr(D_n = (a_1,a_2,\ldots,a_n)) = f_x(a_1)f_x(a_2)\ldots f_x(a_n)\] where \(f_x(a) = \Pr(x_i = a)\) is just the marginal PDF of a single observation. Independence allows us to write the joint PDF as the product of the marginal PDFs for each observation, and identical distribution allows us to use the same marginal PDF for each observation.

The reason we call this “independent and identically distributed” is hopefully obvious, but what does it mean to say we have a “random sample” from a “population”? Well, one simple way of generating an IID sample is to:

- Define the population of interest, for example all Canadian residents.
- Use some purely random mechanism
^{6}to choose a small subset of cases from this population.- The subset is called our
*sample* - “Purely random” here means some mechanism like a computer’s random number generator, which can then be used to dial random telephone numbers or select cases from a list.

- The subset is called our
- Collect data from every case in our sample, usually by contacting them and asking them questions (survey).

It will turn out that a moderately-sized random sample provides surprisingly accurate information on the underlying population.

**Our roulette data is a random sample**

Each observation \(x_i\) in our roulette data is an independent random draw from the \(Bernouilli(p)\) distribution where \(p = \Pr(b \in Red)\).

Therefore, this data set satisfies the criteria for a simple random sample.

### 6.2.2 Time series data

Our employment data set is an example of ** time series** data; it is made of
observations of each variable at regularly-spaced points in time.
Most macroeconomic data - GDP, population, inflation, interest rates - are
time series.

Time series have several features that are inconsistent with the random sampling assumption:

- They usually have clear
*time trends*.- For example, Canada’s realGDP has been steadily growing for as long as we have data.
- This violates “identically distributed” since 2010 GDP is drawn from a distribution with a higher expected value than the distribution for 1910 GDP.

- They usually have clear recurring cyclical patterns or
*seasonality*.- For example, unemployment in Canada is usually lower from September through December.
- This also violates “identically distributed” since February unemployment has a higher expected value than November unemployment.

- They usually exhibit what is called
*autocorrelation*.- For example, shocks to the economy that affect GDP in one month or
quarter (think of COVID or a financial crisis) are likely to have
a similar (if smaller) effect on GDP in the next month or quarter.

- This violates “independence” since nearby time periods are positively correlated.

- For example, shocks to the economy that affect GDP in one month or
quarter (think of COVID or a financial crisis) are likely to have
a similar (if smaller) effect on GDP in the next month or quarter.

We can calculate statistics for time series, and we already did in Chapter 5. However, time series data often requires more advanced techniques than we will learn in this class. ECON 433 addresses time series data.

### 6.2.3 Other sampling models

Not all useful data sets come from a simple random sample or a time series. For example:

- A
is collected by dividing the population into*stratified sample**strata*(subgroups) based on some observable characteristics, and then randomly sampling a predetermined number of cases within each strata.- Most professional surveys are constructed from stratified samples rather than random samples.
- Stratified sampling is often combined with
*oversampling*of some smaller strata that are of particular interest.- The LFS oversamples residents of Prince Edward Island (PEI) because a national random sample would not catch enough PEI residents to accurately measure PEI’s unemployment rate.
- Government surveys typically oversample disadvantaged groups.

- Stratified samples can usually be handled as if they were from a random sample, with some adjustments.

- A
is gathered by dividing the population into*cluster sample*, randomly selecting some of these clusters, and sampling cases within the cluster.*clusters*- Educational data sets are often gathered this way: we pick a random sample of schools, and then collect data from each student within those schools.
- Cluster samples can usually be handled as if they were from a random sample, with some adjustments.

- A
gathers data on every case in the population.*census*- For example, we might have data on all 50 US states, or all 10 Canadian provinces, or all of the countries of the world.
- Data from administrative sources such as tax records or school records often cover the entire population of interest as well.
- Censuses are often treated as random samples from some hypothetical population of “possible” cases.

- A
is gathered by whatever method is convenient.*convenience sample*- For example, we might gather a survey from people who walk by, or we might recruit our friends to participate in the survey.
- Convenience samples are the worst-case scenario; in many cases they simply aren’t usable for accurate statistical analysis.

Many data sets combine several of these elements. For example, Canada’s unemployment rate is calculated using data from the Labour Force Survey (LFS). The LFS is built from a stratified sample of the civilian non-institutionalized working-age population of Canada. There is also some clustering: the LFS will typically interview whole households, and will do some geographic clustering to save on travel costs. The LFS is gathered monthly, and the resulting unemployment rate is a time series.

### 6.2.4 Sample selection and representativeness

Random samples and their close relatives have the feature that they
are ** representative** of the population from which they are
drawn. In a sense that will be made more clear over the next few chapters,
any sufficiently large random sample “looks just like” the population.

Unfortunately, a simple random is quite difficult to collect from humans. Even if we are able to randomly select cases, we often run into the following problems:

occurs when a sampled individual does not provide the information requested by the survey*Nonresponse*nonresponse occurs when the sampled individual does not answer any questions.*Survey-level*- This can occur if the sampled individual cannot be found, refuses to answer, or cannot answer (for example, is incapacitated due to illness or disability).
- Recent response rates to telephone surveys have been around 9%, implying over 90% of those contacted do not respond.

nonresponse occurs when the sampled individual does not answer a particular question.*Item-level*- This can occur if the respondent refuses to answer, or the question is not applicable or has no valid answer.
- Item-level nonresponse is particularly common on sensitive questions including income.

occurs when a particular quantity of interest cannot be observed for a particular case. Censored outcomes are extremely common in economics, for example:*Censoring*- In labour market analysis, we cannot observe the market wage for individuals who are not currently employed.
- In supply/demand analysis, we only observe quantity supplied and quantity demanded at the current market price.

There are two basic solutions to these problems:

: we assume or*Imputation*values for all missing quantities. For example, we might assume that the wage of each non-employed worker is equal to the average wage among employed workers with similar characteristics.*impute*: we redefine the population so that our data can be correctly interpreted as a random sample from that population. For example, instead of having a random sample of*Redefinition**Canadians*, we can interpret our data as a random sample of*Canadians who would answer these questions if asked*.

This is not an issue that has a purely technical solution, but requires careful thought instead. If we are imputing values, do we believe that our imputation method is reasonable? If we are redefining the population, is the redefined population one we are interested in? There is no right or wrong answers to these questions, and sometimes our data are simply not good enough to answer our questions.

**Nonresponse bias in recent US elections**

Going into both the 2016 and 2020 US presidential elections, polls indicated that the Democratic candidate had a substantial lead over the Republican candidate:

- Hillary Clinton led Donald Trump by 4-6% nationally in 2016
- Joe Biden led Trump by 8% nationally in 2020.

The actual vote was much closer:

- Clinton won the popular vote (but lost the election) by 2%
- Biden won the popular vote (and won the election) by about 4.5%.

The generally accepted explanation among pollsters for the clear disparity between polls and voting is systematic nonresponse: for some reason, Trump voters are less likely to respond to polls. Since most people do not respond to standard telephone polls any more (response rates are typically around 9%), it does not take much difference in repsonse rates to produce a large difference in responses. For example, suppose that:

- We call 1,000 voters
- These voters are equally split, with 500 supporting Biden and 500 supporting trump.
- 10% of Biden voters respond (50 voters)
- 8% of Trump voters respond (40 voters)

The overall response rate is 9% (similar to what we usually see in surveys), Biden has the support of \(50/90 = 56\%\) of the respondents while Trump has the support of \(40/90 = 44\%\). Actual support is even, but the polls show a 12 percentage point gap in support, entirely because of the small difference in response rates.

Polling organizations employ statisticians who are well aware of this problem, and they made various adjustments after 2016 to address it. For example, most now weight their analysis by education, since more educated people tend to have a higher response rate. Unfortunately, the 2020 results indicate that this adjustment was not enough. Some pollsters have argued that it makes more sense to just assume the nonresponse bias is 2-3% and adjust the numbers by that amount directly.

More precisely, either or both of \(\sigma_x\) and \(\sigma_y\) could be zero. In that case the covariance will also be zero, and the correlation will be undefined (zero divided by zero).↩︎

As a technical matter, the assumption of independence requires that we sample

*with replacement*. This means we allow for the possibility that we sample the same case more than once. In practice this doesn’t matter as long as the sample is small relative to the population.↩︎