Chapter 8 Sampling and Inference
8.1 Getting Ready
This chapter explores both theoretical and practical issues related to sampling and statistical inference. This material is critical to understanding hypothesis testing, which is taken up in the next chapter, but is also interesting in its own right. To follow along with the graphics and statistics used here, you need to load the county20.rda
data set, as well as the DescTools
library.
8.2 Statistics and Parameters
Almost always, social scientists are interested in making general statements about some social phenomenon. By general statements, I mean that we are interested in making statements that can be applied to a population of interest. For instance, I study voters and would like to make general statements that apply to all voters. Others might study funding patterns in local governments and would like to be able to generalize their findings to all local governments; or, someone who studies aversion to vaccines among a certain group would like to be able to make statements about all members of that group.
Unfortunately, gathering information on entire populations is usually not possible. Therefore, social scientists, as well as natural scientists, rely on samples drawn from the population. Hopefully, the statistics generated from these samples provide an accurate reflection of the underlying population values. We refer to the calculations we make with sample data as statistics, and we assume that those statistics are good representations of population parameters. The connection between sample statistics and population parameters is summarized in Table 8.1, using the already familiar statistics, the mean, variance, and standard deviation.
Table 8.1. Symbols and Formulas for Sample Statistics and Population Parameters.
Measure | Sample Statistic | Formula | Population Parameter | Formula |
---|---|---|---|---|
Mean | \(\bar{x}\) | \(\frac{\sum_{i=1}^n x_i}{n}\) | \(\mu\) | \(\frac{\sum_{i=1}^N x_i}{N}\) |
Variance | \(S^2\) | \(\frac{\sum_{i=1}^n({x_i}-\bar{x})^2}{n-1}\) | \(\sigma^2\) | \(\frac{\sum_{i=1}^N({x_i}-\mu)^2}{N}\) |
Standard Deviation | \(S\) | \(\sqrt{\frac{\sum_{i=1}^n({x_i}-\bar{x})^2}{n-1}}\) | \(\sigma\) | \(\sqrt{\frac{\sum_{i=1}^N({x_i}-\mu)^2}{N}}\) |
Whenever we try to generalize from a sample statistic to a population parameter, we are engaging in the process of statistical inference. We are inferring something about the population based on an estimate from a sample. For instance, we might be interested in knowing how often people access the internet to find news about politics, so we could use a sample of survey respondents and find the mean number of days per week people report looking for political news on the internet. So we are interested in the population parameter, \(\mu\) (pronounced mu, like a French cow saying “moo”), but must settle for the sample statistic (\(\bar{x}\)), which we hope is a good approximation of \(\mu\). Or, since we know from Chapter 7 that it is also important to know how much variation there is around any given mean, we calculate the sample variance (\(S^2\)) and standard deviation (\(S\)), expecting that they are good approximations of the population values, \(\sigma^2\) (sigma squared) and \(\sigma\) (sigma).
You might have noticed that the numerator for the sample variance and standard deviation is \(n-1\) while in the population formulas it is \(N\). The reason for this is somewhat complicated, but the short version is that dividing by \(n\) tends to underestimate variance and standard deviation in samples, so \(n-1\) is a correction for this. You should note, though that in practical terms, dividing by \(n-1\) instead of \(n\) makes little difference in large samples. For small samples, however, it can make an important difference.22
Not all samples are equally useful, however, from the perspective of statistical inference. What we are looking for is a sample that is representative of the population, one that “looks like” the population. The key to a good sample, generally, is that it should be large and randomly drawn from the population. A pure random sample is hard to achieve but it is what we should strive for. The key benefit to this type of sample is that the principle of equal probability of selection is easier to meet than with other samples. What this means is that every unit in the population has an equal chance of being selected. If this is the case, then the sample should “look” a lot like the population.
8.3 Sampling Error
Even with a large, representative sample we can’t expect that a given sample statistic is going to have the same value as the population parameter. There will always be some error–this is the nature of sampling. Because we are using a sample and not the population, our sample estimates will not be the same as the population parameter. Let’s look at a decidedly non-social science illustration of this idea.
Imagine that you have 6000 colored ping pong balls (2000 yellow, 2000 red, and 2000 green), and you toss them all into a big bucket and give the bucket a good shake, so the yellow, red, and green balls are randomly dispersed in the bucket. Now, suppose you reach into the bucket and randomly pull out a sample of 600 ping pong balls. How many yellow, red, and green balls would you expect to find in your 600-ball sample? Would you expect to get exactly 200 yellow, 200 red, and 200 green balls, perfectly representing the color distribution in the full 6000-ball population? Odds are, it won’t work out that way. It is highly unlikely that you will end up with the exact same color distribution in the sample as in the population. However, if the balls are randomly selected, and there is no inherent bias (e.g., one color doesn’t weigh more than the others, or something like that) you should end up with close to one-third yellow, one-third red, and one-third green. This is the idea behind sampling error: samples statistics, by their very nature, will differ population from parameters; but given certain characteristics (large, random samples), they should be close approximations of the population parameters. Because of this, when you see reports of statistical findings, such as the results of a public opinion poll, they are often reported along with a caveat something like “a margin of error of plus or minus two percentage points.” This is exactly where this chapter is eventually headed, measuring and taking into account the amount of sampling error when making inferences about the population.
Let’s look at how this works in the context of some real-world political data. For the rest of this chapter, we will use county-level election returns from the 2020 presidential election to illustrate some principles of sampling and inference. When you begin working with a new data set, it is best to take a look at some of its attributes so you have a better sense of what you are working with. So, let’s take a look at the dimensions and variable names for this data set.
#Size of 'county20' data set
dim(county20)
[1] 3152 10
names(county20)
[1] "state" "county_fips" "county_name" "votes_gop20"
[5] "votes_dem20" "total_votes20" "other_vote20" "djtpct20"
[9] "jrbpct20" "d2pty20"
In the output above, you see there are 3152 rows (counties) and 10 columns (variables). The variables include state and county labels and identifiers, counts of Republican votes (votes_gop20
), Democratic votes (votes_dem20
), other party votes (other_votes20
), and total votes cast (total_votes20
); as well as Trump (djtpct20
) and Biden (jrbpct20
) percent of the total vote, and Biden’s percent of the two-party vote (d2pty20
). 23 We will focus our attention on Biden’s percent of the two-party vote, starting with a histogram and a few summary statistics below.
#Histogram of county-level Biden vote
hist(county20$d2pty20, xlab="Biden % of Two-Party Vote",
ylab="Number of Counties", main="")
#Add vertical lines at the mean and median.
abline(v=mean(county20$d2pty20, na.rm = T),lwd=2)
abline(v=median(county20$d2pty20, na.rm = T),lwd=2, lty=2)
#"na.rm = T" is used to handle missing data,"lwd=2" tells R to
#use a thick line, and "lty=2" results in a dashed line
#Add a legend
legend("topright", legend=c("Mean","Median"), lty=1:2)
#Get summary stats and skewness
summary(county20$d2pty20)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.114 21.327 30.640 34.044 43.396 94.467
Skew(county20$d2pty20, na.rm = T)
[1] 0.8091358
As you can see, this variable is somewhat skewed to the right, with the mean (34.04) a bit larger than the median (30.64), producing a skewness value of .81. Just to be clear, the histogram and statistics reported above are based on the population of counties, not a sample of counties. So, \(\mu=34.04\).24
To get a better idea of what we mean by sampling error, let’s take a sample of 500 counties, randomly drawn from the population of counties, and calculate the mean Democratic percent of the two-party vote in those counties. The R code below shows how you can do this. One thing I want to draw your attention to here is the set.seed()
function. This tells R where to start the random selection process. It is not absolutely necessary to use this, but doing so provides reproducible results. This is important in this instance because, without it, the code would produce slightly different results every time it is run (and I would have to rewrite the text reporting the results over and over).
#Tell R where to start the process of making random selections
set.seed(250)
#draw a sample of 500 counties, using "d2pty20";
#store in "d2pty_500"
<-sample(county20$d2pty20, 500)
d2pty_500#Get stats on Democratic % of two party vote from sample
summary(d2pty_500)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.148 21.139 30.794 33.510 42.340 94.467
From this sample, we get \(\bar{x}=33.51\). As you can see, the sample mean is not equal to, but is pretty close to the population value (34.04). One of the keys to understanding sampling is that if we take another sample of 500 counties and calculate another mean from that sample, not only will it not equal the population value, but it will also be different that the first sample mean, as shown below.
Let’s see:
#Different "set.seed" here because I want to show that the
#results from another sample are different.
#Draw a different sample of 500 counties
set.seed(200)
<-sample(county20$d2pty20, 500)
d2pty_500bsummary(d2pty_500b)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.031 20.460 30.931 33.683 43.751 85.791
We should not be a surprised that we get a different result here (\(\bar{x}=33.68\)), since we are dealing with a different sample. That’s the nature of sampling. But since both samples are taken from the same population, we also should expect that the second sample mean is fairly close in value to the first sample mean, which it is, and that they are both close in value to the population mean, which they are.
8.4 Sampling Distributions
We could go on like this and calculate a lot of different sample means from a lot of different samples, and they would all take on different values. It is also likely that none of them would be exactly equal to the population value unless rounded off after a couple of places to the right of the decimal point. A distribution of sample means (or any other sample statistic) such as this is called a sampling distribution. Sampling distributions are theoretical distributions representing what would happen if we had population data and took repeated, independent samples from the population. Figure 8.2 provides an illustration of a theoretical sampling distribution, where Biden’s percent of the two-party vote in the population of counties is 34.04.
There are several things to note here: the mean of the distribution of sample means (sampling distribution) is equal to the population value (34.04); most of the sample means are relatively close to the population value, between about 32 and 36; and there there are a just few means at the high and low ends of the horizontal axis.
This theoretical distribution reflects a number of important characteristics we can expect from a sampling distribution based on repeated large, random samples: it will be near normally distributed, its mean (the mean of all sample means) equals the population mean, \(\mu\), and it has a standard deviation equal to \(\frac{\sigma}{\sqrt{N}}\). This idea, known as the Central Limit Theorem, holds as long as the sample is relatively large (n > 30). In other words, the Central Limit Theorem is telling us that statistics (means, proportions, standard deviations, etc.) from large randomly drawn samples are good approximations of underlying population parameters.
This makes sense, right? If we take multiple large, random samples from a population, some of our sample statistics should be a bit higher than the population parameter, and some will be a bit lower. And, of course, a few will be a lot higher and a few a lot lower than the population parameter, but most of them will be clustered near the mean (this is what gives the distribution its bell shape). By its very nature, sampling produces statistics that are different from the population values, but the sample statistics should be relatively close to the population values.
Here’s an important thing to keep in mind: The shape of the sampling distribution generally does not depend on the distribution of the empirical variable. In other words, the variable being measured does not have to be normally distributed in order for the sampling distribution to be normally distributed. This is good, since very few empirical variables follow a normal distribution! Note, however, that in cases where there are extreme outliers, this rule may not hold.
8.4.1 Simulating the Sampling Distribution
In reality, we don’t actually take repeated samples from the population–think about it, if we had the population data, we would not need a sample, let alone a sampling distribution. Instead, sampling distributions are theoretical distributions with known characteristics. Still, we can take multiple samples from the population of counties to illustrate that the pattern shown in Figure 8.2 is a realistic expectation. We know from earlier that in the population, the mean county-level share of the two-party vote for Biden was 34.04% (\(\mu=34.04\)), and the distribution is somewhat skewed to the right (skewness=.81), but with no extreme outliers. We can simulate a sampling distribution by taking repeated samples from the population of counties and then observing the shape of the distribution of means from those samples. The resulting distribution should start to look like the distribution presented in Figure 8.2, especially as the number of samples increases.
We start by collecting fifty different samples, each of which includes 50 counties, and then calculate the mean outcome from each sample. Note that this is a small number of relatively small number of samples, but we should still see the distribution trending toward normal. (Don’t worry about understanding exactly what the code below is doing. If it makes you anxious, just copy it onto you script file and run it so you can follow along with the example).
set.seed(251)
#create an object (sample_means50) with space to store fifty sample means
<- rep(NA, 50)
sample_means50 # Run through the data 50 times, getting a 50-county sample each time
#Store the fifty samples in "sample_means50
for(i in 1:50){
<- sample(county20$d2pty20, 50)
samp <- mean((samp), na.rm=T)
sample_means50[i] }
Let’s look at the fifty separate samples means:
#Show 50 different sample means
sample_means50
[1] 31.24475 34.72631 34.84539 37.19550 34.83349 33.52246 35.85605 35.42220
[9] 33.60402 32.97881 31.80286 30.33704 33.65232 33.28219 35.40609 35.97478
[17] 37.88739 34.55101 32.68982 30.96713 32.90207 37.19780 30.35538 33.33603
[25] 33.68002 29.96251 33.70846 35.77540 34.63077 35.51688 35.98887 34.35845
[33] 34.97154 38.55704 33.17959 36.86423 31.40136 30.17257 34.18038 35.64399
[41] 38.93687 30.61432 36.25869 32.81192 37.76805 33.14322 40.73194 34.50676
[49] 30.94900 35.40402
Here’s how you read this output. Each of the values represents an single mean drawn from a sample of 50 counties. The first sample drawn had a mean of 31.24 the second sample 34.73, and so on, with the 50th sample having a mean of 35.4. Most of the sample means are fairly close to the population value (34.04), and a few are more distant. Looking at the summary statistics (below), we see that, on average, these fifty sample means balance out to an overall mean of 34.3, which is very close to the population value, and the distribution has relatively little skewness (.21)
summary(sample_means50)
Min. 1st Qu. Median Mean 3rd Qu. Max.
29.96 32.92 34.43 34.29 35.74 40.73
Skew(sample_means50)
[1] 0.2099118
Figure 8.3 uses a density plot (solid line) of this sampling distribution, displayed alongside a normal distribution (dashed line), to get a sense of how closely the distribution fits the contours of a normal curve. As you can see, with just 50 relatively small samples, the sampling distribution is beginning to take on the shape of a normal distribution.
So, let’s see what happens when we create another sampling distribution but increase the number of samples to 500. In theory, this sampling distribution should resemble a normal distribution more closely. In the summary statistics, we see that the mean of this sampling distribution (34.0) is slightly closer to the population value (34.04) than in the previous example, and there is virtually no skewness. Further, if you examine the density plot shown in Figure 8.4, you will see that the sampling distribution of 500 samples of 50 counties follows the contours of a normal distribution more closely, as should be the case. If we increased the number of samples to 1000 or 2000, we would expect that the sampling distributions would grow even more similar to a normal distribution.
#Gather sample means from 500 samples of 50 counties
set.seed(251)
<- rep(NA, 50)
sample_means500 for(i in 1:500){
<- sample(county20$d2pty20, 50)
samp <- mean((samp), na.rm=T)
sample_means500[i]
}summary(sample_means500)
Min. 1st Qu. Median Mean 3rd Qu. Max.
24.54 32.36 34.01 33.98 35.49 40.73
Skew(sample_means500)
[1] 0.02819022
8.5 Confidence Intervals
In both cases, the simulations support one of the central tenets of the Central Limit Theorem: With large, random samples, the sampling distribution will be nearly normal, and the mean of the sampling distribution equals \(\mu\). This is important because it is also the case that most sample means are fairly close to the mean of the sampling distribution. In fact, because sampling distributions follow the normal distribution, we know that approximately 68% of all sample means will be within one standard deviation of the population value. We refer to the standard deviation of a sampling distribution as the standard error. In the examples shown above, where the the distributions represent collections of different sample means, this is referred to as the standard error of the mean. Remember this, the standard error is a measure of the standard deviation of the sampling distribution.
The formula for a standard error of the mean is:
\[\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{N}}\]
Where \(\sigma\) is the standard deviation of the variable in the population, and \(N\) is the number of observations in the empirical sample.
Of course, this assumes that we know the population value, \(\sigma\), which we do not. Fortunately, because of the Central Limit Theorem, we do know of a good estimate of the population standard deviation, the sample standard deviation. So, we can substitute S (the sample standard deviation) for \(\sigma\):
\[S_{\bar{x}}=\frac{S}{\sqrt{n}}\]
This formula is saying that the standard error of the mean is equal to the sample standard deviation divided by the square root of the sample size. So let’s go ahead and calculate the standard error for a sample of just 100 counties (this makes the whole \(\sqrt{N}\) business a lot easier). The observations for this sample are stored in a new object, d2pty100
. First, just a few descriptive statistics, presented below. Of particular note here is that the mean from this sample, (32.71) is again pretty close to \(\mu\), 34.04 (Isn’t it nice how this works out?).
set.seed(251)
#draw a sample of d2pty20 from 100 counties
<-sample(county20$d2pty20, 100) d2pty100
mean(d2pty100)
[1] 32.7098
sd(d2pty100)
[1] 17.11448
We can use the sample standard deviation (17.11) to calculate the standard error of the sampling distribution (based on the characteristics of this single sample of 100 counties):
=17.11/10
se100 se100
[1] 1.711
Of course, we can also get this more easily:
#This function is in the "DescTools" package.
MeanSE(d2pty100)
[1] 1.711448
The mean from this sample is 32.71 and the standard error is 1.711. For right now, treating this as our only sample, 32.71 is our best guess for the population value. We can refer to this as the point estimate. But we know that this is probably not the population value because we would get a different sample means if we took additional samples, and they can’t all be equal to the population value; but we also know that most sample means, including ours, are going to be fairly close to \(\mu\).
Now, suppose we want to use this sample information to create a range of values that we are pretty confident includes the population parameter. We know that the sampling distribution is normally distributed and that the standard error is 1.711, so we can be confident that 68% of all sample means are within one standard error of the population value. Using our sample mean as the estimate of the population value (it’s the best guess we have), we can calculate a 68% confidence interval:
\[c.i._{.68}=\bar{x}\pm z_{.68}*S_{\bar{x}}\]
Here we are saying that a 68% confidence interval ranges from the \(\bar{x}\) plus and minus the value for z that give us 68% of the area under the curve around the mean, times the standard error of the mean. In this case, the multiplication is easy because the critical value of z (the z-score for an area above and below the mean of about .68) is 1, so:
\[c.i._{.68}=32.71\pm 1*1.711\]
#Estimate lower limit of confidence interval
.68=32.71-1.711
LL.68 LL
[1] 30.999
#Estimate upper limit of confidence interval
.68=32.71+1.711
UL.68 UL
[1] 34.421
\[30.999 \le \mu \le 34.421\] The lower limit of the confidence interval (LL) is about 31 and the upper limit (UL) is 34.42, a narrow range of just 3.42 that does happen to include the population value, 34.04.
You can also use the MeanCI
function to get a confidence interval around a sample mean:
#Note that you need to specify the level of confidence
MeanCI(d2pty100, conf.level = .68, na.rm=T)
mean lwr.ci upr.ci
32.70980 30.99925 34.42035
As you can see, the results are very close to those we calculated, with the differences no doubt due to rounding error.
So, rather than just assuming that the \(\mu\)=32.71 based on just one sample mean, we can incorporate a bit of uncertainty that acknowledges the existence of sampling error. We can be 68% confident that this confidence interval includes the \(\mu\). In this case, since we know that the population value is 34.04, we can see that the confidence interval does incorporate the population value. But, usually, we do not know the population values (that’s why we use samples!) and we have to trust that the confidence interval includes the \(\mu\). How much can we trust that this is the case? That depends on the confidence level specified in the setting up the confidence interval. In this case, we can say that we are 68% certain that the population value is within this confidence interval.
Is \(\mu\) always going to be within the confidence interval limits? No. In fact, if we were to take another sample of 100 counties and calculate a 68% confidence interval for that sample, it might not include the population value. Expanding on this point, if we were to draw 50 samples of 100 counties and constructed 68% confidence intervals for each of the 50 sample means, somewhere around 68% of them (34) would include \(\mu\), and up to 32% (16) would not, just due to chance.
I demonstrate this idea below, where I plot 50 different 68% confidence intervals around estimates of the Democratic percentage of the two-party vote, based on fifty samples of 100 counties. The expectation is that at least 34 of the intervals will include the value of \(\mu\) and up to 16 interval might not include \(\mu\). The circles represent the point estimates, and the horizontal lines represent the width limits of the confidence intervals:
Here you can see that most of the confidence intervals include \(\mu\) (the horizontal line at 34.04), but there are 12 confidence intervals that do not overlap with \(\mu\), identified with the solid dots on the horizontal line at the value of \(\mu\). This is consistent with our expectation that there is at least a .68 probability that any given confidence interval constructed using a z-score of 1 will overlap with \(\mu\).
In reality, while a 68% confidence interval works well for demonstration purposes, we usually demand a bit more certainty. The standard practice in the social sciences is to use a 95% confidence interval. All we have to do to find the lower- and upper-limits of a 95% confidence interval is find the critical value for z that will give us 42.5% of the area under the curve between the mean and the z-score, leaving .025 of the area under the curve at the tails of the distribution. You can use a standard z-distribution table to find the critical value of z, or you can ask R to do it for you, using the qnorm
function:
#qnorm gives the z-score for a specified area on the distribution tail.
#Specifying "lower.tail=F" instructs R to find the upper tail area.
qnorm(.025, lower.tail = F)
[1] 1.959964
The critical value for \(z_{.95} = 1.96\). So, now we can substitute this into the equation we used earlier for the 68% confidence interval to obtain the 95% confidence interval:
\[c.i._{.95}=32.71\pm 1.96*1.711\] \[c.i_{.95}=32.71\pm 3.35\]
#Estimate lower limit of confidence interval
.95=32.71-3.35
LL.95 LL
[1] 29.36
#Estimate upper limit of confidence interval
.95=32.71+3.35
UL.95 UL
[1] 36.06
\[29.36 \le \mu \le 36.06\]
Now we can say that we are 95% confident that the population value for the mean Democratic share of the two-party vote across counties is between the lower limit of 29.36 and the upper limit of 36.06. Technically, what we should say is that 95% of all confidence intervals based on z=1.96 include the value for \(\mu\), so the probability that \(\mu\) is in this confidence interval is .95.
Note that this interval is wider (almost 6.7 points) than the 68% interval (about 3.4 points), because we are demanding a higher level of confidence. So, suppose we want to narrow the width of the interval but we do not want to sacrifice the level of confidence. What can we do about this? The answer lies in the formula for the standard error:
\[S_{\bar{x}}=\frac{S}{\sqrt{n}}\] We only have one thing in this formula that we can manipulate, the sample size. We can’t really change the standard deviation since it is a function of the population standard deviation. If we took another sample, we would get a very similar standard deviation, something around 17.11. However, we might be able to affect the sample size, and as the sample size increases, the standard error of the mean decreases.
Let’s look at this for a new sample of 500 counties.
set.seed(251)
#draw a sample of d2pty20 from 1000 counties
<-sample(county20$d2pty20, 500) d2pty500
mean(d2pty500)
[1] 33.24397
sd(d2pty500)
[1] 15.8867
MeanSE(d2pty500)
[1] 0.7104747
Here, you can see that the mean (33.24) and standard deviation (15.89) are fairly close in value to those obtained from the smaller sample of 100 counties (32.71 and 17.11), but the standard error of the mean is much smaller (.71 compared to 1.71). This difference in standard error, produced by the larger sample size, results in a much narrower confidence interval, even though the level of confidence (95%) is the same:
\[c.i._{.95}=33.24\pm 1.96*.71\] \[c.i_{.95}=33.24\pm 1.39\]
#Estimate lower limit of confidence interval (N=1000)
.95=33.24-1.39
LL.95 LL
[1] 31.85
#Estimate upper limit of confidence interval (N=1000)
.95=33.24+1.39
UL.95 UL
[1] 34.63
The width of the confidence interval is on 2.78 points, compared to 6.7 points for the 100-county sample. Let’s take a closer look at how the width of the confidence interval responds to sample size, using data from the current example:
As you can see, there are diminishing returns in error reduction with increases in sample size. Moving from small samples of 100 or so to larger samples of 500 or so results in a steep drop in the width of the confidence interval; moving from 500 to 1000 results is a smaller reduction in width; and moving from 1000 to 2000 results in an even smaller reduction in the width of the confidence interval. This pattern has important implications for real-world research. Depending on how a researcher is collecting their data, they may be limited by the very real costs associated with increasing the size of a sample. If conducting a public opinion poll, or recruiting experimental participants, for instance, each additional respondent costs money, and spending money on sample sizes inevitably means taking money away from some other part of the research enterprise. The take away point in Figure 8.6 is that money spent on increasing the sample size from a very small sample (say, 100) to somewhere around 500 is money well spent. If resources are not a constraint, increasing the sample size beyond that point does have some payoff in terms of error reduction, but the returns on money spent diminish substantially for increases in sample size beyond about 1000.25
8.6 Proportions
Everything we have just seen regarding the distribution of sample means also applies to the distribution of sample proportions. It should, since a proportion is just the mean of a dichotomous variable scored 0 and 1. For example, with the same data used above, we can focus on a dichotomous variable that indicates whether Biden won (1) or lost (0) in each county. The mean of this variable across all counties is the proportion of counties won by Biden
#Create dichotomous indicator for counties won by Biden
<-as.numeric(county20$d2pty20 >50)
demwintable(demwin)
demwin
0 1
2595 557
mean(demwin, na.rm=T)
[1] 0.1767132
Biden won 557 counties and lost 2595, for a winning proportion of .1767. This is the population value (\(P\)).
Again, we can take samples from this population and none of the proportions calculated from them may match the value of \(P\) exactly, but most of them should be fairly close in value and the mean of the sample proportions should equal the population value over infinite sampling.
Let’s check this out for 500 samples of 50 counties each, stored in a new object, sample_prop500
.
set.seed(251)
#Create an object with space to store 50 sample means
<- rep(NA, 50)
sample_prop500 #run through the date 500 times, getting a 50-county sample of 'demwin'
#each time. Store the the mean of each sample in 'sample_prop500'
for(i in 1:500){
<- sample(demwin, 50)
samp <- mean((samp), na.rm=T)
sample_prop500[i] }
summary(sample_prop500)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0200 0.1400 0.1800 0.1742 0.2000 0.3400
Skew(sample_prop500)
[1] 0.1670031
Here we see that the mean of the sampling distribution (.174) is, as expected, very close to the population proportion (.1767), and the distribution has very little skew. The density plots in Figure 8.7 show that the shape of the sampling distribution mimics the shape for the normal distribution fairly closely. This is all very similar to what we saw with the earlier analysis using the mean Democratic share of the two-party vote.
Everything we learned about confidence intervals around the mean also applies to sample estimates of the proportion. A 95% confidence interval for a sample proportion is:
\[c.i.{_.95}=p\pm z_{.95}*S_{p}\]
Where the standard error of the proportion is calculated the same as the standard error of the mean–the standard deviation divided by the square root of the sample size–except in this case the standard deviation is calculated differently:
\[S_{p}=\frac{\sqrt{p*(1-p)}}{\sqrt{N}}=\sqrt{\frac{p*(1-p)}{N}}\]
Let’s turn our attention to estimating a confidence interval for the proportion of counties won by Biden from a single sample with 500 observations (demwin500
).
set.seed(251)
#Sample 500 counties for demwin
<-sample(demwin, 500)
demwin500mean(demwin500)
[1] 0.154
For the sample of 500 counties taken above, the mean is .154, which is quite a bit lower that the known population value of .1767. To calculate the a 95% confidence interval, we need to estimate the standard error:
=sqrt((.154*(1-.154)/500))
seprop500 seprop500
[1] 0.01614212
We can now plug the standard error of the proportion int0 the confidence interval:
\[c.i.{_.95}=.154\pm 1.96*.0161\] \[c.i.{_.95}=.154\pm .0315\]
#Estimate lower limit of confidence interval
.95=.154-.0315
LL.95 LL
[1] 0.1225
#Estimate upper limit of confidence interval
.95=.154+.0315
UL.95 UL
[1] 0.1855
\[.123\le P \le .186\]
The confidence interval is .063 points wide, meaning that we are 95% confident that the population value for this variable is between .123 and .186. If you want to put this in terms of percentages, we are 95% certain that Biden won between 12.2% and 18.6% of counties. We know, that Biden actually won 17.4% of all counties so, as expected, this confidence interval from our sample of 500 counties includes the population value.
The “\(\pm\)” part of the confidence interval might sound familiar to you from media reports of polling results. This figure is sometimes referred to as the margin of error. When you hear the results of a public opinion poll reported on television and the news reader usually adds language like “plus or minus 3.3 percentage points,” they are referring to the confidence interval (usually 95% confidence interval), except they tend to report percentage points rather than proportions.
So, for instance, in the example shown below, the Fox News poll taken from September 12-September 15, 2021, with a sample of 1002 respondents, reports President Biden’s approval rating at 50%, with a margin or error of \(\pm 3.0\).
If you do the math (go ahead, give it a shot!), based on \({p}=.50\) and \(n=1002\), this \(\pm3.0\) corresponds with the upper and lower limits of a 95% confidence interval ranging from .47 to .53. So we can say with a 95% level of confidence that at the time of the poll, President Biden’s approval rating was between .47 (47%) and .53 (53%), according to this Fox News Poll.
8.7 Next Steps
Now that you have learned about some of the important principles of statistical inference (sampling, sampling error, and confidence intervals), you have all the tools you need to learn about hypothesis testing, which is taken up in the next chapter. In fact, it is easy to illustrate how you could use one important tool–confidence intervals–to test certain types of hypotheses. For instance, using the results of Fox News poll reported above, if someone stated that they thought Biden’s approval rating was no higher than 45% at the time of the poll, you could tell them that you were 95% certain that they are wrong, since you have a 95% confidence interval for Biden’s approval rating that ranges from 47% to 53%, which means that the probability of it being 45% or lower is very small. Hypothesis testing gets a bit more complicated than this, but this example captures the spirit of it very well: we use sample data to test ideas about the values of population parameters. On to hypothesis testing!
8.8 Exercises
8.8.1 Concepts and Calculations
A group of students on a college campus are interested in how much students spend on books and supplies in a typical semester. They interview a random sample of 300 students and find that the average semester expenditure is $350 and the standard deviation is $78.
Are the results reported above from an empirical distribution or a sampling distribution? Explain your answer.
Calculate the standard error of the mean.
Construct and interpret a 95% confidence interval around the mean amount of money students spend on books and supplies per semester.
In the same survey used in question 1, students were asked if they were satisfied or dissatisfied with the university’s response to the COVID-19 pandemic. Among the 300 students, 55% reported being satisfied. The administration hailed this finding as evidence that a majority of students support the course they’ve taken in reaction to the pandemic. What do you think of this claim? Of course, as a bright college student in the midst of learning about political data analysis, you know that 55% is just a point estimate and you really need to construct a 95% confidence interval around this sample estimate before concluding that more than half the students approve of the administration’s actions. So, let’s get to it. (Hint: this is a “proportion” problem)
Calculate the standard error of the proportion. What does this represent?
Construct and interpret a 95% confidence interval around the reported proportion of students who are satisfied with the administration’s actions.
Is the administration right in their claim that a majority of students support their actions related to the pandemic?
One of the data examples used in Chapter Four combined five survey questions on LGBTQ rights into a single index ranging from 0 to 6. The resulting index has a mean of 3.4, a standard deviation of 1.56, and a sample size of 7,816.
- Calculate the standard error of the mean and a 95% confidence interval for this variable.
- What is the the standard error of the mean if you assume that the sample size is 1,000?
- What is the the standard error of the mean if you assume that the sample size is 300?
- Discuss how the magnitude of change in sample size is related to changes in the standard error in this example.
- Calculate the standard error of the mean and a 95% confidence interval for this variable.
8.8.2 R Problems
For these problems, you should use load the county20large
data set to analyze the the distribution of internet access across U.S. counties, using county20large$internet
. This variable measures the percent of households with broadband access from 2015 to 2019.
Describe the distribution of
county20large$internet
, using a histogram and the mean, median, and skewness statistics. Note that since these 3142 counties represent the population of counties, the mean of this variable is the population value (\(\mu\)).Use the code provided below to create a new object named
web250
that represents a sample of internet access in 250 counties, drawn fromcounty20large$internet
.
set.seed(251)
<-sample(county20large$internet, 250) web250
Once you’ve generated your sample, describe the distribution of
web250
using a histogram and the sample mean. How does this distribution compare to population values you produced in Question #1?Create a new object that represents a sample of internet access rates in 750 counties, drawn from
county20large$internet
. Name this objectweb750
. Describe the distribution ofweb750
using a histogram and the sample mean. How does this distribution compare to population values you produced in question #1? Does it resemble the population more closely than the distribution ofweb250
does? If so, in what ways?Use
MeanSE
andMeanCI
to produce the standard errors and 95% confidence intervals for bothweb250
andweb750
. Use words to interpret the substantive meaning of both confidence intervals. How are the confidence intervals and standard errors in these two samples different from each other. How do you explain these differences?
If this has peaked your interest, the following links include some interesting discussions of dividing by \(n-1\), some formal and hard to follow if you have my level of math expertise, and some a bit more accessible: https://stats.stackexchange.com/questions/3931/intuitive-explanation-for-dividing-by-n-1-when-calculating-standard-deviation , https://willemsleegers.com/content/posts/4-why-divide-by-n-1/why-divide-by-n-1.html , and https://youtu.be/9Z72nf6N938 ↩︎
Presidential votes cast in Alaska are not tallied at the county level (officially, Alaska does not use counties), so this data set includes the forty electoral districts that Alaska uses.↩︎
You might look at this distribution and wonder how the average Biden share of the two-party vote could be roughly 34% when he captured 52% of the national two party vote. The answer, of course, is that Biden won where more people live while Trump tended to win in counties with many fewer voters. In fact, just over 94 million votes were cast in counties won by Biden, compared to just over 64 million votes in counties won by Trump.↩︎
Of course there are other reasons to favor large samples, such as providing larger sub-samples for relatively small groups within the population, or in anticipation of missing information on some of the things being measured.↩︎