Chapter 4 How To Get a Good Sample

4.1 Collecting a Sample of Teachers or a Sample of Voters

We’ll get together in groups of 2 or 3 and brainstorm about the following:

Question: In the state of Kentucky, there are about $N=40000$ public school teachers at the K-12 level. Suppose it is necessary to collect a sample of $n=400$ of those teachers. Explain how you would go about collecting such a sample. Discuss what sort of real-life logistical problems you might encounter in attempting to collect data from this sample.

Question: You are trying to estimate what percent of voters nationally support President Donald Trump. How would you go about collecting such a sample? Again, discuss what sort of real-life logistical problems you might encounter in attempting to collect data from this sample.

It may seem amazing to you, but a relatively small sample, if collected using the methods of probability sampling, can accurately estimate these parameters when a much larger sample collected in a sloppy fashion can be biased and therefore completely worthless.

An excellent article in Scientific American explains how a relatively small sample can be highly accurate, while a larger sample can be worthless.

http://www.scientificamerican.com/article/howcan-a-poll-of-only-100/

4.2 Non-Probability Sampling

If a sample is collected in a non-random or nonscientific manner, then we do not know the probability of a member of the population being selected. The data will be biased and may overrepresent some segments of the population and underrepresent others.

A famous historical example was the Literary Digest poll in 1936. This magazine, popular at the time, was trying to determine if President Franklin D. Roosevelt (Democrat) would win re-election, or if he would be defeated by Republican Alf Landon. They included a stamped addressed postcard in each issue of the magazine and had readers mail them in. Several hundred thousand postcards were returned. In addition, many thousands of people were contacted by telephone.

When the results were compiled, about 60% of readers supported Alf Landon. Since you’ve probably never heard of Alf Landon (who was the governor of Kansas), you know that he did not win the election in 1936? Why did this sample give such an inaccurate result?

It turned out that the readership of Literary Digest was mostly fairly wealthy people, especially during the Great Depression. The survey was biased and overrepresented wealthy Americans, who were more likely to be Republicans, and underrepresented the working class and unemployed, who were more likely to be Democrats.

4.3 Selection Bias

It is also important to consider who responds and who fails to respond (non-response) in a survey. This was an issue in the Literary Digest survey, and the next few slides will show a 21st century example.

It is very common for websites to have surveys to ask the visitors some sort of question. For example, there is usually a poll asking some sort of sports-related question (i.e. who will win the Super Bowl?) at http://www.espn.go.com. These surveys are fun and are meant strictly for entertainment, but no serious conclusions should be drawn from them.

4.4 Sampling Mistakes

Sample volunteers (voluntary response bias)
Sample “Conveniently” (convenience sampling)
Use a poor sampling frame (maybe one that misses parts of the population)
Undercoverage (portions of the population are undersampled or not sampled at all)
Nonresponse bias (I won’t take your call or mail back your survey)
Response bias (leading questions; questions about illegal or embarassing or private activities one doesn’t want to admit)

4.5 Probability Sampling

How do we avoid selection bias and end up with a good scientific sample? By using a method based on probability sampling, where we know the chance that each member has a known chance of being included in the sample. There are four main methods of probability sampling, and you likely described some of them when answering the question about how you would take a sample from the population of teachers.

Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster Sampling

4.6 Simple Random Sampling

To conduct a simple random sample, we need to have a sampling frame. A sampling frame is a list of all of the elements of a population; for instance, the names of all 40000 public school teachers.

Each individual is assigned an integer between 1 and $N$ . A total of $n$ random integers between 1 and $N$ are drawn, and those individuals make up the simple random sample. If one of the randomly drawn numbers was #33769, and that was Mr. B. Stein, a high school economics teacher, then he would be in the sample.

If we drew many such simple random samples, we would get somewhat different results–this is called sampling variability.

A major advantage of simple random sampling is that instead of allowing participants to select (or not select) themselves, such as in the Literary Digest or Drudge Report examples, impersonal chance determines who is chosen, removing selection bias.

One disadvantage is if a sampling frame is flawed or unavailable. For example, it might be difficult or impossible to compile an accurate list of all teachers in Kentucky. It might be easier to obtain a sampling frame listing all school districts. Another disadvantage is that it can be cumbersome to draw many random numbers, but computer software is typically used to draw random samples.

4.7 Systematic Sampling

Instead of drawing a simple random sample, we can draw a 1-in- $k$ sample, usually called a systematic sample. We compute $k=\frac{N}{n}$ , choose a random integer between 1 and $k$ , and use that individual and every $k^{th}$ individual after as the sample.

For the teachers problem, $k=\frac{40000}{400}=100$ . Choose a random number between 1 and 100. Suppose it is 37; we will sample the 37th teacher, 137th teacher, 237th teacher, and so on.

This can be very convenient in an industrial setting. We might choose to sample every $200^{th}$ computer off the assembly line to check for quality control purposes.

Although systematic sampling is convenient to use, it can be flawed if there is a cyclical pattern in the sampling frame. For example, suppose a systematic sample ended up sampling the number of absent students or patients admitted to a hospital in such a way that the entire sample was from the same day of the week (say Friday). I think you can see the potential for inadvertent bias.

4.8 Stratified Sampling

Sometimes it makes sense to divide the population into homogeneous groups, which are called strata (the singular is stratum), before the stratified random sample is selected.

For example, in the teachers example, I might divide the population into three groups: elementary teachers (K-5), middle school teachers (6-8), and high school teachers (9-12). Suppose that 50% of the population are elementary teachers, 20% middle school, and 30% high school.

Then three separate simple random samples would be collected. With the desired total sample size of $n=400$ , we would select $n_1=200$ elmentary, $n_2=80$ middle school, and $n_3$ =120 high school teachers.

Other categorical variables, such as sex, race, age group, etc. can be used as the stratification variable.

4.9 Why Use Stratified Sampling?

As you can see, collecting a stratified sample as opposed to a simple random sample is a bit more involved. Why do we bother going through the extra trouble, especially since both samples are random?

A big advantage of stratified sampling is that it can reduce the total variability of the sample statistics. In a political poll or opinion survey, this means that we could either have a smaller margin of error (desirable), or we can use a smaller total sample size to obtain a particular margin of error (also desirable).

In addition, we would often be interested in reporting the statistics by strata: for example, if there are differences in opinion between elementary/middle school/high school teachers, between men and women, between different racial groups, etc.

4.10 Cluster Sampling

Another sampling techique, which is often confused with stratified sampling, is cluster sampling.

In this method, the population is divided into representative subgroups, which are heterogenous rather than homogenous. This subgroups are called clusters. Then, instead of taking a simple random sample of individuals in the population, a simple random sample of the clusters is taken. In the purest form of cluster sampling, everyone in the chosen clusters is used, although in practice it is common to take a simple random sample or stratified sample of the chosen clusters.

In the teachers example, it might be difficult or impossible to get a sampling frame listing all teachers in the state. A frame of all school districts in Kentucky would be easier to obtain. We could randomly select a certain number of school districts and use the teachers at those districts.

4.11 Why Use Cluster Sampling?

Students often confuse stratified and cluster sampling. They appear similar because both methods involve dividing the population into subgroups.

In stratified sampling, the idea is to divide into strata where everyone within a stratum are similar on a characteristic of interest (i.e. teach the same grade level, same sex, same race, etc.) Often this is done to make sure the sample `fairly’ represents the population.

In cluster sampling, the idea is to divide into clusters where ideally the individual clusters are microcosms of the population. Then, we can reduce the time and money needed to collect the sample by sampling clusters rather than individuals.

4.12 Multistage Sampling

Most companies (Gallup, Rasmussen, news organizations, etc.) that take large surveys regularly combine several of these techniques into what is called a multistage sample.

First, cluster sampling might be used to randomly select certain geographic areas; it is common to use census tracts or neighborhood blocks, which are subdivisions used by the U.S. Census Bureau. Then the company or researcher might choose to use stratified sampling within the chosen census tracts on a stratification variable(s) of interest.

These sample designs can be quite complicated, but the intention is to try to gain the advantages of the various methods, especially when a simple random sample is not feasible.