Chapter 1 Qualitative Basis of Statistics

1.1 Motivation

For starters, we must establish proper methodology for data collection and analysis. If you have taken a science class before, you know that this material is unapologetically dry; fortunately, our time with it in this book is fleeting. Still, it is important to develop a strong understanding of these fundamentals, since they are the building blocks that the rest of the book (and all of Statistics) rest upon.

1.2 Populations

Much of Statistics exists because of the sheer, unavoidable fact that in our world it is usually difficult (if not impossible) to obtain information about every single object we are concerned about.

This is a pretty bleak part of life, since it means that we can never really answer the question “How much does a banana weigh on average?” That’s because, to answer this question exactly, you would need to gather every single banana in the world (this is called a ‘census’), and no one has the capacity needed to do that.

Thankfully, Statistics comes to the rescue. Using tools that you will learn in this book, it is possible to generate guesses of varying precision and even to diagnose just how wrong these guesses are.

So, in this book, we will start with a population (call it the population of all bananas), take a sample of said population (call it 10 bananas) and attempt to answer questions about the population as a whole based on our sample (how heavy are bananas on average?).

1.3 The Statistical Toolbox

The major two we will be dealing with are Observational Studies and Experiments. The focus in this book is not on performing these studies, but on collecting the data from them and understanding them well enough to make logical conclusions based on their results. So, while you will likely never have to conduct one of these yourself (in the scope of this book, at least), they are still important to understand because we will be working with their results.

The key distinction between an Observational Study and an Experiment is the word treatment. Treatments are imposed in Experiments and are not imposed in Observational Studies. These treatments are basically actions performed on groups in an experiment, as opposed to the passive observational studies where subjects are only observed from the sidelines.

For example, consider testing if a new energy drink actually energizes the consumer. In an experiment, the treatment would be giving (treating) the drink to one group in the sample, while holding another group (the ‘control’ group) constant (not giving them the drink).

Another important difference between these two studies are the concepts of association and causation. Since an observational study does not impose any treatment (merely watches subjects) it can never conclude any sort of causation (for reasons we will get to soon).

For example, if we were observing home decor tastes in subjects, we might notice that when subjects place a large, evergreen tree in their living room they tend to spend much more that month. While we could say that spending and the presence of this tree are positively associated (as one increases, the other also tends to increase), we can never conclude a causal relationship: here, that the presence of the tree somehow causes the homeowners to wish to spend more money, perhaps whipped up into a spending frenzy by the scent of the needles.

In this example, the reason why there isn’t a causal relationship is obvious. There is something lurking in the background: the holiday celebration of Christmas, which causes the increases in both the presence of trees in the home and the spending on gifts. However, this ‘lurking variable’ (which we will formally define later) won’t always be so easy to spot, so it’s important to be wary of causal relationships from observational studies.

Let’s break each of these down further:

a. Observational Study

These come in three types.

The first is cross-sectional, where subjects are measured at one point in time (or a very short period in time). One example could be measuring the weight right now of a sample of local high schoolers.

The second is Retrospective (also called ‘Case-Control’). Here, the experimenter actually goes back in time for data. One example would be observing (which is really just looking up) crime rates for different cities in the 1920s.

The third is Prospective (also called ‘Longitudinal’ or ‘Cohort’). Here, you go forward in time. This sounds dicey, but it really just means that you pick subjects and observe them as they develop over time. An example would be identifying 10 children in 1st grade and observing their heights or weights as they grow through high school.

Remember, a survey is also an observational study, since no treatment is imposed. It would usually be cross-sectional.

b. Experiment

If you have taken a science course before, you likely are familiar with good experimental techniques. Namely, the pillars of any good experiment are the explanatory and response variables.

The explanatory variable is the same thing as an independent variable: it’s what you, the experimenter, changes to manipulate the subjects. In the energy drink example from earlier, the explanatory variable would be the administering (giving or not giving) of the drink to the subjects. The response variable, then, is what is affected by the explanatory variable and is what you measure. In this example, it would be the amount of energy the subjects got.

Another important factor is the control group, or basically a group without treatment. The idea of these groups is to isolate the affect of the explanatory variable on the subject. Ideally, if you have two groups, one with treatment and one without, you would like for the only difference between the two to be that treatment, and not something like wealth, status or age that might blur the results. So, the control and experimental groups should (ideally) be identical in all but the treatment.

A good rule of thumb is that an effective experiment should have three factors: control (which we just discussed), replication (the steps are clear enough that it can be repeated again) and randomization. This last concept is the idea that samples and experimental groups should be obtained randomly to avoid some of the biases we will discuss in the next bit. In this book, we will use Simple Random Samples, or a drawing process where every individual in the population has an equal chance of being selected to be in the sample or group.

1.4 Danger!

Of course, every rose has its thorn, and every statistical result has pitfalls that we must be wary of. Usually, they can be avoided with good, fundamental techniques. Again, for this book, we will usually just have to point out these faults, and not actually avoid them when carrying out an experiment

1. Lurking Variables

As briefly described in the previous section, these are variables that affect your results (the response variable) from afar. The trouble is that they never identify themselves, so you may be under the impression that some relationship exists between two variables when really it is a lurking variable that causes it.

We saw one excellent example of lurking variables above: a casual observer might think that the presence of an evergreen tree in the living room sparks some sort of spending spree. However, the lurking variable is actually the occurrence of Christmas, which drives both the presence of the tree and the heightened spending patterns.

How do we avoid Lurking Variables? With experimental control. The idea is that if the experimental and control groups are identical in all but their treatment, then no lurking variable could possibly make its way in and mess with the results.

2. Correlation/Association vs. Causation

Here is the same idea from the Observational Studies portion: just because two variables move together does NOT mean that one causes in the other. In the previous section, just because cavities tend to decrease as SAT scores increase, we cannot say that teeth cause intelligence. We saw why we would need an experiment to address this fallacy (control for lurking variables) and thus we should NEVER jump to conclude a causal relationship.

This is a concept that will appear for the entirety of the book, so be sure to familiarize yourself with it now.

3. Sample Bias

We already discussed a good way to sample for experiments (simple random sampling). Here are two common bad ways:

3a. Voluntary Response Sample

This is a sample where people can respond if they would like. For example, if you posted a survey in the newspaper asking if people liked whipped cream or not, where people could respond to the survey by calling the number provided.

It’s clear that the only people that will take the time to respond to this sample are the ones that are REALLY passionate about whipped cream (positively or negatively; hopefully positively!). Therefore, your results will be blurred.

3b. Convenience Sampling

This is a sample where the experimenter selects sample subjects because they are easy to obtain. Say you read about a study that said the average height for humans was now 4.5 feet tall, and that this study was conducted by an elementary school teacher. It’s clear that the teacher was a victim of convenience sampling: he did not randomly select subjects from the entire population but from the small subset that it was easiest for him to reach.

There are also types of biases from sampling that we must be aware of. These include:

3c. Selection Bias

The simple definition is that some groups in the population are over or under represented in the sample. In the example above, children are certainly overrepresented. This can be countered by random sampling, where every person in the population has an equal chance of being sampled. Here, groups should be similarly represented in the sample as they are in the population, for the simple reason that if there is a large group in the population they are more likely to be randomly selected when sampling.

3d. Nonresponse Bias

This occurs when some sampled subjects cannot be reached. For example, if you were asking about income levels over email, you would not be able to reach subjects that do not have an email. This would certainly blur the results, since you would be missing a significant amount of data points from the final analysis.

3e. Response Bias/Framing

This occurs when a question is deliberately framed or poorly worded.

First, we can consider framing. Think about the following questions:

Do you enjoy eating small cucumbers preserved in vinegar?

Do you like pickles?

While these questions are essentially asking the same thing, the first certainly sounds less appetizing and would therefore likely get less affirmative responses than the second.

Framing is also important because it could be deliberately used by people trying to make a point. If someone was on an anti-pickle campaign, they could survey people with that first question, likely get a low amount of ‘yes’ responses, and then claim that our population is seriously against pickles. Again, the key is to be as objective as possible.

Another problem is confusing, poorly worded questions. Consider this:

Is it unlikely that you would disagree that you don’t dislike pickles?

Here, the quadruple negative is the exact same thing as saying ‘Do you like pickles?’ but is extremely confusing and could lead to different answers than intended. It’s simple to avoid these, if you try to ask questions simply!