1.2 Vocab, part 1 of many

1.2.1 Introduction

We’ll use the tomato experiment as an example here. It’s not exactly thrilling, I know, but you’re already familiar with it and, hey, it’s late summer in Massachusetts and I’m eating tomatoes, like, all the time right now.

Here’s a look at the dataset:

data("tomato.data")
tomato.data
##    pos pounds fertilizer
## 1    1   29.9          A
## 2    2   11.4          A
## 3    3   26.6          B
## 4    4   23.7          B
## 5    5   25.3          A
## 6    6   28.5          B
## 7    7   14.2          B
## 8    8   17.9          B
## 9    9   16.5          A
## 10  10   21.1          A
## 11  11   24.3          B

Again, the goal here was to determine which type of fertilizer led to better yield.

1.2.2 Data terminology

You’ve seen rectangular data like this before – data that you can represent in a table. Previously, you’ve referred to each row as a case, or maybe a subject: a thing on which you are taking a measurement. In experimental design, we also use the term experimental unit. The rows in a table are also often called runs. This term reflects the more “active” nature of experiments – remember, in experiments you change or decide something, unlike with observational data. You run an experiment.

Each column, meanwhile, corresponds to a variable: some quantity or quality you’re measuring. Variables have different types.

Categorical variables are like multiple-choice questions: there’s only a limited set of things they can be. In experimental design, we often use the term factors to refer to categorical variables. In the tomato data, the fertilizer type is a factor. Each possible value of a categorical variable is called a level; the fertilizer factor has two levels, A and B.

Quantitative variables are quantities that live on the number line. For this dataset, the yield in pounds is quantitative. Whenever you get a dataset, though, always make sure that the things that look quantitative are quantitative. A researcher could have encoded the fertilizer types as “1” and “2” instead of “A” and “B”, but that wouldn’t mean one is twice as good as the other one!

There’s a third data type you may have encountered, called ordinal. Ordinal data is categorical, but the categories have some inherent ordering. For example, “education level” could be treated as ordinal – it’s not on the number line, you can’t take an average, but “college” is definitely more than “eighth grade.” There’s a whole heap of special methods for dealing with ordinal data which you can learn about in various elective courses later in life. We won’t really get into them in this course – we’ll generally decide to treat something as either categorical or quantitative.

You may also recall that you can always turn a quantitative variable into a categorical variable if you want, by binning it. For example, we could convert pounds into a categorical variable with three levels: “high yield” above 25, “medium” for 15-25, and “low” below 15. But you can’t go back! If someone hands you a dataset with yield recorded as high/medium/low, you can’t convert it to quantitative information, because you don’t have the actual quantitative values.

Data are great, but they get really useful when you combine them with a model. A model is, in vague terms, a description of the way you think the world works (or maybe works) – how variables relate to each other. In models you’ve built in the past, one of the variables was called the response – the one you were interested in explaining or predicting. And correspondingly, the others were called explanatory or predictor variables. All those words are still true, but in the experimental design context, we often call the explanatory variables factors, especially if they’re categorical. We may also call them treatment factors.

1.2.3 Blocking: a first look

There’s a very cool short story by Jorge Luis Borges called “Del rigor en la ciencia” (“On Exactitude in Science”) about this concept. Go look it up sometime when you’re bored.

The thing about models is, they’re not perfect. Every model is a simplification of how the world actually works. It’s like drawing a map: if you drew a map that was exactly perfect in every detail, it’d be useless because it’d be the same size as the area you mapped. Maps, like models, are useful because they leave stuff out. The goal is that they retain the important stuff.

The problem is, how do you know what’s the important stuff? How do you know you even measured the important stuff in the first place?

With observational data…you don’t. You can never be sure there isn’t something you neglected to measure that’s actually really important. This phenomenon is called a lurking variable. For example, what if – unknown to you – some of the tomato plants are a different breed?

Suppose you’re doing this experiment. Your helpful assistant brings you a flat of tomato seedlings to fertilize, so you plop some fertilizer A on there. Then you do the next batch with fertilizer B. But your assistant forgot to tell you: the first set were nice normal Sungold tomatoes, and the second set were a new breed called Monster Heavyweight Tomato King.

Response moment: What’s going to happen? What will you conclude from the experiment?

Well, the Tomato Kings are going to do what they do best, and have super high yield. But you’ll think that the high yield was because of fertilizer B, because you don’t know about the tomato breeds. This is an example of confounding: seeing an effect, but not knowing which factor is driving it. We’ll spend a lot more time with that concept in the future.

So how do we deal with this? Possibly your answer is: come on, Prof T, get a botanist to tell you what breed of tomatoes you’re working with. Which, fair.

Blocking is similar to stratified sampling in observational studies. For example, if you’re polling university students about their thoughts on remote learning, you want to make sure to ask some remote students and some on-campus students. In both cases, there’s an “external” variable that may be important to the response (tomato type, remote status), and you want to be able to pull out the effects of this external variable.

If you know that there’s a variable like this involved – something that you can’t control, but that might affect the response – you can introduce a process called blocking. I can’t control the species of each individual tomato plant, but I can make sure that I don’t give fertilizer A to all the Sungolds and B to all the Tomato Kings. Instead, I should probably distribute them: half the Sungolds get A and half get B, and likewise for the Tomato Kings. Then I’ll be able to see which fertilizer works better for a given tomato type. In this example, the Sungolds are a block, and the Tomato Kings are a second block.

1.2.4 Randomization

Okay, we solved that problem. But what about the seedlings’ age at planting? And how much water they were given at the nursery? And the amount of shade they get in the field?

There’s actually a lot of work out there on doing causal inference from observational data. It’s an important topic because, as noted earlier, sometimes experiments are too expensive/slow/unethical and you need to do your best with observational studies. A starting point is to try to measure as many potential lurking variables as possible – that’s why you see studies with phrases like “the difference remained after accounting for gender, education, and income level.” Beyond that it gets complicated, but maybe we’ll offer an elective on it sometime :)

The problem is, no matter how much blocking you do, you can never be sure you accounted for everything. That’s why you don’t get to make causal statements based on observational data: there could always be a lurking variable (or causation in the opposite direction than you think!).

In experiments, we solve this problem via randomization. Let’s go back to you and the tomatoes. Remember, I originally said that you put fertilizer A on half the plants, and then went and put fertilizer B on the other ones. But what if, instead, you randomly chose which plants to give A, and which B? Then it’s really pretty unlikely that you would accidentally give A to all the Sungolds and B to all the Tomato Kings. You’d probably – and I say “probably” in a statistical sense – end up with a roughly even mix of tomato breeds getting each fertilizer.

The great thing about this is that you didn’t have to know there were Sungolds and Tomato Kings. Just by randomizing which plants got which treatment, you kept the tomato type from confounding, or obscuring, the effect of the fertilizer. And you also dealt with all those other potential lurking variables I mentioned, and all the ones I didn’t even think of.

Of course, you might get unlucky. You could accidentally randomize all the Sungolds to get fertilizer A. That’s why if you know about a factor, you should do blocking, and then randomize within the blocks: randomly choose half the Sungolds to get A, and randomly choose half the Tomato Kings to get A.

But beyond that, we just hope we don’t get unlucky. Which is how statistics works in general, after all; that’s why we have all those phrases like “with 95% confidence.” If you want to be certain about being correct, you should do pure math, and never attempt to work with real-world data :)