4.9 An optional historical side-note: Gosset and the t

William Sealy Gosset was the quality control chemist and mathematician for the Guinness company – which brewed beer, but also did a bunch of agriculture – in the early 1900’s. So what did that have to do with all this distribution theory?

Well, Gosset worked on quality control, of the product and the ingredients that went into it. As a QC person, he spent a lot of time measuring various quantities – checking the chemical characteristics of barley, that kind of thing. He noticed that when he created confidence intervals using the normal distribution,

\[ \bar{y} \pm z^* \frac{s}{\sqrt{n}} \]

the confidence intervals were too narrow: he was sending back too many samples as abnormal (the old “false positive” problem). So he did some simulations with data whose mean and SD he knew. (Please note that this was before computers: he did his simulations by writing values on three thousand pieces of cardboard and manually shuffling them until they were in random order. Don’t you feel better about the tidyverse now.) He divided his data into many small groups of size 4, then calculated the mean and SD of each little sample. Then he found the “z-score” of each group mean, but pretended he didn’t know \(\sigma\) and used the group’s \(s\) instead: \[z = \frac{\bar{y} - \mu}{s/\sqrt{4}}\]

When he did a histogram of these purported z-scores, it turned out that the “tails” were way too long for a normal distribution (such a distribution is sometimes called heavy-tailed). Why would that happen?

Well, as we’ve seen, \(s\) is often smaller than the true \(\sigma\). This makes \(z\) too big. And because these little samples were so small, \(s\) and therefore \(z\) were often way off. The effect would be less pronounced with larger samples, since \(s\) would be more reliable and a better estimate of \(\sigma\).

Gosset figured out what the right distribution actually was, but Guinness wouldn’t let him publish his work under his own name for… reasons. (Some say they were afraid of revealing trade secrets.) So he published it under a pseudonym: Student.

See this surprisingly entertaining article: https://www.jstor.org/stable/2683142.

As a side note to the side note, Gosset had an extended correspondence with a couple of other pro statisticians about this. In one letter to them, he wrote (about reading other people’s mathematical writing): “It’s not so much the mathematics. I can often say ‘Well, of course, that’s beyond me, but we’ll take it as correct’ but when I come to ‘Evidently’ I know that means two hours hard work at least before I can see why.”

Some things never change.