Chapter 1 Introduction to Stats

Motivating scenario: You have seen or heard about statistics, but you don’t really know what people are getting at. You hear words like “p-value”, “statistically significant”, “effect size”, “bias”, or “nonindependent” but can’t pinpoint exactly what they mean?

Learning goals: By the end of this chapter you should be able to

Get a sense of what motivates statisticians.
Be able to explain the difference between a population and a sample, a parameter and an estimate, the relationship between these four concepts, and the things which can make a sample deviate from a population.
Have a sense of why as statisticians, we must consider the type of data and variables in building models.
Understand the differences in motivation and what we can get out of an observational and experimental study.

1.1 Goals of (Bio)stats

Statistics claims to be obsessed with the TRUTH. It’s meant to be a way of understanding the world. It’s meant to describe and differentiate, and ideally, to tease apart causation from correlation. These three goals

Estimation
Hypothesis testing, and
Causal inference,

Are the beams of light we, as statisticians, try to shine on questions. That is – this is what you call a statistician to do.

My goal is to enable you to unleash these powers. But also to realize that on their own, estimation, hypothesis testing, and even causal inference are of pretty limited utility — we could go around estimating useless trivia, or testing hypotheses that have little to do with our motivating questions, or separating between unlikely or non-exclusive causes. As such, in BIOSTATS our goals are more ambitious — we constantly ask:

What is the motivating biological question?
What experiments can I do/have been done and/or data can I collect to address this question?
How can I design this experiment and/or collect these data in a way that will map cleanly onto both a statistical model and my motivating idea?
Do results support an interesting conclusion?
What are the shortcomings of statistical models and causal frameworks in the analysis?
How do I best communicate my results (including estimates, visualizations, conclusions, and caveats)?

The more time I spend as The Statistician in the room, the more I think the best skill you can cultivate is the ability to remain calm and repeatedly ask "What question are you trying to answer?"
— Bryan Howie (@bryan_howie) December 13, 2018

1.2 Sampling from Populations

To a statistician the TRUTH is a population – a collection of all individuals of a circumscribed type (e.g. all grasshoppers in North America), or a process for generating these individuals (see note below). We characterize a population by its parameters (e.g. its TRUE mean, or its TRUE variance).

However, it is often impractical or impossible to study every individual in a population. As such, we often deal with a sample – a subset of a population. We characterize a sample by taking an estimate of population parameters.

So a major goal of a statistical analysis is how to go from conclusions about a sample which we can measure and observe, to the population(s) we care about. In doing so we must worry about random differences between a sample and a population (known as Sampling Error, as well as any systematic issues in our sampling or measuring procedure which will cause estimates to reliably differ from the population (known as Sampling Bias).

Sometimes we think of a population as a process generating our data. For example, when we compare binomial outcomes to the flip of a fair coin, we do not think about all coins flipped in the world. Rather we think about what we would expect to observe from the process of flipping a fair coin.

Sampling Error

Estimates of any sample will differ somewhat from population parameters by chance sampling and/or imprecise measurement. This is why we always consider and report uncertainty in estimates and note that chance can contribute to any result. Much of this course focuses on acknowledging and accommodating sampling error. We can reliably decrease sampling error by increasing the sample size. Note that despite its name, sampling error does not imply that the researcher made a mistake.

I therefore think it’s more appropriate to call this chance deviation of estimates away from parameters “The First Law of sampling” rather than sampling error.

Sampling Bias

Image from [xkcd](https://xkcd.com/1999/)

Figure 1.1: Image from xkcd

There are many ways for samples to systematically deviate from the population.

For example, non-randomly selected individuals will likely differ from the population in important ways. e.g. students who raise their hand in class are perhaps more likely to know the answer than a randomly chosen student, or brightly colored individuals are more likely to be spotted than drab individuals. Such phenomena are known as volunteer bias.

Another devious form of bias is known as survivorship bias, in which survivors differ from a population in critical ways. For example, students at the University of Minnesota likely had higher high school GPAs than a random sample of individuals from across the twin cities of the same age.

Image from [wikipedia](https://en.wikipedia.org/wiki/Survivorship_bias)

Figure 1.2: Image from wikipedia

A classic example of survivorship bias is this distribution of gunshot holes in airplanes returning from WWII — the army initially reacted to the prevalence of holes in the wings and tail by suggesting that such parts of the plane be reinforced. However, Abraham Wald pointed out that airplanes that did not return were the ones who needed more help and these where likely shot where we don’t see holes.

(Non) independence

Most (intro) stats assumes that samples are independent. If observations in a sample are independent of on another, the probability that one individual is studied is unrelated to the probability that any other individual is studied. While this is desirable, it is sometimes impossible, so we address methods for accommodating non-independence later in the term.

1.3 Models and Hypothesis Testing

Statistical Models

While statisticians claim to be interested in TRUTH and POPULATIONS, they quickly give up on this. Rather than characterizing a population in full, or describing the complex processes such as metabolism, meiosis, gene expression that generate data, statisticians build simple models to abstract away the biological complexity and make questions approachable. As such, rather than fully describing a distribution from a population, statisticians often use well characterized statistical distributions to approximate and model the actual biological process.

So, while we use statistics to search for truth, we must always remember this underlying fiction. As we conduct our statistics always consider the relationship between our biological models/phenomena and the stats model we use to approximate it. If our statistical model is inappropriate for our biological question, we cannot reasonably take home much from the result. That said, we’re doing statistics not physiology, and so our goal is to make use of appropriate statistical abstractions, not to reconstruct the whole biological system (even if we’re doing statistics on physiology).

“All models are wrong, but some are useful.”

— Attributed to statistician George Box

Hypothesis Testing

A common use of statistics — especially the statistics we focus on this term — is hypothesis testing. Here, we use our estimates from samples to ask if data come from populations with different parameters. This approach often relies on statistical models.

1.4 Types of Variables and Data

As we build and evaluate statistical models, a key consideration is the type of data and the process that generates these data. Variables are things which differ among individuals (or sampling units) of our study. So, for example, height, or eye color, or the type of fertilizer applied to a site, or the number of insect species per hectare are all variables.

Explanatory and Response variables

We often care to distinguish between explanatory variables, which we think underlie or are associated with the biological process of interest, from response variables, the outcome we aim to understand. This distinction helps us build and consider our statistical model and relate the results to our biological motivation.

The difference between an explanatory and response variable often depends on the motivation and/or study design. For example if we where interested to know if fertilizer type had an (?indirect?) impact on insect diversity, the type of fertilizer would be the explanatory variable and the number of insect species per hectare would be the response variable.

Types of Data

Data can come in different flavors. It is important to understand these, as they should direct our model building and data summaries, interpretation and data visualization.

Flavors of numeric variables.

Numeric variables are quantitative and have magnitude, and come in a few sub-flavors. As we will see soon, these guide our modeling approaches:

Discrete variables come in chunks. For example the number of individuals is an integer, we don’t have 1/2 people.
Continuous variables can take any value within some reasonable range. For example, height, weight, temperature, etc. are classic continuous variables. Some variables are trickier – for example, age is continuous, but we often analyze it as if it’s discrete. In practice, these tricky cases rarely present a serious problem for our analyses (except in the rare cases in which they do).

Not all numbers are numeric. For example, gene ID is a number but it is an arbitrary marker and is not quantitative.

Flavors of categorical variables.

Categorical variables are qualitative, and include,

Nominal variables which cannot be ordered and have names – like sample ID, species, hair color etc…
Binary variables are special types of nominal variables, which have only two options (or for which we only consider two options. Alive/dead, pass/fail, on/off are classic binary variables).
Ordinal variables can be ordered, but do not correspond to a magnitude. For example, bronze, silver and gold medals in the Olympics are ranked from best to worst, but first isn’t some reliable distance away from second or third etc… .

1.5 Types of Studies

The third major goal of statistics (in addition to estimation and hypothesis testing) is inferring causation. When hoping to do so, we must consider the type of study.

A randomized control experiment, in which treatments are randomly assigned to individuals is considered the gold-standard for establishment of causation. If neither the subjects nor the experimenters know the treatment, a randomized control study can allow us to infer causation. However, some caution is warranted here – a causal relationship in a controlled experiment under specified settings may not imply a causal relationship in nature. For example, exceptionally high doses of a pesticide may cause squirrels to die, but this does not mean that the smaller dose of insecticide found around farm fields is responsible for squirrel death.

Additionally, randomized control experiments are often unfeasible, impractical, and/or impossible. As such, scientists often rely on observational studies in which treatments are not randomly assigned to individuals. In observational studies, confounding variables – unmodelled variables which distort the relationship between explanatory and response variables – can bias our interpretation. We therefore must show extreme caution when attempting to learn about causation from observational studies. However, the rapidly developing field of causal inference is establishing a set of approaches that can help us disentangle correlation and causation.

Remember: the distinction between an experimental and an observational experimental study is not about the equipment used or the lab vs. the field. For example, comparing patterns of methylation between live bearing and egg laying species of fish is an observation, not an experiment, as we did not randomly assign live-bearing or egg-laying to fish in this study.

1.6 Quiz

1.7 Definitions

Population A collection of all individuals of a circumscribed type, or a generative process from which we can generate samples.
Sample A subset of a population – individuals we measure.

Parameter A true measure of a population.
Estimate A guess at a parameter that’s made from a finite sample.

Sampling error describes the deviation between parameter and estimate attributable to the finite process of sampling.
Sampling bias describes the deviation between parameter and estimate attributable to non representative sampling.
Independence Samples are independent if the probability that one individual is studied is unrelated to the probability that any other individual is studied.

Explanatory variables are variables we think underlie or are associated with the biological process of interest.
Response variables are the outcome we aim to understand.

Categorical variables are qualitative – they cannot be assigned a meaningful value on the number line.
Numeric variables are quantitative – they can be assigned a meaningful value on the number line.