Applied Biostats
Preface
A book of (com)passion
Course philosophy / goals
R
,
RStudio
, and the
tidyverse
What is this ‘book’ and how will we use it?
How will this term work / look?
0.1
Example mini-chapter: Types of Variables and Data
0.1.1
Explanatory and Response variables
0.1.2
Types of Data
0.1.3
Quiz
0.1.4
Definitions
I Crash course in
R
and stats
1
Introduction to Stats
1.1
Goals of (Bio)stats
1.2
Sampling from Populations
1.2.1
Sampling Error
1.2.2
Sampling Bias
1.2.3
(Non) independence
1.3
Models and Hypothesis Testing
1.3.1
Statistical Models
1.3.2
Hypothesis Testing
1.4
Inferring Cause
1.4.1
Confounds and DAGs
1.4.2
Types of Studies
1.5
Quiz
1.6
Homework
1.7
Definitions
2
Day one of R and RStudio
2.1
What is
R
and why / how do we use it?
2.2
Why use R?
2.2.1
What is RStudio?
2.2.2
What is the tidyverse?
2.3
Observations and suggestions for learning R / computer stuff
2.3.1
Observations
2.3.2
Suggestions
2.4
Installing R (or not)
2.4.1
Getting on RStudio Cloud
2.4.2
Installing RStudio
2.5
The RStudio IDE
2.5.1
A tour of the RStudio IDE.
2.6
Tidy data
2.7
Intro to R
2.7.1
Assigning variables in R
2.7.2
Using functions in R
2.8
Loading, installing, and using packages
2.8.1
Reading data into
R
2.8.2
Vectors and Tibbles in R
2.8.3
Writing R scripts and beyond
II Describing and visualizing data
3
Visualizing data in R – An intro to ggplot
3.1
A quick intro to data visualization.
3.1.1
Exploratory and explanatory visualizations
3.1.2
Centering plots on biology
3.2
The idea of
ggplot
3.2.1
Mapping aesthetics to variables
3.2.2
Adding geom layers
3.3
Making scatterplots
3.4
Making histograms
3.5
Making jiiterplots
3.6
ggplot2 review / reference
3.6.1
ggplot2: cheat sheet
3.7
And so much more
4
Handling data in
R
4.1
Intro
4.2
OPTIONAL Entering data into
R
4.3
Dealing with data in
R
4.3.1
mutate
your data
4.3.2
Nice helper verbs
4.3.3
Summarize your data
4.4
Think about reproducibility
4.5
Functions covered in
Handling data in
R
4.5.1
dplyr cheat sheet
5
Summarizing data
5.1
Four things we want to describe
5.2
Data sets for today
5.3
Measures of location
5.3.1
Getting summaries in R
5.3.2
Getting summaries by group in R
5.4
Summarizing shape of data
5.4.1
Shape of data: Skewness
5.4.2
Shape of data: Unimodal, bimodal, trimodal
5.5
Measures of width
5.5.1
Measures of width: Boxplots, Ranges and Interquartile range (IQR)
5.5.2
Measures of width: Variance, Standard Deviation, and coefficient of variation.
5.6
Parameters and estimates
5.7
Rounding and number of digits, etc…
5.8
Summarizing data quiz
Summarizing data: Definitions, Notation, Equations, and Useful functions
Summarizing data: Definitions, Notation, and Equations
Here are the functions we came across in this chapter that help us summarize data.
6
Sampling
6.1
Populations have parameters
6.2
We estimate population parameters by sampling
6.2.1
(Avoiding) Sampling Bias
6.2.2
(Avoiding) nonindependence of Samples
6.2.3
There is no avoiding sampling Error
6.3
The sampling distribution
6.3.1
Building a sampling distribution
6.4
Standard error
6.5
Minimizing sampling error
6.5.1
Be warry of exceptional results from small samples
6.5.2
Small samples overestimation and the file drawer problem
Sampling Quiz
Functions for sampling in
R
7
Uncertainty
7.1
Review of relevant material:
7.1.1
Review: Populations have parameters
7.1.2
Review: Sampling involves chance
7.1.3
Review: The standard error
7.2
Estimation with uncertainty
7.2.1
Generating a sampling distribution
7.2.2
Estimating the standard error
7.2.3
Confidence intervals
7.2.4
The bootstrap confidence interval
7.3
Visualizing uncertainty
7.4
Common mathematical rules of thumb
7.5
Uncertainty Quiz
7.5.1
Optional: Use existing R packages to bootstrap for you
Potential Datasets For Projects
Zika and head size
Salmon size
Life at high altitude
Hearing color
Eat less, live longer?
A gene for monogamy?
Running with a lighter load
Protected fish
Clarkia pollen movement
8
Review of R and New tricks
8.1
When
R
goes wrong
8.1.1
Warnings and Errors, Mistakes and Impasses
8.1.2
Common gotcha’s
8.2
Making Reproducible examples to get help
8.3
Readable and usable R code
8.3.1
Saving well-organized R scripts
8.3.2
RMarkdown
8.4
R again Quiz
9
Review and reflect on the most important stats concepts we have learned so far
9.1
Questions to consider
III Probability and hypothesis testing
10
Probabilistic thinking
10.1
Why do we care about probability?
10.2
Probability Concepts
10.2.1
Sample space
10.2.2
Probabilities and proportions
10.2.3
Exclusive vs non-exclusive events
10.2.4
Conditional probabilities and (non-)independence
10.3
Probability Rules
10.3.1
Add probabilities for
OR
statements
10.3.2
Multiply probabilities for
AND
statements
10.3.3
The law of total probability
10.3.4
Probability trees
10.3.5
Flipping conditional probabilities
10.4
Probability rules and math: Quiz and Summary
Probabilistic thinking: Critical definitions
11
Simulate for Probabilistic thinking
11.1
Why Simulate?
11.2
Simulating exclusive events with the
sample()
function in
R
Keeping track of and visualizing simulated proportions from a single sample in
R
11.3
Simulating proportions for many samples in
R
11.4
Simulating Non-Exclusive events
11.4.1
Simulating independent events
11.4.2
Simulating Non-independece
11.5
A biologically inspired simulation
11.6
How to do probability – Math or simulation?
11.7
Probabilistic simulations: Quiz
12
Hypothesis Testing
12.1
Review and motivation for null hypothesis significance testing
12.2
Null hypothesis significance testing
12.2.1
Statistical hypotheses
12.2.2
The test statistic
12.2.3
The sampling distribution under the null hypothesis
12.2.4
P-values
12.2.5
Interpretting results and drawing a conclusion
12.3
The Effect of Sample Size
12.4
Problems with P-values and their interpretation
12.4.1
Why the interpretation of p-values is hard
12.5
Never report P-values without context
Hypothesis testing quiz
Hypothesis testing: Definitions
OPTIONAL: Alternatives to null hypothesis significance testing
Bayesian stats as another approach to statistics
13
Shuffling labels to generate a null
13.1
One simple trick to generate a null distribution
13.1.1
Motivation:
13.2
Case study: Mate choice & fitness in frogs
13.3
What to do with data?
13.3.1
Visualize patterns
13.3.2
Estimate parameters
13.3.3
Quantify uncertainty
13.4
Permute to generate a null distribution!
13.4.1
Permutation: State
\(H_0\)
and
\(H_A\)
13.4.2
Permutation: Decide on a test statistic.
13.4.3
Permutation: Calculate the test statistic for the actual data.
13.4.4
Permutation: Permute the data.
13.4.5
Permutation: Calculate the test statistic on permuted data.
13.4.6
Permutation: Permute a bunch to build the null sampling distribution
13.4.7
Permutation: Calculate a p-value.
13.4.8
Permutation: Interpret the p-value.
13.5
Example write up.
13.6
How to deal with nonindependence by permutation
13.7
Permute quiz
13.8
Code for figure
14
Associations between continuous variables
14.1
Associations between continuous variables
14.2
Example: Disease and sperm viability
14.3
Summarizing associations between two categorical variables
14.3.1
Covariance
14.3.2
Correlation
14.4
Bootstrap to quatify uncertainty
Bootstrap to quatify uncertainty:
Step 1. Resample.
Bootstrap to quantify uncertainty:
Step 2. Summarize a bootstrapped resample.
Bootstrap to quantify uncertainty:
Step 2. Repeat many times to approximate the sampling distribution.
Bootstrap to quantify uncertainty:
Step 4. Summarize the bootstrap distribution.
14.5
Testing the null hypothesis of no association
14.5.1
By permutation
14.5.2
With math
14.5.3
Care in interpreting (and running) correlation coefficients.
14.6
Attenuation: Random noise in X and/or Y brings correlations closer to zero.
14.6.1
Attenuation demonstration
14.7
Two binary variables:
14.7.1
Quantifying associations Between Categorical Variables
14.7.2
Testing for associations between categorical variables
14.8
Homework
14.8.1
Guess the correlation
14.8.2
Quiz
14.8.3
Correlation appendix:
15
Reflect Review and Relax
15.1
Review / Setup
15.2
How science goes wrong
15.3
Review
15.4
Homework on Canvas
16
Project 2
IV Linear Models
17
Normal distribution
17.1
Probability densities for continuous variables.
17.2
The many normal distributions
17.2.1
Using R to claculate a probability density
17.2.2
The probability density of a sample mean
17.2.3
The standard normal distribution and the Z transform
17.3
Properties of a normal distribution
17.3.1
A Normal Distribution is symmetric about its mean
17.3.2
Probability that X falls in a range
17.3.3
Quantiles of a Normal Distribution
17.4
Is it normal
17.4.1
“Quantile-Quantile” plots and the eye test
17.4.2
What normal distributions look like
17.4.3
Examples of a sample not from a normal distribution
17.5
Why normal distributions are common
17.5.1
How large must a sample be for us to trust the Central Limit theorem?
17.6
Transforming data
17.6.1
Rules for legit transformations
17.6.2
Common transformations
17.7
Quiz
17.8
Optional Log Likelihood of
\(\mu\)
18
Samples from a normal distribution
18.1
The dilema and the solution
18.2
t is a common test statistic
18.3
Calculations for a t-distribution
18.3.1
Calculating t
18.3.2
Calcualting the degrees of freedom
18.3.3
Calculating a confidence interval
18.3.4
Calculating a p-value
18.3.5
Calculating the effect size
18.4
Assumptions of the t-distribution
18.4.1
What to do when we violate assumptions
18.5
Example of a one sample t-test
18.5.1
Estimation
18.5.2
Evaluating assumptions
18.5.3
Uncertainty
18.5.4
Hypothesis testing
18.5.5
Conclusion
18.6
Paired t-test
18.6.1
Paired t-test example:
18.7
Quiz
18.8
Extra material for the advanced / bored / curious
18.8.1
Showing the t is like the z with unknown sd
18.8.2
A sign test for data that don’t meet normality assumptions
18.8.3
Likelihood based inference for a sample from the normal
19
Two samples from normal distributions
19.1
Calculations for a two sample t-test
19.1.1
The pooled variance
19.1.2
The standard error
19.1.3
Quantifying surprise
19.2
A two sample t-test in R
19.2.1
With the t.test() function
19.3
Assumptions
19.3.1
Assumptions of the two-sample t-test
19.3.2
The two sample t-test assumes equal variance among groups
19.3.3
The two-sample t-test assumes normally disributed residuals
19.4
Alternatives to a two sample t-test
19.4.1
Permutation / bootstrapping
19.4.2
Welch’s t-test when variances differ by group
19.4.3
Mann-Whitney U and Wilcoxon rank sum tests
19.5
Quiz
19.6
Optional
19.6.1
Boring math
19.7
Other thigns to do when data do not meet assumptions
19.7.1
Fit your own likelihood model
19.8
Simulations to evaluate test performance
20
More than two samples from normal distributions
20.1
Setup
20.1.1
Why not all paiwise t-tests?
20.1.2
More on multiple testing
20.2
The ANOVA’s solution to the multiple testing problem
20.2.1
Mathemagic of the ANOVA
20.2.2
Parameters and estimates in an ANOVA
20.3
Concepts and calculations for an ANOVA
20.3.1
ANOVA partitions variance
20.3.2
\(r^2 = \frac{SS_{model}}{SS_{total}}\)
as the “proportion variance explained”
20.3.3
F as ANOVA’s test statistic
20.3.4
Calculating F
20.3.5
Testing the null hypothesis that
\(F=1\)
20.3.6
Exploring F’s sampling distribution and critical values.
20.4
An ANOVA in R and understanding its output
20.5
Assumptions of an ANOVA
20.5.1
The ANOVA asumes equal variance among groups
20.5.2
The ANOVA assumes normally disributed residuals
20.6
Alternatives to an ANOVA
20.6.1
Permuting as an alternative to the ANOVA based p-value
20.7
Post-hoc tests and significance groups
20.7.1
Unplanned comparisons
20.8
Quiz
21
Linear Models
21.1
What is a linear model?
Optional: A view from linear algebra
21.2
The one sample t-test as the simplest linear model
21.3
Residuals describe the difference between predictions and observed values.
21.4
More kinds of linear models
21.4.1
A two-sample t-test as a linear model
21.4.2
An ANOVA as a linear model
21.4.3
A regression as a linear model
21.5
Assumptions of a linear model
21.6
Predictions from linear models
21.7
Quiz
22
Predicting one continuous variable from another
22.1
Setup and review
22.1.1
The dataset
22.1.2
Review
22.2
The regression as a linear model
22.2.1
Finding the “line of best fit”
22.3
Estimation for a linear regression
22.3.1
Estimating the slope of the best fit line
22.3.2
Estimating the intercept of the best fit line
22.3.3
Fitting a linear regression with the
lm()
function.
22.4
Hypothesis testing and uncertainty for the slope
22.4.1
The standard error of the slope
22.4.2
Confidence intervals and null hypothesis significance testing
22.4.3
Putting this all together
22.5
Uncertainty in predictions
22.6
A linear regression as an ANOVA
22.6.1
The squared correlation and the proportion variance explained
22.7
Caveats and considerations for linear regression
22.7.1
Effect of measurement error on estimated slopes
22.7.2
Be warry of extrapolation
22.8
Asumptions of linear regression and what to do when they are violated
22.8.1
Assumptions of linear regression
22.8.2
When assumptions aren’t met
22.8.3
Polynomial regression example
22.9
Quiz
23
Predicting one continuous variable from two (or more) things
23.1
Review of Linear Models
23.1.1
Test statistics for a linear model
23.1.2
Assumptions of a linear model
23.2
Polynomial regression example
23.2.1
Polynomial regression example
23.3
Type I Sums of Squares (and others)
23.4
Two categorical variables without an interaction
23.4.1
Estimation and Uncertainty
23.4.2
Hypothesis testing with two categorical predictors
23.4.3
Post-hoc tests for bigger linear models
23.5
Quiz
24
More considerations when predicting one continuous variable from two (or more) things
24.1
Review of Linear Models
24.2
Statistical interactions
24.2.1
Visualizing main & interactive effects
24.3
Interaction case study
24.3.1
Biological hypotheses
24.3.2
The data
24.3.3
Fitting a linear model with an interaction in R
24.4
Hypothesis testing
24.4.1
Statistical hypotheses
24.4.2
Evaluating assumptions
24.4.3
Hypothesis testing in an ANOVA framework: Types of Sums of Squares
24.4.4
Biological coclusions for case study
24.5
Quiz
25
Break
26
MoreBreak
V Big picture – what is are we trying to do?
27
Designing scientific studies
27.1
Review / Set up
27.2
Challenges in inferring causation
27.2.1
Correlation does not necessarily imply causation
27.3
One weird trick to infer causation
27.3.1
When experiments are not enough
27.4
Minimizing bias in study designs
27.4.1
Potential Biases in Experiments
27.4.2
Eliminating Bias in Experiments
27.5
Inferring causation when we can’t do experiments
27.6
Minimizing sampling error
27.6.1
Increasing the sample size decreases sampling error.
27.6.2
How to decrease the standard deviation.
27.7
Planning for power and precision
27.7.1
Estimating an appropriate sample size.
27.7.2
Simulating to estimate power and precision
27.8
Quiz
28
Causal Inference
28.1
What is a cause?
28.2
DAGs, confounds, and experiments
28.2.1
Confounds
28.2.2
Randomized Controlled Experiments
28.2.3
DAGs
28.3
When correlation is (not) good enough
28.4
Multiple regression, and causal inference
28.4.1
Imaginary scenario
28.5
Additional reading
28.6
Wrap up
28.7
Quiz
29
Bias and bad science
29.1
Review / Setup
29.2
The dark origin of stats
29.3
Poor statistical conclusions motivated by race are common
29.4
Modern machine learning offers new opportunities for amplifying bias
29.5
Quiz
VI Beyond the standard linear model
30
Likelihood based inference
30.1
Likelihood Based Inference
30.1.1
Remember probabilities?
30.1.2
Likelihoods
30.1.3
Log liklihoods
30.2
Log Likelihood of
\(\mu\)
30.2.1
The likelihood profile
30.2.2
Maximum likelihood estimate
30.3
Likelihood based inference: Uncertainty and hypothesis testing.
30.3.1
Example 1: Are species moving uphill
30.4
Bayesian inference
30.4.1
Prior sensitivity
30.5
Quiz
31
Analyzing proportions
31.1
Probabilities and proportions
31.2
Estimating parameters and testing if they deviate from a mathematical distribution
31.3
Deriving the Binomial distribution.
31.3.1
Example: Hardy-Weinberg-Equilibrium
31.3.2
General case of binomial sampling
31.4
The binomial equation
31.4.1
The binomial equation in R
31.4.2
Simulating from a binomial
31.5
Quantifying Variability in a Binomial sample
31.6
Uncertainty in our estimate of the propotion
31.7
Testing the null hypothesis of a
\(p\)
=
\(p_0\)
31.7.1
Testing the null hypothesis of a
\(p\)
=
\(p_0\)
in R
31.7.2
Using the binomial sampling distribution for Bayesian Inference
31.8
Assumptions of the binomial distribution
31.9
Example: Does latex breed diversity?
31.9.1
Estimate variability and uncertainty
31.9.2
Hypothesis Testing
31.9.3
Conclusion
31.10
Testing if data follow a binomial distribution
31.11
Quiz
32
Generalized linear models I: Yes/No
32.1
Review
32.1.1
Linear models and their asumptions
32.1.2
Likelihood-Based Inference
32.2
Generalized Linear Models
32.3
GLM Example 1: “logistic regression”
32.4
Logistic Regression Example: Cold fish
32.4.1
(Imperfect) option 1: Linear model with zero / one data
32.4.2
(Imperfect) option 2: Linear model with proportions
32.4.3
(Better) option three – logistic regression
32.4.4
Assumptions of generalized linear models
32.5
Quiz
32.6
Advanced / under the hood.
32.6.1
How did we find the maximum likelihood?
33
Ordination
33.1
Yaniv’s lecture (required)
33.2
More (optional) resources
33.3
Quiz
34
Random effects
34.1
Review
34.1.1
Linear models
34.1.2
Analysis of Variance (ANOVA)
34.2
Intro to Random effects
34.2.1
Fixed vs random effects
34.2.2
Motivating Example:
34.2.3
Estimating variance components
34.2.4
Hypothesis testing for random effects
34.2.5
Repeatability
34.2.6
Shrinkage
34.2.7
Assumptions of random effects models
34.2.8
Random effects Quiz
34.3
Mixed effects models
34.3.1
Thinking about mixed effects models
34.3.2
Mixed model: Random intercept example
34.3.3
Example II: Loblolly Pine Growth
34.3.4
Mixed effects quiz
34.4
Additional related topics
34.4.1
Generalized Least Squares
34.4.2
More complex covariance structures
34.4.3
Generalized Linear Mixed Effect Models
VII Doing stats out of stats class
35
Collecting and storing data
35.1
Protecting and nurturing your data
35.2
Data in spreadsheets
35.2.1
Entering data into spreadsheets
35.2.2
Dealing with (other people’s) data in spreadsheets
35.2.3
Saving data
35.2.4
Organizing folders
35.3
Loading A Spreadsheet into
R
35.4
Tidying messy data
Assignment
Quiz
35.5
Functions covered in
Handling data in
R
35.5.1
tidyr cheat sheet
35.5.2
importing data cheat sheet
36
Data presentation and visualization. Goals, considerations, and best practices
36.1
Why make a plot?
36.1.1
Why make exploratory plots?
36.1.2
Why make explanatory plots?
36.2
Telling a story and making a point
36.2.1
What point are you making???
36.2.2
Did you make your intended point?
36.2.3
The processs
36.3
The audience and the goal
36.3.1
The audience.
36.3.2
The goal
36.4
Making a good plot
36.4.1
Be honest
36.4.2
Be transparent
36.4.3
Be clear
36.4.4
Consider Accessibility and Universal Design
36.5
Writing about and discussing figures
36.5.1
Writing about figures in text
36.5.2
Writing good figure legends
36.6
Data tables
37
Making betteR figures
37.1
How to implement best practices and make fun and attractive plots
37.2
Review – what makes a good plot
37.3
Combining plots to tell a story.
37.4
Making plots attuned to the audience and method of presentation.
37.4.1
Considering your method of presentation
Considering your method of presentation: Adjusting text size
37.5
Making Honest, Transparent, Clear, Accessible, and Fun plots
37.5.1
Making honest figures
37.5.2
Show your data to be transparent
37.5.3
Supercharge your ggplot skills to make clear plots
37.5.4
Accessibility and Universal Design
37.6
Extra style
37.6.1
Place your legend
37.6.2
Pick your theme
References
Published with bookdown
Applied Biostats – BIOL3272 UMN – Fall 2022
Chapter 25
Break
Work on your linear model project.