Chapter 4 Geocentric models
This chapter introduces linear regression as a Bayesian procedure. Under a probability interpretation, which is necessary for Bayesian work, linear regression uses a Gaussian (normal) distribution to describe our golem’s uncertainty about some measurement of interest. This type of model is simple, flexible, and commonplace. Like all statistical models, it is not universally useful. But linear regression has a strong claim to being foundational, in the sense that once you learn to build and interpret linear regression models, you can more easily move on to other types of regression which are less normal. >p. 71
4.1 Why normal distributions are normal
Imagine there are 1000 people standing on the center line of a soccer field and they each have a coin. They each toss their coin 16 times and with each toss, heads means one step left, and tails means one step right. After everyone has done their 16 tosses, can you imagine the distribution of where they would be on the field? It would be roughly normal centered on the center of the field.
4.1.1 Normal by addition
To simulate this in R we can assign random values that are either left (-1) or right (1) for each coin toss and then add them together.
<- replicate(1000, sum(runif(16,-1,1)))
pos
plot(density(pos))
Now lets visualize what each person’s walk
#load packages
library(dplyr)
library(tidyr)
library(ggplot2)
#set seed for reproducible plots
set.seed(11)
#create dataframe
<- replicate(100, runif(16, -1, 1)) %>% #simulation (turned down to 100 people)
pos_df as_tibble() %>% # make it a tibble
rbind(0, .) %>% # add a row of zeros above the coin tosses as a starting point
mutate(step = 0:16) %>% #this is the step index
gather(key, value, -step) %>% #this will convert the data into long format
mutate(person = rep(1:100, each = 17)) %>% #this adds a person index (17 times for the 0 and 16 coins)
group_by(person) %>% #for each person value
mutate(position = cumsum(value)) %>% #create a cumulative sum for each person
ungroup()
#create plot
%>%
pos_df ggplot(aes(x = step, y = position, group = person)) +
geom_vline(xintercept = c(4, 8, 16), linetype = 2) +
geom_line(aes(color = person < 2, alpha = person < 2)) +
scale_color_manual(values = c('skyblue4','black')) +
scale_alpha_manual(values = c(1/5, 1)) +
scale_x_continuous('step number', breaks = c(0, 4, 8, 12, 16)) +
theme(legend.position = 'none',
panel.grid = element_blank())
You can see that the more steps you add, the more normal the distribution becomes.
4.1.2 Normal by multiplication
Suppose that an organism’s growth rate is influenced by 12 loci each with several alleles that code for more growth and that these loci can interact with each other causing multiplicative effects
set.seed(11)
prod(1 + runif(12, 0, 0.1))
## [1] 1.448587
Or more explicitly:
set.seed(11)
= 1
a = runif(12, 0, 0.1)
b = a + b
c prod(c)
## [1] 1.448587
as a density:
set.seed(11)
<- replicate(10000, prod(1 + runif(12, 0, 0.1)))
growth plot(density(growth))
Now see how the distribution behaves if you change the effect size of the loci
set.seed(11)
<- replicate(10000, prod(1 + runif(12, 0, 0.5)))
big plot(density(big))
set.seed(11)
<- replicate(10000, prod(1 + runif(12, 0, 0.01)))
small plot(density(small))
4.2 A language for describing models
Learning the language
1. Recognize the variables you are working with. Observable variables are data and unobservable variables are parameters
2. Define the variable in terms of the other variables or as probability distributions
3. The combination of variables and probability distributions are joint generative models that can be used to simulate hypothetical observations as well as analyze real ones
Summarize the model in a mathy way:
\[\begin{equation} y_{i} ~ \text{Normal}(\mu_{i}, \sigma)\ \mu_{i} ~ \beta x_{i}\ \beta ~ \text{Normal}(0, 10)\ \sigma ~ \text{Exponential}(1)\ x_{i} ~ \text{Normal}(0,1)\ \end{equation}\]
4.2.1 Re-describing the globe tossing model.
Recall that the globe tossing model had the variables defined as:
\[\begin{equation} W ~ \text{Binomial}(N, p)\ p ~ \text{Uniform}(0, 1) \end{equation}\]
These definitions have implied meanings. For example, the Binomial distribution of \(W\) implies that each toss (\(N\)) is independent of the others. The first line of these simple models are the likelihood function from Bayes’ theorem. The other lines will outline priors for variables if defined.
4.3 Gaussian model of height
In this section we will build the beginnings of a regression model that will follow a gaussian distribution with a mean \(\mu\) and standard deviation \(\sigma\). The model will take in data or a predictor variable and evaluate all possible values of \(\mu\) and \(\sigma\) to produce a distribution of distributions basically.
4.3.1 The data
Let’s load up McElreath’s data
library(rethinking)
data(Howell1)
<- Howell1 d
Now we can see the new data frame
str(d)
## 'data.frame': 544 obs. of 4 variables:
## $ height: num 152 140 137 157 145 ...
## $ weight: num 47.8 36.5 31.9 53 41.3 ...
## $ age : num 63 63 65 41 51 35 32 27 19 54 ...
## $ male : int 1 0 0 1 0 1 0 1 0 1 ...
precis(d, hist=FALSE) #rethinking package
## mean sd 5.5% 94.5%
## height 138.2635963 27.6024476 81.108550 165.73500
## weight 35.6106176 14.7191782 9.360721 54.50289
## age 29.3443934 20.7468882 1.000000 66.13500
## male 0.4724265 0.4996986 0.000000 1.00000
And for this model we will just be using height of adults
<- d[d$age >= 18 ,] #create a new data frame that is all rows where age is 18 or greater d2
4.3.2 The Model
We want to model these heights with a Gaussian distribution. Let’s see the data first
dens(d2$height)
This is pretty normal looking so we can go ahead and set the model up to use a Gaussian distirbution to describe the probability distribution of the data. Because we are still unsure about which distribution is the right one, we will leave the mean and standard deviation as variables for now.
\[\begin{equation} h_i ~ \text{Normal}(\mu, \sigma) \end{equation}\]
This is our likelihood function for the model. Next we will need to define our \(\mu\) and \(\sigma\) variables.
\[\begin{equation} \mu ~ \text{Normal}(178, 20) \sigma ~ \text{Uniform}(0, 50) \end{equation}\]
Here McElreath uses 178 cm as the mean because that is his height and spreads the 95% interval 40 cm in either direction (138 - 218cm). Let’s see what this would look like
curve(dnorm(x, 178, 20), from = 100, to = 250)
Here we are looking across heights from 100cm to 250cm and as you can see the normal distribution is putting the the mean between 140 and 220.
\(\sigma\) has a uniform prior that holds its value somewhere between 0 and 50 cm.
curve(dunif(x, 0, 50), from = -10, to = 60)
Here you can see that any value from 0 to 50 has an equal probability of being ‘correct.’ Now we can create something called a Prior Predictive by using the priors we’ve defined to see how they relate to the observed heights.
<- rnorm(1e4, 178, 20) #create randome draws from the mu prior
sample_mu <- runif(1e4, 0, 50) #create random draws from the sigma prior
sample_sigma <- rnorm(1e4, sample_mu, sample_sigma) #use the priors of mu and sigma to infulence h
prior_h dens(prior_h)
4.3.3 Grid approximation of the posterior distribution
Here we will approximate the posterior in detail with grid approximation for learning purposes. First will be the methods presented by McElreath and second I will add a tidyverse version adapted from Kurz
<- seq(from = 150, to = 160, length.out = 100) #100 equally spaced values from 150 to 160
mu_list <- seq(from = 7, to = 9, length.out = 100) #100 equally spaced values from 7 to 9
sigma_list <- expand.grid(mu = mu_list, sigma = sigma_list) #expand to a grid of all combinations of the above (100 x 100)
post $LL <- sapply(1:nrow(post), function(i) sum(
postdnorm(d2$height, post$mu[i], post$sigma[i], log = TRUE))) #calculate the log likelihood for each row of post across the data values of d2$height
$prod <- post$LL + dnorm(post$mu, 178, 20, TRUE) + dunif(post$sigma, 0, 50, TRUE) #create a product by adding priors to log likelihood
post$prob <- exp(post$prod - max(post$prod)) #Create the posterior probability by taking each product and subtracting the maximum product (while also back transforming to cm) post
Now we can see the posterior with:
contour_xyz(post$mu, post$sigma, post$prob)
or:
image_xyz(post$mu, post$sigma, post$prob)
In tidyverse language
<- 100
n
<-
d_grid as_tibble(mu <- seq(from = 150, to = 160, length.out = n),
<- seq(from = 7, to = 9, length.out = n))
sigma
<- d_grid %>%
d_grid ::expand(mu, sigma)
tidyr
head(d_grid)
## # A tibble: 6 × 2
## mu sigma
## <dbl> <dbl>
## 1 150 7
## 2 150 7.02
## 3 150 7.04
## 4 150 7.06
## 5 150 7.08
## 6 150 7.10
<- function(mu, sigma){
grid_function dnorm(d2$height, mean = mu, sd = sigma, log = TRUE) %>%
sum()
}
<- d_grid %>%
d_grid mutate(log_likelihood = purrr::map2(mu, sigma, grid_function)) %>%
unnest() %>%
mutate(prior_mu = dnorm(mu, mean = 178, sd = 20, log = TRUE),
prior_sigma = dunif(sigma, min = 0, max = 50, log = TRUE)) %>%
mutate(product = log_likelihood + prior_mu + prior_sigma) %>%
mutate(probability = exp(product - max(product)))
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(log_likelihood)`
head(d_grid)
## # A tibble: 6 × 7
## mu sigma log_likelihood prior_mu prior_sigma product probability
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 150 7 -1299. -4.89 -3.91 -1308. 1.96e-35
## 2 150 7.02 -1298. -4.89 -3.91 -1307. 3.78e-35
## 3 150 7.04 -1298. -4.89 -3.91 -1307. 7.20e-35
## 4 150 7.06 -1297. -4.89 -3.91 -1306. 1.36e-34
## 5 150 7.08 -1296. -4.89 -3.91 -1305. 2.53e-34
## 6 150 7.10 -1296. -4.89 -3.91 -1305. 4.65e-34
%>%
d_grid ggplot(aes(x = mu, y = sigma, z = probability)) +
geom_contour() +
labs(x = expression(mu),
y = expression(sigma)) +
xlim(range(d_grid$mu)) +
ylim(range(d_grid$sigma)) +
theme(panel.grid = element_blank())
%>%
d_grid ggplot(aes(x = mu, y = sigma)) +
geom_raster(aes(fill = probability), interpolate = TRUE) +
scale_fill_viridis_c(option = 'A') +
labs(x = expression(mu),
y = expression(sigma)) +
theme(panel.grid = element_blank())
4.3.4 Sampling from the posterior
Now that we have a posterior calculated, we can pull samples from it.
<- sample(1:nrow(post), size = 1e4, replace = TRUE, prob = post$prob) #sample rows by their probability
sample_rows <- post$mu[sample_rows] #index samples of mu by the rows sampled
sample_mu <- post$sigma[sample_rows] #index samples of sigma by rows sampled sample_sigma
Now we can plot these samples to see where the most common values for \(\mu\) and \(\sigma\) might overlap
plot(sample_mu, sample_sigma, cex = 1, pch = 16, col = col.alpha(rangi2, 0.1))
We can also look at the samples individually
dens(sample_mu)
dens(sample_sigma)
And we can get summary points from these sample distributions just like before
PI(sample_mu)
## 5% 94%
## 153.9394 155.2525
PI(sample_sigma)
## 5% 94%
## 7.323232 8.252525
(tidyverse)
<- d_grid %>%
d_grid_samples sample_n(size = 1e4, replace = TRUE, weight = probability)
%>%
d_grid_samples ggplot(aes(x = mu, y = sigma)) +
geom_point(size = 0.9, alpha = 1/15) +
scale_fill_viridis_c() +
labs(x = expression(mu[samples]),
y = expression(sigma[samples])) +
theme(panel.grid = element_blank())
%>%
d_grid_samples select(mu, sigma) %>%
gather() %>%
ggplot(aes(x = value)) +
geom_density(fill = 'grey35', size = 0) +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
theme(panel.grid = element_blank()) +
facet_wrap(~key, scales = 'free')
library(tidybayes)
%>%
d_grid_samples select(mu, sigma) %>%
gather() %>%
group_by(key) %>%
mode_hdi(value)
## # A tibble: 2 × 7
## key value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 mu 155. 154. 156. 0.95 mode hdi
## 2 sigma 7.75 7.22 8.35 0.95 mode hdi
%>%
d_grid_samples select(mu, sigma) %>%
gather() %>%
group_by(key) %>%
median_qi(value, .width = 0.5)
## # A tibble: 2 × 7
## key value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 mu 155. 154. 155. 0.5 median qi
## 2 sigma 7.77 7.57 7.97 0.5 median qi
4.3.5 Finding the posterior distribution with quap
Recall that quadratic approximation uses the maximum a posteriori (MAP) as the peak of the distribution and creates the shape of the distribution as a negative quadratic based on the slope near the peak.
To set up, lets rerun the data input
data(Howell1)
<- Howell1
d <- d[d$age >= 18,] d2
Now we have to define our model parameters in R code
<- alist(
flist ~ dnorm(mu, sigma),
height ~ dnorm(178, 20),
mu ~ dunif(0, 50)
sigma )
And then we can fit the model to the data
.1 <- quap(flist, data = d2) m4
and the posterior is available to view here
precis(m4.1)
## mean sd 5.5% 94.5%
## mu 154.606930 0.4120200 153.948443 155.265418
## sigma 7.731809 0.2914308 7.266046 8.197571
Changing the prior for \(\mu\) to be quite narrow
.2 <- quap(
m4alist(
~ dnorm(mu, sigma),
height ~ dnorm(178, 0.1),
mu ~ dunif(0, 50)
sigma data = d2)
),
precis(m4.2)
## mean sd 5.5% 94.5%
## mu 177.86375 0.1002354 177.70356 178.02395
## sigma 24.51748 0.9289156 23.03289 26.00207
notice here that because the prior was so narrow for \(\mu\), the mean did not change much from the prior value. \(\sigma\) on the other hand has vastly changed from the previous example even though its prior remained the same.
4.3.6 Sampling with quap
Because quadratic approximation is inherently multidimensional, we need to access to the different covariances between pairs of parameters. Here is the variance-covariance matrix for the first model.
vcov(m4.1)
## mu sigma
## mu 0.1697604695 0.0002160376
## sigma 0.0002160376 0.0849319320
The above contains both the variance of each parameter but also the correlated change expected in the other parameters. To see them separately:
diag(vcov(m4.1)) #variance of each parameter (sqrt = sd)
## mu sigma
## 0.16976047 0.08493193
cov2cor(vcov(m4.1)) #Correlation matrix of the parameters
## mu sigma
## mu 1.000000000 0.001799184
## sigma 0.001799184 1.000000000
The correlation matrix above indicates that \(\mu\) and \(\sigma\) are very far from correlated. Now let’s pull vectors of values from a multi-dimenstional distribution (made easy with rethinking
)
<- extract.samples(m4.1, n = 1e4)
post
head(post)
## mu sigma
## 1 154.7748 7.358637
## 2 155.5361 7.869098
## 3 154.4503 7.648488
## 4 154.9172 7.283003
## 5 153.7071 7.978019
## 6 154.6316 8.848559
Now we can compare to the posterior samples to the MAP values
precis(post)
## mean sd 5.5% 94.5% histogram
## mu 154.610172 0.4093737 153.948552 155.26072 ▁▁▅▇▂▁▁
## sigma 7.728621 0.2873045 7.267368 8.18118 ▁▁▁▂▅▇▇▃▁▁▁▁
4.4 Linear Prediction
Now that we have a model for height, we can get to the ‘regression’ part where we add a predictor variable. Let’s add weight to our height model and see what that looks like
data(Howell1); d <- Howell1; d2 <- d[d$age >= 18, ]
plot(d2$height ~ d2$weight)
Seems to be a pretty clear relationship between height and weight. The linear model strategy will take the predictor variable (weight) and assume that it has a constant additive relationship to the mean of the outcome (height). Now we have to put our predictor into the model structure.
\[\begin{equation} h_{i} ~ \text{Normal}(\mu_{i}, \sigma) \mu_{i} = \beta (x_{i} - \overline{x}) \alpha ~ \text{Normal}(178, 20) \beta ~ \text{Uniform}(0, 10) \sigma ~ \text{Uniform}(0, 50) \end{equation}\]
4.4.0.1 Probability of the data (first line of the model)
So now that both height \(h_{i}\) and the mean \(\mu_{i}\) are indexed by \(i\) it means that each data point or row in the dataframe influences the value of \(\mu\) and therefore \(h\).
4.4.0.2 Linear model (second line of the model)
The value of \(\mu_{i}\) is now made from a combination of \(\alpha\), \(\beta\), and observed weight data (\(x\)) where \(\overline{x}\) describes the mean of all weight values observed.
There are two parts to this line in the model that make it ‘linear’
1. When \(x_{i}\) = \(\overline{x}\) then the \(\beta\) value is 0 which makes \(\alpha\) an ‘intercept.’
2. When \(x_{i}\) increases by one unit, the expected change in height (\(h_{i}\)) is explained by \(\beta\) or the ‘slope.’
4.4.0.3 Priors (lines 3-5 of the model)
The \(\alpha\) and \(\sigma\) priors is basically the same as before (except now the \(\mu\) prior is now \(\alpha\)). The \(\beta\) prior is set to have a mean of 0 on purpose because we don’t know for sure what the relationship is between weight and height (even though the plot suggests a positive slope value). To illustrate this point, McElreath simulates the prior predictive distribution using observed weights.
set.seed(2971)
<- 100 # 100 lines
N <- rnorm(N, 178, 20) # alpha prior
a <- rnorm(N, 0, 10) # beta prior b
Let’s see how this random selection from the priors looks
plot(NULL, xlim = range(d2$weight), ylim = c(-100, 400),
xlab = 'weight', ylab = 'height')
abline(h = 0, lty = 2) #dashed line at height = 0
abline(h = 272, lty = 1, lwd = 0.5) #solid line at world's tallest person
mtext('b ~ dnorm(0, 10)') #plot title
<- mean(d2$weight) #average weight
xbar for(i in 1:N){
curve(a[i] + b[i]*(x - xbar), #linear equation for each pair
from = min(d2$weight), to = max(d2$weight), add = TRUE,
col = col.alpha('black',0.2))
}
This is clearly absurd as no adult should be near 0 height and weigh 30 kg. We can fix this by holding \(\beta\) positive with a log-normal distribution
<- rlnorm(1e4, 0, 1)
b dens(b, xlim = c(0,5), adj = 0.1)
Now we can use this new \(\beta\) prior in the last plot
set.seed(2971)
<- 100 # 100 lines
N <- rnorm(N, 178, 20) # alpha prior
a <- rlnorm(N, 0, 1) # NEW beta prior
b
plot(NULL, xlim = range(d2$weight), ylim = c(-100, 400),
xlab = 'weight', ylab = 'height')
abline(h = 0, lty = 2) #dashed line at height = 0
abline(h = 272, lty = 1, lwd = 0.5) #solid line at world's tallest person
mtext('log(b) ~ dnorm(0, 1)') #plot title
<- mean(d2$weight) #average weight
xbar for(i in 1:N){
curve(a[i] + b[i]*(x - xbar), #linear equation for each pair
from = min(d2$weight), to = max(d2$weight), add = TRUE,
col = col.alpha('black',0.2))
}
This new plot seems far more plausible for an adult population.
4.4.1 Finding the posterior distribtuion
Just like before we can use the quap
function to get a quadratic approximation or alternatively, we can use brms to fit the model.
quap
method
#re load the data
data(Howell1) ; d <- Howell1; d2 <- d[d$age >= 18,]
#define the mean weight of observed values
<- mean(d2$weight)
xbar
#fit the model
.3 <- quap(
m4alist(
~ dnorm(mu, sigma), #height likelihood distribution
height <- a + b*(weight - xbar), #linear equation
mu ~ dnorm(178, 20), #alpha prior
a ~ dlnorm(0, 1), #beta prior
b ~ dunif(0, 50) #sigma prior
sigma data = d2
),
)
precis(m4.3)
## mean sd 5.5% 94.5%
## a 154.6013671 0.27030766 154.1693633 155.0333710
## b 0.9032807 0.04192363 0.8362787 0.9702828
## sigma 5.0718809 0.19115478 4.7663786 5.3773831
brms method
library(brms)
.3_priors <- get_prior(data = d2, formula = height ~ 1 + weight)
b4
.3 <- brm(data = d2, #data to be used
b4family = gaussian, #likelihood of height
~ 1 + weight, #linear equation
height prior = b4.3_priors,
iter = 41000, warmup = 40000, chains = 4, cores = 4, seed = 11) #model mechanics
## Compiling Stan program...
## Start sampling
The plot
plot(b4.3)
The alpha values (b_intercept in brms) appear to be quite different here. This is because we haven’t centered the weight values yet (\(x - \overline{x}\)), so we can do that and run it again.
<- d2 %>%
d2 mutate(weight_c = weight - mean(weight))
.3b_priors <- get_prior(data = d2, formula = height ~ 1 + weight_c)
b4
.3b <- brm(data = d2, #data to be used
b4family = gaussian, #likelihood of height
~ 1 + weight_c, #linear equation
height prior = b4.3b_priors,
iter = 41000, warmup = 40000, chains = 4, cores = 4, seed = 11) #model mechanics
## Compiling Stan program...
## recompiling to avoid crashing R session
## Start sampling
plot(b4.3b)
4.4.2 Interpreting the posterior distribution
The model can only report the posterior distribution. This is the right answer to the question you asked. But it’s your responsibility to process the answer and make sense of it.
—McElreath, p.98
You can make sense of model outputs in two ways:
1. read summary tables
2. plot simulations
Reading summary tables becomes rather complicated as the models become more complex.
4.4.2.1 Tables of marginal distributions
precis(m4.3)
## mean sd 5.5% 94.5%
## a 154.6013671 0.27030766 154.1693633 155.0333710
## b 0.9032807 0.04192363 0.8362787 0.9702828
## sigma 5.0718809 0.19115478 4.7663786 5.3773831
In this table, the first row is the quadratic approximation for \(\alpha\), second is \(\beta\), and finally \(\sigma\).
repeated for our brms model
round(posterior_summary(b4.3b, probs = c(0.055,0.945))[1:3,], 2)
## Estimate Est.Error Q5.5 Q94.5
## b_Intercept 154.59 0.27 154.16 155.03
## b_weight_c 0.91 0.04 0.84 0.97
## sigma 5.10 0.19 4.81 5.41
Now let’s see the variance-covariance matrix
round(vcov(m4.3), 3)
## a b sigma
## a 0.073 0.000 0.000
## b 0.000 0.002 0.000
## sigma 0.000 0.000 0.037
repeated for brms model
posterior_samples(b4.3b) %>%
select(-lp__) %>%
cor() %>%
round(digits = 2)
## Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
## recommended alternatives.
## b_Intercept b_weight_c sigma
## b_Intercept 1.00 0.03 0.05
## b_weight_c 0.03 1.00 -0.05
## sigma 0.05 -0.05 1.00
Visual approach
pairs(m4.3)
*very little covariance here
4.4.2.2 Plotting the posterior inference against the data
plot(height ~ weight, data = d2, col = rangi2) #plot the data points
<- extract.samples(m4.3) #extract samples from the posterior
post <- mean(post$a) #get the mean of the samples of alpha
a_map <- mean(post$b) #get the mean of the samples of beta
b_map curve(a_map + b_map*(x - xbar), add = TRUE)
This line is very plausible but ther are others that are equally plausible
%>%
d2 ggplot(aes(x = weight, y = height)) +
geom_abline(intercept = fixef(b4.3)[1],
slope = fixef(b4.3)[2]) +
geom_point(shape = 1, size = 2, color = 'royalblue') +
theme_bw() +
theme(panel.grid = element_blank())
4.4.2.3 Adding uncertainty around the mean
There are many combinations of \(\alpha\) and \(\beta\) that could be plausible
<- extract.samples(m4.3)
post 1:5,] post[
## a b sigma
## 1 154.5789 0.9376825 5.220756
## 2 154.4067 0.8937310 4.752735
## 3 154.4622 0.9150822 5.341227
## 4 154.2649 0.9236067 5.160423
## 5 155.1258 0.9495934 5.108891
We will also alter the amount of observations to see how it influences the spread of these combinations.
<- 10 #lets pull 10 data points
N <- d2[1:N, ]
dN <- quap(
mN alist(
~ dnorm(mu, sigma),
height <- a + b*(weight - mean(weight)),
mu ~ dnorm(178, 20),
a ~ dlnorm(0, 1),
b ~ dunif(0, 50)
sigma data = d2) ),
#and plot 20 samples
<- extract.samples(mN, n=20)
post
#display data and sample size
plot(dN$weight, dN$height, xlim = range(d2$weight), ylim = range(d2$height),
col = rangi2, xlab = 'weight', ylab = 'height',)
mtext(concat('N = ',N))
#add sample lines
for(i in 1:20){
curve(post$a[i] + post$b[i]*(x - mean(dN$weight)),
col=col.alpha('black',0.3), add = TRUE)
}
You can see that the extreme weight values are far more spread out. This pattern disappears when the model has more data to work with and becomes more confident about the means.
<- 300 #lets pull 300 data points
N <- d2[1:N, ]
dN <- quap(
mN alist(
~ dnorm(mu, sigma),
height <- a + b*(weight - mean(weight)),
mu ~ dnorm(178, 20),
a ~ dlnorm(0, 1),
b ~ dunif(0, 50)
sigma data = d2)
),
#and plot 20 samples
<- extract.samples(mN, n=25)
post
#display data and sample size
plot(dN$weight, dN$height, xlim = range(d2$weight), ylim = range(d2$height),
col = rangi2, xlab = 'weight', ylab = 'height',)
mtext(concat('N = ',N))
#add sample lines
for(i in 1:25){
curve(post$a[i] + post$b[i]*(x - mean(dN$weight)),
col=col.alpha('black',0.3), add = TRUE)
}
4.4.2.4 Plotting regression intervals and contours
Let’s use 50 kg as a focal weight and grab some samples
<- extract.samples(m4.3)
post <- post$a + post$b * (50 - xbar) mu_at_50
Now we can look at the distribution of plausible heights for a person of 50kg weight
dens(mu_at_50, col = rangi2, xlab = 'mu|weight = 50kg')
The height of someone that weighs 50 kg is almost certainly between 158 and 160.5 cm.
PI(mu_at_50, prob = 0.89)
## 5% 94%
## 158.5706 159.6649
and 89% of the ways the model can produce the data suggests that the average height of a 50kg person is between 159 and 160cm.
Now what can we do about the average slope (\(\beta\)) from the model?
<- link(m4.3)
mu str(mu)
## num [1:1000, 1:352] 157 157 158 157 157 ...
Now we have a large matrix that has 1000 samples (rows) for 352 data points (observations). But we want to have increments of 1 kg steps across the x axis.
<- seq(from = 25, to = 70, by =1)
weight.seq
<- link(m4.3, data = data.frame(weight = weight.seq))
mu str(mu)
## num [1:1000, 1:46] 137 136 136 137 135 ...
Now our new matrix has 1000 samples across 46 weight steps. Let’s visualize the uncertainty of each weight value.
plot(height ~ weight, d2, type = 'n') #type 'n' hides the data
for (i in 1:100){
points(weight.seq, mu[i,], pch = 16, col = col.alpha(rangi2, 0.1))
}
Now we can compute a mean \(\mu\) value and the distribution for each weight value
<- apply(mu, 2, mean)
mu_mean <- apply(mu, 2, PI, prob = 0.89) mu_PI
And add them to the plot
plot(height ~ weight, data = d2, col = col.alpha(rangi2, 0.5))
shade(mu_PI, weight.seq, col = 'grey80')
lines(weight.seq, mu_mean)
4.4.2.5 Prediction Intervals
Now that we have a good handle on \(\mu_{i}\) we can combine it with \(\sigma\) to create estimates of height rather than just the mean.
<- sim(m4.3, data = list(weight = weight.seq))
sim.height str(sim.height)
## num [1:1000, 1:46] 142 139 138 138 145 ...
<- apply(sim.height, 2, PI, prob = 0.89) height.PI
plot(height ~ weight, d2, type = 'n')
shade(height.PI, weight.seq, col = 'grey85')
shade(mu_PI, weight.seq, col = 'grey60')
points(x = d2$weight, y = d2$height, pch = 16, col = col.alpha(rangi2, 0.5))
lines(weight.seq, mu_mean)
Note: shade() seeems to be broken for me in that without specifying a colour, it is completely transparent
If you want to smooth the edges of the interval, you can add more samples to the sim.height
controls.
<- sim(m4.3, data=list(weight = weight.seq), n=1e4)
sim.height <- apply(sim.height, 2, PI, prob = 0.89)
height.PI
plot(height ~ weight, d2, type = 'n')
#plot(height ~ weight, data = d2, col = col.alpha(rangi2, 0.5))
shade(height.PI, weight.seq, col = 'grey85')
shade(mu_PI, weight.seq, col = 'grey60')
points(x = d2$weight, y = d2$height, pch = 16, col = col.alpha(rangi2, 0.5))
lines(weight.seq, mu_mean)
4.5 Curves from lines
There are two common methods to build curves from lines. One is Polynomial Regression and the other is B-Splines.
4.5.1 Polynomial regression
This method uses powers of a variable (squares & cubes) as extra pedictors. Let’s grab all the height and weight data, not just adults.
data(Howell1)
<- Howell1
d
plot(height~weight, d)
This is quite the curve. We can try to model it with a parabolic model:
\[\begin{equation} \mu_{i} = \alpha + \beta_{1} x_{i} + \beta_{2} x_{i}^{2} \end{equation}\]
where now each weight observation \(x_{i}\) is used twice. Once as before as a linear fit and a second time squared to produce the curvature of the line defined as \(\beta_{2}\)
To fit the model to the data we will need to standardize the predictor variable.
$weight_s <- (d$weight - mean(d$weight))/sd(d$weight)
d$weight_s2 <- d$weight_s^2
d.5 <- quap(
m4alist(
~ dnorm(mu, sigma), #same as before
height <- a + b1*weight_s + b2*weight_s2, #adding the b2 and squared term here
mu ~ dnorm(178, 20), #same
a ~ dlnorm(0, 1), #same as beta prior before
b1 ~ dnorm(0, 1), #new prior that could be positive or negative
b2 ~ dunif(0, 50)
sigma data = d
), )
Now try and interpret the results of the summary table
precis(m4.5)
## mean sd 5.5% 94.5%
## a 146.057411 0.3689756 145.467716 146.647105
## b1 21.733065 0.2888891 21.271364 22.194766
## b2 -7.803264 0.2741839 -8.241463 -7.365065
## sigma 5.774476 0.1764652 5.492450 6.056501
Not very straight forward how the numbers in the table relate to the data
Let’s find the mean for \(\mu\) again now with \(\beta_{1}\) and \(\beta_{2}\) in the equation.
<- seq(from = -2.2, to = 2, length.out = 30) # grab 30 equally spaced weight values
weight.seq <- list(weight_s = weight.seq, weight_s2 = weight.seq^2) #new weight values
pred_dat <- link(m4.5, data = pred_dat) #get new estimates from new data
mu <- apply(mu, 2, mean) #calculate the mu mean for the line
mu.mean <- apply(mu , 2, PI, prob = 0.89) #calculate a 89% interval around the mean for mu
mu.PI <- sim(m4.5, data = pred_dat) #calculate estimates for heights based on mu and sigma with new weight values
sim.height <- apply(sim.height, 2, PI, prob = 0.89) #calculate 89% intervals of height height.PI
plot(height~weight_s, d, type = 'n')
lines(weight.seq, mu.mean)
shade(mu.PI, weight.seq)
shade(height.PI, weight.seq)
points(x = d$weight_s, y = d$height, col = col.alpha(rangi2, 0.5))
Now let’s try adding a cubic predictor term \(\beta_{3}x_{i}^3\)
$weight_s3 <- d$weight_s^3
d.6 <- quap(
m4alist(
~ dnorm(mu, sigma),
height <- a + b1*weight_s + b2*weight_s2 + b3*weight_s3,
mu ~ dnorm(178, 20),
a ~ dlnorm(0, 1),
b1 ~ dnorm(0, 1),
b2 ~ dnorm(0,1),
b3 ~ dunif(0, 50)
sigma data = d
), )
<- seq(from = -2.2, to = 2, length.out = 30) # grab 30 equally spaced weight values
weight.seq <- list(weight_s = weight.seq, weight_s2 = weight.seq^2, weight_s3 = weight.seq^3) #new weight values
pred_dat <- link(m4.6, data = pred_dat) #get new estimates from new data
mu <- apply(mu, 2, mean) #calculate the mu mean for the line
mu.mean <- apply(mu , 2, PI, prob = 0.89) #calculate a 89% interval around the mean for mu
mu.PI <- sim(m4.6, data = pred_dat) #calculate estimates for heights based on mu and sigma with new weight values
sim.height <- apply(sim.height, 2, PI, prob = 0.89) #calculate 89% intervals of height height.PI
plot(height~weight_s, d, type = 'n')
lines(weight.seq, mu.mean)
shade(mu.PI, weight.seq)
shade(height.PI, weight.seq)
points(x = d$weight_s, y = d$height, col = col.alpha(rangi2, 0.5))
And here is how you would plot the above with the original x-axis
plot(height ~ weight_s, d, col=col.alpha(rangi2, 0.5), type = 'n', xaxt = 'n') #turn off x axis text
#Rcode 4.71
<- c(-2, -1, 0, 1, 2)
at <- at*sd(d$weight) + mean(d$weight)
labels axis(side = 1, at = at, labels = round(labels, 1))
lines(weight.seq, mu.mean)
shade(mu.PI, weight.seq)
shade(height.PI, weight.seq)
points(x = d$weight_s, y = d$height, col = col.alpha(rangi2, 0.5))
4.5.2 Splines
Splines are smooth functions that are built from smaller component functions that usually take on a wiggly appearance.
#load cherry blossom bloom time over a thousand years
library(rethinking)
data(cherry_blossoms)
<- cherry_blossoms
d precis(d)
## mean sd 5.5% 94.5% histogram
## year 1408.000000 350.8845964 867.77000 1948.23000 ▇▇▇▇▇▇▇▇▇▇▇▇▁
## doy 104.540508 6.4070362 94.43000 115.00000 ▁▂▅▇▇▃▁▁
## temp 6.141886 0.6636479 5.15000 7.29470 ▁▃▅▇▃▂▁▁
## temp_upper 7.185151 0.9929206 5.89765 8.90235 ▁▂▅▇▇▅▂▂▁▁▁▁▁▁▁
## temp_lower 5.098941 0.8503496 3.78765 6.37000 ▁▁▁▁▁▁▁▃▅▇▃▂▁▁▁
Let’s see the bloom time over the years
plot(x=d$year, y=d$doy)
abline(h=mean(d$doy, na.rm = T), col = 'red', lty = 2)
Similar to the polynomial method, B-splines will break the predictor (year) into chunks with different basis functions to fit the data in that chunk. When we stitch the chunks together, we often get a wiggly line through the data.
The new linear function for \(\mu_{i}\) will look a bit different.
\[\begin{equation} \mu_{i} = \alpha + w_{1}\beta_{i,1} + w_{2}\beta_{i,2} + w_{3}\beta_{i,3} + ... \end{equation}\]
Here, the \(\beta\) values are for each basis function and the \(w\) parameters are the weights for each function.
For the cherry blossom data we will split the data up into 15 quantiles and have the basis functions operate within each quantile.
<- d[complete.cases(d$doy), ] #remove any NA values
d2 <- 15 #number of sections in the data
num_knots <- quantile(d2$year, probs = seq(0, 1, length.out = num_knots))
knot_list
knot_list
## 0% 7.142857% 14.28571% 21.42857% 28.57143% 35.71429% 42.85714% 50%
## 812 1036 1174 1269 1377 1454 1518 1583
## 57.14286% 64.28571% 71.42857% 78.57143% 85.71429% 92.85714% 100%
## 1650 1714 1774 1833 1893 1956 2015
So now that we have the break points in the year range of the data we can assign a polynomial degree. This degree determines how many basis functions will be interacting in each year range. If we choose a cubic spline, four basis functions will be active in each year range.
library(splines)
<- bs(d2$year, knots = knot_list[-c(1,num_knots)], degree = 3, intercept = TRUE) B
We just created a matrix that contains 827 rows and 17 columns. Each row is a year in d2
and each column is a basis function. Let’s see what it looks like (fig 4.13a)
plot(NULL, xlim = range(d2$year), ylim=c(0,1), xlab='year', ylab='basis')
for(i in 1:ncol(B)){
lines(d2$year, B[,i])
}points(x = knot_list, y = rep(0.9, length(knot_list)), col = 'red')
Now we need to calculate the parameter weights (\(w\)) for each basis function. To do this we need to run the model.
\[\begin{equation} D_{i} \sim \text{Normal}(\mu_{i},\sigma) \mu_{i} = \alpha + \sum_{k=1}^{K}w_{k}B_{k,i} \alpha \sim \text{Normal}(100,10) w_{j} \sim \text{Normal}(0,10) \sigma \sim \text{Exponential}(1) \end{equation}\]
.7 <- quap(
m4alist(
~ dnorm(mu, sigma),
D <- a + B %*% w,
mu ~ dnorm(100, 10),
a ~ dnorm(0, 10),
w ~ dexp(1)
sigma data = list(D = d2$doy, B=B),
), start = list(w=rep(0, ncol(B)))
)
Now we can extract some samples and look at the basis weights (fig 4.13b)
<- extract.samples(m4.7)
post <- apply(post$w, 2, mean)
w plot(NULL, xlim=range(d2$year), ylim=c(-6,6), xlab = 'year', ylab='basis * weight')
for(i in 1:ncol(B)){
lines(d2$year, w[i]*B[,i])
}points(x = knot_list, y = rep(4.9, length(knot_list)), col = 'red')
Now we can calculate a posterior interval for \(\mu\) at each year (fig 4.13c)
<- link(m4.7)
mu <- apply(mu, 2, PI, prob = 0.97)
mu_PI <- apply(mu, 2, mean)
mu_mean plot(d2$year, d2$doy, col = col.alpha(rangi2, 0.3), pch = 16)
shade(mu_PI, d2$year, col = col.alpha('black',0.5))
lines(d2$year, mu_mean)
A whole other class of similar models are called GAMs (Generalized Additive Models) and are worth looking into if you have time.