Chapter 7 Ulysses’ compass
The hero of Homer’s Odyssey, Ulysses, had two navigate between two dire consequences to complete his voyage. On one side, a beast with many heads and a taste for men, on the other side, a sea monster that sank boats. In statistical modelling, we are navigating equally treacherous waters while trying to avoid two things
- Overfitting
- Underfitting
Both dangers provide the same result, poor predictions. All this while also avoiding confounds as we’ve seen.
7.1 The problem with parameters
Most people will describe the performance of a model by the \(R^2\) term or amount of variance explained
\[R^{2} = \frac{\text{var(outcome)} - \text{var(residuals)}} {\text{var(outcome)}} = 1 - \frac{\text{var(residuals)}}{\text{var(outcome)}}\]
But \(R^2\) is misleading because adding in more predictor variables almost always improves model fit, even if they are a bunch of random numbers that aren’t related to the outcome.
7.1.1 More parameters (almost) always improve fit
Unfortunately overfitting happens automatically. To see this in action, we will look at seven hominin species’ average body mass and brain volume.
<- c("afarensis", "aricanus", "habilis","boisei","rudolfensis", "ergaster","sapiens")
sppnames <- c(438, 452, 612, 521, 752, 871, 1350)
brainvolcc <- c(37.0, 35.5, 34.5, 41.5, 55.5, 61, 53.5)
masskg <- data.frame(species=sppnames, brain=brainvolcc, mass=masskg) d
If we were interested in brain volume as a function of body mass, we could fit a polynomial regression (see ch. 4). But first we would have to standardize the parameters. Here we do the same as we normally do with the predictor, body mass. The outcome, brain volume, we have in proportion of the largest brain. This will avoid negative brain values.
$mass_std <- (d$mass - mean(d$mass))/sd(d$mass)
d$brain_std <- d$brain / max(d$brain) d
Let’s build the linear model first. We will consider \(\sigma\) to be log-normal to keep it positive. We will also include rather vague priors for \(\alpha\) and \(\beta\).
library(rethinking)
.1 <- quap(
m7alist(
~ dnorm(mu, exp(log_sigma)),
brain_std <- a + b*mass_std,
mu ~ dnorm(0, 0.5),
a ~ dnorm(0, 10),
b ~ dnorm(0, 1)
log_sigma data = d
), )
Rethinking OLS and Bayesian
.1_OLS <- lm(brain_std ~ mass_std, data = d)
m7<- extract.samples(m7.1_OLS) post
Now let’s calculate the \(R^{2}\) value for our linear model
set.seed(11)
<- sim(m7.1)
s <- apply(s, 2, mean) - d$brain_std
r <- var2(r)
resid_var <- var2(d$brain_std)
outcome_var 1 - resid_var/outcome_var
## [1] 0.4731086
And let’s write a function that we can quickly apply to other models
<- function(quap_fit){
R2_is_bad <- sim(quap_fit, refresh = 0)
s <- apply(s, 2, mean) - d$brain_std
r 1 - var2(r)/var2(d$brain_std)
}
This is our second degree polynomial. It is just adding a secondary predictor that is based on the same data as m7.1
.
.2 <- quap(
m7alist(
~ dnorm(mu, exp(log_sigma)),
brain_std <- a + b[1]*mass_std + b[2]*mass_std^2,
mu ~ dnorm(0.5,1),
a ~ dnorm(0, 10),
b ~ dnorm(0, 1)
log_sigma data = d, start = list(b=rep(0,2))
), )
Note: quap()
needs to know how many \(\beta\) values you have so we set this with a start list
Now we will write additinal higher order polynomials, each increasing in degree
.3 <- quap(
m7alist(
~ dnorm(mu, exp(log_sigma)),
brain_std <- a + b[1]*mass_std + b[2]*mass_std^2 +
mu 3]*mass_std^3,
b[~ dnorm(0.5,1),
a ~ dnorm(0, 10),
b ~ dnorm(0, 1)
log_sigma data = d, start = list(b = rep(0,3))
),
)
.4 <- quap(
m7alist(
~ dnorm(mu, exp(log_sigma)),
brain_std <- a + b[1]*mass_std + b[2]*mass_std^2 +
mu 3]*mass_std^3 + b[4]*mass_std^4,
b[~ dnorm(0.5,1),
a ~ dnorm(0, 10),
b ~ dnorm(0, 1)
log_sigma data = d, start = list(b = rep(0,4))
),
)
.5 <- quap(
m7alist(
~ dnorm(mu, exp(log_sigma)),
brain_std <- a + b[1]*mass_std + b[2]*mass_std^2 +
mu 3]*mass_std^3 + b[4]*mass_std^4 +
b[5]*mass_std^5,
b[~ dnorm(0.5,1),
a ~ dnorm(0, 10),
b ~ dnorm(0, 1)
log_sigma data = d, start = list(b = rep(0, 5))
), )
\(\sigma\) here needs to be replaced with a constant to get an output.
.6 <- quap(
m7alist(
~ dnorm(mu,0.001),
brain_std <- a + b[1]*mass_std + b[2]*mass_std^2 +
mu 3]*mass_std^3 + b[4]*mass_std^4 +
b[5]*mass_std^5 + b[6]*mass_std^6,
b[~ dnorm(0.5,1),
a ~ dnorm(0, 10)
b data = d, start = list(b = rep(0,6))
), )
Now we can plot all our models. Below is a sample of plot code for a single model. You can also use brain_plot
from the rethinking
package.
#7.1
<- extract.samples(m7.1)
post <- seq(from = min(d$mass_std), to = max(d$mass_std), length.out = 100)
mass_seq <- link(m7.1, data = list(mass_std=mass_seq))
l <- apply(l, 2, mean)
mu <- apply(l, 2, PI)
ci plot(brain_std ~ mass_std, data = d, xlab = 'body mass (std)', ylab = 'brain volume (cc)')
lines(mass_seq, mu)
shade(ci, mass_seq)
mtext(paste("m7.1: R^2 = ", round(R2_is_bad(m7.1), 2)))
To get an output for m7.6
, sigma had to be a constant small number because the model passes through each point without variance. You can also notice that these high \(R^2\) models do very poorly in data gaps. Between 41 and 54 kg, the higher order models go wildly positive, and then steeply dip between 55 and 60 kg.
7.1.2 Too few parameters hurts, too
Underfitting does a poor job at describing existing data and also lacks predictive power. To see if a model is underfit, you could remove data points one at a time and see how sensitive the model is to change. Overfit models will be very sensitive to change.
Let’s plot the linear model and the 3rd degree polynomial for comparison.
par(mfrow=c(1,2))
brain_loo_plot(m7.1)
brain_loo_plot(m7.4)
The (underfit) linear model doesn’t change much compared to the wild lines in model m7.3
.
7.2 Entropy and accuracy
7.2.1 Firing the weatherperson
Accuracy starts with picking the target. We should be worried about:\
1. Cost-benefit analysis. What is the cost of being wrong? How much do we win by if we are right?
2. Accuracy in context. How do we judge accuracy in a way that accounts for how much a model could imrpove prediction.
Suppose in Vancouver, CBC puts out a 10 day forecast that has uncertain predictions of rain. And say CTV says theirs is better and only predicts sunshine for the 10 days.
CBC forecastDay | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Prediction | 1 | 1 | 1 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 |
Observed | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Day | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Prediction | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Observed | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
If we are define a metric of accuracy to be hit rate to be the average chance of a correct prediction, CBC would have a score of 5.8 hits (\(3 \times 1 + 7 \times 0.4\)) across 10 days (5.8 / 10 = 0.58) and CTV would have 7 hits (\(3 \times 0 + 7 \times 1\)) for a score of 0.7. CTV wins by this metric.
7.2.1.1 Cost and benefits
As a member of the public you may choose your own happiness as the metric to score these forecasts. If you don’t like getting caught in the rain but also dislike carrying umbrellas, you may have a different scoring system. Say getting caught in the rain is -5 happiness points, and carrying an umbrella is -1 points. The probability of you carrying an umbrella is the same as the probability of rain.
<- rep(c(-1, -0.6), times = c(3,7))
CBC <- rep(c(-5, 0), times = c(3,7))
CTV
<- rbind(Day, Observed, CBC, CTV)
d
::kable(d, "html") knitr
Day | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Observed | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CBC | -1 | -1 | -1 | -0.6 | -0.6 | -0.6 | -0.6 | -0.6 | -0.6 | -0.6 |
CTV | -5 | -5 | -5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
So now the CBC weather forecast is costing you -7.2 happiness points where CTV’s forecast is double that at -15.
7.2.1.2 Measuring accuracy
The hit rate of the forecasts calculated above are not the only way to determine accuracy. If we instead wanted to compute the probability of correctly predicting the whole sequence of days, this would be similar to the joint likelihood. So if we look at the CBC forecast, \(1^{3} \times 0.4^{7} \approx 0.002\) doesn’t seem like a great probability. But when compared to the CTV forecast \(0^{3} \times 1^{7} = 0\) there is no way for the forecast to be correct for the whole sequence because the CTV forecast doesn’t allow any prediction for rain.
7.2.2 Information and uncertainty
Now that we have a way to measure accuracy, we want to use it to compare competing models. To do this we can measure the distance of a model’s prediction from the true value. But how do we measure distance from the true model? The answer comes from Information Theory.
Information: The reduction in uncertainty when we learn an outcome
Now we need a way to measure uncertainty. Our measure should include a few things:\
- It should be continuous in scale
- It should increase as the number of possible events increases
- It should be additive (i.e. combinations of different events should be the sum of their parts; weather and temp, rain/cold, rain/hot, sun/cold, sun/hot)
Information Entropy: The uncertainty contained in a probability distribution is the average log-probability of an event.
\[H(p) = -E\text{log}(p_{i}) = -\sum_{i = 1}^{n} p_{i}\text{log}(p_{i})\]
Here \(n\) is the number of events, \(i\) is an individual event with the probability of \(p_{i}\).
To use in an example, let’s say that the true probability of rain is \(p_{1} = 0.3\) and sun is \(p_{2} = 0.7\). Then we can plug into the above equation:
\[H(p) = -(p_{1}\text{log}(p_{1}) + p_{2}\text{log}(p_{2})) \approx 0.61\]
in R:
<- c(0.3, 0.7)
p -sum(p*log(p))
## [1] 0.6108643
So if we were in Abu Dhabi, the chance of rain would be about 0.01 and sunshine would be about 0.99 (1 - rain). So the uncertainty drops way down to $$0.06. The uncertainty is low because we know rain is rare.
<- c(0.01, 0.99)
p -sum(p*log(p))
## [1] 0.05600153
and when sun (0.7), rain (0.15), and snow (0.15) are all in the mix, uncertainty increases.
<- c(0.7, 0.15, 0.15)
p -sum(p*log(p))
## [1] 0.8188085
7.2.3 From entropy to accuracy
Divergence: The additional uncertainty induced by using probabilities from one distribution to describe another distribution.
Now we can measure how much uncertainty is introduced if our model uncertainty \(q\) is different from our true uncertainty \(p\)
\[D_{KL}(p, q) = \sum_{i} p_{i}(\text{log}(p_{i}) - \text{log}(q_{i})) = \sum_{i} p_{i} \text{log} \left(\frac{p_{i}}{q_{i}}\right)\]
Say our true distribution is \(p = {0.3, 0.7}\), and we have our model distributions ranging from \(q = {0.01,0.99}\) to \(q = 0.99, 0.01\).
<- 0.3
p1 <- 0.7
p2 <- seq(from = 0.01, to = 0.99, by = 0.01)
q1 <- (1 - q1)
q2 <- (p1 * log(p1/q1)) + (p2 * log(p2/q2))
d_kl
<- data.frame(p1, p2, q1, q2, d_kl)
d
plot(x = d$q1, y = d$d_kl, type = 'n')
lines(d$q1, d$d_kl, col = 'blue', lwd = 2)
abline(v = 0.3, lty = 2)
text(x = 0.35, y = 2.2, label = 'q = p')
In the plot we can see where \(q_{1} = p_{1}\) (0.3), the divergence is at its lowest value (0) and only increases as you move away from the true value.
Now we can use divergence to compare models estimation of events (observations) to the probability of the events themselves.
7.2.4 Estimating divergence
Because we don’t know \(p\) (especially if we are forecasting weather), we need to estimate our model distance from \(p\) by comparing models to each other. This way we can infer which model is closer to \(p\) without knowing it directly. To do this we will utilize the middle of the above Information Entropy equation, \(E\text{log}(p_{i})\). What we can do is sum the likelihoods over all observations for each model by using:
\[S(q) = \sum_{i} \text{log}(q_{i})\]
for model \(q\) and :
\[S(r) = \sum_{i} \text{log}(r_{i})\]
for model \(r\).
We use all observations so that we don’t throw away information that is hidden in the distributions of each log-likelihood. For Bayesian models, we use Log-Pointwise-Predictive-Densities (lppd) to score model accuracy.
There is a function inside the rethinking
package that will compute this for models, but we will do it by hand below as well.
set.seed(11)
lppd(m7.1, n = 1e4)
## [1] 0.6232812 0.6540091 0.5198331 0.6381867 0.4796153 0.4417278 -0.9146812
The equation for lppd consists of data \(y\) and a distribution \(\Theta\) such that:
\[\text{lppd}(y, \Theta) = \sum_{i}\text{log}\frac{1}{S}\sum_{s}p(y_{i}|\Theta_{s})\]
and \(S\) is the number of samples and \(\Theta_{s}\) is the s-th set of sampled parameter values.
set.seed(11)
<- sim(m7.1, ll=TRUE, n = 1e4) #sampled LL as rows and observations as columns
logprob <- ncol(logprob) # seven observations
n <- nrow(logprob) # 1e4 samples
ns <- function(i){
f log_sum_exp(logprob[,i]) - log(ns) # exp(i) then sums them, then takes log of sum and subtracts the log of sample number
}<- sapply(1:n, f)) (lppd
## [1] 0.6232812 0.6540091 0.5198331 0.6381867 0.4796153 0.4417278 -0.9146812
They match (yay!)
7.2.5 Scoring the right data
One problem still exists. lppd also increases as models get more complex, just like \(R^2\).
set.seed(11)
sapply(list(m7.1,m7.2,m7.3,m7.4,m7.5,m7.6), function(m) sum(lppd(m)))
## [1] 2.478986 2.584124 3.634849 5.325457 14.052121 39.535568
Here is the way around this problem:
1. suppose there is a training sample of size \(N\)
2. compute the posterior distribution of a model for the training sample, then compute the score (\(D_{train}\))
3. Suppose another sample of size \(N\). This is the test sample
4. compute the score on the test sample using the training posterior. (\(D_{test}\))
As an example we will train and test models of the form:
\[y_{i} \sim \text{Normal}(\mu_{i}, 1) \\ \mu_{i} = (0.15)x_{1,i} = (0.4)x_{2,i}\]
Here, \(\sigma\) is static, \(\alpha = 0\), and our \(\beta\) coefficients are known.
How it works:
<- 10 #number of simulations
n_sim <- 1:5 #number of parameters in the model
kseq
#simulation function
<- function(k) {
my_sim <- replicate(n_sim, sim_train_test(N=n, k = k));
r c(mean(r[1,]), mean(r[2,]), sd(r[1,]), sd(r[2,]))
}
# run for 20 cases
<- 20
n <- sapply(kseq, my_sim) # this will take a while
dev_20
# run for 100 cases
#n <- 100
#dev_100 <- sapply(kseq, my_sim) # run if interested but with more data, adding parameters will not largely affect the deviance
#plot
plot(1:5, dev_20[1,], ylim=c(min(dev_20[1:2,])-5, max(dev_20[1:2,])+10),
xlim = c(1, 5.1), xlab = 'number of parameters', ylab='deviance',
pch = 16, col = rangi2)
mtext('N = 20')
points((1:5)+0.1, dev_20[2,])
for(i in kseq){
<- dev_20[1,i] + c(-1,+1)*dev_20[3,i]
pts_in <- dev_20[2,i] + c(-1, +1)*dev_20[4,i]
pts_out lines(c(i,i), pts_in, col=rangi2)
lines(c(i,i)+0.1, pts_out)
}
Notice that deviance always decreases as you add parameters (similar to \(R^2\) increases with parameters) but the black out of sample simulations do best at 3 parameters, which is the correct amount of parameters for the data.
7.3 Golem taming: regularization
Tuning the priors to improve model performance
Standard model:
\[y_{i} \sim \text{Normal}(\mu_{i}, \sigma)\\
\mu_{i} = \alpha + \beta x_{i}\\
\alpha \sim \text{Normal}(0,100)\\
\beta \sim \text{Normal}(0,1)\\
\sigma \sim \text{Exponential}(1)\]
So here we have the prior of \(\alpha\) quite flat providing no real effect on the inference. The \(\beta\) prior though is rather narrow and doesn’t expect any increase of 1 unit of \(x\) to create any change outside of 2 units in \(y\).
7.4 Predicting predictive accuracy
7.4.1 Cross-validation
LOO and PSIS are common tools for cross-validatiting your models.
LOO can be computationally expensive for large datasets so PSIS (Pareto-smoothed Importance Sampling Cross-validation) is used as a shortcut by using observational importance of each data point.
7.4.2 Information criteria
AIC (Akaike Information Criteria) provides a simple estimate for out-of-sample deviance:
\[\text{AIC} = D_{train} + 2p = -2\text{lppd} +2p\]
with \(p\) representing free parameters in the posterior distribution. Unfortunately, AIC is not great with skeptical priors or multilevel models, so instead we will use the Widely Applicable Information Criterion (WAIC). It is a bit more complicated:
\[\text{WAIC}(y, \Theta) = -2\left(\text{lppd} - \sum_{i}\text{var}_{\theta}\text{log}p(y_{i}|\theta)\right)\]
Where the lppd equation is written above in section 7.2 and the term to the right of lppd is a penalty term proportional to the variance in the posterior predictions.
Here is how you can do it in R in case it will help understand the process:
#standard linear model
data(cars)
<- quap(
m alist(
~ dnorm(mu, sigma),
dist <- a + b*speed,
mu ~ dnorm(0, 100),
a ~ dnorm(0, 10),
b ~ dexp(1)
sigma data = cars
),
)
set.seed(11)
<- extract.samples(m, n = 1000) #pull samples from posterior post
# need to calculate likelihood for each observation
<- 1000
n_samples <- sapply(1:n_samples,
logprob function(s){
<- post$a[s] + post$b[s]*cars$speed
mu dnorm(cars$dist, mu, post$sigma[s], log = TRUE)
})
The output of the above will be 1000 samples in columns for the 50 observations in rows
#calculate the lppd
<- nrow(cars) #number of observations
n_cases <- sapply(1:n_cases, function(i) log_sum_exp(logprob[i,]) - log(n_samples)) lppd
#penalty term in WAIC
<- sapply(1:n_cases, function(i) var(logprob[i,])) pWAIC
-2*(sum(lppd) - sum(pWAIC))
## [1] 423.0532
#compare to automatic function
::WAIC(m) rethinking
## WAIC lppd penalty std_err
## 1 422.4586 -206.8399 4.38937 17.32067
#calculate the standard error of the WAIC for yourself
<- -2*(lppd - pWAIC)
waic_vec sqrt(n_cases*var(waic_vec))
## [1] 17.83286
7.5 Model comparison
You may see people select the best model and discard the others, but this is a mistake.
7.5.1 Model mis-selection
Let’s start looking at some models. Recall the plant growth models from section 6.2.
set.seed(11)
compare(m6.6, m6.7, m6.8, func=WAIC)
## WAIC SE dWAIC dSE pWAIC weight
## m6.7 330.9678 16.15098 0.00000 NA 4.054545 1.000000e+00
## m6.8 395.6257 11.23436 64.65787 14.23936 2.901013 9.114282e-15
## m6.6 402.7142 11.05810 71.74634 16.16689 1.697835 2.633180e-16
set.seed(11)
#compute dSE pointwise
.7 <- WAIC(m6.7, pointwise = TRUE)$WAIC
waic_m6.8 <- WAIC(m6.8, pointwise = TRUE)$WAIC
waic_m6<- length(waic_m6.7)
n .7_6.8 <- waic_m6.7 - waic_m6.8
diff_m6sqrt(n*var(diff_m6.7_6.8))
## [1] 14.19359
99% interval between models (dWAIC ; m6.7 - m6.8)
65 + c(-1,1)*14.2*2.9 #dWAIC (+ or -) dSE * pWAIC
## [1] 23.82 106.18
plot(x = compare(m6.6, m6.7, m6.8)[,1], y=c(1:3) ) #Fix this
m6.6 (intercept only) and m6.8 pointwise dSE
set.seed(11)
.6 <- WAIC(m6.6, pointwise=TRUE)$WAIC
waic_m6.6_m6.8 <- waic_m6.6 - waic_m6.8
diff_m6sqrt(n*var(diff_m6.6_m6.8))
## [1] 6.790192
only 7 points of deviance apart
looking at dSE for all three
set.seed(11)
compare(m6.6, m6.7, m6.8)@dSE
## m6.6 m6.7 m6.8
## m6.6 NA 16.16689 6.807944
## m6.7 16.166887 NA 14.239356
## m6.8 6.807944 14.23936 NA
Notice the dSE for m6.6-m6.8 isn’t in the above table. above only has comparisons to best model (m6.7)
7.5.2 Outliers and other illustions
Lets refit the waffle - divroce models
data(WaffleDivorce)
<- WaffleDivorce
d $A <- standardize(d$MedianAgeMarriage)
d$D <- standardize(d$Divorce)
d$M <- standardize(d$Marriage)
d
.1 <- quap(
m5alist(
~ dnorm(mu, sigma),
D <- a + bA*A,
mu ~ dnorm(0, 0.2),
a ~ dnorm(0, 0.5),
bA ~ dexp(1)
sigma data = d
),
)
.2 <- quap(
m5alist(
~ dnorm(mu, sigma),
D <- a + bM*M,
mu ~ dnorm(0, 0.2),
a ~ dnorm(0, 0.5),
bM ~ dexp(1)
sigma data = d
),
)
.3 <- quap(
m5alist(
~ dnorm(mu, sigma),
D <- a + bM*M + bA*A,
mu ~ dnorm(0, 0.2),
a ~ dnorm(0,0.5),
bM ~ dnorm(0, 0.5),
bA ~ dexp(1)
sigma data = d
),
)
precis(m5.3) #remember marriage rate not useful with age
## mean sd 5.5% 94.5%
## a -3.141054e-05 0.09707997 -0.1551840 0.1551211
## bM -6.536477e-02 0.15078018 -0.3063406 0.1756111
## bA -6.135209e-01 0.15099044 -0.8548328 -0.3722090
## sigma 7.851597e-01 0.07785368 0.6607345 0.9095849
Look at PSIS function comparison
set.seed(11)
compare(m5.1, m5.2, m5.3, func=PSIS)
## Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.
## Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.
## PSIS SE dPSIS dSE pPSIS weight
## m5.1 127.6505 15.03249 0.000000 NA 4.767972 0.880458424
## m5.3 131.6777 16.57977 4.027213 1.903536 7.038831 0.117546749
## m5.2 139.8303 10.84369 12.179770 11.146999 3.352505 0.001994827
When we get the high Pareto \(k\) warning, we may want to inspect the data
set.seed(11)
.3 <- PSIS(m5.3, pointwise = TRUE) PSIS_m5
## Some Pareto k values are high (>0.5). Set pointwise=TRUE to inspect individual points.
set.seed(11)
.3 <- WAIC(m5.3, pointwise = TRUE)
WAIC_m5
<- list(x = PSIS_m5.3[PSIS_m5.3$k > 0.4,5], labels = d$Loc[c(9, 13, 20)], y = WAIC_m5.3[c(9,13,20),3])
labels
plot(PSIS_m5.3$k, WAIC_m5.3$penalty, xlab = "PSIS Pareto k",
ylab = "WAIC penalty", col=rangi2, lwd=2)
text(x = labels$x-0.05, y = labels$y, labels = labels$labels)
Robust regressing to deal with outliers
Change the distribution from Gaussian (normal) to Student-T and add a \(\nu\) term to adjust the frequency of rare events.
.3t <- quap(
m5alist(
~ dstudent(2, mu, sigma),
D <- a + bM*M + bA*A,
mu ~ dnorm(0,0.2),
a ~ dnorm(0,0.5),
bM ~ dnorm(0,0.5),
bA ~ dexp(1)
sigma data = d
), )
PSIS(m5.3t) #no error for high k
## PSIS lppd penalty std_err
## 1 134.1568 -67.07842 7.138106 11.87741
precis(m5.3)
## mean sd 5.5% 94.5%
## a -3.141054e-05 0.09707997 -0.1551840 0.1551211
## bM -6.536477e-02 0.15078018 -0.3063406 0.1756111
## bA -6.135209e-01 0.15099044 -0.8548328 -0.3722090
## sigma 7.851597e-01 0.07785368 0.6607345 0.9095849
precis(m5.3t)
## mean sd 5.5% 94.5%
## a 0.015066723 0.09943806 -0.1438545 0.1739880
## bM 0.003731365 0.21184966 -0.3348453 0.3423080
## bA -0.701153916 0.13045818 -0.9096513 -0.4926565
## sigma 0.552696770 0.08005500 0.4247534 0.6806401
The age of marriage effect has been increased (farther from zero) slightly and the marriage rate has gotten closer to zero with a wider sd.