Applied Biostats 2024: Chapter 32: Likelihood

Motivating Scenario: We aim to make inferences on data from various types of null models.

Likelihood-Based Inference

So far, we’ve focused extensively on the normal distribution in our linear modeling. This makes sense: many things are approximately normal, but certainly not everything. We’ve also assumed independence — though this assumption is more tenuous, as true independence usually only holds in carefully controlled experiments.

The good news is that we can step beyond the assumptions of linear models and build inferences under any model we can describe through an equation using likelihood-based inference! This approach allows us to model more complex scenarios, like phylogenies, genome sequences, or any structured data. For now, however, we’ll stick to our simple normal distribution so that we can follow along.

Probabilities and Likelihoods

Remember, a probability represents the proportion of times a model would produce a particular result if we did the same thing many many times.

Mathematically, calculating a likelihood is identical to calculating a probability. The key difference lies in our perspective.

Log-Likelihoods

If our data points are independent, we can calculate the likelihood of all data points by taking the product of each individual likelihood. However, this approach introduces some practical challenges:

To address these issues, we typically take the log of each likelihood and sum them all. In log space, addition replaces multiplication — for example, 0.5 * 0.2 = 0.1 becomes log(0.5) + log(0.2) = -2.302585, and

$e^{-2.302585} = 0.1$ . So you will almost always see people talking about log-likelihoods instead of raw likelihoods.

Log Likelihood of

$\mu$

How can we calculate likelihoods for a parameter of a normal distribution? Here’s how!

Suppose we have a sample with values 0.01, 0.07, and 2.2, and we know the population standard deviation is one, but we don’t know the population mean. We can find the likelihood of a proposed mean by multiplying the probability of each observation, given the proposed mean. For example, the likelihood of

$\mu = 0 | \sigma = 1, \text{ and } Data = \{0.01, 0.07, 2.2\}$ is:

dnorm(x = 0.01, mean = 0, sd = 1) * 
  dnorm(x = 0.07, mean = 0, sd = 1) * 
  dnorm(x = 2.20, mean = 0, sd = 1)

dnorm(x = c(0.01, 0.07, 2.20), mean = 0, sd = 1) %>% prod()

As discussed above, we typically work with log likelihoods rather than linear likelihoods. Because multiplying on the linear scale is equivalent to adding on the log scale, the log likelihood in this case is:

dnorm(x = c(0.01, 0.07, 2.20), mean = 0, sd = 1, log = TRUE) %>% sum()

Reassuringly,

$ln(0.00563186) =$ -5.1793155, and similarly

$e^{5.179316} =$ 0.0056319

The Likelihood Profile

We can consider a range of plausible parameter values and calculate the (log) likelihood of the data for each of these “models.” This is called a likelihood profile. We can create a likelihood profile with the following:

obs            <- c(0.01, 0.07, 2.2)    # our observations
proposed.means <- seq(-1, 2, .001)      # proposing means from -1 to 2 in 0.001 increments

likelihood_profile <- tibble(proposed_mean = proposed.means)%>%
  group_by(proposed_mean)%>%
  # find the logliklihood of each proposed "model"
  mutate(log_lik = sum(dnorm(x = obs, mean = proposed_mean , sd = 1, log = TRUE)))%>%
  ungroup()

Show entries

Search:

	proposed_mean	log_lik
1	-1	-8.959
2	-0.999	-8.954
3	-0.998	-8.949
4	-0.997	-8.943
5	-0.996	-8.938

Showing 1 to 5 of 3,001 entries

Previous1 2 3 4 5…601Next

Above I just said the santdard deviation was one. But more often we find it rom our data. however

The stadard deviation in our likelihood calcualtions is a bit differet than we are used to.

First, we consider our sums of squares as deviations away from the proposed mean, rather than the estimated mean.
Second we divide by n, not not n-1 (because we state the mean and do not estimate it)

Maximum Likelihood Estimate

How can we use this for inference? One estimate of our parameter is the value that “maximizes the likelihood” of our data (e.g., the x-value corresponding to the highest point on the plot above).

To find the maximum likelihood estimate, we find the biggest value (i.e. the smallest negative number) in our likelihood profile. We can find this by arranging our data set from largest to smallest number

likelihood_profile%>% 
  arrange(desc(log_lik))

Show entries

Search:

	proposed_mean	log_lik
1	0.76	-4.3129156
2	0.759	-4.3129171
3	0.761	-4.3129171
4	0.758	-4.3129216
5	0.762	-4.3129216

Showing 1 to 5 of 3,001 entries

Previous1 2 3 4 5…601Next

Or by filtering our data so we only have the proposed mean that maximizes the liklihood of our data

MLE <- likelihood_profile %>%
  filter(log_lik == max(log_lik))

print(MLE)

This equals the mean of our observations: mean(c(0.01, 0.07, 2.2)) = 0.76. In general, for normally distributed data, the maximum likelihood estimate of the population mean is the sample mean.

NOTE The maximium liklihood estimate is not the parameter with the best chance of being correct (that’s a Bayesian question); rather, it’s the parameter that, if true, would have the greatest likelihood of producing our observed data.

Likelihood based inference: Uncertainty and hypothesis testing.

Uncertainty

We need one more trick to use the likelihood profile to estimate uncertainty – log likelihoods are roughly

$\chi^2$ distributed with degrees of freedom equal to the number of parameters we’re trying to guess (here, just one – corresponding to the mean). So for 95% confidence intervals are everything within qchisq(p = .95, df =1) /2 = 1.92 log likelihood units of the maximum likelihood estimate.

in95CI <- pull(MLE, log_lik) -1.92

CI <- likelihood_profile %>%
  filter(log_lik > in95CI) %>%
  reframe(CI = range(proposed_mean))

Now we can plot our likelihood profile, noting our MLE (in red) and 95% confidence intervals (the shaded area):

CI
-0.371
1.891

ggplot(likelihood_profile)+
  geom_rect(data = .  %>% summarise(ymin = min(log_lik),ymax = max(log_lik),
                                 xmin = min(pull(CI )),xmax = max(pull(CI ))),
            aes(xmin = xmin,xmax = xmax, ymin = ymin, ymax = ymax), 
            fill= "lightgrey", alpha = .4)+
  geom_line(aes(x = proposed_mean, y = log_lik))+
  geom_vline(xintercept = pull(MLE, proposed_mean), color = "red")+
  theme_light()

Null hypothesis significance testing.

We can find a p-value and test the null hypothesis by comparing the likelihood of our MLE (

$log\mathscr{L}(MLE|D)$ ) to the likelihood of the null model (

$log\mathscr{L}(H_0|D)$ ). We call this a likelihood ratio test, because we divide the likelihood of the MLE by the likelihood of the null – but we’re doing this in logs, so we subtract rather than divide. For this arbitrary example, lets pretent we want to test the null that the true mean is zero:

We then calculate

$D$ which is simply two times this difference in lof likelihoods, and calculate a p-value with it by noting that

$D$ is

$\chi^2$ distributed with degrees of freedom equal to the number of parameters we’re inferring (here, just one – corresponding to the mean). We see that our p value is greater than 0.05 so we fail to reject the null.

log_lik_MLE <- pull(MLE, log_lik)
log_lik_H0  <- likelihood_profile %>% filter(proposed_mean == 0)%>% pull(log_lik)
D      <- 2 * (log_lik_MLE - log_lik_H0)
p_val  <- pchisq(q = D, df = 1, lower.tail = FALSE)

Unfortunately, the LRT test give poorly calibrated p-values – especially when we have a small sample size.

D	p_val
1.733	0.188

t.test(obs,mu=0) %>%tidy() %>% pull(p.value)

A simple solution is applying Bartlett’s Correction for Small Samples correction, here

$D_\text{corrected} = \frac{D}{1+\frac{1}{2\times n}}$ . This gets us closer, but we still must recognize that the LRT is a very rough approximtation

Bartlett_D      <- D / (1+1/(2 * length(obs)))
Bartlett_p_val  <- pchisq(q = Bartlett_D, df = 1, lower.tail = FALSE)
print(Bartlett_p_val)

Example 1: Are species moving uphill?

Chen et al. (2011) tested the idea that organisms move to higher elevation as the climate warms. To test this, they collected data (available here) from 31 species, plotted below (Fig 1).

Change in the elevation of 31 species. Data from @Chen2011.

Figure 1: Change in the elevation of 31 species. Data from Chen et al. (2011).

We could conduct a one sample t-test against of the null hypothesis that there has been zero net change on average.

lm(elevationalRangeShift ~ 1, data = range_shift) %>%
  tidy()                                                                                                                                                                           %>% mutate(p.value = p.value* 10^8)%>%mutate_at(2:5, round, digits = 2) %>% mutate(p.value = paste(p.value,"x 10^-8"))%>% gt()

term	estimate	std.error	statistic	p.value
(Intercept)	39.33	5.51	7.14	6.06 x 10^-8

Calculate log likelihoods for each model

There is nothing wrong with that one-sample t-test, but let’s use this as an opportunity to learn about how to apply likelihood. First we grab our observations, write down our proposed means – lets say from negative one hundred to two hundred in increments of .01.

observations   <-  pull(range_shift, elevationalRangeShift)
proposed_means <- seq(-100,200,.01)
n               <- length(observations)

log_lik_uphill <- tibble(mu = proposed_means)%>%      # Step 1 
  group_by(mu)%>%                                        # R stuff to make sure we only work within a parameter
  mutate(sigma   = sqrt(sum((observations-mu)^2) / n) ,  # Step 2: Find the standard deviation  
         log_lik = dnorm(x = observations,               # Step 3: Find the 
                         mean = mu,                      #    log-likelihood
                         sd = sigma,                     #    if each data 
                         log= TRUE)%>%                   #    point
           sum())%>%                                     # Step 4: Sum the log-liklihoods
  ungroup()%>%
  arrange(desc(log_lik))

	mu	sigma	log_lik
1	39	30.166	-149.596
2	40	30.172	-149.601
3	38	30.194	-149.624

Figure 2: Log likelihoods of for 100 proposed means (a subst of those investigated above), sorted from highest to lowest log likelihood.

MLE

This sorted list shows that our maximum likelihood estimate is about 39 meters. The actual MLE is about

MLE <- log_lik_uphill %>%
  filter(log_lik == max(log_lik))

mu	sigma	log_lik
39.3	30.2	-149.6

95% CI

Again, because log likelihoods are roughly

$\chi^2$ distributed with one degree of freedom, everything within qchisq(p = .95, df =1) /2 = 1.92 log likelihood units of the maximum likelihood estimate is in the 95% confidence interval:

in95CI <- pull(MLE, log_lik) -1.92

CI <- log_lik_uphill  %>%
  filter(log_lik > in95CI) %>%
  reframe(CI = range(mu))

Now we can plot our likelihood profile, noting our MLE (in red) and 95% confidence intervals (the shaded area):

CI
28.38
50.28

ggplot(log_lik_uphill)+
  geom_rect(data = .  %>% summarise(ymin = min(log_lik),ymax = max(log_lik),
                                 xmin = min(pull(CI )),xmax = max(pull(CI ))),
            aes(xmin = xmin,xmax = xmax, ymin = ymin, ymax = ymax), 
            fill= "lightgrey", alpha = .4)+
  geom_line(aes(x = mu, y = log_lik))+
  geom_vline(xintercept = pull(MLE, mu), color = "red")+
  theme_light()

Null hypothesis significance testing.

As above we find the p-value by comparing two times the difference in log likelihoods under the MLE and the null model (

$D$ ), to the

$\chi^2$ distribution with one degress of freedom. This value is in the same ballpart as what we got from the one sample t-test (

$6.06 \times 10^{-8}$ )

log_lik_MLE <- pull(MLE, log_lik)
log_lik_H0  <- log_lik_uphill %>% filter(mu == 0)%>% pull(log_lik)
D      <- 2 * (log_lik_MLE - log_lik_H0)
p_val  <- pchisq(q = D, df = 1, lower.tail = FALSE)

log_lik_MLE	log_lik_H0	D	p_val
-149.59	-164.99	30.79	2.87 x 10^-8

R tricks

MLE_model <- lm(elevationalRangeShift ~ 1, data = range_shift) # Estimate the intercept
logLik(MLE_model ) %>% as.numeric()

Null_model <- lm(elevationalRangeShift ~ 0, data = range_shift) # Set intercept to zero
logLik(Null_model ) %>% as.numeric()

library(broom)
glance(MLE_model) %>% select(logLik) %>% pull()

You can also use the lrtest function in the lmtest package to conduct this test:

library(lmtest)
lrtest(Null_model,MLE_model )

Bayesian inference

We often care about how probable our model is given the data, rather than the reverse. Likelihoods can help us approach this! Remember Bayes’ theorem:

Today, we’ll assign an arbitrary prior probability for demonstration purposes. This is not ideal—Bayesian inferences are only meaningful to the extent that we can meaningfully interpret the posterior, which depends on having a well-justified prior. However, for the sake of this example, let’s assume a prior: the parameter is normally distributed with a mean of 0 and a standard deviation of 30.

bayes_uphill <- log_lik_uphill %>%
  mutate(lik   = exp(log_lik),
         prior = dnorm(x = mu, mean = 0, sd = 30) / sum(dnorm(x = mu, mean = 0, sd = 30) ),
         evidence  = sum(lik * prior),
         posterior =  (lik *  prior) / evidence)

bayes_uphill %>%
  filter(posterior == max(posterior))                                                                                                  %>% data.frame()

Note that this MAP estimate does not equal our MLE as it is pulled away from it by our prior.

bayes_uphill %>%
  mutate(cumProb = cumsum(posterior)) %>%
  filter(cumProb > 0.025 & cumProb < 0.975)%>%
  summarise(lower_95cred = min(mu),
            upper_95cred = max(mu))                                                                                                %>% data.frame()

Prior sensitivity

In a good world our priors are well calibrated.
In a better world, the evidence in the data is so strong, that our priors don’t matter.

A good thing to do is to compare our posterior distributions across different prior models. The plot below shows that if our prior is very tight, we have trouble moving the posterior away from it. Another way to say this, is that if your prior believe is strong, it would take loads of evidence to gr you to change it.

With more complex models, we usually can’t use the math above to solve Bayesian problems. Rather we use computer ticks – most notably the Markov Chain Monte Carlo MCMC to approximate the posterior distribution.

The programming here can be tedious so there are many programs – notable WINBUGS, JAGS and STAN – that make the computation easier. But even those can be a lot of work. Here I use the R package brms, which runs stan for us, to do an MCMC and do Bayesian stats. I suggest looking into this if you want to get stared, and learning STAN for more serious analyses

library(brms)
change.fit <- brm(elevationalRangeShift ~ 1, 
  data   = range_shift,
  family = gaussian(),
  prior  = set_prior("normal(0, 30)", 
                      class = "Intercept"),
  chains = 4,
  iter   = 5000)

change.fit$fit

term	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
b_Intercept	37.82	0.06	5.69	26.56	34.11	37.86	41.59	48.82	7863	1
sigma	31.70	0.05	4.25	24.84	28.68	31.24	34.20	41.34	6444	1
lp__	-156.70	0.02	1.04	-159.48	-157.09	-156.38	-155.97	-155.70	4340	1

Chapter 32: Likelihood

Authors

Affiliations

Published

DOI

Likelihood-Based Inference

Probabilities and Likelihoods

Log-Likelihoods

Log Likelihood of $\mu$

The Likelihood Profile

Maximum Likelihood Estimate

Likelihood based inference: Uncertainty and hypothesis testing.

Uncertainty

Null hypothesis significance testing.

Example 1: Are species moving uphill?

Calculate log likelihoods for each model

MLE

95% CI

Null hypothesis significance testing.

R tricks

Bayesian inference

Prior sensitivity

Footnotes

References

Chapter 32: Likelihood

Authors

Affiliations

Published

DOI

Likelihood-Based Inference

Probabilities and Likelihoods

Log-Likelihoods

Log Likelihood of μ\mu

The Likelihood Profile

Maximum Likelihood Estimate

Likelihood based inference: Uncertainty and hypothesis testing.

Uncertainty

Null hypothesis significance testing.

Example 1: Are species moving uphill?

Calculate log likelihoods for each model

MLE

95% CI

Null hypothesis significance testing.

R tricks

Bayesian inference

Prior sensitivity

Footnotes

References

Log Likelihood of $\mu$