Lecture 5 Bayesian Inference
5.1 Overview
This lecture provides an introduction to Bayesian inference, the second major approach to statistical inference alongside likelihood-based inference. After a short recap on likelihood-based inference in Section 5.2, we continue with the motivation for Bayesian inference in Section 5.3 which arises from another philosophical perspective on the parameter of interest, namely that it comes from a random distribution. By referring to Bayes’ Theorem in Section 5.4 we explain how the posterior distribution of the parameter is linked to three key components: the likelihood, the prior, and the marginal model. In Section 5.5 then provide an example of such an bayesian approach. Finally, in the very last Section 5.7 we present different types of prior distributions for the parameter of interest.
5.2 Reminder: Likelihood Inference
In likelihood inference, we have redefined the density as the likelihood: This means our objective function is the likelihood of the observed random data given the fixed parameter . To estimate the true population parameter , we maximize the likelihood with respect to :
5.3 Philosophical Consideration: Parameter Distribution
Up to now we have assumed that only the data comes from a random distribution while parameter vector is fixed: However, by using a frequentist approach (maximum likelihood) it is only possible to infer the insecurity of the estimation but not the distribution of the parameter itself.
Bayesian statistics builds upon the question whether a parameter can be fixed or if it is random anyway. It assumes that also the parameter has a random prior distribution with parameter vector : That leads to our target size now being . This is called the posterior distribution of since it is the distribution of the parameter after observing the data. For the following sections we follow the very same bayesian approach and assume randomness in the parameter .
5.4 Bayes’ Theorem
5.4.1 General Equation
Thomas Bayes’ Theorem provides a mathematical rule for inverting conditional probabilities. It can be written as the following fraction: where and are events and .
- is the conditional probability of event occurring given that is true.
- is the conditional probability of event occurring given that is true.
- and are the probabilities of observing and respectively without any given conditions. They are known as the marginal probabilities.
The question now is: How can this be used for inference?
5.4.2 Application: Bayesian Inference
Substituting the target distribution into the left-hand side of Bayes’ Theorem yields the following equation: where
- is the posterior distribution of the random parameter given the data .
- is the likelihood of the data given the fixed parameter.
- is the prior probability of the parameter.
- is the marginal distribution of the data.
We can now reinterpret this fraction as a framework for updating probabilities based on new evidence. Since the denominator is fixed, it is called the normalizing constant and does not affect the mode of the posterior. Hence, it is often sufficient to consider only the numerator, as the equations are proportional:
In Bayesian inference we now search for , i.e., the distribution of the parameter given the data. For this purpose we only need the likelihood and the prior distribution .
5.5 Practical Example
5.5.1 Setting
We assume that the data are realizations from a Poisson-distributed random variable . However, the rate parameter of this distribution is also assumed to be random. Given that it can only take positive values, suitable prior distributions include the Chi-Squared, Exponential, and Gamma distribution. Among these, the gamma distribution is often the preferred choice, which we will also use in the following example:
Reminder: PMF of Poisson and Density of Gamma
Probability Mass Function Poisson:
Density Gamma: Note: In our case is the random variable in the gamma density.
5.5.2 Result
The next step is to simplify the term and remove all parts that do not depend on : When having a closer look, the result has an identical kernel structure as the gamma prior. We conclude:
The moments of a gamma distribution with parameters and are given by and .
In our exemplary case, the posterior mean is given by and the posterior variance is expressed as
5.5.3 Interpretation
We observe that both the posterior mean and variance are weighted averages of the respective moments derived from the MLE and the prior distribution. Notably, a larger prior scale parameter , which corresponds to a smaller prior variance, increases the prior’s influence. However, as the number of observations increases, the likelihood’s influence becomes more dominant relative to the prior.
5.6 Interactive Shiny App
An example of Poisson-distributed data is the number of days in August with a maximum temperature below 0°C on the Zugspitze (Germany), observed over the years 1936 to 2016. In the following app you can get an impression on how the shape of the likelihood and the Gamma prior influence the resulting posterior distribution:
Click here for the full version of the ‘Introduction to Bayesian Statistics’ shiny app. It even includes a second example from the WeatherGermany dataset to play around with.
5.7 Types of Priors
A suitable prior distribution is one that matches the parameter space of the respective parameter. However, there are three special types of priors that are often used in practice:
- Conjugate priors
- Non-informative priors (not relevant)
- Jeffreys priors (not relevant)
5.7.1 Conjugate Priors
A prior distribution is called conjugate prior for the likelihood if the posterior distribution is in the same distribution family as the prior. This phenomenon allows for simpler calculations of the posterior making Bayesian inference easier. A conjugate prior can be identified by examining the kernel of the distribution and the support of the prior. See the example in Section 5.5 where the gamma distribution is conjugate to the poisson distribution with respect to .
5.7.2 Excursus: Non-informative Priors
Non-informative priors are used when we are indifferent about the parameter’s distribution. They assign equal probabilities to all possible values of the parameter vector : Since the prior is proportional to a constant, these priors are also referred to as “flat priors”. Common examples include the uniform distribution prior or Jeffreys prior 5.7.3. Typically, the usage of flat priors in Bayesian inference yields a posterior that does not differ a lot from the frequentist approach (Maximum-Likelihood).
5.7.3 Excursus: Jeffreys Prior
Jeffreys Prior is commonly used as a non-informative prior because of its desirable properties. Its density function is proportional to the square root of the determinant of the Fisher information matrix: One of is key features is that it is invariant under reparametrization. In contrast, a uniform prior is not invariant, i.e., if you apply nonlinear transformation, the prior distribution changes in a way that might introduce bias.
Example: Jeffreys Prior for Gaussian Distribution
For the Gaussian distribution of the real value with fixed, the Jeffreys prior for the standard deviation is