3.2 Bayesian Inference
Bayesian inference extends the logic of Bayes’ Law by replacing the prior probability estimate that \(\theta\) is true with a prior probability distribution that \(\theta\) is true. Rather than saying, “I am x% certain \(\theta\) is true,” you are saying “I believe the probability that \(\theta\) is true is somewhere in a range that has maximum likelihood at x%”.
Let \(\Pi(\theta)\) be the prior probability function of \(\theta\). \(\Pi(\theta)\) has a pmf or pdf \(P(\theta)\), and a set of conditional distributions called the generative model for the observed data \(D\) given \(\theta\), \(\{f_\theta(D): \theta \in \Omega\}\). \(f_\theta(D)\) is the likelihood of observing \(D\) given \(\theta\). Their product, \(f_\theta(D)P(\theta)\), is a joint distribution of \((D, \theta)\).
For continuous prior distributions, the marginal distribution for \(D\), called the prior predictive distribution, is
\[m(D) = \int_\Omega f_\theta(D)P(\theta) d\theta\]
For discrete prior distributions, replace the integral with a sum, \(m(D) = \sum\nolimits_\Omega f_\theta(D) P(\theta)\). The posterior probability distribution of \(\theta\), conditioned on the observance of \(D\), is \(\Pi(\cdot|D)\). It is the joint density, \(f_\theta(D) P(\theta),\) divided by the the marginal density, \(m(D)\).
\[P(\theta | D) = \frac{f_\theta(D) P(\theta)}{m(D)}\] The numerator makes the posterior proportional to the prior. The denominator is a normalizing constant that scales the likelihood into a proper density function (whose values sum to 1).
It is helpful to look first at discrete priors, a list of competing priors to see how the observed evidence shifts the probabilities of the priors into their posterior probabilities. From there it is a straight-forward step to the more abstract case of continuous prior and posterior distributions.