Chapter 1 Probabilty and Inference
Bayes’ Rule: If \(A\) and \(B\) are events in event space \(F\), then Bayes’ rule states that \[ P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)} = \frac{P(B\mid A)P(A)}{P(B\mid A)P(A) + P(B\mid A^c)P(A^c)} \] Let
- \(y\) be the data we will collect from an experiment,
- \(K\) be everything we know for certain about the world (aside from \(y\)), and
- \(\theta\) be anything we don’t know for certain.
A Bayesian statistician is an individual who makes decisions based on the probability distribution of those things we don’t know conditional on what we know, i.e. \(p(\theta\mid y, K)\).
- Parameter estimation: \(p(\theta\mid y, M)\), where \(M\) is the model with parameter vector \(\theta\)
- Hypothesis testing: \(p(M_j\mid y, M)\)
- Prediction: \(p(\tilde{y}\mid y, M)\)
Parameter Estimation Example: exponential model
Let \(Y\mid \theta \sim Exp(\theta)\), the likelihood is \(p(y\mid \theta) = \theta \exp(-\theta y)\). Let’s assume a prior \(\theta \sim Ga(a, b)\), \(p(\theta) = \frac{b^a}{\Gamma(a)}\theta^{a-1}e^{-b\theta}\), then prior predictive distribution is \[ p(y)=\int p(y \mid \theta) p(\theta) d \theta=\frac{b^{a}}{\Gamma(a)} \frac{\Gamma(a+1)}{(b+y)^{a+1}} \] The posterior is \[ p(\theta\mid y) = \frac{p(y\mid \theta)p(\theta)}{p(y)} = \frac{(b+y)^{a+1}}{\Gamma(a+1)}\theta^{a+1-1}e^{-(b+y)\theta} \] thus \(\theta\mid y \sim Ga(a+1, b+y)\).
If \(p(y) < \infty\), we can use \(p(\theta\mid y) \propto p(y\mid \theta)p(\theta)\) to find the posterior. In the example, \(\theta^{a}e^{-(b+y)\theta}\) is the kernel of a \(Ga(a+1, b+y)\) distribution.
Bayesian learning: \(p(\theta) \rightarrow p(\theta\mid y_1) \rightarrow p(\theta\mid y_1, y_2) \rightarrow \ldots\)
Model selection
Formally, to select a model, we use \(p(M_j\mid y) \propto p(y\mid M_j)p(M_j)\). Thus, a Bayesian approach provides a natural way to learn about models, i.e. \(p(M_j) \rightarrow p(M_j\mid y)\).
Prediction
\(p(\tilde{y}\mid y) = \int p(\tilde{y}, \theta \mid y)dy = \int p(\tilde{y}\mid \theta)p(\theta\mid y)d\theta\). From the previous example, let \(y_i \sim Exp(\theta)\), \(\theta \sim Ga(a,b)\), \[ \begin{aligned} p(\tilde{y} \mid y) &=\int p(\tilde{y} \mid \theta) p(\theta \mid y) d \theta \\ &=\int \theta e^{-\theta \tilde{y}} \frac{(b+n \bar{y})^{a+n}}{\Gamma(a+1)} \theta^{a+n-1} e^{-\theta(b+n \bar{y})} d \theta \\ &=\frac{(b+n \bar{y})^{a+n}}{\Gamma(a+n)} \int \theta^{a+n+1-1} e^{-\theta(b+n \bar{y}+\tilde{y})} d \theta \\ &=\frac{(b+n \bar{y})^{a+n}}{\Gamma(a+n)} \frac{\Gamma(a+n+1)}{(b+n \bar{y}+\tilde{y})^{a+n+1}} \\ &=\frac{(a+n)(b+n \bar{y})^{a+n}}{(\tilde{y}+b+n \bar{y})^{a+n+1}} \end{aligned} \] which is called the Lomax distribution for \(\tilde{y}\) with parameters \(a + n\) and \(b + n\bar y\).
Probabilty: A subjective probability describes an individual’s personal judgement about how likely a particular event is to occur.
Rational individuals can differ about the probability of an event by having different knowledge, i.e. \(P(E \mid K_1) \neq P(E \mid K_2)\). But given enough data, we might have \(P(E \mid K_1,Y) \approx P(E \mid K_2, y)\).