Chapter 1 Probabilty and Inference

Bayes’ Rule: If $A$ and $B$ are events in event space $F$ , then Bayes’ rule states that $P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)} = \frac{P(B\mid A)P(A)}{P(B\mid A)P(A) + P(B\mid A^c)P(A^c)}$ Let

$y$ be the data we will collect from an experiment,
$K$ be everything we know for certain about the world (aside from $y$ ), and
$\theta$ be anything we don’t know for certain.

A Bayesian statistician is an individual who makes decisions based on the probability distribution of those things we don’t know conditional on what we know, i.e. $p(\theta\mid y, K)$ .

Parameter estimation: $p(\theta\mid y, M)$ , where $M$ is the model with parameter vector $\theta$
Hypothesis testing: $p(M_j\mid y, M)$
Prediction: $p(\tilde{y}\mid y, M)$

Parameter Estimation Example: exponential model

Let $Y\mid \theta \sim Exp(\theta)$ , the likelihood is $p(y\mid \theta) = \theta \exp(-\theta y)$ . Let’s assume a prior $\theta \sim Ga(a, b)$ , $p(\theta) = \frac{b^a}{\Gamma(a)}\theta^{a-1}e^{-b\theta}$ , then prior predictive distribution is $p(y)=\int p(y \mid \theta) p(\theta) d \theta=\frac{b^{a}}{\Gamma(a)} \frac{\Gamma(a+1)}{(b+y)^{a+1}}$ The posterior is $p(\theta\mid y) = \frac{p(y\mid \theta)p(\theta)}{p(y)} = \frac{(b+y)^{a+1}}{\Gamma(a+1)}\theta^{a+1-1}e^{-(b+y)\theta}$ thus $\theta\mid y \sim Ga(a+1, b+y)$ .

If $p(y) < \infty$ , we can use $p(\theta\mid y) \propto p(y\mid \theta)p(\theta)$ to find the posterior. In the example, $\theta^{a}e^{-(b+y)\theta}$ is the kernel of a $Ga(a+1, b+y)$ distribution.

Bayesian learning: $p(\theta) \rightarrow p(\theta\mid y_1) \rightarrow p(\theta\mid y_1, y_2) \rightarrow \ldots$

Model selection

Formally, to select a model, we use $p(M_j\mid y) \propto p(y\mid M_j)p(M_j)$ . Thus, a Bayesian approach provides a natural way to learn about models, i.e. $p(M_j) \rightarrow p(M_j\mid y)$ .

Prediction

$p(\tilde{y}\mid y) = \int p(\tilde{y}, \theta \mid y)dy = \int p(\tilde{y}\mid \theta)p(\theta\mid y)d\theta$ . From the previous example, let $y_i \sim Exp(\theta)$ , $\theta \sim Ga(a,b)$ , $\begin{aligned} p(\tilde{y} \mid y) &=\int p(\tilde{y} \mid \theta) p(\theta \mid y) d \theta \\ &=\int \theta e^{-\theta \tilde{y}} \frac{(b+n \bar{y})^{a+n}}{\Gamma(a+1)} \theta^{a+n-1} e^{-\theta(b+n \bar{y})} d \theta \\ &=\frac{(b+n \bar{y})^{a+n}}{\Gamma(a+n)} \int \theta^{a+n+1-1} e^{-\theta(b+n \bar{y}+\tilde{y})} d \theta \\ &=\frac{(b+n \bar{y})^{a+n}}{\Gamma(a+n)} \frac{\Gamma(a+n+1)}{(b+n \bar{y}+\tilde{y})^{a+n+1}} \\ &=\frac{(a+n)(b+n \bar{y})^{a+n}}{(\tilde{y}+b+n \bar{y})^{a+n+1}} \end{aligned}$ which is called the Lomax distribution for $\tilde{y}$ with parameters $a + n$ and $b + n\bar y$ .

Probabilty: A subjective probability describes an individual’s personal judgement about how likely a particular event is to occur.

Rational individuals can differ about the probability of an event by having different knowledge, i.e. $P(E \mid K_1) \neq P(E \mid K_2)$ . But given enough data, we might have $P(E \mid K_1,Y) \approx P(E \mid K_2, y)$ .