Chapter 1 Probabilty and Inference
Bayes’ Rule: If A and B are events in event space F, then Bayes’ rule states that P(A∣B)=P(B∣A)P(A)P(B)=P(B∣A)P(A)P(B∣A)P(A)+P(B∣Ac)P(Ac) Let
- y be the data we will collect from an experiment,
- K be everything we know for certain about the world (aside from y), and
- θ be anything we don’t know for certain.
A Bayesian statistician is an individual who makes decisions based on the probability distribution of those things we don’t know conditional on what we know, i.e. p(θ∣y,K).
- Parameter estimation: p(θ∣y,M), where M is the model with parameter vector θ
- Hypothesis testing: p(Mj∣y,M)
- Prediction: p(˜y∣y,M)
Parameter Estimation Example: exponential model
Let Y∣θ∼Exp(θ), the likelihood is p(y∣θ)=θexp(−θy). Let’s assume a prior θ∼Ga(a,b), p(θ)=baΓ(a)θa−1e−bθ, then prior predictive distribution is p(y)=∫p(y∣θ)p(θ)dθ=baΓ(a)Γ(a+1)(b+y)a+1 The posterior is p(θ∣y)=p(y∣θ)p(θ)p(y)=(b+y)a+1Γ(a+1)θa+1−1e−(b+y)θ thus θ∣y∼Ga(a+1,b+y).
If p(y)<∞, we can use p(θ∣y)∝p(y∣θ)p(θ) to find the posterior. In the example, θae−(b+y)θ is the kernel of a Ga(a+1,b+y) distribution.
Bayesian learning: p(θ)→p(θ∣y1)→p(θ∣y1,y2)→…
Model selection
Formally, to select a model, we use p(Mj∣y)∝p(y∣Mj)p(Mj). Thus, a Bayesian approach provides a natural way to learn about models, i.e. p(Mj)→p(Mj∣y).
Prediction
p(˜y∣y)=∫p(˜y,θ∣y)dy=∫p(˜y∣θ)p(θ∣y)dθ. From the previous example, let yi∼Exp(θ), θ∼Ga(a,b), p(˜y∣y)=∫p(˜y∣θ)p(θ∣y)dθ=∫θe−θ˜y(b+nˉy)a+nΓ(a+1)θa+n−1e−θ(b+nˉy)dθ=(b+nˉy)a+nΓ(a+n)∫θa+n+1−1e−θ(b+nˉy+˜y)dθ=(b+nˉy)a+nΓ(a+n)Γ(a+n+1)(b+nˉy+˜y)a+n+1=(a+n)(b+nˉy)a+n(˜y+b+nˉy)a+n+1 which is called the Lomax distribution for ˜y with parameters a+n and b+nˉy.
Probabilty: A subjective probability describes an individual’s personal judgement about how likely a particular event is to occur.
Rational individuals can differ about the probability of an event by having different knowledge, i.e. P(E∣K1)≠P(E∣K2). But given enough data, we might have P(E∣K1,Y)≈P(E∣K2,y).